Data-centric View of Sensornets: An Overview

Data-centric view of sensornets: An Overview Puru Kulkarni Vijay Sundaram Bhuvan Urgaonkar

Motivation • Ubiquitous presence of sensor networks • Communication, computation, limited storage, sensing capabilities • Used to sense, actuate, control • Sensors everywhere = Data everywhere! • Require an infrastructure for data access and storage

Overview • Sensors sense/generate data • Users/Applications interested in data or some measure of data • Common user operations are: • Queries and Monitoring • Actuate and Control

Typical Queries • Historical • What is the average rainfall over past 2 days? • Current • What is the current temperate in Rm# 226? • Long Running • Temperature in Rm# 226 over the next 4 hours every 30 seconds

Issues • How to identify relevant sensors? • Computation vs. Communication tradeoff • Where to process query? • inside the sensor network (route query) • Need new techniques • at a centralized location (route data) • Large amounts of data transfer (not efficient) • Data gathering may not reflect query rate • How to process query? • queries on streaming data

DataSpace: Querying and Monitoring Deeply Networked Collections in Physical SpaceT. Imielinski and S. Goel, Rutgers University • Billions of objects populate space • Each produces and locally stores data • Location aware • Can be selectively monitored, queried and controlled • Physical world enhanced with data

Characteristics • Dataspace • Data lives on the object • Users access not only “local” information but can navigate entire dataspace • Spatial world divided in 3-D datacubes • CS Bldg. , street, block etc • Communication, messaging and computation techniques for querying and monitoring required

Querying and Monitoring • Queries are spatially driven • Steps: • Identify relevant datacubes • Identify relevant nodes (dataflocks) • Datacube directory service • Aggregation for queries on several datacubes • e.g.: Information about Manhattan taxi cabs

Architecting DataSpace • Network as DataSpace engine • multicast mechanisms (each node has an IP address!) • group membership based on • physical location • attribute (temperature, #vehicles etc) • multicast fits selective node addressing criteria to access relevant data • e.g.: what is average temperature in CS Bldg? • Query reaches only sensors in the CS Bldg datacube and have the corresponding group address

Based on interested attribute Based on location of datacube <space-handle> <subject-handle> DataSpace address Network as DataSpace engine • Space Handleencodes datacube information • Subject Handle attributes that are part of a multicast group • Dataspace address is a IPv6 mutlicast address E.g.: Space handle: 224.4.5 Subject handle: 8 Dataspace address: 224.4.5.8

Geographic Routing infrastruture • Route message based on physical location rather than IP address • Use GPS coordinates for locations • Avoids use of multicast for routing queries to datacubes • Once query reaches a region use mutlicast

Geographic Routing infrastruture • Geo-router (routes based on datacube location) • Geo-node (issue query to nodes in datacube) • Geo-host (process geographics messages) • Approach • Route query to datacube • Geo-nodes route query within datacube • mulitcast with a TTL of 1

The Sensor Network as a Database • Govindan, Hellerstein, Hong, Madden, Franklin, Shenker • Querying the Physical World • Bonnet, Gehrke, Seshadri

Sensornet Database architecture • Given a routing and access mechanism, how to process queries? • Provide a DB-view to users/apps • well understood programming interface • common data operations use computation in network • help energy-efficiency • allow users to be unaware of actual network, but treat it as a database • Sensor Network + Data => Sensor Network Database

What is required? • Core DB operations tailored for sensor networks • Design appropriate building blocks for DB operations • Join, aggregation, grouping, selection etc

SensornetDatabase Architecutre • Two important ideas: • in-network implementationsof primitive database query operators such as grouping, aggregation, and joins • group communication and routing protocols with possible processing at intermediate nodes implement the operator in an application independent way

SensornetDatabase Architecutre • Relax the semantics of database queries to allow approximate results • relaxation enables energy-efficient implementations even given the expected high level of network dynamics • A sensor network is a proxy for a continuous realworld phenomenon, and by nature samples that phenomenon discretely at some rate, with some degree of error.

In-network Implementation • JOIN operator • selection over cross-product of a pair of tables • Tuples generated at different nodes might be joined at a single node • Some JOIN implementations are blocking • Blocking is infeasible in sensor networks • tables can contain unbounded streams of data • amount of memory available is limited • Need to retool these operations • Pipelining • Partitioning

Non Blocking Pipelinined Joins • Symmetric hash-join: • Maintains two hash tables (keyed by the column(s) used for the join) • On an input tuple, looks up matching tuples from other input’s hash table • Outputs any matching results • Ripple joins: • Statistically sample the two tables to be joined, in order to produce a stream of joined tuples • Relative rates at which the two tables are sampled adapt to match the variance produced by the data in each • low energy approach to obtain approximateanswers

Partitioning • Partitioning: • tuples are partitioned based on their join-column values and redistributed on the fly across multiple nodes; • the work of joining the individual partitions is done in parallel by each of the nodes • Partitions can be defined by value, geographically, or by sensor type, and a node (or nodes) can be designated to perform the join for the partition

In-network Implementation • Aggregation operators • summarization of a column(s) into a single numerical value E.g. SUM, COUNT, AVERAGE, MIN, MAX etc • query flooded in the network and the responses are routed on the reverse path trees, • results aggregated across several nodes • E.g: to calculate AVERAGE each node returns (SUM, COUNT) values to parent • Can be a very common operator

Distributed Sensnet DBs • How to represent devices in DBs on sensornets? • ADTs (Abstract Data Types) • Methods correspond to sensing functionality • Virtual Relations (VRs) store local data • Network used for query operations

Virtual Relation • VR with attributes as • Inputs to an ADT (device) function • Arguments to an ADT function • Output of the function • Timestamp of the function

Virtual Relation • Some VR properties • records are never updated or deleted • is naturally partitioned over the sensnet (each device takes care of its set of VR records) • What does this mean? – a distributed DB • Records from the VRs (distributed over the devices) are processed using distributed query execution plans

Approximate Results • Energy-efficiency can be achieved using approximate aggregates • Uniform sampling: • Tuples are uniformly sampled and the resulting average is assumed to represent the actual average • Packet loss might invalidate the statistical assumptions that these intervals depend on. • Logarithmic sampling • The number of respondents (or the size of memory needed for the count) scales logarithmically with the size of the network • Provides looser error bounds but uses significantly less memory or communication.

Complex query evaluation • R x S x T • What order to follow? • (RxS)xT or Rx(SxT) or (RxT)XS • Decided by query optimizer • Usually depends on table size • With Sensernret DB • Need adaptive policy to route tuples based on • Energy consumption • Topology • Loss rates

Conclusions • Explosion of data from sensor networks needs an infrastructure for access, storage etc • Organizing sensors • Datacubes • Other techniques ? • Identifying relevant sensors is preliminary to fetch data • Dataspace provided two solutions • Other approaches ?

Conclusions • Sensornets as Distributed DB • Provide a database view to sensornet data • Pros • App development easy • In-network processing helps resource usage • Cons • Distributed DB can be difficult • Requires to retool DB operations for sensornets • Other approaches?

Representations for Devices Functions • Internal Representation • We can’t use trad OO DB methods • - they all demand immediate access • - with asynchronous quality of sensnets this is unacceptable

Overview • Direction of sensor networks progress • Small form-factor devices • On-board computation • Wireless communication • Increased sensing capabilities • Improved OS and networking functionalities • Prediction: • Every device (> 1 $) will have some sensor • Ubiquitous presence of sensor networks

Overview • Typical sensor networks usage: • Sense, collect and convey data • Provides a ubiquitous computing platform • Applications query/monitor sensed data • Ecosystem dynamics • Temperature/weather sensing • Automobile traffic analysis • Data-centric network, generated data more important than node identity

Requirements • Addressing • Identify relevant sensors • How to access/process data? • Communicate data and process centrally • Compute query at node and perform DB operations • Interface for querying/monitoring and control

What to do with data? • Answer queries/give useful info • How ?? • Centralized approach • Communicate data • Store and process all data at central location (traditional DB approach) • Is all temporal data to be stored? • Communication overhead?

What to do with data? • De-centralized approach • Communicate query (query routing) • Required data attribute of node • Node stores and communicates data to queries • Processing at node • Computation overhead • Computation overhead smaller than communication! • How to aggregate data? • How to route queries? • How to map nodes to addresses for communication purposes?

Need for Decentralization • Centralized (Traditional databases) • Inefficient use of resources • Large amounts of data communicated to central location • All sensors send data all the time • Dissociates access to device from query load • Communication more expensive than computation • Decentralized (Distributed DBs) • Data on devices • In-network query processing

Pipelining Benefits • Provide streamed partial answers, hence, can enable query refinement • Schemes like ripple joins form a low energy approach to obtain approximate answers and can be used together with sampling

Data-centric View of Sensornets: An Overview