Data Management in Sensor Networks By Jinbao Li, Zhipeng Cai, and Jianzhong Li
Introduction • The purpose of data management in sensor networks is to separate the logical view (name, access, operation) from the physical view of the data • Users and applications need not be concerned about the details of sensor networks, but the logical structures of queries • From the data management point of view, the data management system of a sensor network can be seen as a distributed database system, but it is different from traditional ones
The data management system of a sensor network organizes and manages perceptible information from the inspected area and answers queries from users or applications • A sensor network data management system is a system for extracting, storing and managing sensor data • This chapter discusses the methods and techniques of data management in sensor networks, including • the difference between data management systems in sensor networks and in traditional distributed database systems • the architecture of a data management system in a sensor network
the data model and the query language • the storing and indexing techniques of sensor data • the query processing techniques • two examples of data management systems in sensor networks: TinyDB and Cougar
1. Difference Between Data Management Systems In Sensor Networks and In Distributed Database Systems
WSN vs. Distributed Database • In traditional distributed database systems, data management and query processing are just applications of the network systems, and the details of the networks should not be of concern. While in a sensor network, the details of the networks must be of concern • The data produced by a sensor is an infinite data stream. However, infinite data streams cannot be managed by traditional database systems.
WSN vs. Distributed Database (cont.) • The perceptible data from sensors are not accurate • The data management system of a sensor network needs to reduce the waste of power to extend the network lifetime • Traditional database systems do not have the ability to process long-running queries
WSN vs. Distributed Database (cont.) • Query processing techniques in a traditional database system are not suitable for sensor networks • query optimization technique • locks and unlocks the data • infinite and uncertain data streams • real-time query processing
WSN vs. Distributed Database (cont.) • The amount of data from sensors is very large and not all of the data can be stored • sensor networks needs power efficient in-network distributed data processing algorithms
Four system models • centralized model • semi-distributed model • distributed model • hierarchical model
2.1 Centralized Model • In a centralized model, query processing and access to sensor networks are separated • The centralized approach proceeds in two steps • data is extracted from the sensor network in a predefined way and is stored in a database located on a central server • query processing takes place on the centralized database
Centralized Model (cont.) • Disadvantages: • the central server is the performance bottleneck and single point of failure • all sensors are required to send data to the central server, which incurs large communication cost
2.2 Semi-distributed Model • Certain computations can be performed on the raw data at each sensor node • Two representative systems for this model –Fjord, Cougar • Fjord, part of Telegraph (a developing project at UC Berkeley) , is an adaptive dataflow system • Fjord has two major components: adaptive query processing engine and sensor proxy
Fjord • Fjord is a query processing engine combining push and pull mechanisms • In Fjord, data streams (an infinite sequence of [tuple, timestamp] pairs) are pushed to the query processing engine instead of beingpulled as in traditional database systems • At the same time, non-sensor data is pulled by the query processing engine • Fjord integrates Eddy to adaptively change the execution plans according to computing environments on a tuple-per-tuple basis.
Fjord (cont.) • Fjord is sensor proxy, an interface between a single sensor and query processor as shown in Figure 1 • A sensor node only needs to deliver data to its sensor proxy. Then the sensor proxy delivers the data to the query processor. • the sensor proxy directs sensors to perform certain local computations to aggregate samples in a predefined way. • the sensor proxy actively monitors sensors, evaluates user needs and current power conditions, and appropriately programs and controls sensors' sampling rate and delivery rate to achieve acceptable sensor battery lifetime and performance
Cougar • Cougar is a sensor database project developed at Cornell University [7, 4] • The basic idea of this project is to push as much computation as possible into the sensor network to reduce the communication between sensor nodes and front-end server(s) • In this model, the query workload determines the data that should be extracted from sensors. Only relevant data are extracted from the sensor network
2.3 Distributed Model • In this model, each sensor is assumed to have high storage, computation and communication capabilities • Distributed Hash Table (DHT) • Each sensor samples, senses and detects events. Then a hash function is applied on the event key, and events are stored at a “home" sensor node which is the closest to the hash value of the event key. • To process a query, the same hash function is applied first. Then the query is sent to the node with the closest hash value for further processing
Distributed Hash Table (cont.) • This model pushes all computation and communication to sensor nodes. • The problem with this model is that sensors are assumed to have almost the same communication and computation capabilities as normal computers • DHT is only suitable for key queries, which incurs large communication cost.
2.4 Hierarchical model • This model includes two layers: sensor network layer and proxy network layer • This model combines in-network programming, adaptive query processing and efficient content-based search techniques • Sensors have three functions: receiving commands from proxy, performing local computation and delivering data to proxy • Sensor nodes receive control commands including sampling rate, delivery rate and operations that need to be performed from the proxy layer.
Proxies have five functions: receiving queries from users, issuing control commands and other information to sensors, receiving data from sensors, processing queries, and delivering query results to users • each proxy processes the query in a decentralized way and delivers the results to the users. • computation and communication loads are distributed among all proxies.
3.1 Data Model • The data model in TinyDB simply extends the traditional relational data model. It defines the sensed data as a single, infinitely long logical table • This table conceptually contains one row for each reading generated by any sensor, and hence the table can be thought of streaming infinitely over time • Table 1 is an example of the relational table for TinyDB.
Data Model (cont.) • A sensor network is looked at as a large distributed database system in the Cougar system developed by Cornell University • Each sensor corresponds to a node in a distributed database system and stores part of the data • Cougar does not send data at each sensor to a central node for storage and processing. It tries to process data separately within the sensor network
3.2 Query Language • Query schemes proposed include snapshot query, continuous query, event-based query, life-cycle based query and accuracy-based query • TinyDB's query language is based on SQL, and we will refer to it as TinySQL • Query Language in TinySQL supports selection, projection, determining sampling rate, group aggregation, user defined aggregation, event trigger, lifetime query, setting storing point and simple join
TinySQL • The Grammar of TinySQL query language is as follows: SELECT select-list [FROM sensors] WHERE predicate [GROUP BY gb-list] [HAVING predicate] [TRIGGER ACTION command-name[(param)]] [EPOCH DURATION time]
TinySQL (cont.) • examples of a TinyDB query: SELECT room_no, AVERAGE(light), AVERAGE(volume) FROM sensors GROUP BY room no HAVING AVERAGE(light) > l AND AVERAGE(volume) > v EPOCH DURATION 5min SELECT temp FROM sensors WHERE temp > thresh TRIGGER ACTION SetSnd(512) EPOCH DURATION 512
TinySQL (cont.) Select nodeID, light FROM sensors WHERE light > 200
Query Language in Cougar • Cougar also provides a SQL-like query language • The grammar of query language in Cougar is as follows SELECT select-list FROM [Sensordata S] [WHERE predicate] [GROUP BY attributes] [HAVING predicate] DURATION time-interval EVERY time-span
Query Language in Cougar (cont.) • An example is given below: SELECT AVG(R.concentration) FROM ChemicalSensor R WHERE R.loc IN region HAVING AVG(R.concentration)>0.6 DURATION (now, now+3600) EVERY 10
Storage and Index Techniques in Sensor Networks • For sensor networks, one of the most challenging problems is to name data • For data centric storage systems, every data generated by each sensor is stored at some sensor(s) in the network according to its name • In the same way, it is easy to find the corresponding data in the sensor network
4.1 Data Centric Naming • hierarchical naming: data generated by a camera sensor may be named: US/Universities/USC/CS/cameral • attribute-value naming scheme • These naming schemes implicitly define a set of ways in which the data may be accesse type = camera value = image.jpg location = “CS Dept, Univ. of Southern California"
4.2 The Performance of Data-centric Storage Systems • Data names are used when storing and receiving data in data centric storage. It uses a mapping between a sensor and the name of a data to store the data • Figure 4 describes such a data centric storage algorithm [16, 18]. Assume sensor nodes A and B want to insert a data named bird-sighting and this data is hashed to node C, so the data is routed to node C by the routing protocol. • Similarly, a query also uses the name of the data to acquire the location where the data is stored and the query is sent to that sensor.
The Performance of Data-centric Storage Systems (cont.) • Besides data centric storage, we consider two alternatives: • an External Storage scheme in which all events are stored at a node outside the network; • for external storage, the cost of accessing events is zero, since all events are available at one node • there is an energy cost in sending events to this node, and significant energy is spent at nodes near the external node in receiving all these events (these nodes become hot-spots) • a Local Storage scheme where each event is stored at the node at which it is generated; • local storage incurs zero communication cost in storing the data, but incurs a large communication cost, a network flood, in accessing the data
The Performance of Data-centric Storage Systems (cont.) • analysis shows that the data-centric storage scheme becomes more preferable as the size of the network increases, or when many more events are generated than can be usefully queried • Consider a network of n nodes, in which the cost of sending messages to all nodes (e.g., a flood) is O(n) and the cost of sending a message to a designated node is O(n). see Table 3.
De the total number of the detected events, Q the number of the queries, and Dq the number of the events which are returned as answers for the Q queries
4.3 Mechanisms for Data-centric Storage • the essence of a data-centric storage system is captured at its interface, which supports a put() operation that stores data by name within the network, and a get() operation that retrieves data by name • In this section, we describe a system called a Geographic Hash Table (GHT)
GHT • In a GHT, event names are randomly hashed to a geographic location (i.e., an x, y coordinate). This mapping is multi-to-one • A put() operation stores an event at the node which is the closest to the hashed location, and a get() operation retrieves one or more events from that node • Applications can determine which parts of an event name are used to compute the geographic hash value • Data route to the hashed node by GPSR routing portocol
GPSR: an Overview • GPSR is a geographic routing protocol that was originally designed for mobile ad-hoc networks • Given the coordinates of a node, GPSR routes a packet to that node using location information only • GPSR contains two different algorithms: greedy forwardingand perimeter forwarding
GPSR (cont.) • Greedy forwarding • Assume each node in a network knows its own location, and that of its neighbors • When a node receives a message destined to location D, it sends the message to another neighbor C which is closer to D than itself • Such a neighbor might not always exist; in this case, GPSR invokes perimeter routing at that node
GPSR (cont.) • perimeter routing • When a packet finds itself at a node which has no neighbors closer to the destination than itself, we say that the packet has encountered a void • Voids can result from irregular deployment of nodes, as well as from radio-opaque obstacles • A natural technique to route around a void is the right-hand rule (Figure 5) • According to this rule, a traversal walks around the perimeter of the void • When this traversal reaches a node that is closer to D than A, greedy forwarding resumes.
GPSR (cont.) • Assume that GHT hashes an event to a destination location d, and, without loss of generality, that no node exists at that location (Figure 6) • When a packet returns in perimeter mode to the node that originated the perimeter traversal, the corresponding event is stored at that node.