Hierarchical distributed data classification in wireless sensor networks

Hierarchical distributed data classification in wireless sensor networks Xu Cheng, JiXu, Jian Pei, Jiangchuan Liu School of Computing Science, Simon Fraser University Burnaby, British Columbia, Canada (To appear in Computer Communication 2010, Elsevier) • Presented by Binh Tran • 03/24/2010

Outline • Introduction • Classification Basis • Decision-tree-based hierarchical distributed classification • Results • Conclusion

Introduction (1/2) • Problem: given the huge amount of sensed data, classifying them becomes a critical task in many applications • For examples, for wildlife monitoring, sensor nodes continuously sense the physical phenomena (temperature, humidity, and sunlight, number of animals, etc..). If greater threshold, the environment is suitable. After learning the relation between physical phenomena and training data, we may determine the inquired environment from external source. These inquires are the unseen data. • Many applications: • Wildlife monitoring • Military target tracking and surveillance • Hazardous environment exploration • Natural disaster relief • Goal: high classification accuracy with low storage and communication overhead

Introduction (2/2) • Solution: the novel decision-tree-based hierarchical distributed classification in energy-constrained sensor networks. • Organize the sensor nodes and performs classification in a localized, iterative and bottom-up manner, utilizing a decision tree method. • Starting from each leaf node, a classifier is built based on its local training data • An upstream node will build a new classifier based on the classifier from its children • These local classifiers will be iteratively enhanced from bottom to top and reach the base station finally, making a global classifier for all the data distributed across the sensor nodes. • The energy consumption for transmission can be significantly reduced. Observations: • A high tree will save more energy but a wide tree will achieve higher accuracy. • Training a new classifier from a mix of downstream classifiers with the local training dataset  generating a pseudo training dataset

Classification basis (1/2): Classification • Classification is the task of assigning objects to one of several predefine categories. • Two-step process of data classification • Step 1: • A built model describes a pre-determined set of data classes or concepts using database samples. • Each sample is assumed to belong to a predefine class, as determined by one of the attributes, called the class label attribute. • Step 2: • Model is used for classification. • First, the predictive accuracy of the model (or classifier) is estimated using test set of class-labeled samples. • If the accuracy of the model is considered acceptable, the model can be used to classify future data of which the class label is not known.

Classification basis (2/2): Decision tree (1/3) • A decision tree is a mapping from observation about an item to conclusion about its target value. • A decision tree has three types of nodes: • A root node: no incoming edges • Internal nodes: one incoming edge and more than one outgoing edges • Leaf nodes: one incoming edge and no outgoing edge. Leaf node is assigned a class label. • Two common criteria of constructing the decision tree • [Information gain]: simplest criterion, use the entropy measure where S is the dataset and c is the number of classes and is the proportion of each class. • The information gain is calculated as where V(A) is the set of all possible values for attribute A, and is the subset of S for which attribute A has value v.

Classification basis (2/2): Decision tree (2/3) • [Gain ratio]: there exists natural bias in information gain, as it favors attributes with many values • For example, in weather forecast, the attribute “data” may have the highest information gain, but it will lead to a very broad decision tree of depth one and is inapplicable to any future data. • Gain ratio (advanced criteria) attributes by incorporating split information

Classification basis (2/2): Decision tree (3/3) • This information is sensitive to how broadly and uniformly the attribute splits the data. • The gain ratio is calculated as • Many other methods to build a decision tree • ID3 algorithm: recursively constructs a tree in a top-downdivide-and-conquer manner. It uses information gain as measure to determine the best attribute. • C4.5 algorithm: some improvement of ID3. • However, both ID3 and C4.5 algorithms cannot be used to directly combine the local classifiers •  Need an enhanced C4.5 algorithm

Decision-tree-based hierarchical distributed classification (1/5): System overview • N sensor nodes distributed in a field. • Each node collects data within the area. • The data reporting follows a spanning tree rooted at the base station • Each sensor node first collects its local training data . If it is a leaf node, it builds a classifier by a learning algorithm. • The node then sends to its parent node, say • ( decision tree = “classifier”) • Upstream node combines the children’s classifiers with its local training data to build an enhanced classifier. • From bottom to top, reach the base station, making a global classifier • For each child node , node will generate a set of pseudo training data from the classifier , then combine all these data with its own training data to build the enhanced classifier • The pseudo training data should closely reflect the characteristics of the original training data. The amount of pseudo data is also an important concern.

Decision-tree-based hierarchical distributed classification (2/5): Constructing Spanning Tree (1/3) • Questions: • How the shape of the tree affects the result of the classification? • How the height and width of the tree affect the energy consumption and the accuracy of the classifier?

Decision-tree-based hierarchical distributed classification (2/5): Constructing Spanning Tree (2/3) • We assume that there are two kinds of spanning tree • T1: each sensor node is the child of the other one that closer to the base station • T2: all sensor nodes are the child of the base station • The energy consumption is proportional to the transmitted data size and square of the distance • In T1: • In T2: • where s is the data size transmitted by a sensor node , L: distance between the base station and the last one Node 0: base station Node 1…n: sensor nodes

Decision-tree-based hierarchical distributed classification (2/5): Constructing Spanning Tree (3/3) • The energy consumption is proportional to the transmitted data size and square of the distance • In T1: • In T2: where s is the data size transmitted by a sensor node , L: distance between the base station and the last one • Observations: • In term of energy consumption, T2 is clearly larger than that of T1 • To save more energy, a high tree (T1) is demanded • On the other hand, noise in the classification is inevitable, and the noise will be accumulated along the spanning tree. • Hence, the larger the height is, the lessaccurate the classifier will be. • Therefore, we need an approach to control the shape of the spanning tree during its construction

Decision-tree-based hierarchical distributed classification (3/5): Building Spanning Tree • Local decision trees are built by the widely used enhanced C4.5 algorithm • C4.5 algorithm: • At each node of the tree, chooseoneattribute of the data that most effectively splits its set of samples into subsets enriched in on class or the other. Its criterion is the normalized information gain(difference in entropy) that results from choosing an attribute for splitting the data. • The attribute with highest normalized information gain is chosen to make the decision • The algorithm then recurses on the smaller sublists. • However, the C4.5 does not keep the information about the attribute distribution and the amount of the original training data, preventing pseudo data recovering from a classifier. • Solution:each leaf node recordthe count of each class. Thus having the knowledge about the amount of the sample for building each branch of the decision tree. • The basic C4.5 stops if all sample belong to the same class. The information of other attributes will be missing, which can cause problems with heterogeneous data distribution across different sensor nodes. • E.g. if all the training data: {temp: <10 degree ; humidity: <20% , class: negative} Using basic C4.5 algorithm, only one attribute, say temperature will appear in the decision tree, The information of humidity is completely missing. Lead to a set of pseudo data generated with humidity uniformly distributed from 0% to 99% which is clearly not the case for the original data.

Decision-tree-based hierarchical distributed classification (4/5): Generating Pseudo Data • The pseudo data generation is one the most important steps. A critical challenges is to generate data that are close to the original data as possible. • The distribution of each attribute should closely resemble that of the original data. • How many pseudo data should be generated? We need to generate the same amount of the pseudo data as the original data. Thus, a sensor node close to the base station has to generate a huge amount of pseudo data, impossible given the limited memory of the sensor. • Using “preservation factor”, ranging from 0 to 1(the base station always has a factor of 1) to control the amount of the generated pseudo code. • For example, suppose the decision tree’s leaf node represents a rule of when • temperature is between 10 and 20, • humidity is between 20 and 40, and • sunlight is normal. • There are 10 positive and 90 negative class labels. • Assume preservation factor = 0.6, we then randomly generate (10+90)x0.6=60 data that satisfy the attribute requirement. • Each data has a probability 10/100=0.1 to be assigned a class label as positive and 0.9 to be negative • The original data are partitioned to each decision tree leaf node. The combined pseudo data will largely reflect the characteristics the original training data.

Decision-tree-based hierarchical distributed classification (5/5): Hierarchical Classification The sensor nodes in the network be organized by a spanning tree. Leaf node: builds the decision tree with local sensed training data and sends the decision tree to parent Intermediate node periodicallychecks if there is any new classifier from children. Yes, generate a set of pseudo data for each new classifier and combines them with its local data The node never received any classifier from children, combine the generated pseudo data with local one, performs the learning algo Once built a classifier, discard previous received classifiers, generate pseudo data with factor = 1and combine it with the other pseudo datasets to build new decision tree The base station will build the global classifier. Wait for allbefore to build the global classifier

Decision-tree-based hierarchical distributed classification (6/6): Further Discussion • Inherits the effectiveness and efficiency of C4.5 when building classifiers. • To keep high accuracy and achieve the goals of saving energy and storage space • Class count: records the count, and indicates the distribution of all the class labels in the original dataset. In other word, keep “noise” information because the noise may be important in the heterogeneous data distribution. • For example, suppose one decision tree branch has negative label because the numbers of positive and negative training data satisfying the constraint are 1 and 9, while another decision tree branch has the positive label on the sameattribute constraint because the counts are 99 and 1. Without recording the class label count, no idea which one is more accurate, may treat them equally. In fact, combining the two training set, obtain the positive label with class counts 100 versus 10. The problem is particularly severe in heterogeneous. • Preservation factor: determines the amount of the pseudo data to generate because of the limited memory of the sensor node. The smaller the preservation factor, the moredominant the local training data are • For example, suppose a node has 4 children, each having 200 training data, the node itself has 200 training data. • If preservation factor= 1.0, the node has to learn 1000 data, in which 20% its local data. • If factor = 0.1, the node learns 280 data, in which 71% its local data • Intuitively, • if the pseudo data can represent the original data very well, the greater the factor, the more accurate the classifier is, because everyarea should be treated equally. • Otherwise, the greater factor, the morenoise it will make, decreasing the accuracy. • The representativeness of the pseudo data of the original data is crucial

Performance Evaluation (1/6): Configuration and Dataset • m x m randomly deployed sensor nodes (m=7) • Spanning tree with heights of 3,4,5 • Data consists of three attributes (temperature, humidity and sunlight) and a class label. • Temperatureand humidity: numerical attributes ranging from 0 to 49 and 0 to 99 • The last attribute: categorical attribute (weak, normal and strong) • The class label is positive or negative • Randomly generate data • Noise , which indicates the data has the probability of to be other class label determined by the rules. (Noise not transmission error) • 10training datasets, each having • Five out of 10 has 1% noise, the rest 10% noise • For each dataset, make it two version • Heterogeneous data: data distribution depends on the location of the sensor nodes. Calculate its coordinate according to attributes, and assign data to not full sensor(200 data) • Homogeneous data: independent on the location, and is randomly and uniformly distributed across sensor nodes. Randomly divide and assign all the training data to all sensors as local training data (i.e. each has 200 training data) • Generate 10 test dataset, each 2000 data. 5 out of 10 have 1% noise, the rest 10% noise

Performance Evaluation (2/6): Baseline for Comparison • Ensemble method (E): constructs a set of base classifiers and take a majority voting on predictions in classification. • Significantly improve the accuracy of prediction because if the base classifier are independent, then the ensemble make a wrong prediction only if more than half of the base classifiers are wrong. • The best-possible accuracy of learning from the entire dataset (A): assume one sensor node collects all the data and builds the classifier.

Performance Evaluation (3/6): Impact of preservation factor, noise and height for heterogeneous data Small Preservation Factor (PF), BigHeight, Large Noise: the preservation factor does affect the accuracy. Small Noise (1%), PF>0.4, regardless of height, the preservation factor does not affect the accuracy Noise = 10%, PF at least 0.7, not decrease the accuracy. Small PF, Large Height, Less accuracy. When the factor is large enough, the mechanism to generate pseudo data works quite well. When the factor is small, the accuracy becomes relatively lower. Because the areas are close to the base station generally dominates (should consider different areas equally) When the height is higher, the noise will be accumulated, the accuracy will be reduced. Comparing with E, large PF, the approach achieve much higher accuracy. Greater Noise, the difference is even larger. Because in E, for heterogeneous data, each sensor only learns part of the data (i.e. in E, only a few classifiers are responsible for a certain test data. If given unlabeled date much different from its training data, randomly guess a class label)

Performance Evaluation (4/6): Comparison of enhanced and basic C4.5 algorithm (1/2) Because basic C4.5 is not suitable for generating pseudo data for heterogeneous data distribution, the enhanced C4.5 gains the heterogeneous data with different noises. Large enough PF, regardless height, not affect accuracy. Smaller PF, Larger Height, the accuracy of basicC4.5 algorithm is much lower than the approach. The difference is when all the samples belong to the same class label, the basic algostop the recursion, while the approach continues. If utilize the basic ID3, the numerical attribute considered ascategory attribute, reducing the accuracy.

Performance Evaluation (4/6): Comparison of enhanced and basic C4.5 algorithm (2/2)

Performance Evaluation (5/6): Impact of training data distribution (1/2) For heterogeneous data, Small PF, the accuracy is low. For the homogeneous data, not affect the accuracy. PF>0.5, two achieve similar accuracy. This because in homogeneous data distribution, all the nodes have similar training data, built classifiers are almost the same, leads to high similarity between the generated pseudo data and the local training data. Thus, the accuracy is independent on PF for homogeneous data. In E, the accuracy for heterogeneous data is much lower than that of homogeneous data and the approach.

Performance Evaluation (5/6): Impact of training data distribution (2/2)

Performance Evaluation (6/6): Comparison of energy consumption Homo, 1% noise Hetero Homo, 10 %noise The energy consumption between the proposed hierarchical classification and the ensemble method. The energy consumption = computation energy consumption and transmission energy consumption Neglect the computation energy consumption. The transmission energy consumption depends on the size of the transmitted data and the distance between 2 nodes. The energy consumption of the approach is much lower than the ensemble method in all situation when Height > 1 (on average, save 70% of the energy spent). The greater the height is, the more energy we save. Because a classifier is only forwarded to the parent. The E approach forwards all to the base station. The data distribution does not significantly affect the energy consumption The noise does not affect the energy consumption much. Because the noise does not change the transmission distance, also does not noticeably change the size of the decision trees.

Conclusion • A novel decision-tree-based hierarchical distributed classification approach in wireless sensor networks in which the data distribution is heterogeneous, which is seldom studied before. • Sensor nodes are organized in a spanning tree and local classifiers are built by individual nodes and merge along the routing path. • Generated pseudo data with new local data for classifier. • Maintain high classification accuracy with very low storage and communication overhead.

Hierarchical distributed data classification in wireless sensor networks