380 likes | 552 Vues
Algorithms for Distributed Supervised and Unsupervised Learning. Haimonti Dutta The Center for Computational Learning Systems (CCLS) Columbia University, New York. About ME. BCSE from Jadavpur University, Kolkata MS from Temple University, PA
E N D
Algorithms for Distributed Supervised and Unsupervised Learning Haimonti Dutta The Center for Computational Learning Systems (CCLS) Columbia University, New York.
About ME • BCSE from Jadavpur University, Kolkata • MS from Temple University, PA • Ph.D. from University of Maryland, Baltimore County • Research Interests • Machine Learning and Data Mining • Data Intensive Computing • Distributed Data Mining and Optimization • Website: www1.ccls.columbia.edu/~dutta
The Data Avalanche • High Energy Physics (CERN) • Subatomic particles are accelerated to nearly the speed of light and then collided • Measured at time intervals of 1 nanosec • 1 peta byte of data • Astronomy (SDSS, 2MASS) • Telescopes observe galaxies, stars, quasars • Of the order of 215 million objects, Approx 100 – 200 attributes per object • Genome Sequences (Human Genome Project) • Advanced Imaging (fMRI , CT scans) • Simulation Data (climate modeling, earth observation) • Time Series Data (EEG, ECG, ECoG)
Astronomy Sky Surveys • Example – Sloan Digital Sky Survey • Telescope observes galaxies, stars, quasars • Few hundreds of attributes for each observed object. • The Data Release 5 • 8000 square degrees • 215 million objects.
Graph, Graph, Everywhere from H. Jeong et al Nature 411, 41 (2001) Aspirin Yeast protein interaction network Co-author network Internet Slide borrowed from Dr Jiawei Han’s tutorial on graph algorithms
Centralized versus Distributed Data Mining Data Mining Model Data Mining Model Distributed Computation Centralized Data Repository DB 1 DB N Data / Compute Node 1 Data / Compute Node N Problems of Centralizing Data: (1) Communication Cost (2) Privacy Loss
Issues unique to DDM • Communication • Machine Learning on a central server – No communication cost incurred • Distributed Mining – communication not free, treated as a ‘resource’ • Incomplete knowledge (input data, surrounding network structure, etc.) • Coping with failures – Many things can go wrong! • Timing and synchrony
Road Map • Problem Motivation • Potential Applications • DDM Basics • Data Distribution • Communication Protocols • Gossip-based communication • Randomized Gossip • Converge cast, Up cast and Down cast • Distributed Classification • Distributed Clustering • Mining on Large Scale Systems: Challenges and Open Problems
Data Distribution • Horizontal Partitioning: Each site has exactly the same set of attributes • Example: A departmental store using a standard database schema for its customer base. Same database maintained at different geographic locations. • Vertical Partitioning: Different attributes are observed at different sites. • Example: Astronomy example described earlier
Timing and Synchrony • Synchronous Model • Message sent by node v at time P, must reach neighbor u latest by time P+1 • System driven by global clock • Send message to neighbors, receive messages from neighbors, perform computation v u Local Clock P P P+1
Timing and Synchrony contd .. • Asynchronous Model • Algorithms are event driven • No access to a global clock • Messages from one processor to neighbor arrive within a finite but unpredictable time • Question: How do you know whether a message was sent by a neighbor or not? • Non-deterministic in nature • Arbitrary ordering of messages delivered
An example: Asynchronous Messages Input v P v, u’s Protocols Send X to P X = 0 P’s Protocol Upon Getting A Message Print it u X = 1
Communication Protocols: Part 1 • Broadcast • Disseminate a message M from source s to all vertices in the network • Common strategy: Use a spanning tree T rooted at the source s • Tree-cast – Internal vertex gets message from parent and forwards to children
Communication Protocols: Part 1 • Convergecast • Source can detect that the broadcast operation has terminated (Termination detection) • Acknowledgement Echos
Gossip-based communication • Based on spread of an epidemic in a large population • Suseptible, infected and dead nodes • The “epidemic” spreads exponentially fast Node1 Node 2 Node 3 Node 5 Node 4
Randomized Gossip • Nodes contact any one neighbor chosen at random • Models can be asynchronous or synchronous • Asynchronous – Single clock is ticking according to a rate n poisson process at times Zk, k>1, |Zk+1 – Zk| is an exponential of rate n • Synchronous model – time is slotted uniformly across all nodes • Reference: S. Boyd et al, “Randomized Gossip Algorithms”, IEEE Transactions on Information Theory, 2006.
Road Map • Problem Motivation • Potential Applications • DDM Basics • Data Distribution • Communication Protocols • Gossip-based communication • Randomized Gossip • Converge cast, Up cast and Down cast • Distributed Classification • Distributed Clustering • Mining on Large Scale Systems: Challenges and Open Problems
Decision Tree Induction • Example of Quinlan’s ID3 (Play / No Play)
Decision Tree Built on the Data Outlook = sunny Humidity <= 75 Outlook = overcast Play No Play Outlook = rain Play Windy=true No Play No Play
Algorithm for Decision Tree Induction • Basic algorithm (a greedy algorithm) • Tree is constructed in a top-down recursive divide-and-conquer manner • At start, all the training examples are at the root • Attributes are categorical (if continuous-valued, they are discretized in advance) • Examples are partitioned recursively based on selected attributes • Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) • Conditions for stopping partitioning • All samples for a given node belong to the same class • There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf • There are no samples left Data Mining: Concepts and Techniques
Attribute Selection: Information Gain • Select the attribute with the highest information gain • Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |Ci, D|/|D • Expected information (entropy) needed to classify a tuple in D: • Information needed (after using A to split D into v partitions) to classify D: • Information gained by branching on attribute A
Distributed Decision Tree Construction • Adam sends Betty “Outlook = Rainy” • Betty constructs “Humidity=High & Play=Yes” and “Humidity=Normal & Play = Yes” • Dot product represents tuples “Outlook = Rainy & Humidity = Normal & Play = Yes” AND “Outlook = Rainy & Humidity = High & Play = Yes” Example Obtained from: C Gianella, K Liu, T Olsen and H Kargupta, “Communication efficient construction of decision trees over heterogeneously distributed data”, ICDM 2004
A technique from Random Projection • Simple technique that has been useful in developing approximation algorithms • Given n points in Euclidean space like Rn, project down to random k-diml subspace for k << n. • If k is “medium-size” like O(-2 log n), then approximation preserves many interesting quantities. • If k is small like 1, then can often still get something useful.
Johnson and Lindenstrauss Lemma • Given n points in Rn, if project randomly to Rk, for k = O(-2 log n), then with high probability all pairwise distances preserved up to 1(after scaling by (n/k)1/2).
Distributed Dot Product Estimation Using Random Projection • Data Matrix: Site A - n X p , Site B – n X q • Normalize data • A and B get a random number generation seed • Generate an l X n random matrix (l << n) • A sends RA and B sends RB to S • Compute D = (RA)T (RB) / l • E[D]= E[AT(RTR)B/ l ] = AT E[RTR] B / l ~ AT B (Johnson and Linden Strauss lemma)
Distributed Decision Tree Construction • At each site, locally determine which attribute has the largest information gain • Keep track of the global attribute (AG) with largest information gain • For each distinct value of AG -- branch leading from root node will be constructed • Site with AG will send projection to other sites • Leaf node determination – 1. all instances have the same class 2. Minimum number of objects allowable in a class reached 3. Child of a node is empty
PLANET: Parallel Learning for Assembling Numerous Ensemble Trees • Ref: B Panda, J. S. Herbach, S. Basu, R. J. Bayardo, “PLANET: Massively Parallel Learning of Tree Ensembles with Map Reduce”, VLDB 2009 • Components • Controller (maintains a ModelFile) • MapReduceQueue and InMemoryQueue
Classifier Design by Linear Programming • Classification can be posed as an LP problem • Kth instance xK, Weight W • xK W ≥ d • ek is error associated with an instance • LP is written as XW + E = D + S, S contains the slack variables • Assume that each node in a network has a data set, how can the classification problem be solved? H Dutta and H Kargupta, “Distributed Linear Programming and Resource Management for Data Mining in Distributed Environments”, ICDM 2008.
Classification Function of Ensemble Classifier … f2(x) f3(x) f1(x) fn(x) Weighted Sum ai: weight for Tree i fi(x): classification of Tree i i ai fi(x) f(x) =
Ensemble Approach • Bagging (Breiman, 96) • Boosting (Freund and Schapire, 99) • Arcing (Breiman, 97) • Stacking (Wolpert, 92) • Rotation Forest (Kuncheva et al)
The Distributed Boosting Algorithm • k distributed sites storing homogeneously partitioned data • At each local site, initialize the local distribution Δj • Keep track of the global initial distribution by broadcasting Δj • For each iteration across all sites • Draw indices from the local data set based of the global distribution • Train a weak learner and distribute to all sites • Create an ensemble by combining weak learners; use the ensemble to compute the weak hypothesis • Compute weights, and re-distribute to all sites • Update distribution and repeat until termination. • Reference: A. Lazarevic and Z. Obradovic, “The Distributed Boosting Algorithm”, KDD 2001.
Road Map • Problem Motivation • An Astronomy Application • DDM Basics • Data Distribution • Synchronous vs Asynchronous algorithms • Communication Protocols • Gossip-based communication • Randomized Gossip • Converge cast, Up cast and Down cast • Distributed Classification • Distributed Clustering / Outlier Detection • Mining on Large Scale Systems: Challenges and Open Problems
CPCA: Collective Principal Component Analysis-based Clustering • Kargupta et. al (KAIS, 2001) Central Coordinator Send Global PCs • Perform PCA at Local Site • Project data onto the PCs • Apply clustering in lower dimension • Perform PCA at Local Site • Project data onto the PCs • Apply clustering in lower dimension Perform local clustering at the sites with global PCs
KDEC: Distributed Density Based Clustering • Mathias Klusch et al (IJCAI 2003) • Homogeneous data partitioning across nodes • Local sites and a Helper agent • Assume global kernel function and bandwidth are agreed upon • Local density estimates are made at each site • Global KDE is obtained by summing local estimates • Value sent back to local sites which clusters data • Points that can be connected by a continuous uphill path to local maxima are in same cluster • Privacy preserving variation also exists
Parallel K-means • Dhillon and Modha • Chunk data ( homogeneous partitioning) • Random node selects cluster centroids • Distance between centroids and local data computed • After each iteration, independent results to be reduced • Use of MPI to implement the procedure • Parallelization is different from the Distributed setting!
Summary • Data Avalanche in scientific disciplines • Distributed Data Mining – a relatively new field in the past 15 years • Data Distribution and Communication Protocols • How does the distributed data affect mining? • Algorithms for Decision Tree construction, Boosting in distributed settings • Unsupervised Learning in Distributed Environments • A lot more to be done theoretically and empirically! • Interested in collaborating – send email to haimonti@ccls.columbia.edu