Algorithms for Distributed Supervised and Unsupervised Learning

Algorithms for Distributed Supervised and Unsupervised Learning Haimonti Dutta The Center for Computational Learning Systems (CCLS) Columbia University, New York.

About ME • BCSE from Jadavpur University, Kolkata • MS from Temple University, PA • Ph.D. from University of Maryland, Baltimore County • Research Interests • Machine Learning and Data Mining • Data Intensive Computing • Distributed Data Mining and Optimization • Website: www1.ccls.columbia.edu/~dutta

The Data Avalanche • High Energy Physics (CERN) • Subatomic particles are accelerated to nearly the speed of light and then collided • Measured at time intervals of 1 nanosec • 1 peta byte of data • Astronomy (SDSS, 2MASS) • Telescopes observe galaxies, stars, quasars • Of the order of 215 million objects, Approx 100 – 200 attributes per object • Genome Sequences (Human Genome Project) • Advanced Imaging (fMRI , CT scans) • Simulation Data (climate modeling, earth observation) • Time Series Data (EEG, ECG, ECoG)

Astronomy Sky Surveys • Example – Sloan Digital Sky Survey • Telescope observes galaxies, stars, quasars • Few hundreds of attributes for each observed object. • The Data Release 5 • 8000 square degrees • 215 million objects.

Graph, Graph, Everywhere from H. Jeong et al Nature 411, 41 (2001) Aspirin Yeast protein interaction network Co-author network Internet Slide borrowed from Dr Jiawei Han’s tutorial on graph algorithms

Centralized versus Distributed Data Mining Data Mining Model Data Mining Model Distributed Computation Centralized Data Repository DB 1 DB N Data / Compute Node 1 Data / Compute Node N Problems of Centralizing Data: (1) Communication Cost (2) Privacy Loss

Issues unique to DDM • Communication • Machine Learning on a central server – No communication cost incurred • Distributed Mining – communication not free, treated as a ‘resource’ • Incomplete knowledge (input data, surrounding network structure, etc.) • Coping with failures – Many things can go wrong! • Timing and synchrony

Road Map • Problem Motivation • Potential Applications • DDM Basics • Data Distribution • Communication Protocols • Gossip-based communication • Randomized Gossip • Converge cast, Up cast and Down cast • Distributed Classification • Distributed Clustering • Mining on Large Scale Systems: Challenges and Open Problems

Data Distribution • Horizontal Partitioning: Each site has exactly the same set of attributes • Example: A departmental store using a standard database schema for its customer base. Same database maintained at different geographic locations. • Vertical Partitioning: Different attributes are observed at different sites. • Example: Astronomy example described earlier

Timing and Synchrony • Synchronous Model • Message sent by node v at time P, must reach neighbor u latest by time P+1 • System driven by global clock • Send message to neighbors, receive messages from neighbors, perform computation v u Local Clock P P P+1

Timing and Synchrony contd .. • Asynchronous Model • Algorithms are event driven • No access to a global clock • Messages from one processor to neighbor arrive within a finite but unpredictable time • Question: How do you know whether a message was sent by a neighbor or not? • Non-deterministic in nature • Arbitrary ordering of messages delivered

An example: Asynchronous Messages Input v P v, u’s Protocols Send X to P X = 0 P’s Protocol Upon Getting A Message Print it u X = 1

Communication Protocols: Part 1 • Broadcast • Disseminate a message M from source s to all vertices in the network • Common strategy: Use a spanning tree T rooted at the source s • Tree-cast – Internal vertex gets message from parent and forwards to children

Communication Protocols: Part 1 • Convergecast • Source can detect that the broadcast operation has terminated (Termination detection) • Acknowledgement Echos

Gossip-based communication • Based on spread of an epidemic in a large population • Suseptible, infected and dead nodes • The “epidemic” spreads exponentially fast Node1 Node 2 Node 3 Node 5 Node 4

Randomized Gossip • Nodes contact any one neighbor chosen at random • Models can be asynchronous or synchronous • Asynchronous – Single clock is ticking according to a rate n poisson process at times Zk, k>1, |Zk+1 – Zk| is an exponential of rate n • Synchronous model – time is slotted uniformly across all nodes • Reference: S. Boyd et al, “Randomized Gossip Algorithms”, IEEE Transactions on Information Theory, 2006.

Road Map • Problem Motivation • Potential Applications • DDM Basics • Data Distribution • Communication Protocols • Gossip-based communication • Randomized Gossip • Converge cast, Up cast and Down cast • Distributed Classification • Distributed Clustering • Mining on Large Scale Systems: Challenges and Open Problems

Decision Tree Induction • Example of Quinlan’s ID3 (Play / No Play)

Decision Tree Built on the Data Outlook = sunny Humidity <= 75 Outlook = overcast Play No Play Outlook = rain Play Windy=true No Play No Play

Algorithm for Decision Tree Induction • Basic algorithm (a greedy algorithm) • Tree is constructed in a top-down recursive divide-and-conquer manner • At start, all the training examples are at the root • Attributes are categorical (if continuous-valued, they are discretized in advance) • Examples are partitioned recursively based on selected attributes • Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) • Conditions for stopping partitioning • All samples for a given node belong to the same class • There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf • There are no samples left Data Mining: Concepts and Techniques

Attribute Selection: Information Gain • Select the attribute with the highest information gain • Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |Ci, D|/|D • Expected information (entropy) needed to classify a tuple in D: • Information needed (after using A to split D into v partitions) to classify D: • Information gained by branching on attribute A

Distributed Decision Tree Construction • Adam sends Betty “Outlook = Rainy” • Betty constructs “Humidity=High & Play=Yes” and “Humidity=Normal & Play = Yes” • Dot product represents tuples “Outlook = Rainy & Humidity = Normal & Play = Yes” AND “Outlook = Rainy & Humidity = High & Play = Yes” Example Obtained from: C Gianella, K Liu, T Olsen and H Kargupta, “Communication efficient construction of decision trees over heterogeneously distributed data”, ICDM 2004

A technique from Random Projection • Simple technique that has been useful in developing approximation algorithms • Given n points in Euclidean space like Rn, project down to random k-diml subspace for k << n. • If k is “medium-size” like O(-2 log n), then approximation preserves many interesting quantities. • If k is small like 1, then can often still get something useful.

Johnson and Lindenstrauss Lemma • Given n points in Rn, if project randomly to Rk, for k = O(-2 log n), then with high probability all pairwise distances preserved up to 1(after scaling by (n/k)1/2).

Distributed Dot Product Estimation Using Random Projection • Data Matrix: Site A - n X p , Site B – n X q • Normalize data • A and B get a random number generation seed • Generate an l X n random matrix (l << n) • A sends RA and B sends RB to S • Compute D = (RA)T (RB) / l • E[D]= E[AT(RTR)B/ l ] = AT E[RTR] B / l ~ AT B (Johnson and Linden Strauss lemma)

Distributed Decision Tree Construction • At each site, locally determine which attribute has the largest information gain • Keep track of the global attribute (AG) with largest information gain • For each distinct value of AG -- branch leading from root node will be constructed • Site with AG will send projection to other sites • Leaf node determination – 1. all instances have the same class 2. Minimum number of objects allowable in a class reached 3. Child of a node is empty

PLANET: Parallel Learning for Assembling Numerous Ensemble Trees • Ref: B Panda, J. S. Herbach, S. Basu, R. J. Bayardo, “PLANET: Massively Parallel Learning of Tree Ensembles with Map Reduce”, VLDB 2009 • Components • Controller (maintains a ModelFile) • MapReduceQueue and InMemoryQueue

Classifier Design by Linear Programming • Classification can be posed as an LP problem • Kth instance xK, Weight W • xK W ≥ d • ek is error associated with an instance • LP is written as XW + E = D + S, S contains the slack variables • Assume that each node in a network has a data set, how can the classification problem be solved? H Dutta and H Kargupta, “Distributed Linear Programming and Resource Management for Data Mining in Distributed Environments”, ICDM 2008.

Ensemble Learning in Distributed Environments T1 T2 T0

Classification Function of Ensemble Classifier … f2(x) f3(x) f1(x) fn(x)  Weighted Sum ai: weight for Tree i fi(x): classification of Tree i i ai fi(x) f(x) =

Ensemble Approach • Bagging (Breiman, 96) • Boosting (Freund and Schapire, 99) • Arcing (Breiman, 97) • Stacking (Wolpert, 92) • Rotation Forest (Kuncheva et al)

The Distributed Boosting Algorithm • k distributed sites storing homogeneously partitioned data • At each local site, initialize the local distribution Δj • Keep track of the global initial distribution by broadcasting Δj • For each iteration across all sites • Draw indices from the local data set based of the global distribution • Train a weak learner and distribute to all sites • Create an ensemble by combining weak learners; use the ensemble to compute the weak hypothesis • Compute weights, and re-distribute to all sites • Update distribution and repeat until termination. • Reference: A. Lazarevic and Z. Obradovic, “The Distributed Boosting Algorithm”, KDD 2001.

Road Map • Problem Motivation • An Astronomy Application • DDM Basics • Data Distribution • Synchronous vs Asynchronous algorithms • Communication Protocols • Gossip-based communication • Randomized Gossip • Converge cast, Up cast and Down cast • Distributed Classification • Distributed Clustering / Outlier Detection • Mining on Large Scale Systems: Challenges and Open Problems

CPCA: Collective Principal Component Analysis-based Clustering • Kargupta et. al (KAIS, 2001) Central Coordinator Send Global PCs • Perform PCA at Local Site • Project data onto the PCs • Apply clustering in lower dimension • Perform PCA at Local Site • Project data onto the PCs • Apply clustering in lower dimension Perform local clustering at the sites with global PCs

KDEC: Distributed Density Based Clustering • Mathias Klusch et al (IJCAI 2003) • Homogeneous data partitioning across nodes • Local sites and a Helper agent • Assume global kernel function and bandwidth are agreed upon • Local density estimates are made at each site • Global KDE is obtained by summing local estimates • Value sent back to local sites which clusters data • Points that can be connected by a continuous uphill path to local maxima are in same cluster • Privacy preserving variation also exists

Parallel K-means • Dhillon and Modha • Chunk data ( homogeneous partitioning) • Random node selects cluster centroids • Distance between centroids and local data computed • After each iteration, independent results to be reduced • Use of MPI to implement the procedure • Parallelization is different from the Distributed setting!

Summary • Data Avalanche in scientific disciplines • Distributed Data Mining – a relatively new field in the past 15 years • Data Distribution and Communication Protocols • How does the distributed data affect mining? • Algorithms for Decision Tree construction, Boosting in distributed settings • Unsupervised Learning in Distributed Environments • A lot more to be done theoretically and empirically! • Interested in collaborating – send email to haimonti@ccls.columbia.edu

Algorithms for Distributed Supervised and Unsupervised Learning

Algorithms for Distributed Supervised and Unsupervised Learning

Presentation Transcript

Unsupervised learning Cluster Analysis: Basic Concepts and Algorithms

Supervised and unsupervised wrapper generation

Supervised learning vs. unsupervised learning

Supervised and Unsupervised learning for Natural language processing

Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning

Stochastic k- Neighborhood Selection for Supervised and Unsupervised Learning

Lab 5 Unsupervised and supervised clustering

Supervised and semi-supervised learning for NLP

Unsupervised and Semi-Supervised Learning of Tone and Pitch Accent

Classification Supervised and unsupervised

Unsupervised and Supervised Tracking

Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning

Unsupervised learning

Unsupervised Learning

Unsupervised Learning

Unsupervised Machine Learning Algorithms

Supervised and Unsupervised MFA learning Self-organization Classification

Ways to Distinguish Unsupervised from Supervised Machine learning

Unsupervised Learning Types, Algorithms and Applications