220 likes | 352 Vues
This work by PhD student Xiaofeng I. Zhang explores distributed model-based learning, focusing on data clustering and dimension reduction techniques. It covers various methods like linear approaches (SVD, PCA), pairwise distance methods (MDS), topographic maps, and manifold learning (LLE). The research addresses challenges in handling incomplete data and privacy concerns, offering innovative solutions such as hierarchical models and privacy control in distributed environments. The role of Gaussian Mixture Models and the evaluation of model accuracy and efficiency are also discussed.
E N D
Distributed Model-Based Learning PhD student: Zhang, Xiaofeng
I. Model-Based Learning • Methods used in Data Clustering • dimension reduction • 1. Linear methods: SVD, PCA, Kernel PCA etc. • 2. Pairwise distance methods: Multidimensional scaling (MDS), etc. • 3. Topographic maps: Elastic net, SOM, generative topographic mapping(GTM) etc. • 4. Manifold learning: LLE etc.
Characters: • Cope with incomplete data • Better to explain data • Visualization • GTM as an example • Gaussian distribution over the dataset
Collaborative Filtering using GTM: Dataset: Movie data Rate on movie [0~1] Each color represent a class of movie Visualize in a 2-D plane Romance vs. Action Blue one: Action Pink one: Romance
Centralized GTM in CF: • Centralized dataset • Large scale, billions of records • Expensive to maintain • Distributed Requirement • Security concern: bank, government, military • Privacy sensitive: bank, commercial site, personal site • Scalable • Expensive to centralize • Real time huge data stream • Distributed learning way for statistical model is an important issue
II. Related Work • Distributed Information Retrieval • Globally building a P2P network • Locally routing a query • Globally matching the query to a distributed dataset
Distributed Data Mining • Partition of the dataset • Horizontal or homogenous • Attributes are same in partitions • Vertical or heterogeneous • Attributed are different in partitions • Approach: • Distributed KNN • Density-Based • Distributed Bayesian network • For example: a global virtue table is built for vertical partition
Approaches to distributed learning: • Mediator based • Agent based • Grid based • Middleware based • Density-based • Model-based
III. Our Approach • Sparse local data • Underlying a global model • Problem review • Local three model • Globally merge the local models • Merge again or not?
A related approach • Artificial data • A Gaussian Mixture Model over global dataset • MCMC sampling • To learn local model • From the average local model to learn global model • Privacy cost distribution: a gaussian distribution
Density based merging approach • The combined global model • K : the number of the components • pi(xt) : a Gaussian component • αi=1 is the weight value. satisfy
Merging criteria • Q = argmax(Lij) + argmin(Cosij) • Lij: likelihood measure • Cosij: Privacy cost between two model • Two consideration: • Privacy cost • Likelihood a data generated by the other model
Steps: • Locally learning models • Merging according to the likelihood and privacy control • Merging stop if no clusters is density connected. • Learn the parameters of a global GMM via • K etc.
Hierarchical Approach • Local six models • Merge according to the similarity measure • Each level can be controlled by the privacy cost • Bottom up learning a hierarchical model • After a global model is learned, change the privacy control level, can change the model
Model selection • Simij = Dist(Cost(Di);Cost(Dj)) < Const • Cost(Di) : transform dataset use cost function • Dist( x , y) : operation of computing distance between two dataset • Smaller than a threshold then merge • Steps: • 1. Learn a local model from local dataset. • 2. Based on the predefined the privacy control function, merge local models to form a hierarchical global model. • 3. Relabel the local model according to the changed privacy.
Privacy Control by Data Sampling • Previously control the privacy function • Try to control the dataset sensitive to privacy • D1’ = D1 U Oa21 (D2) U Oa31(D3) U Oa41(D4) • D2’ = Oa12 (D1) U D2 U Oa32(D3) U Oa42(D4) • … • Oa12: Operator over the dataset • New local dataset are reconstructed by sampling from the other local dataset at some privacy control level
P2P Approach • Local small world of network • Local global model • Storing local network information in each node • Trust propagation to connected nodes • Pass knowledge to connected small world
Algorithms: • 1. Learn a global model for each small world of local nodes. • 2. pass back global information to each node in this small world. • 3. Nodei pass its trust relationship to its connected outer small world nodes at a certain value. • 4. The connected nodes merge the local model with new knowledge in another model. • 5. Update the connected global model knowledge, and propagate to all the local models in this small world. • 6. Sum all the knowledge L3 collected, and update the G2, then repeat the step 3 - 6 until the loop criteria is satisfied: reach the iteration number or the global model change little.
IV. Model Evaluation • Effective criterion • Precision • How accurate a model can be • Recall • Cover how many the right data in the model
Efficiency criterion • The communication cost • bandwidth is the same • Only proportion to partition size • Maximum data transferred • Overhead • Compare three approach with the centralized way • Complexity • Computation complexity
V. Experiments Issue • Another approach for the dataset • Site vector instead of document vector • Pick out meaningful representatives of local models • LLE vs. GTM etc. • Change the privacy distribution to control the shape of global model