1 / 22

Distributed Model-Based Learning

Distributed Model-Based Learning. PhD student: Zhang, Xiaofeng. I. Model-Based Learning. Methods used in Data Clustering dimension reduction 1. Linear methods: SVD, PCA, Kernel PCA etc. 2. Pairwise distance methods: Multidimensional scaling (MDS), etc.

ferrol
Télécharger la présentation

Distributed Model-Based Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed Model-Based Learning PhD student: Zhang, Xiaofeng

  2. I. Model-Based Learning • Methods used in Data Clustering • dimension reduction • 1. Linear methods: SVD, PCA, Kernel PCA etc. • 2. Pairwise distance methods: Multidimensional scaling (MDS), etc. • 3. Topographic maps: Elastic net, SOM, generative topographic mapping(GTM) etc. • 4. Manifold learning: LLE etc.

  3. Characters: • Cope with incomplete data • Better to explain data • Visualization • GTM as an example • Gaussian distribution over the dataset

  4. Collaborative Filtering using GTM: Dataset: Movie data Rate on movie [0~1] Each color represent a class of movie Visualize in a 2-D plane Romance vs. Action Blue one: Action Pink one: Romance

  5. Centralized GTM in CF: • Centralized dataset • Large scale, billions of records • Expensive to maintain • Distributed Requirement • Security concern: bank, government, military • Privacy sensitive: bank, commercial site, personal site • Scalable • Expensive to centralize • Real time huge data stream • Distributed learning way for statistical model is an important issue

  6. II. Related Work • Distributed Information Retrieval • Globally building a P2P network • Locally routing a query • Globally matching the query to a distributed dataset

  7. Distributed Data Mining • Partition of the dataset • Horizontal or homogenous • Attributes are same in partitions • Vertical or heterogeneous • Attributed are different in partitions • Approach: • Distributed KNN • Density-Based • Distributed Bayesian network • For example: a global virtue table is built for vertical partition

  8. Approaches to distributed learning: • Mediator based • Agent based • Grid based • Middleware based • Density-based • Model-based

  9. III. Our Approach • Sparse local data • Underlying a global model • Problem review • Local three model • Globally merge the local models • Merge again or not?

  10. A related approach • Artificial data • A Gaussian Mixture Model over global dataset • MCMC sampling • To learn local model • From the average local model to learn global model • Privacy cost distribution: a gaussian distribution

  11. Density based merging approach • The combined global model • K : the number of the components • pi(xt) : a Gaussian component • αi=1 is the weight value. satisfy

  12. Merging criteria • Q = argmax(Lij) + argmin(Cosij) • Lij: likelihood measure • Cosij: Privacy cost between two model • Two consideration: • Privacy cost • Likelihood a data generated by the other model

  13. Steps: • Locally learning models • Merging according to the likelihood and privacy control • Merging stop if no clusters is density connected. • Learn the parameters of a global GMM via • K etc.

  14. Hierarchical Approach • Local six models • Merge according to the similarity measure • Each level can be controlled by the privacy cost • Bottom up learning a hierarchical model • After a global model is learned, change the privacy control level, can change the model

  15. Model selection • Simij = Dist(Cost(Di);Cost(Dj)) < Const • Cost(Di) : transform dataset use cost function • Dist( x , y) : operation of computing distance between two dataset • Smaller than a threshold then merge • Steps: • 1. Learn a local model from local dataset. • 2. Based on the predefined the privacy control function, merge local models to form a hierarchical global model. • 3. Relabel the local model according to the changed privacy.

  16. Privacy Control by Data Sampling • Previously control the privacy function • Try to control the dataset sensitive to privacy • D1’ = D1 U Oa21 (D2) U Oa31(D3) U Oa41(D4) • D2’ = Oa12 (D1) U D2 U Oa32(D3) U Oa42(D4) • … • Oa12: Operator over the dataset • New local dataset are reconstructed by sampling from the other local dataset at some privacy control level

  17. P2P Approach • Local small world of network • Local global model • Storing local network information in each node • Trust propagation to connected nodes • Pass knowledge to connected small world

  18. Algorithms: • 1. Learn a global model for each small world of local nodes. • 2. pass back global information to each node in this small world. • 3. Nodei pass its trust relationship to its connected outer small world nodes at a certain value. • 4. The connected nodes merge the local model with new knowledge in another model. • 5. Update the connected global model knowledge, and propagate to all the local models in this small world. • 6. Sum all the knowledge L3 collected, and update the G2, then repeat the step 3 - 6 until the loop criteria is satisfied: reach the iteration number or the global model change little.

  19. IV. Model Evaluation • Effective criterion • Precision • How accurate a model can be • Recall • Cover how many the right data in the model

  20. Efficiency criterion • The communication cost • bandwidth is the same • Only proportion to partition size • Maximum data transferred • Overhead • Compare three approach with the centralized way • Complexity • Computation complexity

  21. V. Experiments Issue • Another approach for the dataset • Site vector instead of document vector • Pick out meaningful representatives of local models • LLE vs. GTM etc. • Change the privacy distribution to control the shape of global model

  22. Question & Answer

More Related