1 / 22

Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce

Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce. Chao Liu, Hung- chih Yang, Jinliang Fan, Li-Wei He, Yi-Min Wang Internet Services Research Center (ISRC) Microsoft Research Redmond. Internet Services Research Center (ISRC).

Télécharger la présentation

Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed Nonnegative Matrix Factorization for Web-Scale DyadicData Analysis on MapReduce Chao Liu, Hung-chih Yang, Jinliang Fan, Li-Wei He, Yi-Min Wang Internet Services Research Center (ISRC) Microsoft Research Redmond

  2. Internet Services Research Center (ISRC) • Advancing the state of the art in online services • Dedicated to accelerating innovations in search and ad technologies • Representing a new model for moving technologies quickly from research projects to improved products and services

  3. Dyadic Data on the Web • Web abounds with dyadic data • Web search: term by document, query by clickedURL, web linkage, … • Advertising: query by ad, bid term by ad, user by ad, … • Social media: tag by image, user by community, friendship graph, … • Common characteristics • Good source for discovering latent relationships • High dimensionality, sparse, nonnegative, dynamic

  4. Nonnegative Matrix Factorization (NMF) • Effective tool to uncover latent relationships in nonnegative matrices with many applications [Berry et al., 2007, Sra & Dhillon, 2006] • Interpretable dimensionality reduction [Lee & Seung, 1999] • Document clustering [Shahnaz et al., 2006, Xu et al, 2006] • Challenge: Can we scale NMF to million-by-million matrices

  5. NMF Algorithm [Lee & Seung, 2000]

  6. Parallel NMF [Robila & Maciak, 2006] • Parallelism on multi-core machines • Partition along the long dimension for parallelism • Assuming all matrices can be held in shared memory

  7. Distributed NMF … … • Data Partition: A, W and H across machines . . . . . . . . . .

  8. Copmuting DNMF: The Big Picture

  9. … … … … Map-III Map-V Map-II Map-I Map-IV … Reduce-II Reduce-I Reduce-III … … … Reduce-V …

  10. … Map-II Map-I … Reduce-I Reduce-II … … …

  11. … … Map-III Map-IV Reduce-III . . . . . . . . . . .

  12. … … Map-V … Reduce-V …

  13. … … … … Map-III Map-V Map-II Map-I Map-IV … Reduce-II Reduce-I Reduce-III … … … Reduce-V …

  14. Experimental Evaluation • Synthesized data on a sandbox cluster • No interference from other jobs • Performance with various parameters • Real-world data on a commercial cluster • Real-world scalability

  15. Synthesized Data on Sandbox Cluster • A Hadoop cluster with 8 workers in total • Worker: Pentium-IV CPU, 1 or 2 cores, 1~2 GB memory, 150G hard drive • V: Number of workers in cluster • Matrix simulator • Generate m-by-n matrix with sparsityδ • k: factorization dimensionality • Defaults:

  16. Computation Breakdown • dominates the computation • is lightweight • The sparser, the faster

  17. Performance w.r.t. Parameters • Linear to m×n×δ • Linear to factorization dimension k • Sub-ideal speedup w.r.t. cluster size V

  18. Scalability on Real-world Data • User-by-Website matrix • Browsed URLs of opt-in users, represented by UID • URLs trimmed to site level • http://www.cnn.com/breakingnews --> www.cnn.com • Experiments on Microsoft SCOPE • SCOPE: Structure Computations Optimized for Parallel Execution [Chaikenet al., VLDB’08]

  19. Executions w.r.t. Iterations • Observations • Longer total elapse time • Shorter time per iteration • Reason • Overlapped computation across iterations Normalized Elapse Time Iterations

  20. Scalability w.r.t. Matrix Size • 3 hours per iteration, 20 iterations take around 20*3*0.72 ≈ 43 hours Less than 7 hours on a 43.9M-by-769M matrix with 4.38 billion nonzero values

  21. Conclusion • NMF is an effective tool to uncover latent structures in dyadic data that is abundant on the Web • NMF is admissible to MapReduce • Distributed NMF solves the scalability challenge • Applications down the road

  22. Q&A Thank You!

More Related