1 / 44

LDA Training System

LDA Training System. xueminzhao@tencent.com 8/22/2012. Outline. Introduction SparseLDA Rethinking LDA: Why Priors Matter LDA Training System Design: MapReduce -LDA. Outline. Introduction SparseLDA Rethinking LDA: Why Priors Matter LDA Training System Design: MapReduce -LDA.

balin
Télécharger la présentation

LDA Training System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LDA Training System xueminzhao@tencent.com 8/22/2012

  2. Outline • Introduction • SparseLDA • Rethinking LDA: Why Priors Matter • LDA Training System Design: MapReduce-LDA

  3. Outline • Introduction • SparseLDA • Rethinking LDA: Why Priors Matter • LDA Training System Design: MapReduce-LDA

  4. Problem – Text Relevance • Q1: apple pie • Q2: iphone crack • Doc1: Apple Computer Inc. is a well known company located in California, USA. • Doc2: The apple is the pomaceous fruit of the apple tree, spciesMalusdomestica in the rose.

  5. Topic Models

  6. Topic Model – Generative Process

  7. Topic Model - Inference

  8. Latent Dirichlet Allocation

  9. Outline • Introduction • SparseLDA • Rethinking LDA: Why Priors Matter • LDA Training System Design: MapReduce-LDA

  10. Gibbs Sampling for LDA

  11. Gibbs Sampling for LDA

  12. Document-Topic Statistics

  13. Topic-Word Statistics

  14. For each token,

  15. For each token,

  16. For each token,

  17. For each token,

  18. For each token,

  19. Sample a new topic

  20. For each token,

  21. Summary so far

  22. The normalizing constant

  23. The normalizing constant

  24. The normalizing constant

  25. Statistics are sparse

  26. Summary so far

  27. Huge savings: time and memory

  28. Outline • Introduction • SparseLDA • Rethinking LDA: Why Priors Matter • LDA Training System Design: MapReduce-LDA

  29. Priors for LDA

  30. Priors for LDA

  31. Priors for LDA

  32. Priors for LDA

  33. Priors for LDA

  34. Comparing Priors for LDA

  35. Optimizing m

  36. Selecting T

  37. Outline • Introduction • SparseLDA • Rethinking LDA: Why Priors Matter • LDA Training System Design: MapReduce-LDA

  38. Overview

  39. MapReduce Jobs

  40. Scalability • Hypothesis - memory 40GB per machine; - 5 words per doc. • Scalability - if #<docs> <= 1,000,000,000, no #<topics> limit; - if #<topics> < 14,000, no #<docs> limit.

  41. Experiment for Correctness Validation

  42. References • D. Blei, Andrew Ng, and M. Jordan, Latent Dirichlet Allocation, JMLR2003. • Thomas L. Griffiths, and Mark Steyvers, Finding scientific topics, PNAS2004. • Gregor Heinrich, Parameter estimation for text analysis, Technical Report, 2009. • Limin Yao, David Mimno, and Andrew McCallum. Efficient Methods for Topic Model Inference on StreamingDocument Collections. KDD'09. • Hanna M. Wallach, David Mimno, and Andrew McCallum, Rethinking LDA: Why Priors Matter, NIPS2009. • David Newman, Arthur Asuncion, Padhraic Smyth, and Max Welling, Distributed Inference for Latent Dirichlet Allocation, NIPS2007. • Yi Wang, HongjieBai, Matt Stanton, Wen-Yen Chen, and Edward Y. Chang, PLDA: Parallel Latent Dirichlet Allocation for Large-scale Applications, AAIM2009. • Xueminzhao. LDA design doc. http://x.x.x.x/~xueminzhao/html_docs/internal/modules/lda.html.

  43. Thanks!

More Related