440 likes | 701 Vues
LDA Training System. xueminzhao@tencent.com 8/22/2012. Outline. Introduction SparseLDA Rethinking LDA: Why Priors Matter LDA Training System Design: MapReduce -LDA. Outline. Introduction SparseLDA Rethinking LDA: Why Priors Matter LDA Training System Design: MapReduce -LDA.
E N D
LDA Training System xueminzhao@tencent.com 8/22/2012
Outline • Introduction • SparseLDA • Rethinking LDA: Why Priors Matter • LDA Training System Design: MapReduce-LDA
Outline • Introduction • SparseLDA • Rethinking LDA: Why Priors Matter • LDA Training System Design: MapReduce-LDA
Problem – Text Relevance • Q1: apple pie • Q2: iphone crack • Doc1: Apple Computer Inc. is a well known company located in California, USA. • Doc2: The apple is the pomaceous fruit of the apple tree, spciesMalusdomestica in the rose.
Outline • Introduction • SparseLDA • Rethinking LDA: Why Priors Matter • LDA Training System Design: MapReduce-LDA
Outline • Introduction • SparseLDA • Rethinking LDA: Why Priors Matter • LDA Training System Design: MapReduce-LDA
Outline • Introduction • SparseLDA • Rethinking LDA: Why Priors Matter • LDA Training System Design: MapReduce-LDA
Scalability • Hypothesis - memory 40GB per machine; - 5 words per doc. • Scalability - if #<docs> <= 1,000,000,000, no #<topics> limit; - if #<topics> < 14,000, no #<docs> limit.
References • D. Blei, Andrew Ng, and M. Jordan, Latent Dirichlet Allocation, JMLR2003. • Thomas L. Griffiths, and Mark Steyvers, Finding scientific topics, PNAS2004. • Gregor Heinrich, Parameter estimation for text analysis, Technical Report, 2009. • Limin Yao, David Mimno, and Andrew McCallum. Efficient Methods for Topic Model Inference on StreamingDocument Collections. KDD'09. • Hanna M. Wallach, David Mimno, and Andrew McCallum, Rethinking LDA: Why Priors Matter, NIPS2009. • David Newman, Arthur Asuncion, Padhraic Smyth, and Max Welling, Distributed Inference for Latent Dirichlet Allocation, NIPS2007. • Yi Wang, HongjieBai, Matt Stanton, Wen-Yen Chen, and Edward Y. Chang, PLDA: Parallel Latent Dirichlet Allocation for Large-scale Applications, AAIM2009. • Xueminzhao. LDA design doc. http://x.x.x.x/~xueminzhao/html_docs/internal/modules/lda.html.