LDA Training System

LDA Training System xueminzhao@tencent.com 8/22/2012

Outline • Introduction • SparseLDA • Rethinking LDA: Why Priors Matter • LDA Training System Design: MapReduce-LDA

Problem – Text Relevance • Q1: apple pie • Q2: iphone crack • Doc1: Apple Computer Inc. is a well known company located in California, USA. • Doc2: The apple is the pomaceous fruit of the apple tree, spciesMalusdomestica in the rose.

Topic Models

Topic Model – Generative Process

Topic Model - Inference

Latent Dirichlet Allocation

Gibbs Sampling for LDA

Document-Topic Statistics

Topic-Word Statistics

For each token,

Sample a new topic

For each token,

Summary so far

The normalizing constant

Statistics are sparse

Summary so far

Huge savings: time and memory

Priors for LDA

Comparing Priors for LDA

Optimizing m

Selecting T

Overview

MapReduce Jobs

Scalability • Hypothesis - memory 40GB per machine; - 5 words per doc. • Scalability - if #<docs> <= 1,000,000,000, no #<topics> limit; - if #<topics> < 14,000, no #<docs> limit.

Experiment for Correctness Validation

References • D. Blei, Andrew Ng, and M. Jordan, Latent Dirichlet Allocation, JMLR2003. • Thomas L. Griffiths, and Mark Steyvers, Finding scientific topics, PNAS2004. • Gregor Heinrich, Parameter estimation for text analysis, Technical Report, 2009. • Limin Yao, David Mimno, and Andrew McCallum. Efficient Methods for Topic Model Inference on StreamingDocument Collections. KDD'09. • Hanna M. Wallach, David Mimno, and Andrew McCallum, Rethinking LDA: Why Priors Matter, NIPS2009. • David Newman, Arthur Asuncion, Padhraic Smyth, and Max Welling, Distributed Inference for Latent Dirichlet Allocation, NIPS2007. • Yi Wang, HongjieBai, Matt Stanton, Wen-Yen Chen, and Edward Y. Chang, PLDA: Parallel Latent Dirichlet Allocation for Large-scale Applications, AAIM2009. • Xueminzhao. LDA design doc. http://x.x.x.x/~xueminzhao/html_docs/internal/modules/lda.html.

Thanks!

LDA Training System

LDA Training System

Presentation Transcript

Linear Discriminant Analysis (LDA)

Noble Group Lda .

Latent Dirichlet Allocation( LDA)

Wing LDA

LDA, TB, DESY Meeting

From LDA+U to LDA+DMFT

LDA Protoype Board

AHEAD LDA .

Scaling up LDA

Scaling up LDA

Latent Dirichlet Allocation (LDA)

LDA, CCC, Mars-Modul

Status: CCC, LDA

Latent Dirichlet Allocation (LDA)