A General Optimization Framework for Smoothing Language Models on Graph Structures

A General Optimization Framework for Smoothing Language Models on Graph Structures Qiaozhu Mei, Duo Zhang, ChengXiang Zhai University of Illinois at Urbana-Champaign

Doc Language Model (LM) θd : p(w|d) text 4/100=0.04 mining 3/100=0.03 clustering 1/100=0.01 … data = 0computing = 0… text =0.039 mining =0.028 clustering =0.01 … data = 0.001computing = 0.0005… Similarity function Query Language Model θq : p(w|q) p(w|q’) ? Data ½=0.5 Mining ½=0.5 Data ½=0.4 Mining ½=0.4 Clustering =0.1 … Kullback-Leibler Divergence Retrieval Method Smoothed Doc LM θd' : p(w|d’) Document d A text mining paper Query q data mining

Smoothing a Document Language Model Retrieval performance  estimate LM  smoothing LM text 4/100 = 0.04 mining 3/100 = 0.03 Assoc. 1/100 = 0.01 clustering 1/100=0.01 … data = 0computing = 0… Estimate a more accurate distribution from sparse data text = 0.039 mining = 0.028 Assoc. = 0.009 clustering =0.01 … data = 0.001computing = 0.0005… text = 0.038 mining = 0.026 Assoc. = 0.008 clustering =0.01 … data = 0.002computing = 0.001… Assign non-zero prob. to unseen words

d d d Previous Work on Smoothing Estimate a Reference language model θref using the collection (corpus) Collection [Ponte & Croft 98] Clusters [Liu & Croft 04] Nearest Neighbors Interpolate MLE with Reference LM [Kurland& Lee 04] 4

Problems of Existing Methods • Smoothing with Global Background • Ignoring collection structure • Smoothing with Document Clusters • Ignoring local structures inside cluster • Smoothing using Neighbor Documents • Ignoring global structure • Different heuristics on θref and interpolation • No clear objective function for optimization • No guidance on how to further improve the existing methods

Research Questions • What is the right corpus structure to use? • What are the criteria for a good smoothing method? • Accurate language model? • What are we ending up optimizing? • Could there be a general optimization framework?

Our Contribution • Formulation of smoothing as optimization over graph structures • A general optimization framework for smoothing both document LMs and query LMs • Novel instantiations of the framework lead to more effective smoothing methods

d A Graph-based Formulation of Smoothing • A novel and general view of smoothing Collection Collection = Graph (of Documents) P(w|d) = Surface on top of the Graph P(w|d): MLE P(w|d): Smoothed P(w|d1) projection on a plain P(w|d2) Smoothed LM = Smoothed Surface! d1 d2

Covering Existing Models Smoothing with Global Background - Star graph C4 Background Smoothing with Graph Structure C1 d C3 Smoothing with Nearest Neighbor - Local Graph C2 Smoothing with Document Clusters - Forest w/ Pseudo docs Collection = Graph Smoothed LM = Smoothed Surfaces

Instantiations of the Formulation Document Graphs

w Smoothing over Word Graphs P(wu|d)/Deg(u) Given d, {P(w|d)} = Surface over the word graph! Similarity graph of words Smoothed LM = Smoothed Surface! P(wu|d) P(wv|d)

The General Objective of Smoothing Fidelity to MLE Smoothness of the surface Importance of vertices - Weights of edges (1/dist.) 12

The Optimization Framework • Criteria: • Fidelity: keep close to the MLE • Surface Smoothness: local and global consistency • Constraint: • Unified optimization objective: Fidelity to MLE Smoothness of the surface

The Procedure of Smoothing Define graph Construct a document/word graph; Define reasonable w(u) and w(u,v); Define surfaces Define reasonable fu d Smooth surfaces Iterative updating Additional Dirichlet Smoothing

Smoothing Language Models using a Document Graph Construct a kNN graph of documents; w(u): Deg(u) w(u,v): cosine fu= p(w|du); or fu= s(q, du); Document language model: d Alternative: Document relevance score: e.g., (Diaz 05) Additional Dirichlet Smoothing

Smoothing Language Models using a Word Graph Construct a kNN graph of words; w(u): Deg(u) w(u,v): PMI fu= Document language model: w Query Language Model Additional Dirichlet Smoothing

Intuitive Interpretation – Smoothing using Word Graph Stationary distribution of a Markov Chain w Writing a document = random walk on the word Markov chain; write down w whenever passing w w

Intuitive Interpretation – Smoothing using Document Graph Absorption Probability to the “1” state d Writing a word w in a document = random walk on the doc Markov chain; write down w if reaching “1” 1 0 d Act as neighbors do

Experiments Liu and Croft ’04 Tao ’06 • Smooth Document LM on Document Graph (DMDG) • Smooth Document LM on Word Graph (DMWG) • Smooth relevance Score on Document Graph (DSDG) • Smooth Query LM on word graph (QMWG) • Evaluate using MAP

Effectiveness of the Framework Wilcoxon test: *, **, *** means significance level 0.1, 0.05, 0.01 † DMWG: reranking top 3000 results. Usually this yields to a reduced performance than ranking all the documents Graph-based smoothing >> Baseline Smoothing Doc LM >> relevance score >> Query LM

Comparison with Existing Models Graph-based smoothing > state-of-the-art More iterations > Single iteration (similar to DELM)

Combined with Pseudo-Feedback w q smooth Top docs rerank d smooth w

Related Work • Language modeling in Information Retrieval; smoothing using collection model • (Ponte & Croft 98); (Hiemstra & Kraaij 98); (Miller et al. 99); (Zhai & Lafferty 01), etc. • Smoothing using corpus structures • Cluster structure: (Liu & Croft 04), etc. • Nearest Neighbors: (Kurland & Lee 04), (Tao et al. 06) • Relevance score propagation (Diaz 05), (Qin et al. 05) • Graph-based learning • (Zhu et al. 03); (Zhou et al. 04), etc.

Conclusions • Smoothing language models using document/word graphs • A general optimization framework • Various effective instantiations • Improved performance over state-of-the-art • Future Work: • Combine document graphs with word graphs • Study alternative ways of constructing graphs

Thanks!

Parameter Tuning Fast Convergence

A General Optimization Framework for Smoothing Language Models on Graph Structures