龙星计划课程 : 信息检索 Statistical Language Models for IR

龙星计划课程:信息检索Statistical Language Models for IR ChengXiang Zhai (翟成祥) Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign http://www-faculty.cs.uiuc.edu/~czhai, czhai@cs.uiuc.edu

Outline • More about statistical language models in general • Systematic review of language models for IR • The basic language modeling approach • Advanced language models • KL-divergence retrieval model and feedback • Language models for special retrieval tasks

More about statistical language models in general

What is a Statistical LM? • A probability distribution over word sequences • p(“Today is Wednesday”)  0.001 • p(“Today Wednesday is”)  0.0000000000001 • p(“The eigenvalue is positive”)  0.00001 • Context/topic dependent! • Can also be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model

Why is a LM Useful? • Provides a principled way to quantify the uncertainties associated with natural language • Allows us to answer questions like: • Given that we see “John” and “feels”, how likely will we see “happy” as opposed to “habit” as the next word? (speech recognition) • Given that we observe “baseball” three times and “game” once in a news article, how likely is it about “sports”? (text categorization, information retrieval) • Given that a user is interested in sports news, how likely would the user use “baseball” in a query? (information retrieval)

Source-Channel Framework(Model of Communication System [Shannon 48] ) Transmitter (encoder) Noisy Channel Receiver (decoder) Source Destination X Y X’ P(X) P(X|Y)=? P(Y|X) (Bayes Rule) When X is text, p(X) is a language model Many Examples: Speech recognition: X=Word sequence Y=Speech signal Machine translation: X=English sentence Y=Chinese sentence OCR Error Correction: X=Correct word Y= Erroneous word Information Retrieval: X=Document Y=Query Summarization: X=Summary Y=Document

Basic Issues • Define the probabilistic model • Event, Random Variables, Joint/Conditional Prob’s • P(w1 w2 ... wn)=f(1, 2 ,…,m) • Estimate model parameters • Tune the model to best fit the data and our prior knowledge • i=? • Apply the model to a particular task • Many applications

The Simplest Language Model(Unigram Model) • Generate a piece of text by generating each word independently • Thus, p(w1 w2 ... wn)=p(w1)p(w2)…p(wn) • Parameters: {p(wi)} p(w1)+…+p(wN)=1 (N is voc. size) • Essentially a multinomial distribution over words • A piece of text can be regarded as a sample drawn according to this word distribution

Text Generation with Unigram LM Text mining paper Food nutrition paper (Unigram) Language Model  p(w| ) Sampling Document d … text 0.2 mining 0.1 assocation 0.01 clustering 0.02 … food 0.00001 … Topic 1: Text mining Given , p(d| ) varies according to d … food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 … Topic 2: Health

Estimation of Unigram LM … text ? mining ? assocation ? database ? … query ? … 10/100 5/100 3/100 3/100 1/100 How good is the estimated model ? It gives our document sample the highest prob, but it doesn’t generalize well… More about this later… (Unigram) Language Model  p(w| )=? Estimation Document text 10 mining 5 association 3 database 3 algorithm 2 … query 1 efficient 1 Total #words =100

Empirical distribution of words • There are stable language-independent patterns in how people use natural languages • A few words occur very frequently; most occur rarely. E.g., in news articles, • Top 4 words: 10~15% word occurrences • Top 50 words: 35~40% word occurrences • The most frequent word in one corpus may be rare in another

Zipf’s Law Word Freq. Most useful words (Luhn 57) Is “too rare” a problem? Biggest data structure (stop words) Word Rank (by Freq) Generalized Zipf’s law: Applicable in many domains • rank * frequency  constant

More Sophisticated LMs • N-gram language models • In general, p(w1 w2 ... wn)=p(w1)p(w2|w1)…p(wn|w1 …wn-1) • n-gram: conditioned only on the past n-1 words • E.g., bigram: p(w1 ... wn)=p(w1)p(w2|w1) p(w3|w2) …p(wn|wn-1) • Remote-dependence language models (e.g., Maximum Entropy model) • Structured language models (e.g., probabilistic context-free grammar) • Will not be covered in detail in this course. If interested, read [Jelinek 98, Manning & Schutze 99, Rosenfeld 00]

Why Just Unigram Models? • Difficulty in moving toward more complex models • They involve more parameters, so need more data to estimate (A doc is an extremely small sample) • They increase the computational complexity significantly, both in time and space • Capturing word order or structure may not add so much value for “topical inference” • But, using more sophisticated models can still be expected to improve performance ...

Evaluation of SLMs • Direct evaluation criterion: How well does the model fit the data to be modeled? • Example measures: Data likelihood, perplexity, cross entropy, Kullback-Leibler divergence (mostly equivalent) • Indirect evaluation criterion: Does the model help improve the performance of the task? • Specific measure is task dependent • For retrieval, we look at whether a model helps improve retrieval accuracy • We hope more “reasonable” LMs would achieve better retrieval performance

What You Should Know • How the source-channel framework can model many different problems • Why unigram LMs seem to be sufficient for IR • Zipf’s law

Systematic Review of Language Models for IR

Representative LMs for IR (up to 2006) 1998 1999 2000 2001 2002 2003 2004 2005 - Query likelihood scoring Ponte & Croft 98 Hiemstra & Kraaij 99; Miller et al. 99 Smoothing examined Zhai & Lafferty 01a Bayesian Query likelihood Zaragoza et al. 03. Theoretical justification Lafferty & Zhai 01a,01b URL prior Kraaij et al. 02 Parameter sensitivity Ng 00 Time prior Li & Croft 03 Two-stage LMs Zhai & Lafferty 02 Basic LM (Query Likelihood) Beyond unigram Song & Croft 99 Term-specific smoothing Hiemstra 02 Cluster LM Kurland & Lee 04 Cluster smoothing Liu & Croft 04; Tao et al. 06 Improved Basic LM Title LM Jin et al. 02 Concept Likelihood Srikanth & Srihari 03 Thesauri Cao et al. 05 Translation model Berger & Lafferty 99 Dependency LM Gao et al. 04 Relevance LM Lavrenko & Croft 01 Parsimonious LM Hiemstra et al. 04 Pesudo Query Kurland et al. 05 Query/Rel Model & Feedback Rel. Query FB Nallanati et al 03 Model-based FB Zhai & Lafferty 01b Query expansion Bai et al. 05 Markov-chain query model Lafferty & Zhai 01b Rebust Est. Tao & Zhai 06 Special IR tasks Lavrenko et al. 02 Shen et al. 05 Xu & Croft 99 Ogilvie & Callan 03 Zhai et al. 03 Xu et al. 01 Zhang et al. 02 Tan et al. 06 Cronen-Townsend et al. 02 Si et al. 02 Kurland & Lee 05 Dissertations Ponte 98 Hiemstra 01 Berger 01 Lavrenko 04 Kraaij 04 Zhai 02 Tao 06 Kurland 06 Srikanth 04

Ponte & Croft’s Pioneering Work [Ponte & Croft 98] • Contribution 1: • A new “query likelihood” scoring method: p(Q|D) • [Maron and Kuhns 60] had the idea of query likelihood, but didn’t work out how to estimate p(Q|D) • Contribution 2: • Connecting LMs with text representation and weighting in IR • [Wong & Yao 89] had the idea of representing text with a multinomial distribution (relative frequency), but didn’t study the estimation problem • Good performance is reported using the simple query likelihood method

Early Work (1998-1999) • At about the same time as SIGIR 98, in TREC 7, two groups explored similar ideas independently: BBN [Miller et al., 99] & Univ. of Twente [Hiemstra & Kraaij 99] • In TREC-8, Ng from MIT motivated the same query likelihood method in a different way [Ng 99] • All following the simple query likelihood method; methods differ in the way the model is estimated and the event model for the query • All show promising empirical results • Main problems: • Feedback is explored heuristically • Lack of understanding why the method works….

Later Work (1999-) • Attempt to understand why LMs work [Zhai & Lafferty 01a, Lafferty & Zhai 01a, Ponte 01, Greiff & Morgan 03, Sparck Jones et al. 03, Lavrenko 04] • Further extend/improve the basic LMs [Song & Croft 99, Berger & Lafferty 99, Jin et al. 02, Nallapati & Allan 02, Hiemstra 02, Zaragoza et al. 03, Srikanth & Srihari 03, Nallapati et al 03, Li &Croft 03, Gao et al. 04, Liu & Croft 04, Kurland & Lee 04,Hiemstra et al. 04,Cao et al. 05, Tao et al. 06] • Explore alternative ways of using LMs for retrieval (mostly query/relevance model estimation) [Xu & Croft 99, Lavrenko & Croft 01, Lafferty & Zhai 01a, Zhai & Lafferty 01b, Lavrenko 04, Kurland et al. 05, Bai et al. 05,Tao & Zhai 06] • Explore the use of SLMs for special retrieval tasks [Xu & Croft 99, Xu et al. 01, Lavrenko et al. 02, Cronen-Townsend et al. 02, Zhang et al. 02, Ogilvie & Callan 03, Zhai et al. 03, Kurland & Lee 05, Shen et al. 05, Balog et al. 06, Fang & Zhai 07]

Review of LM for IR: Part 1. Basic Language Modeling Approach

The Basic LM Approach[Ponte & Croft 98] Language Model … text ? mining ? assocation ? clustering ? … food ? … ? Which model would most likely have generated this query? … food ? nutrition ? healthy ? diet ? … Document Query = “data mining algorithms” Text mining paper Food nutrition paper

Ranking Docs by Query Likelihood Doc LM Query likelihood d1 p(q| d1) p(q| d2) d2 p(q| dN) dN d1 q d2 dN

Modeling Queries: Different Assumptions • Multi-Bernoulli: Modeling word presence/absence • q= (x1, …, x|V|), xi =1 for presence of word wi; xi =0 for absence • Parameters: {p(wi=1|d), p(wi=0|d)} p(wi=1|d)+ p(wi=0|d)=1 • Multinomial (Unigram LM): Modeling word frequency • q=q1,…qm , where qj is a query word • c(wi,q) is the count of word wi in query q • Parameters: {p(wi|d)} p(w1|d)+… p(w|v||d) = 1 [Ponte & Croft 98]uses Multi-Bernoulli; most other work uses multinomial Multinomial seems to work better[Song & Croft 99, McCallum & Nigam 98,Lavrenko 04]

Retrieval as LM Estimation • Document ranking based on query likelihood Document language model • Retrieval problem  Estimation of p(wi|d) • Smoothing is an important issue, and distinguishes different approaches • Many smoothing methods are available

Which smoothing method is the best? It depends on the data and the task! Cross validation is generally used to choose the best method and/or set the smoothing parameters… For retrieval, Dirichlet prior performs well… Backoff smoothing [Katz 87] doesn’t work well due to a lack of 2nd-stage smoothing… Note that many other smoothing methods exist See [Chen & Goodman 98] and other publications in speech recognition…

Comparison of Three Methods[Zhai & Lafferty 01a] Comparison is performed on a variety of test collections

The Dual-Role of Smoothing [Zhai & Lafferty 02] Keyword queries Verbose queries long long short short Why does query type affect smoothing sensitivity?

Another Reason for Smoothing Intuitively, d2 should have a higher score, but p(q|d1)>p(q|d2)… Query = “the algorithms for data mining” P(w|REF) 0.2 0.00001 0.2 0.00001 0.00001 Smoothed p(w|d1): 0.184 0.000109 0.182 0.000209 0.000309 Smoothed p(w|d2):0.182 0.000109 0.181 0.000309 0.000409 Content words Query = “the algorithms for data mining” pDML(w|d1):0.04 0.001 0.02 0.002 0.003 pDML(w|d2): 0.02 0.001 0.01 0.003 0.004 p( “algorithms”|d1) = p(“algorithm”|d2) p( “data”|d1) < p(“data”|d2) p( “mining”|d1) < p(“mining”|d2) So we should make p(“the”) and p(“for”) less different for all docs, and smoothing helps achieve this goal…

Two-stage Smoothing [Zhai & Lafferty 02] Stage-1 -Explain unseen words -Dirichlet prior(Bayesian) Stage-2 -Explain noise in query -2-component mixture c(w,d) +p(w|C) (1-) + p(w|U)   |d| + Collection LM P(w|d) = User background model Can be approximated by p(w|C)

Estimating  using leave-one-out [Zhai & Lafferty 02] w1 Leave-one-out P(w1|d- w1) log-likelihood w2 P(w2|d- w2) Maximum Likelihood Estimator ... wn Newton’s Method P(wn|d- wn)

Why would “leave-one-out” work? Now, suppose we leave “e” out…  doesn’t have to be big  must be big! more smoothing 20 word by author1 Suppose we keep sampling and get 10 more words. Which author is likely to “write” more new words? abc abc ab c d d abc cd d d abd ab ab ab ab cd d e cd e 20 word by author2 abc abc ab c d d abe cb e f acf fb ef aff abef cdc db ge f s The amount of smoothing is closely related to the underlying vocabulary size

Automatic 2-stage results  Optimal 1-stage results [Zhai & Lafferty 02] Average precision (3 DB’s + 4 query types, 150 topics) * Indicates significant difference Completely automatic tuning of parameters IS POSSIBLE!

Variants of the Basic LM Approach • Different smoothing strategies • Hidden Markov Models (essentially linear interpolation) [Miller et al. 99] • Smoothing with an IDF-like reference model [Hiemstra & Kraaij 99] • Performance tends to be similar to the basic LM approach • Many other possibilities for smoothing [Chen & Goodman 98] • Different priors • Link information as prior leads to significant improvement of Web entry page retrieval performance [Kraaij et al. 02] • Time as prior [Li & Croft 03] • PageRank as prior [Kurland & Lee 05] • Passage retrieval [Liu & Croft 02]

Review of LM for IR: Part 2. Advanced Language Modeling Approaches

Improving the Basic LM Approach • Capturing limited dependencies • Bigrams/Trigrams [Song & Croft 99]; Grammatical dependency [Nallapati & Allan 02, Srikanth & Srihari 03, Gao et al. 04] • Generally insignificant improvement as compared with other extensions such as feedback • Full Bayesian query likelihood [Zaragoza et al. 03] • Performance similar to the basic LM approach • Translation model for p(Q|D,R) [Berger & Lafferty 99, Jin et al. 02,Cao et al. 05] • Address polesemy and synonyms; improves over the basic LM methods, but computationally expensive • Cluster-based smoothing/scoring [Liu & Croft 04, Kurland & Lee 04,Tao et al. 06] • Improves over the basic LM, but computationally expensive • Parsimonious LMs [Hiemstra et al. 04]: • Using a mixture model to “factor out” non-discriminative words

Translation Models • Directly modeling the “translation” relationship between words in the query and words in a doc • When relevance judgments are available, (q,d) serves as data to train the translation model • Without relevance judgments, we can use synthetic data [Berger & Lafferty 99],<title, body>[Jin et al. 02] , or thesauri [Cao et al. 05] Basic translation model Translation model Regular doc LM

Cluster-based Smoothing/Scoring • Cluster-based smoothing: Smooth a document LM with a cluster of similar documents [Liu & Croft 04]: improves over the basic LM, but insignificantly • Document expansion smoothing: Smooth a document LM with the neighboring documents (essentially one cluster per document) [Tao et al. 06] : improves over the basic LM more significantly • Cluster-based query likelihood: Similar to the translation model, but “translate” the whole document to the query through a set of clusters [Kurland & Lee 04] How likely doc D belongs to cluster C Likelihood of Q given C Only effective when interpolated with the basic LM scores

Feedback and Doc/Query Generation Rel. doc model NonRel. doc model “Rel. query” model (q1,d1,1) (q1,d2,1) (q1,d3,1) P(D|Q,R=1) (q1,d4,0) (q1,d5,0) (q3,d1,1) P(D|Q,R=0) Parameter Estimation (q4,d1,1) (q5,d1,1) (q6,d2,1) (q6,d3,0) P(Q|D,R=1) Query-based feedback Doc-based feedback Classic Prob. Model Query likelihood (“Language Model”) Initial retrieval: - query as rel doc vs. doc as rel query - P(Q|D,R=1) is more accurate Feedback: - P(D|Q,R=1) can be improved for the currentquery and futuredoc - P(Q|D,R=1) can also be improved, but for current doc and futurequery

Overview of Feedback Techniques • Feedback as machine learning: many possibilities • Standard ML: Given examples of relevant (and non-relevant) documents, learn how to classify a new document as either “relevant” or “non-relevant”. • “Modified” ML: Given a query and examples of relevant (and non-relevant) documents, learn how to rank new documents based on relevance • Challenges: • Sparse data • Censored sample • How to deal with query? • Modeling noise in pseudo feedback (as semi-supervised learning) • Feedback as query expansion: traditional IR • Step 1: Term selection • Step 2: Query expansion • Step 3: Query term re-weighting • Traditional IR is still robust (Rocchio), but ML approaches can potentially be more accurate

Difficulty in Feedback with Query Likelihood • Traditional query expansion [Ponte 98, Miller et al. 99, Ng 99] • Improvement is reported, but there is a conceptual inconsistency • What’s an expanded query, a piece of text or a set of terms? • Avoid expansion • Query term reweighting [Hiemstra 01, Hiemstra 02] • Translation models [Berger & Lafferty 99, Jin et al. 02] • Only achieving limited feedback • Doing relevant query expansion instead [Nallapati et al 03] • The difficulty is due to the lack of a query/relevance model • The difficulty can be overcome with alternative ways of using LMs for retrieval (e.g., relevance model [Lavrenko & Croft 01] , Query model estimation [Lafferty & Zhai 01b; Zhai & Lafferty 01b])

Two Alternative Ways of Using LMs • Classic Probabilistic Model :Doc-Generation as opposed to Query-generation • Natural for relevance feedback • Challenge: Estimate p(D|Q,R=1) without relevance feedback; relevance model [Lavrenko & Croft 01] provides a good solution • Probabilistic Distance Model :Similar to the vector-space model, but with LMs as opposed to TF-IDF weight vectors • A popular distance function: Kullback-Leibler (KL) divergence, covering query likelihood as a special case • Retrieval is now to estimate query & doc models and feedback is treated as query LM updating [Lafferty & Zhai 01b; Zhai & Lafferty 01b] Both methods outperform the basic LM significantly

Relevance Model Estimation[Lavrenko & Croft 01] • Question: How to estimate P(D|Q,R) (or p(w|Q,R)) without relevant documents? • Key idea: • Treat query as observations about p(w|Q,R) • Approximate the model space with document models • Two methods for decomposing p(w,Q) • Independent sampling (Bayesian model averaging) • Conditional sampling: p(w,Q)=p(w)p(Q|w) Original formula in [Lavranko &Croft 01]

Query Model Estimation[Lafferty & Zhai 01b, Zhai & Lafferty 01b] • Question: How to estimate a better query model than the ML estimate based on the original query? • “Massive feedback”: Improve a query model through co-occurrence pattern learned from • A document-term Markov chain that outputs the query [Lafferty & Zhai 01b] • Thesauri, corpus [Bai et al. 05,Collins-Thompson & Callan 05] • Model-based feedback: Improve the estimate of query model by exploiting pseudo-relevance feedback • Update the query model by interpolating the original query model with a learned feedback model [ Zhai & Lafferty 01b] • Estimate a more integrated mixture model using pseudo-feedback documents [ Tao & Zhai 06]

Review of LM for IR: Part 3. KL-divergence retrieval model and feedback

Kullback-Leibler (KL) Divergence Retrieval Model • Unigram similarity model • Retrieval  Estimation of Q and D • Special case: = empirical distribution of q recovers “query-likelihood” query entropy (ignored for ranking)

Feedback as Model Interpolation(Rocchio for Language Models) =0 =1 No feedback Full feedback Document D Results Query Q Feedback Docs F={d1, d2 , …, dn} Generative model

Generative Mixture Model Background words w P(w| C)  F={d1,…,dn} P(source) Topic words w 1- P(w|  ) Maximum Likelihood  = Noise in feedback documents

龙星计划课程 : 信息检索 Statistical Language Models for IR

龙星计划课程 : 信息检索 Statistical Language Models for IR

Presentation Transcript

Supervised models for coreference resolution

PSYCHOLOGY OF LANGUAGE

CS 388: Natural Language Processing: Part-Of-Speech Tagging, Sequence Labeling, and Hidden Markov Models (HMMs)

Statistical Evaluation of Data

Understanding Spoken Language using Statistical and Computational Methods Steven Greenberg International Computer Scien

Grounded Language Learning Models for Ambiguous Supervision

LING / C SC 439/539 Statistical Natural Language Processing

Smoothing N-gram Language Models

Supply and Demand Models

Lecture 9

Statistical Quality Control

Statistical Process Control

Statistical Natural Language Parsing

Log-Linear Models in NLP

Language

chapter13

Albania

Advanced Statistical Topics 2001-02

Language Modeling

Math2411

Statistical inference for astrophysics

Part 2 Statistical Mechanics