A Language Modeling Approach to Information Retrieval

A Language Modeling Approachto Information Retrieval 한 경 수 2002-04-02 • Introduction • Previous Work • Model Description • Empirical Results • Conclusions and Future Work • Relevance Feedback in LM

Introduction Indexing model of probabilistic retrieval model • A model of the assignment of indexing terms to documents • Indexing model of 2-Poisson model • Indicate the useful indexing terms by means of the differences in their rate of occurrence in documents elite for a given term vs. those without the property of eliteness. • The current indexing models have not led to improved retrieval results. • Due to 2 unwarranted assumptions • Documents are members of pre-defined classes. • Combinatorial explosion of elite sets • The parametric assumption • Unnecessary to construct a parametric model of the data when we have the actual data. LM Approach to IR

Introduction Retrieval based on probabilistic LM • Treat the generation of queries as a random process. • Approach • Infer a language model for each document. • Estimate the probability of generating the query according to each of these models. • Rank the documents according to these probabilities. • Intuition • Users … • Have a reasonable idea of terms that are likely to occur in documents of interest. • Will choose query terms that distinguish these documents from others in the collection. • Collection statistics … • Are integral parts of the language model. • Are not used heuristically as in many other approaches. LM Approach to IR

Introduction Probabilistic IR Information need d1 matching d2 query … dn document collection LM Approach to IR

Introduction IR based on LM Information need d1 generation d2 query … … dn document collection LM Approach to IR

Previous Work Previous Work • Difference from the 2-Poisson model • Don’t make distributional assumptions. • Don’t distinguish a subset of specialty words. • Don’t assume a preexisting classification of documents into elite and non-elite sets. • Difference from Robertson & Sparck Jones model and Croft & Harper model • Don’t focus on relevance except to the extent that the process of query production is correlated with it. • Fuhr model • INQUERY • Kwok, Wong & Yao, Kalt LM Approach to IR

Model Description Query generation probability • Ranking formula • The probability of producing the query given the language model of document d Assumption: Given a particular language model, the query term occur independently : language model of document d : raw tf of term t in document d : total number of tokens in document d LM Approach to IR

Model Description Insufficient data • Zero probability • Don’t wish to assign a probability of zero to a document that is missing one or more of the query terms. • Somewhat radical assumption to infer that • Assumption • A non-occurring term is possible, but no more likely than what would be expected by chance in the collection. • If , : raw count of term t in the collection : raw collection size(total number of tokens in the collection) LM Approach to IR

Model Description Averaging for robustness • If we could get an arbitrary sized sample of data from we could be reasonably confident in the maximum likelihood estimator. • We only have a document sized sample from that distribution. • To circumvent this problem, • Need an estimate from a larger amount of data : document frequency of t LM Approach to IR

Model Description The Risk • Cannot and are not assuming that every document containing t is drawn from the same language model. • There is some risk in using the mean to estimate • If we used the mean by itself, there would be no distinction between documents with different term frequencies. • The risk for a term t in a document d (geometric distribution) • As the tf gets further away from the normalized mean, the mean probability becomes riskier to use as an estimate. : mean term frequency of term t in documents where t occurs normalized by document length (= ) LM Approach to IR

Model Description Combining the two estimates LM Approach to IR

Model Description Analysis of the formulation • Generalization: formulation of the LM for IR • Conception • The user has a document in mind, and generate the query from this document. • The equation represents the probability that the document that the user had in mind was in fact this one. general language model individual-document model LM Approach to IR

Empirical Results Experiment Environment • Data • TREC topics 202-250 on TREC disks 2 and 3 • Natural language queries consisting of one sentence each • TREC topics 51-100 on TREC disk 3 using the concept fields • Lists of good terms • <num>Number: 054 • <dom>Domain: International Economics • <title>Topic: Satellite Launch Contracts • <desc>Description: • … • <con>Concept(s): • Contract, agreement • Launch vehicle, rocket, payload, satellite • Launch services, … • … LM Approach to IR

Empirical Results Recall/Precision Experiments(1) LM Approach to IR

Empirical Results Recall/Precision Experiments(2) LM Approach to IR

Empirical Results Improving the Basic Model(1) • Smoothing the estimate of the average probability for terms with low document frequency • The estimate is based on a small amount of data • So could be sensitive to outliers • Binned estimate • Bin the low frequency data by document frequency • Cutoff: df=100 • Use the binned estimate for the average LM Approach to IR

Empirical Results Improving the Basic Model(2) LM Approach to IR

Empirical Results Improving the Basic Model(3) LM Approach to IR

Conclusions & Future Work Conclusions & Future Work • Conclusions • Novel way of looking at the problem of text retrieval based on probabilistic language modeling • Conceptually simple and explanatory • LM will provide effective retrieval and can be improved to the extent that the following conditions can be met • Our language models are accurate representations of the data. • Users understand our approach to retrieval. • Users have a some sense of term distribution. • The ability to think about retrieval in a new way • Future Work • Estimate of default probability • Current estimator could in some strange cases assign a higher probability to a non-occurring query term. • Query expansion LM Approach to IR

Relevance Feedback in LM LM approach to multiple relevant documents • Current LM approach • Allow for N+1 language models • N(collection size) + general language model • The relationship between general language model and the individual document models is never raised. • How can a document be generated from one language model when the entire collection is generated from a different one? • We need … • General model for some accumulation of text, which is modified (not replaced) by a local model for some smaller part of the same text. LM Approach to IR

Relevance Feedback in LM 3-level model(1) • 3-level model • Whole collection model ( ) • Specific-topic model; relevant-documents model ( ) • Individual-document model ( ) • Relevance hypothesis • A request(query; topic) is generated from a specific-topic model { , }. • Iff a document is relevant to the topic, the same model will apply to the document. • It will replace part of the individual-document model in explaining the document. • The probability of relevance of a document • The probability that this model explains part of the document • The probability that the { , , } combination is better than the { , } combination LM Approach to IR

3-level model(2) Information need d1 d2 generation … query … … dn document collection LM Approach to IR

Geometric distribution(1) • 기하분포 • 첫번째 성공을 거둘 때까지 성공률이 p인 베르누이 시행을 반복 할 때, 총 시행횟수를 X라 두면 이 확률변수 X가 갖는 분포가 기하분포이다. LM Approach to IR

Geometric distribution(2) • 예 • 어떤 실험을 한번 하는데 드는 비용은 10만원이다. 이 실험이 성공할 확률은 0.2이고 성공할 때까지 이 실험을 반복한다고 할 때 실험에 드는 총비용을 얼마로 예상하면 될까? LM Approach to IR

A Language Modeling Approach to Information Retrieval

A Language Modeling Approach to Information Retrieval

Presentation Transcript

Language Models for Information Retrieval

Cross-Language Information Retrieval

Cross-Language Information Retrieval

A Language Modeling Approach to Tracking

Cross Language Information Retrieval (CLIR)

Multifaceted Approach to Biomedical Information Retrieval

A Language Modeling Approach for Temporal Information Needs

Cross Language Information Retrieval (CLIR)

Cross Language Information Retrieval (CLIR)

Cross Language Information Retrieval (CLIR)

A Discourse-based Information Retrieval Approach

A natural-language approach to modeling

Statistical Language Modeling for Speech Recognition and Information Retrieval

Language Modeling Frameworks for Information Retrieval

Challenges in Information Retrieval and Language Modeling

Modeling Diversity in Information Retrieval

Information Retrieval Modeling

Model-based Feedback in the Language Modeling Approach to Information Retrieval

Cross-Language Information Retrieval (CLIR)

Statistical Language Modeling for Speech Recognition and Information Retrieval