1 / 24

A Language Modeling Approach to Information Retrieval

A Language Modeling Approach to Information Retrieval. 한 경 수 2002-04-02. Introduction Previous Work Model Description Empirical Results Conclusions and Future Work Relevance Feedback in LM. Introduction. Indexing model of probabilistic retrieval model.

prem
Télécharger la présentation

A Language Modeling Approach to Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Language Modeling Approachto Information Retrieval 한 경 수 2002-04-02 • Introduction • Previous Work • Model Description • Empirical Results • Conclusions and Future Work • Relevance Feedback in LM

  2. Introduction Indexing model of probabilistic retrieval model • A model of the assignment of indexing terms to documents • Indexing model of 2-Poisson model • Indicate the useful indexing terms by means of the differences in their rate of occurrence in documents elite for a given term vs. those without the property of eliteness. • The current indexing models have not led to improved retrieval results. • Due to 2 unwarranted assumptions • Documents are members of pre-defined classes. • Combinatorial explosion of elite sets • The parametric assumption • Unnecessary to construct a parametric model of the data when we have the actual data. LM Approach to IR

  3. Introduction Retrieval based on probabilistic LM • Treat the generation of queries as a random process. • Approach • Infer a language model for each document. • Estimate the probability of generating the query according to each of these models. • Rank the documents according to these probabilities. • Intuition • Users … • Have a reasonable idea of terms that are likely to occur in documents of interest. • Will choose query terms that distinguish these documents from others in the collection. • Collection statistics … • Are integral parts of the language model. • Are not used heuristically as in many other approaches. LM Approach to IR

  4. Introduction Probabilistic IR Information need d1 matching d2 query … dn document collection LM Approach to IR

  5. Introduction IR based on LM Information need d1 generation d2 query … … dn document collection LM Approach to IR

  6. Previous Work Previous Work • Difference from the 2-Poisson model • Don’t make distributional assumptions. • Don’t distinguish a subset of specialty words. • Don’t assume a preexisting classification of documents into elite and non-elite sets. • Difference from Robertson & Sparck Jones model and Croft & Harper model • Don’t focus on relevance except to the extent that the process of query production is correlated with it. • Fuhr model • INQUERY • Kwok, Wong & Yao, Kalt LM Approach to IR

  7. Model Description Query generation probability • Ranking formula • The probability of producing the query given the language model of document d Assumption: Given a particular language model, the query term occur independently : language model of document d : raw tf of term t in document d : total number of tokens in document d LM Approach to IR

  8. Model Description Insufficient data • Zero probability • Don’t wish to assign a probability of zero to a document that is missing one or more of the query terms. • Somewhat radical assumption to infer that • Assumption • A non-occurring term is possible, but no more likely than what would be expected by chance in the collection. • If , : raw count of term t in the collection : raw collection size(total number of tokens in the collection) LM Approach to IR

  9. Model Description Averaging for robustness • If we could get an arbitrary sized sample of data from we could be reasonably confident in the maximum likelihood estimator. • We only have a document sized sample from that distribution. • To circumvent this problem, • Need an estimate from a larger amount of data : document frequency of t LM Approach to IR

  10. Model Description The Risk • Cannot and are not assuming that every document containing t is drawn from the same language model. • There is some risk in using the mean to estimate • If we used the mean by itself, there would be no distinction between documents with different term frequencies. • The risk for a term t in a document d (geometric distribution) • As the tf gets further away from the normalized mean, the mean probability becomes riskier to use as an estimate. : mean term frequency of term t in documents where t occurs normalized by document length (= ) LM Approach to IR

  11. Model Description Combining the two estimates LM Approach to IR

  12. Model Description Analysis of the formulation • Generalization: formulation of the LM for IR • Conception • The user has a document in mind, and generate the query from this document. • The equation represents the probability that the document that the user had in mind was in fact this one. general language model individual-document model LM Approach to IR

  13. Empirical Results Experiment Environment • Data • TREC topics 202-250 on TREC disks 2 and 3 • Natural language queries consisting of one sentence each • TREC topics 51-100 on TREC disk 3 using the concept fields • Lists of good terms • <num>Number: 054 • <dom>Domain: International Economics • <title>Topic: Satellite Launch Contracts • <desc>Description: • … • <con>Concept(s): • Contract, agreement • Launch vehicle, rocket, payload, satellite • Launch services, … • … LM Approach to IR

  14. Empirical Results Recall/Precision Experiments(1) LM Approach to IR

  15. Empirical Results Recall/Precision Experiments(2) LM Approach to IR

  16. Empirical Results Improving the Basic Model(1) • Smoothing the estimate of the average probability for terms with low document frequency • The estimate is based on a small amount of data • So could be sensitive to outliers • Binned estimate • Bin the low frequency data by document frequency • Cutoff: df=100 • Use the binned estimate for the average LM Approach to IR

  17. Empirical Results Improving the Basic Model(2) LM Approach to IR

  18. Empirical Results Improving the Basic Model(3) LM Approach to IR

  19. Conclusions & Future Work Conclusions & Future Work • Conclusions • Novel way of looking at the problem of text retrieval based on probabilistic language modeling • Conceptually simple and explanatory • LM will provide effective retrieval and can be improved to the extent that the following conditions can be met • Our language models are accurate representations of the data. • Users understand our approach to retrieval. • Users have a some sense of term distribution. • The ability to think about retrieval in a new way • Future Work • Estimate of default probability • Current estimator could in some strange cases assign a higher probability to a non-occurring query term. • Query expansion LM Approach to IR

  20. Relevance Feedback in LM LM approach to multiple relevant documents • Current LM approach • Allow for N+1 language models • N(collection size) + general language model • The relationship between general language model and the individual document models is never raised. • How can a document be generated from one language model when the entire collection is generated from a different one? • We need … • General model for some accumulation of text, which is modified (not replaced) by a local model for some smaller part of the same text. LM Approach to IR

  21. Relevance Feedback in LM 3-level model(1) • 3-level model • Whole collection model ( ) • Specific-topic model; relevant-documents model ( ) • Individual-document model ( ) • Relevance hypothesis • A request(query; topic) is generated from a specific-topic model { , }. • Iff a document is relevant to the topic, the same model will apply to the document. • It will replace part of the individual-document model in explaining the document. • The probability of relevance of a document • The probability that this model explains part of the document • The probability that the { , , } combination is better than the { , } combination LM Approach to IR

  22. 3-level model(2) Information need d1 d2 generation … query … … dn document collection LM Approach to IR

  23. Geometric distribution(1) • 기하분포 • 첫번째 성공을 거둘 때까지 성공률이 p인 베르누이 시행을 반복 할 때, 총 시행횟수를 X라 두면 이 확률변수 X가 갖는 분포가 기하분포이다. LM Approach to IR

  24. Geometric distribution(2) • 예 • 어떤 실험을 한번 하는데 드는 비용은 10만원이다. 이 실험이 성공할 확률은 0.2이고 성공할 때까지 이 실험을 반복한다고 할 때 실험에 드는 총비용을 얼마로 예상하면 될까? LM Approach to IR

More Related