230 likes | 346 Vues
This document presents an overview of language models used in information retrieval, focusing on their types, such as Unigram and Bigram models, and the Query Likelihood Model. It discusses the importance of smoothing techniques to handle unseen words and evaluates the models using precision and recall metrics. Various models are compared, highlighting their mathematical and conceptual advantages while recognizing limitations in relevancy capture. Examples illustrate the application of these models in ranking documents based on the probability of queries.
E N D
Language Models for Information Retrieval Andy Luong and Nikita Sudan
Outline • Language Model • Types of Language Models • Query Likelihood Model • Smoothing • Evaluation • Comparison with other approaches
Language Model • A language model is a function that puts a probability measure over strings drawn from some vocabulary.
Language Models P(q|Md) instead of P(R=1|q,d)
Example • Doc1: “frog said that toad likes frog” • Doc2: “toad likes frog” 1/3 1/6 1/3
Example Continued q = “frog likes toad” P(q | M1) = (1/3)*(1/6)*(1/6)*0.8*0.8*0.2 P(q | M2) = (1/3)*(1/3)*(1/3)*0.8*0.8*0.2 P(q | M1) < P (S | M2)
Types of Language Models CHAIN RULE UNIGRAM LM BIGRAM LM
Multinomial distribution Frequency Order Constraint M is the size of the term vocabulary
Query Likelihood Model • Infer LM for each document • Estimate P(q | Md(i)) • Rank documents based on probabilities
Smoothing • Basic Intuition • New word or unseen word in the document • P( t | Md) = 0 • Zero probabilities will make P ( q | Md) = 0 • Why else should we smooth?
Smoothing Continued Non-occurring term Probability Bound Linear Interpolation Language Model
Example • Doc1: “frog said that toad likes frog” • Doc2: “toad likes frog” 1/3 1/9 1/9 2/9 2/9
Example Continued q= “frog said” λ = ½ P(q | M1) = [(1/3 + 1/3)*(1/2)] * [(1/6 + 1/9)*(1/2)] = .046 P(q | M2) = [(1/3 + 1/3)*(1/2)] * [(0 + 1/9)*(1/2)] = .018 P(q | M1) > P (q | M2)
Evaluation • Precision = (relevant documents ∩ retrieved documents)/ retrieved documents • Recall = (relevant documents ∩ retrieved documents)/ relevant documents
Tf-Idf • The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.
Pros and Cons • “Mathematically precise, conceptually simple, computationally tractable and intuitively appealing.” • Relevancy is not captured
Query vs. Document Model (a) Query Likelihood (b) Document Likelihood (c) Model Comparison