html5-img
1 / 52

Probabilistic Models in Information Retrieval SI650: Information Retrieval

Probabilistic Models in Information Retrieval SI650: Information Retrieval. Winter 2010 School of Information University of Michigan. Most slides are from Prof. ChengXiang Zhai ’s lectures. Essential Probability & Statistics. Prob/Statistics & Text Management.

kevlyn
Télécharger la présentation

Probabilistic Models in Information Retrieval SI650: Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probabilistic Models in Information RetrievalSI650: Information Retrieval Winter 2010 School of Information University of Michigan Most slides are from Prof. ChengXiang Zhai’s lectures

  2. Essential Probability & Statistics

  3. Prob/Statistics & Text Management • Probability & statistics provide a principled way to quantify the uncertainties associated with natural language • Allow us to answer questions like: • Given that we observe “baseball” three times and “game” once in a news article, how likely is it about “sports”? (text categorization, information retrieval) • Given that a user is interested in sports news, how likely would the user use “baseball” in a query? (information retrieval)

  4. Basic Concepts in Probability • Random experiment: an experiment with uncertain outcome (e.g., flipping a coin, picking a word from text) • Sample space: all possible outcomes, e.g., • Tossing 2 fair coins, S ={HH, HT, TH, TT} • Event: a subspace of the sample space • ES, E happens iff outcome is in E, e.g., • E={HH} (all heads) • E={HH,TT} (same face) • Impossible event ({}), certain event (S) • Probability of Event : 0  P(E) ≤1, s.t. • P(S)=1 (outcome always in S) • P(A B)=P(A)+P(B), if (AB)= (e.g., A=same face, B=different face)

  5. Example: Flip a Coin Sample space: {Head, Tail} Fair coin: • p(H) = 0.5, p(T) = 0.5 Unfair coin, e.g.: • p(H) = 0.3, p(T) = 0.7 Flipping two fair coins: • Sample space: {HH, HT, TH, TT} Example in modeling text: • Flip a coin to decide whether or not to include a word in a document • Sample space = {appear, absence}

  6. Example: Toss a Dice Sample space: S = {1,2,3,4,5,6} Fair dice: • p(1) = p(2) = p(3) = p(4) = p(5) = p(6) = 1/6 Unfair dice: p(1) = 0.3, p(2) = 0.2, ... N-dimensional dice: • S = {1, 2, 3, 4, …, N} Example in modeling text: • Toss a die to decide which word to write in the next position • Sample space = {cat, dog, tiger, …}

  7. Basic Concepts of Prob. (cont.) • Joint probability: P(AB), also written as P(A, B) • Conditional Probability: P(B|A)=P(AB)/P(A) • P(AB) = P(A)P(B|A) = P(B)P(A|B) • So, P(A|B) = P(B|A)P(A)/P(B) (Bayes’ Rule) • For independent events, P(AB) = P(A)P(B), so P(A|B)=P(A) • Total probability: If A1, …, An form a partition of S, then • P(B) = P(BS) = P(B, A1) + … + P(B, An) (why?) • So, P(Ai|B) = P(B|Ai)P(Ai)/P(B) = P(B|Ai)P(Ai)/[P(B|A1)P(A1)+…+P(B|An)P(An)] • This allows us to compute P(Ai|B) based on P(B|Ai)

  8. Revisit: Bayes’ Rule Hypothesis space: H={H1 ,…,Hn} Evidence: E In text classification: H: class space; E: data (features) If we want to pick the most likely hypothesis H*, we can drop P(E) Posterior probability of Hi Prior probability of Hi Likelihood of data/evidence if Hi is true

  9. Random Variable • X: S  (“measure” of outcome) • E.g., number of heads, all same face?, … • Events can be defined according to X • E(X=a) = {si|X(si)=a} • E(Xa) = {si|X(si)  a} • So, probabilities can be defined on X • P(X=a) = P(E(X=a)) • P(Xa) = P(E(Xa)) • Discrete vs. continuous random variable (think of “partitioning the sample space”)

  10. Getting to Statistics ... • We are flipping an unfair coin, but P(Head)=? (parameter estimation) • If we see the results of a huge number of random experiments, then • But, what if we only see a small sample (e.g., 2)? Is this estimate still reliable? We flip twice and got two tails, does it mean P(Head) = 0? • In general, statistics has to do with drawing conclusions on the whole population based on observations of a sample (data)

  11. Parameter Estimation • General setting: • Given a (hypothesized & probabilistic) model that governs the random experiment • The model gives a probability of any data p(D|) that depends on the parameter  • Now, given actual sample data X={x1,…,xn}, what can we say about the value of ? • Intuitively, take your best guess of  -- “best” means “best explaining/fitting the data” • Generally an optimization problem

  12. Maximum Likelihood vs. Bayesian • Maximum likelihood estimation • “Best” means “data likelihood reaches maximum” • Problem: small sample • Bayesian estimation • “Best” means being consistent with our “prior” knowledge and explaining data well • Problem: how to define prior?

  13. Likelihood: p(X|) X=(x1,…,xN) Prior: p() : prior mode : posterior mode ml: ML estimate Illustration of Bayesian Estimation Posterior: p(|X) p(X|)p() 

  14. Maximum Likelihood Estimate Data: a document d with counts c(w1), …, c(wN), and length |d| Model: multinomial distribution M with parameters {p(wi)} Likelihood: p(d|M) Maximum likelihood estimator: M=argmax M p(d|M) Problem: a document is short if c(wi) = 0, does this mean p(wi) = 0?

  15. If you want to know how we get it… Data: a document d with counts c(w1), …, c(wN), and length |d| Model: multinomial distribution M with parameters {p(wi)} Likelihood: p(d|M) Maximum likelihood estimator: M=argmax M p(d|M) We’ll tune p(wi) to maximize l(d|M) Use Lagrange multiplier approach Set partial derivatives to zero ML estimate

  16. Statistical Language Models

  17. What is a Statistical Language Model? • A probability distribution over word sequences • p(“Today is Wednesday”)  0.001 • p(“Today Wednesday is”)  0.0000000000001 • p(“The eigenvalue is positive”)  0.00001 • Context-dependent! • Can also be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model

  18. Why is a Language Model Useful? • Provides a principled way to quantify the uncertainties associated with natural language • Allows us to answer questions like: • Given that we see “John” and “feels”, how likely will we see “happy” as opposed to “habit” as the next word? (speech recognition) • Given that we observe “baseball” three times and “game” once in a news article, how likely is it about “sports”? (text categorization, information retrieval) • Given that a user is interested in sports news, how likely would the user use “baseball” in a query? (information retrieval)

  19. Source-Channel Framework(Communication System) Source Transmitter (encoder) Noisy Channel Receiver (decoder) Destination X Y X’ P(X) P(X|Y)=? P(Y|X) When X is text, p(X) is a language model

  20. Speech Recognition Acoustic signal (A) Words (W) Recognizer Recognized words Speaker Noisy Channel P(W|A)=? P(W) P(A|W) Language model Acoustic model Given acoustic signal A, find the word sequence W

  21. Machine Translation Chinese Words(C) English Words (E) Translator English Translation English Speaker Noisy Channel P(E|C)=? P(E) P(C|E) English Language model English->Chinese Translation model Given Chinese sentence C, find its English translation E

  22. Spelling/OCR Error Correction “Erroneous” Words(E) Original Words (O) Corrector Corrected Text Original Text Noisy Channel P(O|E)=? P(O) P(E|O) “Normal” Language model Spelling/OCR Error model Given corrupted text E, find the original text O

  23. Basic Issues • Define the probabilistic model • Event, Random Variables, Joint/Conditional Prob’s • P(w1 w2 ... wn)=f(1, 2 ,…,m) • Estimate model parameters • Tune the model to best fit the data and our prior knowledge • i=? • Apply the model to a particular task • Many applications

  24. The Simplest Language Model(Unigram Model) • Generate a piece of text by generating each word INDEPENDENTLY • Thus, p(w1 w2 ... wn)=p(w1)p(w2)…p(wn) • Parameters: {p(wi)} p(w1)+…+p(wN)=1 (N is voc. size) • Essentially a multinomial distribution over words • A piece of text can be regarded as a sample drawn according to this word distribution

  25. Modeling Text As Tossing a Die Sample space: S = {w1,w2,w3,… wN} Die is unfair: • P(cat) = 0.3, P(dog) = 0.2, ... • Toss a die to decide which word to write in the next position E.g., Sample space = {cat, dog, tiger} P(cat) = 0.3; P(dog) = 0.5; P(tiger) = 0.2 Text = “cat cat dog” P(Text) = P(cat) * P(cat) * P(dog) = 0.3 * 0.3 * 0.5 = 0.045

  26. Text mining paper Food nutrition paper Text Generation with Unigram LM (Unigram) Language Model  p(w| ) Sampling Document … text 0.2 mining 0.1 association 0.01 clustering 0.02 … food 0.00001 … Topic 1: Text mining … food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 … Topic 2: Health

  27. text ? mining ? association ? database ? … query ? … 10/100 5/100 3/100 3/100 1/100 Estimation of Unigram LM (Unigram) Language Model  p(w| )=? Estimation Document text 10 mining 5 association 3 database 3 algorithm 2 … query 1 efficient 1 A “text mining paper” (total #words=100)

  28. Empirical distribution of words • There are stable language-independent patterns in how people use natural languages • A few words occur very frequently; most occur rarely. E.g., in news articles, • Top 4 words: 10~15% word occurrences • Top 50 words: 35~40% word occurrences • The most frequent word in one corpus may be rare in another

  29. Word Freq. Most useful words (Luhn 57) Is “too rare” a problem? Biggest data structure (stop words) Word Rank (by Freq) Generalized Zipf’s law: Applicable in many domains Revisit: Zipf’s Law • rank * frequency  constant

  30. Problem with the ML Estimator • What if a word doesn’t appear in the text? • In general, what probability should we give a word that has not been observed? • If we want to assign non-zero probabilities to such words, we’ll have to discount the probabilities of observed words • This is what “smoothing” is about …

  31. P(w) Max. Likelihood Estimate Smoothed LM w Language Model Smoothing (Illustration)

  32. How to Smooth? • All smoothing methods try to • discount the probability of words seen in a document • re-allocate the extra counts so that unseen words will have a non-zero count • Method 1 (Additive smoothing): Add a constant  to the counts of each word • Problems? Counts of w in d “Add one”, Laplace smoothing Vocabulary size Length of d (total counts)

  33. Discounted ML estimate Reference language model How to Smooth? (cont.) • Should all unseen words get equal probabilities? • We can use a reference model to discriminate unseen words

  34. Other Smoothing Methods • Method 2 (Absolute discounting): Subtract a constant  from the counts of each word • Method 3 (Linear interpolation, Jelinek-Mercer): “Shrink” uniformly toward p(w|REF) # uniq words αd = δ|d|u/|d| αd = λ parameter ML estimate

  35. Other Smoothing Methods (cont.) • Method 4 (Dirichlet Prior/Bayesian): Assumepseudo counts p(w|REF) • Method 5 (Good Turing): Assume total # unseen events to be n1 (# of singletons), and adjust the seen events in the same way αd = μ/(|d|+ μ) parameter

  36. So, which method is the best? It depends on the data and the task! Many other sophisticated smoothing methods have been proposed… Cross validation is generally used to choose the best method and/or set the smoothing parameters… For retrieval, Dirichlet prior performs well… Smoothing will be discussed further in the course…

  37. Language Models for IR

  38. Text mining paper Food nutrition paper Text Generation with Unigram LM (Unigram) Language Model  p(w| ) Sampling Document … text 0.2 mining 0.1 assocation 0.01 clustering 0.02 … food 0.00001 … Topic 1: Text mining … food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 … Topic 2: Health

  39. text ? mining ? assocation ? database ? … query ? … 10/100 5/100 3/100 3/100 1/100 Estimation of Unigram LM (Unigram) Language Model  p(w| )=? Estimation Document text 10 mining 5 association 3 database 3 algorithm 2 … query 1 efficient 1 A “text mining paper” (total #words=100)

  40. Language Model … text ? mining ? assocation ? clustering ? … food ? … ? Which model would most likely have generated this query? … food ? nutrition ? healthy ? diet ? … Language Models for Retrieval(Ponte & Croft 98) Document Query = “data mining algorithms” Text mining paper Food nutrition paper

  41. Doc LM Query likelihood d1 p(q| d1) p(q| d2) d2 p(q| dN) dN Ranking Docs by Query Likelihood d1 q d2 dN

  42. Document language model Retrieval as Language Model Estimation • Document ranking based on query likelihood • Retrieval problem  Estimation of p(wi|d) • Smoothing is an important issue, and distinguishes different approaches

  43. How to Estimate p(w|d)? • Simplest solution: Maximum Likelihood Estimator • P(w|d) = relative frequency of word w in d • What if a word doesn’t appear in the text? P(w|d)=0 • In general, what probability should we give a word that has not been observed? • If we want to assign non-zero probabilities to such words, we’ll have to discount the probabilities of observed words • This is what “smoothing” is about …

  44. Discounted ML estimate Collection language model A General Smoothing Scheme • All smoothing methods try to • discount the probability of words seen in a doc • re-allocate the extra probability so that unseen words will have a non-zero probability • Most use a reference model (collection language model) to discriminate unseen words

  45. Doc length normalization (long doc is expected to have a smaller d) TF weighting IDFweighting Ignore for ranking Smoothing & TF-IDF Weighting • Plug in the general smoothing scheme to the query likelihood retrieval formula, we obtain • Smoothing with p(w|C) TF-IDF + length norm.

  46. Three Smoothing Methods(Zhai & Lafferty 01) • Simplified Jelinek-Mercer: Shrink uniformly toward p(w|C) • Dirichlet prior (Bayesian): Assumepseudo countsp(w|C) • Absolute discounting: Subtract a constant

  47. Assume We Use Dirichlet Smoothing Doc. Term frequency Query term frequency Similar to IDF Doc. length Only consider words in the query

  48. Relevance P(d q) or P(q d) Probabilistic inference (Rep(q), Rep(d)) Similarity P(r=1|q,d) r {0,1} Probability of Relevance Regression Model (Fox 83) Generative Model Different inference system Different rep & similarity Query generation Doc generation … Inference network model (Turtle & Croft, 91) Prob. concept space model (Wong & Yao, 95) Vector space model (Salton et al., 75) Prob. distr. model (Wong & Yao, 89) Classical prob. Model (Robertson & Sparck Jones, 76) LM approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a) The Notion of Relevance

  49. Query Generation and Document Generation • Document ranking based on the conditional probability of relevance R {0,1} Ranking based on Document generation  Okapi/BM25 Query generation  Language models

  50. Relevance P(d q) or P(q d) Probabilistic inference (Rep(q), Rep(d)) Similarity P(r=1|q,d) r {0,1} Probability of Relevance Regression Model (Fox 83) Generative Model Different inference system Different rep & similarity Query generation Doc generation … Inference network model (Turtle & Croft, 91) Prob. concept space model (Wong & Yao, 95) Vector space model (Salton et al., 75) Prob. distr. model (Wong & Yao, 89) Classical prob. Model (Robertson & Sparck Jones, 76) LM approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a) The Notion of Relevance

More Related