1 / 90

Modeling and Solving Term Mismatch for Full-Text Retrieval

Modeling and Solving Term Mismatch for Full-Text Retrieval. Le Zhao lezhao@cs.cmu.edu School of Computer Science Carnegie Mellon University April 16, 2012 @Microsoft Research, Redmond. What is Full-Text Retrieval. The task The Cranfield evaluation [ Cleverdon 1960]

anais
Télécharger la présentation

Modeling and Solving Term Mismatch for Full-Text Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modeling and SolvingTerm Mismatch for Full-Text Retrieval Le Zhao lezhao@cs.cmu.edu School of Computer Science Carnegie Mellon University April 16, 2012@Microsoft Research, Redmond

  2. What is Full-Text Retrieval • The task • The Cranfieldevaluation [Cleverdon 1960] • abstracts away the user, • allows objective & automatic evaluations Results Retrieval Engine User User Query Document Collection

  3. Where are We (Going)? • Current retrieval models • formal models since 1970s, best ones 1990s • based on simple collection statistics (tf.idf), no deep understanding of natural language texts • Perfect retrieval • Query: “information retrieval”, A: “… text search …” • Textual entailment (difficult natural language task) • Searcher frustration [Feild, Allan and Jones 2010] • Still far away, what have been holding us back? imply

  4. Two Long Standing Problems in Retrieval • Term mismatch • [Furnas, Landauer, Gomez and Dumais 1987] • No clear definition in retrieval • Query dependent term importance (P(t | R)) • Traditionally, idf (rareness) • P(t | R) [Robertson and Spärck Jones 1976; Greiff 1998] • Few clues about estimation • This work • connects the two problems, • shows they can result in huge gains in retrieval, • and uses a predictive approach toward solving both problems.

  5. What is Term Mismatch & Why Care? • Job search • You look for information retrievaljobs on the market. They want text search skills. • cost you job opportunities, (50% even if you are careful) • Legal discovery • You look for bribery or foul playin corporate documents.They say grease, pay off. • cost you cases • Patent/Publication search • cost businesses • Medical record retrieval • cost lives

  6. Prior Approaches • Document: • Full text indexing • Instead of only indexing key words • Stemming • Include morphological variants • Document expansion • Inlink anchor, user tags • Query: • Query expansion, reformulation • Both: • Latent Semantic Indexing • Translation based models

  7. Main Questions Answered • Definition • Significance (theory & practice) • Mechanism (what causes the problem) • Model and solution

  8. Definition of Mismatch P(t | Rq) _ • Collection Directly calculated given relevance judgments for q Relevant (q) Jobs mismatched All relevant jobs • Documents that contain t “retrieval” _ mismatch (P(t| Rq)) == 1 –term recall (P(t| Rq))

  9. How Often Do Terms Mismatch? (Example TREC-3 topics)

  10. Main Questions • Definition • P(t | R) or P(t| R), simple, • estimated from relevant documents, • analyze mismatch • Significance (theory & practice) • Mechanism (what causes the problem) • Model and solution _

  11. Term Mismatch &Probabilistic Retrieval Models Binary Independence Model • [Robertson and Spärck Jones 1976] • Optimal ranking score for each document d • Term weight for Okapi BM25 • Other advanced models behave similarly • Used as effective features in Web search engines • Idf (rareness) Term recall

  12. Term Mismatch &Probabilistic Retrieval Models Binary Independence Model • [Robertson and Spärck Jones 1976] • Optimal ranking score for each document d • “Relevance Weight”, “Term Relevance” • P(t | R): only part about the query, & relevance • Idf (rareness) Term recall

  13. Main Questions • Definition • Significance • Theory (as idf & only part about relevance) • Practice? • Mechanism (what causes the problem) • Model and solution

  14. Term Mismatch &Probabilistic Retrieval Models Binary Independence Model • [Robertson and Spärck Jones 1976] • Optimal ranking score for each document d • “Relevance Weight”, “Term Relevance” • P(t | R): only part about the query, & relevance • Idf (rareness) Term recall

  15. Without Term Recall • The emphasis problem for idf-only term weighting • (Not only for BIM, but also tf.idfmodels) • Emphasize high idf (rare) terms in query • “prognosis/viability of a political third party in U.S.” (Topic 206)

  16. Ground Truth (Term Recall) Query: prognosis/viability of a political third party Emphasis Wrong Emphasis

  17. Top Results (Language model) Query: prognosis/viability of a political third party 1. … discouraging prognosis for 1991 … 2. … Politics … party … Robertson's viability as a candidate … 3. … political parties … 4. … there is no viable opposition … 5. … A third of the votes … 6. … politics … party… two thirds … 7. … third ranking political movement… 8. … political parties … 9. … prognosis for the Sunday school … 10. … third party provider … All are false positives. Emphasis / Mismatch problem, not precision. ( , are doing better, but still have top 10 false positives. Emphasis / Mismatch also a problem for large search engines!)

  18. Without Term Recall • The emphasis problem for idf-only term weighting • Emphasize high idf(rare) terms in query • “prognosis/viability of a political third party in U.S.” (Topic 206) • False positives throughout rank list • especially detrimental at top rank • No term recall hurts precision at all recall levels • How significant is the emphasis problem?

  19. Failure Analysis of 44 Topics from TREC 6-8 Recall term weighting Mismatch 27% Mismatch guided expansion Basis: Term Mismatch Prediction RIA workshop 2003 (7 top research IR systems, >56 expert*weeks) Failure analyses of retrieval models & techniques still standard today

  20. Main Questions • Definition • Significance • Theory: as idf & only part about relevance • Practice: explains common failures, other behavior: Personalization, WSD, structured • Mechanism (what causes the problem) • Model and solution

  21. Failure Analysis of 44 Topics from TREC 6-8 Recall term weighting Mismatch 27% Mismatch guided expansion Basis: Term Mismatch Prediction RIA workshop 2003 (7 top research IR systems, >56 expert*weeks)

  22. True Term Recall Effectiveness • +100% over BIM (in precision at all recall levels) • [Robertson and Spärk Jones 1976] • +30-80% over Language Model, BM25 (in MAP) • This work • For a new query w/o relevance judgments, • Need to predict • Predictions don’t need to be very accurate to show performance gain

  23. Main Questions • Definition • Significance • Theory: as idf & only part about relevance • Practice: explains common failures, other behavior, • +30 to 80% potential from term weighting • Mechanism (what causes the problem) • Model and solution

  24. How Often Do Terms Mismatch? Same term, different Recall (Examples from TREC 3 topics) Varies 0 to 1 Differs from idf

  25. Statistics Term recall across all query terms (average ~55-60%) TREC 3 titles, 4.9 terms/query TREC 9 descriptions, 6.3 terms/query average 55% term recall average 59% term recall

  26. Statistics Term recall on shorter queries (average ~70%) TREC 9 titles, 2.5 terms/query TREC 13 titles, 3.1 terms/query average 70% term recall average 66% term recall

  27. Statistics Query dependent (but for many terms, variance is small) Term Recall for Repeating Terms 364 recurring words from TREC 3-7, 350 topics

  28. P(t | R) vs. idf P(t | R) P(t | R) vs. df/N (Greiff, 1998) df/N TREC 4 desc query terms

  29. Prior Prediction Approaches • Croft/Harper combination match (1979) • treats P(t | R)as a tuned constant, or estimated from PRF • when >0.5, rewards docs that match more query terms • Greiff’s (1998) exploratory data analysis • Used idf to predict overall term weighting • Improved over basic BIM • Metzler’s (2008) generalized idf • Used idf to predict P(t | R) • Improved over basic BIM • Simple feature (idf), limited success • Missing piece: P(t | R) = term recall = 1 – term mismatch

  30. What Factors can Cause Mismatch? • Topic centrality (Is concept central to topic?) • “Laser research related or potentially related to defense” • “Welfare laws propounded as reforms” • Synonyms (How often they replace original term?) • “retrieval” == “search” == … • Abstractness • “Laser research … defense”“Welfare laws” • “Prognosis/viability” (rare & abstract)

  31. Main Questions • Definition • Significance • Mechanism • Causes of mismatch: Unnecessary concepts, replaced by synonyms or more specific terms • Model and solution

  32. Designing Features to Model the Factors • We need to • Identify synonyms/searchonyms of a query term • in a query dependent way • External resource? (WordNet, wiki, or query log) • Biased (coverage problem, collection independent) • Static (not query dependent) • Not easy, not used here • Term-term similarity in concept space! • Local LSI (Latent Semantic Indexing) • Top ranked documents (e.g. 200) • Dimension reduction (LSI keep e.g. 150 dimensions)

  33. Synonyms from Local LSI P(t| Rq)

  34. Synonyms from Local LSI P(t| Rq) (1) Magnitude of self similarity – Term centrality (2) Avg similarity of supporting terms – Concept centrality (3) How likely synonyms replace term t in collection

  35. Features that Model the Factors Correlation with P(t | R) 0.3719 • idf: – 0.1339 • Term centrality • Self-similarity (length of t) after dimension reduction • Concept centrality • Avg similarity of supporting terms (top synonyms) • Replaceability • How frequently synonyms appear in place of original query term in collection documents • Abstractness • Users modify abstract terms with concrete terms 0.3758 – 0.1872 – 0.1278 effects on the US educational program prognosis of a political third party

  36. Prediction Model Regression modeling • Model: M: <f1, f2, .., f5>  P(t | R) • Train on one set of queries (known relevance), • Test on another set of queries (unknown relevance) • RBF kernel Support Vector regression

  37. Experiments • Term recall prediction error • L1 loss (absolute prediction error) • Term recall based term weighting retrieval • Mean Average Precision (overall retrieval success) • Precision at top 10 (precisionat top of rank list)

  38. Term Recall Prediction Example Query: prognosis/viability of a political third party. (Trained on TREC 3) Emphasis

  39. Term Recall Prediction Error The lower, the better L1 Loss:

  40. Main Questions • Definition • Significance • Mechanism • Model and solution • Can be predicted, Framework to design and evaluate features

  41. A General View of Retrieval Modeling as Transfer Learning • The traditional restricted view sees a retrieval model as • a document classifier for a given query. • More general view: A retrieval model really is • a meta-classifier, responsible for many queries, • mapping a query to a document classifier. • Learning a retrieval model == transfer learning • Using knowledge from related tasks (training queries) to classify documents for a new task (test query) • Our features and model facilitate the transfer. • More general view  more principled investigations and more advanced techniques

  42. Using (t | R) in Retrieval Models • In BM25 • Binary Independence Model • In Language Modeling (LM) • Relevance Model [Lavrenko and Croft 2001] Only term weighting, no expansion.

  43. Predicted Recall Weighting 10-25% gain (MAP) Recall LM desc Datasets: train -> test “*”: significantly better by sign & randomization tests

  44. Predicted Recall Weighting 10-20% gain(top Precision) Recall LM desc Datasets: train -> test “*”: Prec@10 is significantly better. “!”: Prec@20 is significantly better.

  45. Main Questions • Definition • Significance • Mechanism • Model and solution • Term weighting solves emphasis problem for long queries • Mismatch problem?

  46. Failure Analysis of 44 Topics from TREC 6-8 Recall term weighting Mismatch 27% Mismatch guided expansion Basis: Term Mismatch Prediction RIA workshop 2003 (7 top research IR systems, >56 expert*weeks)

  47. Recap: Term Mismatch • Term mismatch ranges 30%-50% on average • Relevance matching can degrade quickly for multi-word queries • Solution: Fix every query term

  48. Conjunctive Normal Form (CNF) Expansion E.g. Keyword query: placement of cigarette signs on television watched by children  Manual CNF: (placement OR place OR promotion OR logo OR sign OR signage OR merchandise)AND(cigarette OR cigar OR tobacco)AND(televisionOR TV OR cable OR network)AND(watch OR view)AND(children OR teen OR juvenile OR kid OR adolescent) • Expressive & compact (1 CNF == 100s alternatives) • Used by lawyers, librarians and other expert searchers • But, tedious to create

  49. Diagnostic Intervention • Diagnose term mismatch • Terms that need help placement of cigarette signs on television watched by children • Guided expansion intervention (placement OR place OR promotion OR logo OR sign ORsignage OR merchandise)AND cigarAND televisionAND watchAND (children ORteen OR juvenile OR kid OR adolescent) • Goal • Least amount user effort  near optimal performance • E.g. expand 2 terms  90% improvement

  50. Diagnostic Intervention (We Hope to) (child AND cigar) (child  teen) (child > cigar) (child OR teen) AND cigar

More Related