1 / 65

Lecture 9: Probabilistic Retrieval

Prof. Ray Larson University of California, Berkeley School of Information. Lecture 9: Probabilistic Retrieval. Principles of Information Retrieval. Mini-TREC. Need to make groups Today – Give me a note with group members (names and login names) Systems SMART (not recommended…)

oliverh
Télécharger la présentation

Lecture 9: Probabilistic Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Prof. Ray Larson University of California, Berkeley School of Information Lecture 9: Probabilistic Retrieval Principles of Information Retrieval

  2. Mini-TREC • Need to make groups • Today – Give me a note with group members (names and login names) • Systems • SMART (not recommended…) • ftp://ftp.cs.cornell.edu/pub/smart • MG (We have a special version if interested) • http://www.mds.rmit.edu.au/mg/welcome.html • Cheshire II & 3 • II = ftp://cheshire.berkeley.edu/pub/cheshire & http://cheshire.berkeley.edu • 3 = http://cheshire3.sourceforge.org • Zprise (Older search system from NIST) • http://www.itl.nist.gov/iaui/894.02/works/zp2/zp2.html • IRF (new Java-based IR framework from NIST) • http://www.itl.nist.gov/iaui/894.02/projects/irf/irf.html • Lemur • http://www-2.cs.cmu.edu/~lemur • Lucene (Java-based Text search engine) • http://jakarta.apache.org/lucene/docs/index.html • Galago (Also Java-based) • http://www.galagosearch.org • Others?? (See http://searchtools.com )

  3. Mini-TREC • Proposed Schedule • February 9 – Database and previous Queries • March 2 – report on system acquisition and setup • March 9, New Queries for testing… • April 18, Results due • April 20, Results and system rankings • April 27 Group reports and discussion

  4. Today • Review • Clustering and Automatic Classification • Probabilistic Models • Probabilistic Indexing (Model 1) • Probabilistic Retrieval (Model 2) • Unified Model (Model 3) • Model 0 and real-world IR • Regression Models • The “Okapi Weighting Formula”

  5. Today • Review • Clustering and Automatic Classification • Probabilistic Models • Probabilistic Indexing (Model 1) • Probabilistic Retrieval (Model 2) • Unified Model (Model 3) • Model 0 and real-world IR • Regression Models • The “Okapi Weighting Formula”

  6. Review: IR Models • Set Theoretic Models • Boolean • Fuzzy • Extended Boolean • Vector Models (Algebraic) • Probabilistic Models (probabilistic)

  7. Similarity Measures Simple matching (coordination level match) Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient

  8. Documents in Vector Space t3 D1 D9 D11 D5 D3 D10 D4 D2 t1 D7 D6 D8 t2

  9. Vector Space Visualization

  10. Vector Space with Term Weights and Cosine Matching Di=(di1,wdi1;di2, wdi2;…;dit, wdit) Q =(qi1,wqi1;qi2, wqi2;…;qit, wqit) Term B 1.0 Q = (0.4,0.8) D1=(0.8,0.3) D2=(0.2,0.7) Q D2 0.8 0.6 0.4 D1 0.2 0 0.2 0.4 0.6 0.8 1.0 Term A

  11. Document/Document Matrix

  12. Hierarchical Methods 2 .4 3 .4 .2 4 .3 .3 .3 5 .1 .4 .4 .1 1 2 3 4 Single Link Dissimilarity Matrix Hierarchical methods: Polythetic, Usually Exclusive, Ordered Clusters are order-independent

  13. Threshold = .1 2 .4 3 .4 .2 4 .3 .3 .3 5 .1 .4 .4 .1 1 2 3 4 2 0 3 0 0 4 0 0 0 5 1 0 0 1 1 2 3 4 1 2 5 3 4 Single Link Dissimilarity Matrix

  14. Threshold = .2 2 .4 3 .4 .2 4 .3 .3 .3 5 .1 .4 .4 .1 1 2 3 4 2 0 3 0 1 4 0 0 0 5 1 0 0 1 1 2 3 4 1 2 5 3 4

  15. Threshold = .3 2 .4 3 .4 .2 4 .3 .3 .3 5 .1 .4 .4 .1 1 2 3 4 2 0 3 0 1 4 1 1 1 5 1 0 0 1 1 2 3 4 1 2 5 3 4

  16. K-means & Rocchio Clustering Doc Doc Doc Doc Doc Doc Doc Doc Agglomerative methods: Polythetic, Exclusive or Overlapping, Unordered clusters are order-dependent. Rocchio’s method 1. Select initial centers (I.e. seed the space) 2. Assign docs to highest matching centers and compute centroids 3. Reassign all documents to centroid(s)

  17. Clustering • Advantages: • See some main themes • Disadvantage: • Many ways documents could group together are hidden • Thinking point: what is the relationship to classification systems and facets?

  18. Automatic Class Assignment Doc Doc Doc Doc Search Engine Doc Doc Doc 1. Create pseudo-documents representing intellectually derived classes. 2. Search using document contents 3. Obtain ranked list 4. Assign document to N categories ranked over threshold. OR assign to top-ranked category Automatic Class Assignment: Polythetic, Exclusive or Overlapping, usually ordered clusters are order-independent, usually based on an intellectually derived scheme

  19. Automatic Categorization in Cheshire II • Cheshire supports a method we call “classification clustering” that relies on having a set of records that have been indexed using some controlled vocabulary. • Classification clustering has the following steps…

  20. Start with a collection of documents.

  21. Index Classify and index with controlled vocabulary. Ideally, use a database already indexed

  22. Index Problem:Controlled Vocabularies can be difficult for people to use. “pass mtr veh spark ign eng”

  23. Solution:Entry Level Vocabulary Indexes. Index EVI pass mtr veh spark ign eng” = “Automobile”

  24. EVI example Index term:“pass mtr veh spark ign eng” EVI 1 User Query “Automobile” Index term:“automobiles” OR “internal combustible engines” EVI 2

  25. But why stop there? Index EVI

  26. Index EVI Index EVI Index EVI Index “Which EVI do I use?”

  27. Index EVI EVI2 Index EVI Index EVI Index EVI to EVIs

  28. In Arabic Chinese Greek Japanese Korean Russian Tamil Find Plutonium Why not treat language the same way?

  29. In Arabic Chinese Greek Japanese Korean Russian Tamil Find Plutonium Digital library resources Statistical association

  30. Cheshire II - Two-Stage Retrieval • Using the LC Classification System • Pseudo-Document created for each LC class containing terms derived from “content-rich” portions of documents in that class (e.g., subject headings, titles, etc.) • Permits searching by any term in the class • Ranked Probabilistic retrieval techniques attempt to present the “Best Matches” to a query first. • User selects classes to feed back for the “second stage” search of documents. • Can be used with any classified/Indexed collection.

  31. Cheshire EVI Demo

  32. Problems with Vector Space • There is no real theoretical basis for the assumption of a term space • it is more for visualization than having any real basis • most similarity measures work about the same regardless of model • Terms are not really orthogonal dimensions • Terms are not independent of all other terms

  33. Today • Review • Clustering and Automatic Classification • Probabilistic Models • Probabilistic Indexing (Model 1) • Probabilistic Retrieval (Model 2) • Unified Model (Model 3) • Model 0 and real-world IR • Regression Models • The “Okapi Weighting Formula”

  34. Probabilistic Models • Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query • Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle) • Relies on accurate estimates of probabilities

  35. Probability Ranking Principle • If a reference retrieval system’s response to each request is a ranking of the documents in the collections in the order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data has been made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data. Stephen E. Robertson, J. Documentation 1977

  36. Model 1 – Maron and Kuhns • Concerned with estimating probabilities of relevance at the point of indexing: • If a patron came with a request using term ti, what is the probability that she/he would be satisfied with document Dj?

  37. Probability theory (detour) • To get to the Bayesian statistical inference used in both model 1 and 2…

  38. Probability Theory • The “Bayes’ Rule” (AKA: Bayesian Inference) says

  39. Bayes’ theorem For example: A: disease B: symptom

  40. Bayes’ Theorem: Application Toss a fair coin. If it lands head up, draw a ball from box 1; otherwise, draw a ball from box 2. If the ball is blue, what is the probability that it is drawn from box 2? Box2 Box1 p(box1) = .5 P(red ball | box1) = .4 P(blue ball | box1) = .6 p(box2) = .5 P(red ball | box2) = .5 P(blue ball | box2) = .5

  41. Bayes’ Theorem: Application in IR Goal: want to estimate the probability that a document D is relevant to a given query. It is often useful to estimate log odds of probability of relevance

  42. Bayes’ Theorem: Application in IR If documents are represented by binary vectors, then Steven & Sparck Jones term weighting

  43. Bayes Theorem: Application in IR

  44. Model 1 • A patron submits a query (call it Q) consisting of some specification of her/his information need. Different patrons submitting the same stated query may differ as to whether or not they judge a specific document to be relevant. The function of the retrieval system is to compute for each individual document the probability that it will be judged relevant by a patron who has submitted query Q. Robertson, Maron & Cooper, 1982

  45. Model 1 Bayes • A is the class of events of using the system • Di is the class of events of Document i being judged relevant • Ij is the class of queries consisting of the single term Ij • P(Di|A,Ij) = probability that if a query is submitted to the system then a relevant document is retrieved

  46. Model 2 • Documents have many different properties; some documents have all the properties that the patron asked for, and other documents have only some or none of the properties. If the inquiring patron were to examine all of the documents in the collection she/he might find that some having all the sought after properties were relevant, but others (with the same properties) were not relevant. And conversely, he/she might find that some of the documents having none (or only a few) of the sought after properties were relevant, others not. The function of a document retrieval system is to compute the probability that a document is relevant, given that it has one (or a set) of specified properties. Robertson, Maron & Cooper, 1982

  47. Model 2 – Robertson & Sparck Jones Given a term t and a query q Document Relevance + - + r n-r n - R-r N-n-R+r N-n R N-R N Document indexing

  48. Robertson-Spark Jones Weights • Retrospective formulation --

  49. Robertson-Sparck Jones Weights • Predictive formulation

  50. Probabilistic Models: Some Unifying Notation • D = All present and future documents • Q = All present and future queries • (Di,Qj) = A document query pair • x = class of similar documents, • y = class of similar queries, • Relevance is a relation:

More Related