280 likes | 421 Vues
This lecture provides an overview of information retrieval (IR) as it relates to database management systems (DBMS). It covers the history and evolution of IR, comparing classical database queries to text searches. Key concepts include indexing methodologies, precision and recall metrics, query models like Boolean and Vector Space, and the probabilistic ranking principle. Students are encouraged to prepare with lecture notes, engage in discussions, and understand the relationship between data models, query semantics, and retrieval functionalities.
E N D
CS511Design of Database Management Systems Lecture 13: Information Retrieval: Overview Kevin C. Chang
Announcements • MT format: • Wednesday 2:00-3:15pm • open notes, papers, books. Calc. OK (won’t need). PDA no. • 75 points (for 75 minutes) • 4 problems • Prob. 1: True/False problems • Prob. 2-4: longer problems • Preparation: • study lecture notes, HW, SGP– use them to review papers • ask why ask that... • discussion with peers • think more (beyond stated) and try to relate issues CS411
Some History • Early Days-- • 1945: V. Bush’s article “As we may think” • 1957: H. P. Luhn’s idea of word counting and matching • Indexing & Evaluation Methodology (1960’s) • Smart system (G. Salton’s group) • Cranfield test collection (C. Cleverdon’s group) • Indexing: automatic can be as good as manual • IR Models (1970’s & 1980’s) … • Large-scale Evaluation & Applications (1990’s) • TREC (D. Harman & E. Voorhees, NIST) • Large scale Web search CS411
?? Text Search vs. Database Queries • Two related areas: • information retrieval (IR) • databases • traditionally separate-- brought together by the Web • ?? Any differences in • data models? • query semantics? • desirable functionalities? CS411
Text vs. Rel. DB: Art vs. Algebra • Data models: • unstructured text vs. well-structured data • Query semantics: • fuzzy vs. well-defined • text search: to satisfy “information need” <-- art • DB queries: to perform data computation <-- algebra • relevant vs. correct answers • ranked vs. Boolean answers • Functionalities: • read-mostly vs. read-write/transactions/cc ... CS411
Recall: Measuring False-Negatives • Recall = |x| / |relevant| • e.g.: relevant = {D1, D2}, retrieved = {D1, D3, D4} • recall R = 1 / 2 = 0.5 • there is 1 false negative: D2 • ? How to fool recall? x relevant retrieved collection CS411
Precision: Measuring False-Positives • Precision = |x| / |retrieved| • e.g.: relevant = {D1, D2}, retrieved = {D1, D3, D4} • precision P = 1 / 3 = 0.33 • there are 2 false positives: D3 and D4 x relevant retrieved collection CS411
Models • Boolean: criteria-based • Vector space: similarity-based • Probabilistic: probability-based CS411
Boolean Model • Query: • Q1: data AND web • Q2: (knowledge OR information) AND base • Q3: data NOT info • Documents: • D1: “web data and web queries” • D2: “digital data index” • D3: “data base for dummies” CS411
Boolean Model • View: Satisfaction by criteria • Query: a Boolean expression • Q1: data AND web • Document: a Boolean conjunction • D1: “web data and web queries” = • web AND data AND queries • Query results: • {D | D implies Q}, i.e., all docs that satisfy Q CS411
Boolean Queries: Problems • Query matching is “exact” and not flexible • exact matching can result in too few/many matches • Hard to formulate a right query • what is query for “documents about color printer”? • Results are not ranked/ordered for exploration • Boolean is binary: yes or no • In short: “relevance” not captured • traditional DB queries are similarly bad at “fuzzy” concepts • new research work in top-k queries CS411
Vector Space Model • View: Similarity of content • Intuitions: • docs consist of words --> put docs in the word space • space: n-dimension for n words • similarity becomes geometric comparison • document-query similarity = vector-vector similarity D Q CS411
Probabilistic Models • View: Probability of relevance • the “probabilistic ranking principle” • Estimate and rank by P(R | Q, D) • or by log-odds: CS411
Probabilistic Models • To rank by • I.e., (see next page) • Assume pithe same for all query terms • Assume qi= ni/N • N is the collection size; i.e., “all” docs are “irrelevant” • Similar to using “IDF” • intuition: e.g., “apple computer” in a computer DB CS411
Probabilistic Models • To rank by CS411
Feedback judgments System Architecture docs INDEXING query Query Rep Doc Rep User Ranking results CS411
Technique: Term Selection/Weighting • Basis for matching query with document • Query and document should be represented using the same units/terms • Controlled vocabulary vs. full text indexing CS411
What is a good indexing term? • Specific (phrases) or general (single word)? • Luhn found that words with middle frequency are most useful • Not too specific • Not too general • All words or a (controlled) subset? • When term weighting is used, it is a matter of weighting not selecting of indexing terms • more later CS411
Technique: Stemming • Words with similar meanings should be mapped to the same indexing term • Stemming: Mapping all inflectional forms of words to the same root form, e.g. • computer -> compute • computation -> compute • computing -> compute • Porter’s Stemmer is popular for English • In general: clustering of “synonym” words CS411
Technique: Stopwords • A “common word” that bears little semantic content • preposition: for, on, … • article: a, an, the • non-informative words (collection specific) • e.g., “database” in this class • e.g., “PC” in a computer collection • You can search the Web for stopwords list CS411
Results: d1 3.5 d2 2.4 … dk 0.5 ... Retrieval Engine Query Updated query User Document collection Judgments: d1 + d2 - d3 + … dk - ... Feedback Technique: Relevance Feedback(or Query Modification) • Motivation: easier to judge results than to formulate queries right CS411
Pseudo Feedback • Motivation: top results are often relevant Results: d1 3.5 d2 2.4 … dk 0.5 ... Retrieval Engine Query Updated query Document collection Judgments: d1 + d2 + d3 + … dk - ... top 10 Feedback CS411
Technique: Inverted List • ti <d1, …>, …….., <dn, …> • E.g.,: • color <d1, …>, <d2, …>, <d5, …> • printer <d2, …>, <d5, …>, <d8, …> • How to evaluate Q: color AND printer? • How to evaluate Q: “color printer”? • what info to maintain in each entry? • More later… CS411
DB Meets IR • Multimedia databases • relational data + text, images, audio, video… • Fuzzy retrieval for relational data • similarity, preference-based queries • e.g., product search in e-commerce • XML represents text-based data • IR type search will be helpful • how can we extend it to retrieve XML documents?# CS411
?? Web Search • Text IR as natural starting point: • Web as a collection of HTML documents • find pages satisfy information need • Web search as killer-app of IR! • Web search vs. traditional document search • ?? how are they related? • ?? any differences or new issues? • why search engines give lousy results?# CS411
Web Search: New Issues and Challenges • Highly topic-heterogeneous documents • notion of “collection” lost • stopwords, idf scheme for term selection/weighting challenged • Structured/semi-structured documents • Highly-linked pages: collection no longer flat • how to use links cleverly – link analysis (more in TW2) • ideas from social networks for “standing” or “importance” • Extremely large scale: Billions docs and counting • Many documents/data hidden behind databases • Multi-lingual documents • Spamming CS411
What’s Next • Vector space model CS411
End of Talk CS411