CS511 Design of Database Management Systems

CS511Design of Database Management Systems Lecture 13: Information Retrieval: Overview Kevin C. Chang

Announcements • MT format: • Wednesday 2:00-3:15pm • open notes, papers, books. Calc. OK (won’t need). PDA no. • 75 points (for 75 minutes) • 4 problems • Prob. 1: True/False problems • Prob. 2-4: longer problems • Preparation: • study lecture notes, HW, SGP– use them to review papers • ask why ask that... • discussion with peers • think more (beyond stated) and try to relate issues CS411

Some History • Early Days-- • 1945: V. Bush’s article “As we may think” • 1957: H. P. Luhn’s idea of word counting and matching • Indexing & Evaluation Methodology (1960’s) • Smart system (G. Salton’s group) • Cranfield test collection (C. Cleverdon’s group) • Indexing: automatic can be as good as manual • IR Models (1970’s & 1980’s) … • Large-scale Evaluation & Applications (1990’s) • TREC (D. Harman & E. Voorhees, NIST) • Large scale Web search CS411

?? Text Search vs. Database Queries • Two related areas: • information retrieval (IR) • databases • traditionally separate-- brought together by the Web • ?? Any differences in • data models? • query semantics? • desirable functionalities? CS411

Text vs. Rel. DB: Art vs. Algebra • Data models: • unstructured text vs. well-structured data • Query semantics: • fuzzy vs. well-defined • text search: to satisfy “information need” <-- art • DB queries: to perform data computation <-- algebra • relevant vs. correct answers • ranked vs. Boolean answers • Functionalities: • read-mostly vs. read-write/transactions/cc ... CS411

Recall: Measuring False-Negatives • Recall = |x| / |relevant| • e.g.: relevant = {D1, D2}, retrieved = {D1, D3, D4} • recall R = 1 / 2 = 0.5 • there is 1 false negative: D2 • ? How to fool recall? x relevant retrieved collection CS411

Precision: Measuring False-Positives • Precision = |x| / |retrieved| • e.g.: relevant = {D1, D2}, retrieved = {D1, D3, D4} • precision P = 1 / 3 = 0.33 • there are 2 false positives: D3 and D4 x relevant retrieved collection CS411

Models • Boolean: criteria-based • Vector space: similarity-based • Probabilistic: probability-based CS411

Boolean Model • Query: • Q1: data AND web • Q2: (knowledge OR information) AND base • Q3: data NOT info • Documents: • D1: “web data and web queries” • D2: “digital data index” • D3: “data base for dummies” CS411

Boolean Model • View: Satisfaction by criteria • Query: a Boolean expression • Q1: data AND web • Document: a Boolean conjunction • D1: “web data and web queries” = • web AND data AND queries • Query results: • {D | D implies Q}, i.e., all docs that satisfy Q CS411

Boolean Queries: Problems • Query matching is “exact” and not flexible • exact matching can result in too few/many matches • Hard to formulate a right query • what is query for “documents about color printer”? • Results are not ranked/ordered for exploration • Boolean is binary: yes or no • In short: “relevance” not captured • traditional DB queries are similarly bad at “fuzzy” concepts • new research work in top-k queries CS411

Vector Space Model • View: Similarity of content • Intuitions: • docs consist of words --> put docs in the word space • space: n-dimension for n words • similarity becomes geometric comparison • document-query similarity = vector-vector similarity D Q CS411

Probabilistic Models • View: Probability of relevance • the “probabilistic ranking principle” • Estimate and rank by P(R | Q, D) • or by log-odds: CS411

Probabilistic Models • To rank by • I.e., (see next page) • Assume pithe same for all query terms • Assume qi= ni/N • N is the collection size; i.e., “all” docs are “irrelevant” • Similar to using “IDF” • intuition: e.g., “apple computer” in a computer DB CS411

Probabilistic Models • To rank by CS411

Feedback judgments System Architecture docs INDEXING query Query Rep Doc Rep User Ranking results CS411

Technique: Term Selection/Weighting • Basis for matching query with document • Query and document should be represented using the same units/terms • Controlled vocabulary vs. full text indexing CS411

What is a good indexing term? • Specific (phrases) or general (single word)? • Luhn found that words with middle frequency are most useful • Not too specific • Not too general • All words or a (controlled) subset? • When term weighting is used, it is a matter of weighting not selecting of indexing terms • more later CS411

Technique: Stemming • Words with similar meanings should be mapped to the same indexing term • Stemming: Mapping all inflectional forms of words to the same root form, e.g. • computer -> compute • computation -> compute • computing -> compute • Porter’s Stemmer is popular for English • In general: clustering of “synonym” words CS411

Technique: Stopwords • A “common word” that bears little semantic content • preposition: for, on, … • article: a, an, the • non-informative words (collection specific) • e.g., “database” in this class • e.g., “PC” in a computer collection • You can search the Web for stopwords list CS411

Results: d1 3.5 d2 2.4 … dk 0.5 ... Retrieval Engine Query Updated query User Document collection Judgments: d1 + d2 - d3 + … dk - ... Feedback Technique: Relevance Feedback(or Query Modification) • Motivation: easier to judge results than to formulate queries right CS411

Pseudo Feedback • Motivation: top results are often relevant Results: d1 3.5 d2 2.4 … dk 0.5 ... Retrieval Engine Query Updated query Document collection Judgments: d1 + d2 + d3 + … dk - ... top 10 Feedback CS411

Technique: Inverted List • ti  <d1, …>, …….., <dn, …> • E.g.,: • color  <d1, …>, <d2, …>, <d5, …> • printer  <d2, …>, <d5, …>, <d8, …> • How to evaluate Q: color AND printer? • How to evaluate Q: “color printer”? • what info to maintain in each entry? • More later… CS411

DB Meets IR • Multimedia databases • relational data + text, images, audio, video… • Fuzzy retrieval for relational data • similarity, preference-based queries • e.g., product search in e-commerce • XML represents text-based data • IR type search will be helpful • how can we extend it to retrieve XML documents?# CS411

?? Web Search • Text IR as natural starting point: • Web as a collection of HTML documents • find pages satisfy information need • Web search as killer-app of IR! • Web search vs. traditional document search • ?? how are they related? • ?? any differences or new issues? • why search engines give lousy results?# CS411

Web Search: New Issues and Challenges • Highly topic-heterogeneous documents • notion of “collection” lost • stopwords, idf scheme for term selection/weighting challenged • Structured/semi-structured documents • Highly-linked pages: collection no longer flat • how to use links cleverly – link analysis (more in TW2) • ideas from social networks for “standing” or “importance” • Extremely large scale: Billions docs and counting • Many documents/data hidden behind databases • Multi-lingual documents • Spamming CS411

What’s Next • Vector space model CS411

End of Talk CS411

CS511 Design of Database Management Systems