CS276A Text Information Retrieval, Mining, and Exploitation

CS276AText Information Retrieval, Mining, and Exploitation Lecture 4 15 Oct 2002

Recap of last time • Index size • Index construction techniques • Dynamic indices • Real world considerations

Back of the envelope index size calculation • Number of docs = n = 40M • Number of terms = m = 1M • Use Zipf to estimate number of postings entries: • n + n/2 + n/3 + …. + n/m ~ n ln m = 560M postings entries • This is just a word-document index, not one that includes positional information

1 3 2 4 Merge sort of 56 sorted runs • Merge tree of log256 ~ 6 layers. • During each layer, read into memory runs in blocks of 10M, merge, write back. 2 1 3 4 Disk

1 3 2 4 Merge sort of 56 sorted runs • How do you write back long merged runs? • Wait to accumulate 10M-sized output blocks before writing back. • Thus amortize seek time over block transfer. 2 1 3 4 Disk

Today’s topics • Ranking models • The vector space model • Inverted indexes with term weighting • Evaluation with ranking models

Ranking models in IR • Key idea: • We wish to return in order the documents most likely to be useful to the searcher • To do this, we want to know which documents best satisfy a query • An obvious idea is that if a document talks about a topic more then it is a better match • A query should then just specify terms that are relevant to the information need, without requiring that all of them must be present • Document relevant if it has a lot of the terms

Binary term presence matrices • Record whether a document contains a word: document is binary vector in {0,1}v • What we have mainly assumed so far • Idea: Query satisfaction = overlap measure:

Overlap matching • What are the problems with the overlap measure? • It doesn’t consider: • Term frequency in document • Term scarcity in collection (document mention frequency) • Length of documents • (And queries: score not normalized)

Overlap matching • One can normalize in various ways: • Jaccard coefficient: • Cosine measure: • What documents would score best using Jaccard against a typical query? • Does the cosine measure fix this problem?

Count term-document matrices • We haven’t considered frequency of a word • Count of a word in a document: • Bag of words model • Document is a vector in ℕv Normalization: Calpurnia vs. Calphurnia

Weighting term frequency: tf • What is the relative importance of • 0 vs. 1 occurrence of a term in a doc • 1 vs. 2 occurrences • 2 vs. 3 occurrences … • Unclear: but it seems that more is better, but a lot isn’t necessarily better than a few • Can just use raw score • Another option commonly used in practice:

Dot product matching • Match is dot product of query and document • [Note: 0 if orthogonal (no words in common)] • Rank by match • It still doesn’t consider: • Term scarcity in collection (document mention frequency) • Length of documents and queries • Not normalized

Weighting should depend on the term overall • Which of these tells you more about a doc? • 10 occurrences of hernia? • 10 occurrences of the? • Suggest looking at collection frequency (cf) • But document frequency (df) may be better: Word cf df try 10422 8760 insurance 10440 3997 • Document frequency weighting is only possible in known (static) collection.

tf x idf term weights • tf x idf measure combines: • term frequency (tf) • measure of term density in a doc • inverse document frequency (idf) • measure of informativeness of term: its rarity across the whole corpus • could just be raw count of number of documents the term occurs in (idfi = 1/dfi) • but by far the most commonly used version is: • See Kishore Papineni, NAACL 2, 2002 for theoretical justification

Summary: tf x idf (or tf.idf) • Assign a tf.idf weight to each term i in each document d • Increases with the number of occurrences within a doc • Increases with the rarity of the term across the whole corpus What is the wt of a term that occurs in all of the docs?

Real-valued term-document matrices • Function (scaling) of count of a word in a document: • Bag of words model • Each is a vector in ℝv • Here log scaled tf.idf

Documents as vectors • Each doc j can now be viewed as a vector of tfidf values, one component for each term • So we have a vector space • terms are axes • docs live in this space • even with stemming, may have 20,000+ dimensions • (The corpus of documents gives us a matrix, which we could also view as a vector space in which words live – transposable data)

Why turn docs into vectors? • First application: Query-by-example • Given a doc d, find others “like” it. • Now that d is a vector, find vectors (docs) “near” it.

Intuition t3 d2 d3 d1 θ φ t1 d5 t2 d4 Postulate: Documents that are “close together” in vector space talk about the same things.

The vector space model Query as vector: • We regard query as short document • We return the documents ranked by the closeness of their vectors to the query, also represented as a vector. • Developed in the SMART system (Salton, c.1970) and standardly used by TREC participants and web IR systems

Desiderata for proximity • If d1 is near d2, then d2 is near d1. • If d1 near d2, and d2 near d3, then d1 is not far from d3. • No doc is closer to d than d itself.

First cut • Distance between vectors d1 and d2 is the length of the vector |d1 – d2|. • Euclidean distance • Why is this not a great idea? • We still haven’t dealt with the issue of length normalization • Long documents would be more similar to each other by virtue of length, not topic • However, we can implicitly normalize by looking at angles instead

t 3 d2 d1 θ t 1 t 2 Cosine similarity • Distance between vectors d1 and d2captured by the cosine of the angle x between them. • Note – this is similarity, not distance

Cosine similarity • Cosine of angle between two vectors • The denominator involves the lengths of the vectors • So the cosine measure is also known as the normalized inner product

Cosine similarity exercises • Exercise: Rank the following by decreasing cosine similarity: • Two docs that have only frequent words (the, a, an, of) in common. • Two docs that have no words in common. • Two docs that have many rare words in common (wingspan, tailfin).

Normalized vectors • A vector can be normalized (given a length of 1) by dividing each of its components by the vector's length • This maps vectors onto the unit circle: • Then, • Longer documents don’t get more weight • For normalized vectors, the cosine is simply the dot product:

Exercise • Euclidean distance between vectors: • Euclidean distance: • Show that, for normalized vectors, Euclidean distance gives the same closeness ordering as the cosine measure

Example • Docs: Austen's Sense and Sensibility, Pride and Prejudice; Bronte's Wuthering Heights • cos(SAS, PAP) = .996 x .993 + .087 x .120 + .017 x 0.0 = 0.999 • cos(SAS, WH) = .996 x .847 + .087 x .466 + .017 x .254 = 0.929

Digression: spamming indices • This was all invented before the days when people were in the business of spamming web search engines: • Indexing a sensible passive document collection vs. • An active document collection, where people (and indeed, service companies) are trying to shape documents in an attempt to achieve ranking function maximization

Digression: ranking in Machine Learning • Our problem is: • Given document collection D and query q, return a ranking of D according to relevance to q. • Such ranking problems have been much less studied in machine learning than classification/regression problems • But much more interest recently, e.g., • W.W. Cohen, R.E. Schapire, and Y. Singer. Learning to order things. Journal of Artificial Intelligence Research, 10:243–270, 1999. • And subsequent research

Digression: ranking in Machine Learning • Many “WWW” applications are ranking (aka ordinal regression) problems: • Text information retrieval • Image similarity search (QBIC) • Book/movie recommendations • Collaborative filtering • Meta-search engines

Summary: What’s the real point of using vector spaces? • Key: A user’s query can be viewed as a (very) short document. • Query becomes a vector in the same space as the docs. • Can measure each doc’s proximity to it. • Natural measure of scores/ranking – no longer Boolean.

Evaluation II • Evaluation of ranked results: • You can return any number of results ordered by similarity • By taking various numbers of documents (levels of recall), you can produce a precision-recall curve

Precision-recall curves

Interpolated precision • If you can increase precision by increasing recall, then you should get to count that…

Evaluation • There are various other measures • Precision at fixed recall • This is perhaps the most appropriate thing for web search: all people want to know is how many good matches there are in the first one or two pages of results • 11-point interpolated average precision • The standard measure in the TREC competitions: you take the precision at 11 levels of recall varying from 0 to 1 by tenths of the documents, using interpolation (the value for 0 is always interpolated!), and average them

We’ll use more notions from linear algebra next lecture • Matrix, vector • Transpose and product • Rank • Eigenvalues and eigenvectors.

Resources, and beyond • MG 4.4–4.5, MIR 2.5. • Next steps • Computing cosine similarity efficiently. • Dimensionality reduction. • Probabilistic approaches to IR

CS276A Text Information Retrieval, Mining, and Exploitation