CS 430: Information Discovery

CS 430: Information Discovery Lecture 9 Term Weighting and Ranking

Course Administration •

Building an Index documents Documents assign document IDs text document numbers and *field numbers break into words words stoplist non-stoplist words stemming* *Indicates optional operation stemmed words term weighting* terms with weights Index database from Frakes, page 7

Term weighting Zipf's Law: If the words, w, in a collection are ranked, r(w), by their frequency, f(w), they roughly fit the relation: r(w) * f(w) = c This suggests that some terms are more effective than others in retrieval. In particular relative frequency is a useful measure that identifies terms that occur with substantial frequency in some documents, but with relatively low overall collection frequency. Term weights are functions that are used to quantify these concepts.

Inverse Document Frequency Weighting Principle: (a) Weight is proportional to the number of times that the term appears in the document (b) Weight is inversely proportional to the number of documents that contain the term: Very simple approach: wij = fij / dj Where: wijis the weight given to term j in document i fij is the frequency with which term j appears in document i dj is the number of documents that contain term j

Term Frequency Concept A term that appears many times within a document is likely to be more important than a term that appears only once.

Term Frequency Suppose term j appears fij times in document i Let mi = max (fij) i.e., miis the maximum frequency of any term in document i Term frequency (TF): tfij = fij / mi i

Inverse Document Frequency Concept A term that occurs in a few documents is likely to be a better discriminator that a term that appears in most or all documents.

Inverse Document Frequency For term j number of documents = n document frequency (number of documents in which term j occurs) = dj One possible measure is: n/dj But this over-emphasizes small differences. Therefore a more useful definition is: Inverse document frequency (IDF): idfj = log2 (n/dj) + 1

Example of Inverse Document Frequency Example n = 1,000 documents term kdkik A 100 4.32 B 500 2.00 C 900 1.13 D 1,000 1.00 From: Salton and McGill

Weighting Practical experience has demonstrated that weights of the following form perform well in a wide variety of circumstances: (Weight of term j in document i) = (Term frequency) * (Inverse document frequency) The standard weighting scheme is: wij = tfij *idfj = (fij / mi) * (log2 (n/dj) + 1) Frake, Chapter 14 discusses many variations on this basic scheme.

Page Rank Algorithm (Google) Concept: The rank of a web page is higher if many pages link to it. Links from highly ranked pages are given greater weight than links from less highly ranked pages.

Intuitive Model A user: 1. Starts at a random page on the web 2. Selects a random hyperlink from the current page and jumps to the corresponding page 3. Repeats Step 2 a very large number of times Pages are ranked according to the relative frequency with which they are visited.

Page Ranks Citing page P1 P2 P3 P4 P5 P6 P1 1 1 1 P2 1 P3 1 P4 1 1 1 1 P5 1 P6 1 1 Cited page Number 2 1 4 1 2 2

Normalize by Number of Links from Page Citing page P1 P2 P3 P4 P5 P6 P1 1 0.25 0.5 P20.25 P30.5 P40.50.25 1 0.5 P50.5 P60.25 0.5 = B Cited page Number 2 1 4 1 2 2

Weighting of Pages Initially all pages have weight 1 w1 = Recalculate weights w2 = Bw1 = 1.75 0.25 0.50 2.25 0.50 0.75 1 1 1 1 1 1

Page Ranks (Basic Algorithm) • Iterate until wk = Bwk-1 • This w is the high order eigenvector of B • It ranks the pages by links to them, normalized by the number of citations from each page and weighted by the ranking of the cited pages • Basic algorithm: • calculates the ranks for all pages • lists hits in page rank order

Google PageRank Model A user: 1. Starts at a random page on the web 2a. With probability p, selects any random page and jumps to it 2b. With probability 1-p, selects a random hyperlink from the current page and jumps to the corresponding page 3. Repeats Step 2a and 2b a very large number of times Pages are ranked according to the relative frequency with which they are visited.

Google: PageRank The Google PageRank algorithm is usually written with the following notation If page A has pages Ti pointing to it. • d: damping factor • C(A): # of links out of A Iterate until:

Compare TF.IDF to PageRank With TF.IDF document are ranked depending on how well they match a specific query. With PageRank, the pages are ranked in order of importance, with no reference to a specific query.

CS 430: Information Discovery