CS 430 / INFO 430: Information Retrieval

CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2

Course Administration Some thoughts about Google as an organization

Indexing the Web Goals: Precision Short queries applied to very large numbers of items leads to large numbers of hits. • Goal is that the first 10-100 hits presented should satisfy the user's information need -- requires ranking hits in order that fits user's requirements • Recall is not an important criterion Completeness of index is not an important factor. • Comprehensive crawling is unnecessary

Graphical Methods Document A provides information about document B Document A refers to document B

Anchor Text The source of Document A contains the marked-up text: <a href="http://www.cis.cornell.edu/">The Faculty of Computing and Information Science</a> The anchor text: The Faculty of Computing and Information Science can be considered descriptive metadata about the document: http://www.cis.cornell.edu/

Concept of Relevance and Importance Document measures Relevance, as conventionally defined, is binary (relevant or not relevant). It is usually estimated by the similarity between the terms in the query and each document. Importancemeasures documents by their likelihood of being useful to a variety of users. It is usually estimated by some measure of popularity. Web search engines rank documents by a combination of estimates of relevance and importance.

Ranking Options 1. Paid advertisers 2. Manually created classification 3. Vector space ranking with corrections for document length, and extra weighting for specific fields, e.g., title, anchors, etc. 4. Popularity, e.g., PageRank The details of 3 and the balance between 3 and 4 are not made public.

Citation Graph cites Paper is cited by Note that journal citations always refer to earlier work.

Bibliometrics Techniques that use citation analysis to measure the similarity of journal articles or their importance Bibliographic coupling: two papers that cite many of the same papers Co-citation: two papers that were cited by many of the same papers Impact factor (of a journal): frequency with which the average article in a journal has been cited in a particular year or period

Bibliometrics: Impact Factor Impact Factor (Garfield, 1972) • Set of journals in Journal Citation Reports of the Institute for Scientific Information • Impact factor of a journal j in a given year is the average number of citations received by papers published in the previous two years of journal j. Impact factor counts in-degrees of nodes in the network. Influence Weight (Pinski and Narin, 1976) • A journal is influential if, recursively, it is heavily cited by other influential journals.

Graphical Analysis of Hyperlinks on the Web This page links to many other pages (hub) 2 1 4 Many pages link to this page (authority) 3 6 5

Graphical Methods on Web Links Choices • Graph of full Web or subgraph • In-links to a node or all links Algorithms • Hubs and Authorities -- subgraph, all links (Kleinberg, 1997) • PageRank -- full graph, in-links only (Brin and Page, 1998) See: J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46 (1999), for descriptions of all these methods Or take: CS/INFO 685,The Structure of Information Networks

PageRank Algorithm Used to estimate importance of documents. Concept: The rank of a web page is higher if many pages link to it. Links from highly ranked pages are given greater weight than links from less highly ranked pages. PageRank is essentially a modified version of Pinski and Narin's influence weights applied to the Web graph.

Intuitive Model (Basic Concept) Basic (no damping) A user: 1. Starts at a random page on the web 2. Selects a random hyperlink from the current page and jumps to the corresponding page Repeats Step 2 a very large number of times Pages are ranked according to the relative frequency with which they are visited.

Matrix Representation Citing page (from) P1 P2 P3 P4 P5 P6 Number P1 1 1 P2 1 1 2 P3 1 1 1 3 P4 1 1 1 1 4 P5 1 1 P6 1 1 Cited page (to) Number 4 2 1 1 3 1

Basic Algorithm: Normalize by Number of Links from Page Citing page P1 P2 P3 P4 P5 P6 P1 0.33 P2 0.25 1 P3 0.25 0.5 1 P40.25 0.5 0.33 1 P50.25 P6 0.33 = B Cited page Normalized link matrix Number 4 2 1 1 3 1

Basic Algorithm: Weighting of Pages Initially all pages have weight 1 w0 = Recalculate weights w1 = Bw0 = 0.33 1.25 1.75 2.08 0.25 0.33 1 1 1 1 1 1 If the user starts at a random page, the jth element of w1 is the probability of reaching page j after one step.

Basic Algorithm: Iterate Iterate: wk = Bwk-1 w0 w1 w2 w3 ... converges to ...w -> -> -> -> -> -> 0.00 2.39 2.39 1.19 0.00 0.00 0.08 1.83 2.79 1.12 0.08 0.08 0.03 2.80 2.06 1.05 0.02 0.03 1 1 1 1 1 1 0.33 1.25 1.75 2.08 0.25 0.33 The sum of the weights is the number of pages.

Graphical Analysis of Hyperlinks on the Web There is no link out of {2, 3, 4} 2 1 4 3 6 5

Google PageRank with Damping A user: 1. Starts at a random page on the web 2a. With probability d, selects any random page and jumps to it 2b. With probability 1-d, selects a random hyperlink from the current page and jumps to the corresponding page 3. Repeats Step 2a and 2b a very large number of times Pages are ranked according to the relative frequency with which they are visited.

The PageRank Iteration The basic method iterates using the normalized link matrix, B. wk = Bwk-1 This w is the high order eigenvector of B PageRank iterates using a damping factor. The method iterates: wk = dw0 + (1 - d)Bwk-1 w0 is a vector with every element equal to 1. d is a constant found by experiment.

The PageRank Iteration The iteration expression with damping can be re-written. Let R be a matrix with every element equal to 1/n Rwk-1= w0 (The sum of the elements of wk-1 equals n) Let P = dR + (1 - d)B The iteration formula wk = dw0 + (1 - d)Bwk-1 is equivalent to wk = Pwk-1 so that w is the high order eigenvector of P Extra slide added November 22, 2005

Iterate with Damping Iterate: wk = Pwk-1 (d = 0.3) w0 w1 w2 w3 ... converges to ...w -> -> -> -> -> -> 0.38 1.68 1.87 1.31 0.37 0.38 0.41 1.46 2.03 1.29 0.39 0.41 0.39 1.80 1.78 1.26 0.37 0.39 1 1 1 1 1 1 0.53 1.18 1.53 1.76 0.48 0.53 Iteration expression corrected: November 22, 2005

Google: PageRank The Google PageRank algorithm is usually written with the following notation If page A has pages Ti pointing to it. • d: damping factor • C(A): number of links out of A Iterate until: Note added 12/1/05: the parameter d used in this expression is (1-d) the parameter used in Slides 20-23.

Information Retrieval Using PageRank Simple Method Consider all hits (i.e., all document that match the query in the Boolean sense) as equal. Display the hits ranked by PageRank. The disadvantage of this method is that it gives no attention to how closely a document matches a query

Combining Term Weighting with Reference Pattern Ranking Combined Method 1. Find all documents that contain the terms in the query vector. 2. The similarity, using conventional term weighting, between the query and documentj is sj. 3. The rank of documentj using PageRank or other reference pattern ranking is pj. 4. Calculate a combined rank cj = sj + (1- )pj, where  is a constant. 5. Display the hits ranked by cj. This method is used in several commercial systems, but the details have not been published.

Problems with PageRank Most pages have very small page ranks • For searches that return large numbers of hits, there are usually a reasonable number of pages with high PageRank. • For searches that return smaller numbers of hits, e.g, highly specific queries, all the pages may have very small PageRanks, so that it is difficult to rank them in a sensible order. Example A search by a customer for information about a product may rank a large number of mail order businesses that sell the product above the manufacturer's site that provides a specification for the product. A small number of links makes big changes to the rank.

Problems with Graphical Methods (Anchor Text + PageRank) Google Bomb: a collective hyperlinking strategy intended to change the search results of a specific term or phrase. Examples The "failure" Google bomb promoted George W. Bush’s page on whitehouse.gov to the number one rank in a search of the phrase "failure." The "Jew" Google bomb demoted an anti–Semitic Web site from number one rank with a search of "Jew," and promoted the wikipedia.org definition of "Jew" to number one. See: Clifford Tatum, 2005, http://www.firstmonday.org/issues/issue10_10/tatum/

Advanced Graphical Methods: www.teoma.com • Carry out a search • Divide Web sites found by a search into clusters, known as communities • Calculate authority within communities • Calculate hubs within communities, known as experts Note: Teoma does not publish the precise algorithms it uses

CS 430 / INFO 430: Information Retrieval

CS 430 / INFO 430: Information Retrieval

Presentation Transcript

CS 430 / INFO 430 Information Retrieval

Information Retrieval

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 1 Boolean retrieval

Introduction to Information Retrieval

Information Retrieval continued

CS 430 / INFO 430 Information Retrieval

XML Retrieval

Cumulative Progress in Language Models for Information Retrieval

Multimedia Information Retrieval

Retrieval and Evaluation Techniques for Personal Information

Information Retrieval

Information Retrieval

Information Retrieval: aka “Google-lite”

Information Retrieval

An overview of the technology used Information Retrieval

Information Retrieval Models

Information Retrieval

Information Retrieval

Using Semantic Relations to Improve Information Retrieval

Web Information retrieval (Web IR)

Information Retrieval

Application of Markov chains in an interactive information retrieval system