220 likes | 338 Vues
This lecture covers the origins and motivations behind Google’s PageRank algorithm, highlighting its design goals and foundational concepts in distributed algorithms and systems. We'll explore how PageRank transformed web search by exploiting the topological structure of hypertextual systems, addressing the limitations of existing search engine methods, and introducing key ideas such as the importance of backlinks and the treatment of dangling links. You'll gain insight into the mathematical underpinnings of PageRank and its implications for information retrieval on the web.
E N D
Lecture 4 CS492 Special Topics in Computer Science Distributed Algorithms and Systems
“The PageRank Citation Ranking:Bringing Order to the Web” L. Page, S. Brin, R. Motwani, T Winograd 1998 Fall 2008 CS492
Origin of “Google” Hostnames Active http://news.netcraft.com Fall 2008 CS492 • Googol • 10^100 • Motivation behind • Human maintained indices such as Yahoo! • Explosive growth
Design Goals of Google Fall 2008 CS492 • Improved search quality • In 1997, 1 out of 4 top search engines found itself • High precision in finding relevant document was necessary • Academic search engine research • Search engine technology went commercial: an black art • To build systems that a good number of people could use • To build an architecture to support novel research on large-scale Web data
Weakness of Existing Approaches Fall 2008 CS492 • Calculate similarities • Based on flat, vector-space model of each page • Prone to cheating (Web spamming or search engine persuasion)
Basic Idea of PageRank Fall 2008 CS492 Exploit the topological structure of hypertextual systems
Simple Example A 0.4 C 0.4 B 0.2 Fall 2008 CS492
Related Work Fall 2008 CS492 • Academic citation analysis • Similarities • Graph structure; paper = node, web page = node citation = link, URL = link • “node” authority independent of “node” content • Differences • Uniform unit of info (paper) versus great variability in quality, usage, citations, and length • Equal link weight vs variable importance A backlink from Yahoo! vs. from a friend
Which Page Should Be Ranked Higher? JohnDoe A B Fall 2008 CS492
Simple Expression page rank of set of pages pointing at out-degree of Question: role of c? Answer: total rank of all web pages constant Fall 2008 CS492
Dangling links Fall 2008 CS492 • Pages without outgoing pointers • Example: Pages not yet downloaded • Do not affect the calculation much • Remove them, calculate ranks, and add them back
Loop A C B Question: ranks of A, B, and C? Answer: infinite! (rank sink) Fall 2008 CS492
Basic Algorithm page rank of set of pages pointing at out-degree of dumping factor Fall 2008 CS492
Matrix Representation where and Fall 2008 CS492 Question: Where to start?
Iterative Algorithm where and Question: Will it converge? Fall 2008 CS492
Example [LM04] Fall 2008 CS492
Turn the Problem into a Markov Process [LM04] Fall 2008 CS492
Evenly Split Rank of Dangling Links [LM04] Fall 2008 CS492
Final Solution Fall 2008 CS492 Eigenvector of P = steady state rank
Spam Rank [BGS05] Fall 2008 CS492
Questions Fall 2008 CS492 • Where to start? • Find a nondegenerate start vector • What if there are two pages that point to each other and no one else and there is a page that points to one of them? • Role of dumping factor guarantees no rank sink
References Fall 2008 CS492 [BP98] Sergey Brin, Lawrence Page, “The anatomy of a large-scale hypertextual Web search engine,” Computer Networks and ISDN Systems, Vol. 30, 1998. [BGS05] Monica Bianchini, Marco Gori, Franco Scarselli, “Inside PageRank,” ACM Transactions on Internet Technology, Vol. 5, No. 1, Feb. 2005. [LM04] Amy N. Langville, Carl Meyer, “Deeper inside PageRank,” Internet Mathematics, Vol. I, No. 3, 2004. [K99] Jon Kleinberg, “Authoritative sources in a Hyperlinked Environment,” Journal of the ACM 46:5 (1999).