1 / 21

Lecture #10 PageRank

Lecture #10 PageRank. CS492 Special Topics in Computer Science: Distributed Algorithms and Systems. Origin of “Google”. Hostnames Active. http://news.netcraft.com. Googol 10^100 Motivation behind Human maintained indices such as Yahoo! Explosive growth. Design Goals of Google.

akamu
Télécharger la présentation

Lecture #10 PageRank

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture #10PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems

  2. Origin of “Google” Hostnames Active http://news.netcraft.com • Googol • 10^100 • Motivation behind • Human maintained indices such as Yahoo! • Explosive growth

  3. Design Goals of Google • Improved search quality • In 1997, 1 out of 4 top search engines found itself • High precision in finding relevant document was necessary • Academic search engine research • Search engine technology went commercial: an black art • To build systems that a good number of people could use • To build an architecture to support novel research on large-scale Web data

  4. Weakness of Existing Approaches • Calculate similarities • Based on flat, vector-space model of each page • Prone to cheating (Web spamming or search engine persuasion)

  5. Basic Idea of PageRank Exploit the topological structure of hypertextual systems

  6. Simple Example A 0.4 C 0.4 B 0.2

  7. Related Work • Academic citation analysis • Similarities • Graph structure; paper = node, web page = node citation = link, URL = link • “node” authority independent of “node” content • Differences • Uniform unit of info (paper) versus great variability in quality, usage, citations, and length • Equal link weight vs variable importance A backlink from Yahoo! vs. from a friend

  8. Which Page Should Be Ranked Higher? JohnDoe A B

  9. Simple Expression page rank of set of pages pointing at out-degree of Question: role of c? Answer: total rank of all web pages constant

  10. Dangling links • Pages without outgoing pointers • Example: Pages not yet downloaded • Do not affect the calculation much • Remove them, calculate ranks, and add them back

  11. Loop A C B Question: ranks of A, B, and C? Answer: infinite! (rank sink)

  12. Basic Algorithm page rank of set of pages pointing at out-degree of dumping factor

  13. Matrix Representation where and Question: Where to start?

  14. Iterative Algorithm where and Question: Will it converge?

  15. Example [LM04]

  16. Turn the Problem into a Markov Process [LM04]

  17. Evenly Split Rank of Dangling Links [LM04]

  18. Final Solution Eigenvector of P = steady state rank

  19. Spam Rank [BGS05]

  20. Questions • Where to start? • Find a nondegenerate start vector • What if there are two pages that point to each other and no one else and there is a page that points to one of them? • Role of dumping factor guarantees no rank sink

  21. References [PBMW] L. Page, S. Brin, R. Motwani, T. Winograd, “The PageRank citation ranking: bringing order to the web,” WWW 1998 [BP98] Sergey Brin, Lawrence Page, “The anatomy of a large-scale hypertextual Web search engine,” Computer Networks and ISDN Systems, Vol. 30, 1998. [BGS05] Monica Bianchini, Marco Gori, Franco Scarselli, “Inside PageRank,” ACM Transactions on Internet Technology, Vol. 5, No. 1, Feb. 2005. [LM04] Amy N. Langville, Carl Meyer, “Deeper inside PageRank,” Internet Mathematics, Vol. I, No. 3, 2004. [K99] Jon Kleinberg, “Authoritative sources in a Hyperlinked Environment,” Journal of the ACM 46:5 (1999).

More Related