1 / 36

How PageRank Works

How PageRank Works. Ketan Mayer-Patel University of North Carolina January 31, 2011. Me vs. Jeff. High school Public school in Texas College The University of California, Berkeley Faculty member at... UNC. High School Hoity-toity, private all-boys school in Jersey College Stanford

talen
Télécharger la présentation

How PageRank Works

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011

  2. Me vs. Jeff • High school • Public school in Texas • College • The University of California, Berkeley • Faculty member at... • UNC • High School • Hoity-toity, private all-boys school in Jersey • College • Stanford • Faculty member at... • Duke

  3. The World Wide Web • A Simple Request/Response System Request for web page. Web page returned.

  4. Making The Request • How do you make a web request? • Use a browser. • Specify what you want directly. • Follow a link. • Turns out we very rarely specify documents directly. • Uniform Resource Locator (URL) • http://server-name.com/path/to/a/page • Two key characteristics of hyperlinks: • Directional • Unilateral

  5. Web Search In Three Easy Steps • What’s step one? • Cut a hole in the box.

  6. Web Search In Three Easy Steps • First, crawl. • Try to find all of the web pages. • Follow the links. • Second, index. • Organize what you find. • Lots of secret sauce here. • Third, query. • Usually, text query words. • Retrieves a list of related pages. • Usually because they contain the query text.

  7. Which to list first? • Possible clues: • Number of times the query term appears • Where it appears • Title, body text, URL, metadata, etc. • How it appears • Style of text • Role of text • Position in the document graph • This is what distinguished Google from other search engines at the time.

  8. PageRank • Supposedly named after Larry Page • Part of his research in grad school • Patented while in grad school. • Licensed to Google for ~ 1 million shares of Google. • Sold for about $300M

  9. Document Graph

  10. Probability Distribution of a Random Walk • Start walking the graph. • After some reasonably long amount of time, stop. • What’s the chance that you are on a particular page. • Larger chance => more important page • Is this actually true? • Maybe, maybe not

  11. Random Walk Example

  12. Random Walk Example

  13. Random Walk Example

  14. Random Walk Example

  15. Random Walk Example

  16. Random Walk Example

  17. Random Walk Example

  18. Trapdoors and Dead Ends Hotel California: Can’t ever leave. Shangri-La: Can’t ever get here.

  19. Spider Traps

  20. Fixing Our Random Walk • What can we do to fix it? • Add a bit more randomness. • At each step, with probabilityαjump to any random page. • Otherwise, randomly follow a link. • Provides a way in to / out of trapdoors / dead ends and spider traps.

  21. Random Walk Scalability • Problem: Would need to simulate the random walk over and over again to even come close to discovering the underlying probability distribution. • Easy to do for small graphs. • Pain in the ass for large ones. • Markov Chain • Tool for analyzing stochastic processes. • Power method

  22. Power Method Equation • N : Number of documents • Rk: Page rank of document k • Lk : Number of outgoing links in k • δ(k,j) : Delta functionforlinks between k and j δ(k,j) = 1 if and only if there exists a link from document k to document j

  23. Power Method Equation • Our definition is circular. • To calculate page rank of a page we need to already know the page rank of other pages. • Iterative solution. • Start with an initial assignment. • Basically set the page rank of every page to 1/N. • Why 1/N? • Calculate an updated value for every page using the current values. • Keep repeating until the value are stable.

  24. Power Method Equation • Intuition: • Page rank of a document is the sum of its fair share of the page ranks of the pages that link to the document.

  25. Example i= 0 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

  26. Example i= 1 0.025 0.075 0.125 0.05 0.1 0.1 0.1 0.2 0.125 0

  27. Example Something is wrong! i= 10 0.015 0.051 0.189 0.036 0.134 0.072 0.154 0.071 0.015 0

  28. Power Method v2 • Dead ends leak. • Spider traps slowly collect everything. • Translating our random walk solution: • Add a “virtual” link from every document to every other document. • Define a weighting factor α between 0.0 and 1.0 • Distribute α proportion of your page rank over the virtual links • Distribute (1- α) proportion of your page rank over the real links

  29. Power Method v2 • Dead ends leak. • Spider traps slowly collect everything. • Translating our random walk solution: • Add a “virtual” link from every document to every other document. • Define a weighting factor α between 0.0 and 1.0 • Distribute α proportion of your page rank over the virtual links • Distribute (1- α) proportion of your page rank over the real links

  30. Convergence • Typical value for α is 0.15. • Convergence typically occurs in about 50 iterations even for large graphs.

  31. Example i= 10 0.024 0.074 0.115 0.061 0.112 0.073 0.107 0.105 0.034 0.011

  32. Example i= 10 0.015 0.024 0.051 0.074 0.189 0.115 0.036 0.061 0.134 0.112 0.072 0.073 0.154 0.107 0.071 0.105 0.034 0 0.011 0.015

  33. Billions and billions • How do you do this with billions of documents? • Can be implemented using matrix math. • Special techniques for sparse matrices. • PageRank roughly equivalent to first eigenvector.

  34. Gaming The System • Google Bomb! • Create a lot of links to the page that you want to be highly ranked. • Create your own spider trap. • Relatively easy to combat by discounting links that come from the same domain. • Comment spam. • Porn trap.

  35. Last Notes • Stanford Sucks! • GO HEELS!

  36. Bad Math • When originally presented, the final version of the power method equation was shown as: • The simplification for the first term is wrong and should have been:

More Related