250 likes | 262 Vues
Learn about the algorithm used by Google to rank web pages based on importance. Understand how PageRank calculates the probability distribution of web pages and assigns them a rank value.
 
                
                E N D
Google Search Engine* CS461 Lecture Department of Computer Science Iowa State University • “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, S. Brin and L. Page, in Proceeding of WWW’98 • “The pagerank citation ranking: Bringing order to the Web “, L. Page, S. Brin, R. Motwani, and T. Winograd, Technical Report, Stanford University, 1998
What to cover today • PageRank • Google Architecture
Problem Statement • Ultimate version • Find what I want • In most cases, I don’t know exactly or cannot expressed clearly what I want • “What-I-want” can be estimated using a set of keywords • Simplified version • Find the files that are most related to a set of keywords
Naïve Solution • How it works • Download the entire Internet to a local machine • Search and return all files containing the set of keywords • Problems: all files are treated equally importance • Could return tons of files, but most of them are not what I want • Since most users simply check out the first few files, this scheme actually cannot find much useful things
Ranking Based on Hit Rate • How it works • A file is ranked higher if it is visited more frequently • Problems • Could be affected by faked hits • A file will be ranked higher and higher
Ranking based on Citation • Basic idea • A paper is important if it is cited by many papers • Each paper has a set of references that link to the related work • A pioneering paper typically has a high citation • An HTML page is more important if it is linked by many other page • Each page may link to other pages • Problems • Publish of academic papers is well-controlled • Many are peer-reviewed • Chronically ordered • Internet files could be anything
Proposed: PageRank • Basic idea • A page with many links to it is more likely to be useful than one with few links to it • Just like citation • The links from a page that itself is the target of many links are likely to be particularly important • This is something new
Proposed: PageRank • Basic idea • A page with many links to it is more likely to be useful than one with few links to it • Just like citation • The links from a page that itself is the target of many links are likely to be particularly important • This is something new back links forward link Each link has different weight
Proposed: PageRank • How it works • Each page is ranked using a value called PageRank (PR) • A page’s PR depends on the PRs of its back link pages PR(A)=(1-d) + d*[PR(T1)/C(T1)+…+ PR(Tn)/C(Tn)] d: damping factor, normally this is set to 0.85 T1, … Tn: pages point to page A PR(A): PageRank of page A PR(Ti): PageRank of page Ti pointing to page A C(Ti): the number of links going out of page Ti
Proposed: PageRank • Properties of PageRank formula • PageRanks form a probability distribution over web pages, so the normalized sum of all web pages' PageRanks will be one • Challenge of calculating PageRanks • The links could be circulated, e.g., ABA Page A Page B
Page A Page B PageRank Calculation • Assign each page an initial rank value • Could be any number (seed) • Repeat calculations until the rank of each page does not change much Seed = 1 PR(A)= 0.15 + 0.85 * 1 = 1 PR(B)= 0.15 + 0.85 * 1 = 1 d= 0.85 PR(A)= (1 – d) + d(PR(B)/1) PR(B)= (1 – d) + d(PR(A)/1)
Page A Page B PageRank Calculation • Assign each page an initial rank value • Could be any number (seed) • Repeat calculations until the rank of each page does not change much Seed = 0 1) PR(A)= 0.15 + 0.85 * 0 = 0.15 PR(B)= 0.15 + 0.85 * 0.15 = 0.2775 2) PR(A)= 0.15 + 0.85 * 0.2775 = 0.385875 PR(B)= 0.15 + 0.85 * 0.385875 = 0.47799375 3) PR(A)= 0.15 + 0.85 * 0.47799375 = 0.5562946875 PR(B)= 0.15 + 0.85 * 0.5562946875 = 0.622850484375 d= 0.85 PR(A)= (1 – d) + d(PR(B)/1) PR(B)= (1 – d) + d(PR(A)/1)
Page A Page B PageRank Calculation • Assign each page an initial rank value • Could be any number (seed) • Repeat calculations until the rank of each page does not change much Seed = 40 1) PR(A)= 0.15 + 0.85 * 40 = 34.25 PR(B)= 0.15 + 0.85 * 0.385875 = 29.1775 2) PR(A)= 0.15 + 0.85 * 29.1775 = 24.950875 PR(B)= 0.15 + 0.85 * 24.950875 = 21.35824375 3) ...... d= 0.85 PR(A)= (1 – d) + d(PR(B)/1) PR(B)= (1 – d) + d(PR(A)/1)
Page A Page B PageRank Calculation • Assign each page an initial rank value • Could be any number (seed) • Repeat calculations until the rank of each page does not change much Seed = 40 1) PR(A)= 0.15 + 0.85 * 40 = 34.25 PR(B)= 0.15 + 0.85 * 0.385875 = 29.1775 2) PR(A)= 0.15 + 0.85 * 29.1775 = 24.950875 PR(B)= 0.15 + 0.85 * 24.950875 = 21.35824375 3) …… Observation: It doesn’t matter what the seed value you use, once the PageRank calculations settle down, the “normalized probability distribution” (the average PageRank for all pages) will be 1.0
Example of Calculation (0) Page A Page B Page C Page D
Example of Calculation (1) Page A 1 Page B 1 Page C 1 Page D 1
Example of Calculation (2) Page A 1 Page B 1 1*0.85/2 1*0.85/2 1*0.85 1*0.85 Page C 1 Page D 1 1*0.85
Each page has not passed on 0.15, so we get: Page A: 0.85 (from Page C) + 0.15 (not transferred) = 1 Page B: 0.425 (from Page A) + 0.15 (not transferred) = 0.575 Page C: 0.85 (from Page D) + 0.85 (from Page B) + 0.425 (from Page A) + 0.15 (not transferred) = 2.275 Page D: receives none, but has not transferred 0.15 = 0.15 Page A 1 Page B 0.575 Page C 2.275 Page D 0.15
Example of Calculation (3) Page A 1 Page B 0.575 Page C 2.275 Page D 0.15
Page A: 2.275*0.85 (from Page C) + 0.15 (not transferred) = 2.08375 Page B: 1*0.85/2 (from Page A) + 0.15 (not transferred) = 0.575 Page C: 0.15*0.85 (from Page D) + 0.575*0.85(from Page B) + 1*0.85/2 (from Page A) +0.15 (not transferred) = 1.19125 Page D: receives none, but has not transferred, remains at 0.15 Page A 2.03875 Page B 0.575 Page C 1.1925 Page D 0.15
Example of calculation (4) • After 20 iterations, we get Page A 1.490 Page B 0.783 Page C 1.577 Page D 0.15 In reality: a PageRank for 26,000,000web pages can be computed in a few hours on a medium size workstation. (1998)
Result • Page C has the highest PageRank, and page A has the next highest: page C has a highest importance in this page links! • More iterations lead to a stability PageRank of the resulting page for keyword research.
PageRank Summary • PageRank is a citation importance ranking • Approximated measure of importance or quality • Number of citations or backlinks • The pages with high PageRanks are those that are linked to by many pages and/or by important pages (e.g., Yahoo!)
PageRank Summary • PageRank is a citation importance ranking • Approximated measure of importance or quality • Number of citations or backlinks • Each citation has different weight • The pages with high PageRanks are those that are linked to by many pages and/or by important pages (e.g., Yahoo!) • Questions: how to improve the ranking of your web pages? • Creating dummy sites to link to their main sites? • Increasing internal links and/or decreasing external links?