280 likes | 408 Vues
Delve into the depths of PageRank, Google's defining metric, exploring its motivation, implementation, and impact on search results. Learn how proximity, types of hits, and link importance influence rankings. Uncover the challenges of citation analysis and web manipulation.
E N D
The Anatomy of a Large-ScaleHypertextual Web Search Engine A review by: Adam Chamberlain, Adrian Hudnott, Rob Garrood & Ben Smith November 2005
Agenda • Introduction • Overview of Google • PageRank • Motivation & Description • Example • Issues & Comparison • Further Work • Application • Conclusions
Introduction • About the paper • Brin & Page, 1998, Stanford University • Details a prototype search engine, Google • Covers both architecture and algorithms • Cited in web metrics with relation to significance • Also relevant to Web Graph Properties • PageRank • Covered in a separate paper from Brin & Page • Is the primary metric used in the paper
Overview : What is Google? • Web search engine • Tackles issues faced by previous crawlers of scalability and manipulation • Academic • Built on strong understanding of web metrics • Use of hyperlink structures • Transparent • Initially released into the public domain • Support for informatics research
Crawler Barrels Sorter Overview : Architecture URL Server Store Server Anchors Repository Check sums URL Resolver Indexer Links Doc Index Lexicon Searcher PageRank
Overview: Google Architecture (Explanation for handout only.) • URL Server: Finds pages to surf. • Crawler: Downloads pages and places them in the repository. • Store Server: Document compression. • Repository: Cached copies of most web pages. • Indexer: Creates the forward index (documents words) and extracts hyperlink tags into the Anchors file. • URL Resolver: Converts relative URLs into absolute URLs and creates the Links file. • Links file: Ordered pairs of document IDs where a hyperlink exists between them. • Sorter: Re-sorts the forward index to create the inverted index (words documents) and creates the Lexicon. • Lexicon: Dictionary of all possible search keywords. • Doc Index: Maps document identifier codes to URLs. • PageRank: An influential web metric used to sort Google’s matches. • Searcher: Performs searches!
Overview : Forward Index • Indexer identifies key word ‘hits’ in a document • Maps document (page) ID’s to word ID’s in Lexicon • Word ID’s partially sorted into barrels • 64 of these • Word ID’s within a barrel are unsorted. • Individual document may spread over barrels. • However, not useful for search!
Overview : Inverted Index • Want to know in what documents a key word occurs • Need the ‘Inverted Index’ • Sorts the forward index into its inverted form • Function performed by the ‘Sorter’
Overview : Ranking System • Proximity of keyword ‘hits’ • This is the sum of the distance between them • Hits have ‘types’ • Types: body text, heading text, anchor text, url, … • Relative font size factor used • Count how many hits occur of each type and range of proximity values • Apply a function to each type-proximity count • These form a type-proximity vector, C
f(x) Hit Count, x Overview : Ranking System (2) • V = C·W (dot product) is computed. • W is the importance associated with each type-proximity class. • Combine V with the PageRank score • Effect of increasing hits declines • Prevents large scale manipulation
PageRank : Motivation • Academic Citation Analysis* attempted, but… • Web has no formal quality control or peer review • Possible to inflate citation counts artificially • Web pages vary more than academic papers • Consider: • One link from the University’s main page, or one link from Yahoo’s main page… • Which citation should carry the higher weight ? *Also known as bibliometrics
PageRank : Description • Informal Definition: • “A page has a high rank if the sum of the ranks of its backlinks are high” • Handles ‘Yahoo’ case on previous slide • Intuitive Definition: • Corresponds to the Random Surfer Model • User keeps clicking on links ‘linearly’ then gets bored and restarts at a random location • Now for the maths…
PageRank : Description (2) • Formal Definition: • c is a ‘dampening’ factor, was 0.85 • Nv is number of out-links from page v • Bu is the set of backlinks from the current page • cE(u) corresponds to the surfer getting ‘bored’
A B E D C PageRank : Example • Considering an example network • Calculating A: c = dampening factor N = out-degree R = PageRank
A B E D C PageRank : Example (2) • Initially set all PageRank to 1 • First Iteration:
PageRank : Example (3) • Repeat process for B, C, D and E • Feed computed values into next iteration
PageRank : Analysis • Converges in log n time • Constrained by the time to build a full-text index more than anything • Rank ‘Sinks’ • Caused by two pages that point to each other but not to any other pages: rank accumulates • Solved by random surfer model • Manipulation – ‘Google Bombing’ • French Military ‘Victories’ links to ‘Defeats’ • ‘Miserable Failure’ links to George Bush biography
PageRank : Comparison • Web Graph Properties • Uses graph of the entire web: depends on full crawl • More sophisticated than simply summing in/out-degrees • Web Page Significance • Uses Boolean Spread Activation – match all words • Enhanced citation analysis – building on work of Kleinberg, Egghe & Rousseau • Doesn’t suffer from Tightly Knit Communities effect of Kleinberg’s Hubs & Authorities
PageRank : Further Work • Personalised PageRank, Haveliwala, 1999 • In-memory, block oriented, algorithm • PageRank can be computed in an hour on a PIII 450Mhz using less than 100Mb of main memory • Compute PageRank on the client-side • Use local information: bookmarks, searches, history • Provide the link structure of the web on a DVD • 11/11/05, “Personalized Search” released
PageRank : Further Work (2) • Topic Sensitive PageRank, Haveliwala, 2002 • Improve Google by giving weight to the informational relationship between sites • A) Uniform Results • Similar to ‘current’ Google but with topics • B) Personalised to a particular user • Based on previous searches and users’ surfing habits
Applications : Google • Google Inc. • Largest search engine • Technologies utilised by others (e.g. Yahoo!) • Biggest ever technology IPO, 2004 • Redefining search • Set a trend for other search providers • Raised importance of quality web search results • Combining information retrieval methods • Business model based on advertising • Potential area for conflict • Over 100 factors now influence results
Applications : PageRank • Back-link prediction • Desire for optimal web crawling strategy • Better indicator than citation counts! • Improving user navigation • ‘The PageRank Proxy’ • Providing PageRank information with links • Establishing trust • Wealth of authors on the web, who to trust? • Use PageRank to rate trust
Applications : The Future • Internal Development • Project no longer in academic realm • Lack of transparency initially intended • Role of PageRank unclear • Likely focus on extensions and results tuning • External Development • API’s • Allowing innovative use of Google technologies • Open Source Code • Focused on developing infrastructure
Conclusions • Academic Background • Success from strong academic understanding • Raised profile of informatics and search • Good platform for future research • Success as a failure • Intention for transparency and use in academia • Commercial success has removed transparency • Potentially bad for further research in this area
Summary • We have seen: • The architecture used by Google • PageRank as a web metric • Strengths and potential manipulations • The commercial success of Google • Applications • Potential areas of future research
References • Work by Brin & Page (now at Google) • Brin, S., Page, L. (1998), ‘The anatomy of a large-scale hypertextual search engine’, Computer Networks and ISDN Systems, 30(1-7):107--117. • Page, L., Brin, S., Motwani, R. and Winograd, T. (1998), ‘The PageRank Citation Ranking: Bringing Order to the Web', Stanford Digital Library Technologies Project. • More papers at: http://www.google.com on many aspects of web metrics and search in general • PageRank • http://www.iprcom.com/papers/pagerank/ • Take a look at the example at: http://www.dcs.warwick.ac.uk/~csucbu • http://en.wikipedia.org/wiki/Google_bomb
References (2) • Further Developments • Haveliwala, T. H. (1999), ‘Efficient computation of PageRank’. Technical report, Stanford University, Stanford, CA, 1999. • Haveliwala, T. H. (2002), ‘Topic-sensitive PageRank’. In Proceedings of the Eleventh International World Wide Web Conference, Honolulu, Hawaii, May 2002. • Commercial Aspect • http://money.cnn.com/2004/04/29/technology/google/ • http://www.google.com/corporate/history.html • Web Metrics • Dhyani, D., Keong N., W. , and Bhowmick, S. (2002), ‘A survey of web metrics’, ACM Computing Surveys, 34(4):469--503.