140 likes | 247 Vues
“The Anatomy of a Large-Scale Hypertextual Web Search Engine” ‘98. Google case Angela Fogarolli afogarol@dit.unitn.it 07/06/2006. Roadmap. Google design goals System features Page Rank Anchor Text Others System architecture System functionalities Crawling Indexing
E N D
“The Anatomy of a Large-Scale Hypertextual Web Search Engine” ‘98 Google case Angela Fogarolli afogarol@dit.unitn.it 07/06/2006
Roadmap • Google design goals • System features • Page Rank • Anchor Text • Others • System architecture • System functionalities • Crawling • Indexing • Searching • Conclusion
Google design goals • Improve search quality • Improve search engine usability • Improve scalability on large web data.
System feature: PageRank PageRank is the probability that a random surfer visits a page. PageRank is based on citation (link) graph. • It does not count links from all pages equally.It normalizes link numbers by the number of link in a page. • PageRank recursively propagates weights through the link structure of the web
PageRank Calculation PR(A)=(1-d)+d(PR(T1)/C(T1)+… (PR(Tn)/C(Tn)) Page A has pages T1…Tn which point to it • d is a dumping factor, usually is set to 0.85 • C(A) is the number of links going out of page A • Example: • A page has a high PageRank if there are many pages that point to it, or if there are some pages that point to it and have a high PageRank.
System feature: Anchor Text The text of the link is associated with the page the link is on. In addition Google associates it with the page the link points to. Advantages: • Anchors often provide more accurate descriptions of web pages than the pages themselves. • Anchors may exist for documents which cannot be indexed (images, programs and db)
System features: Others • Extensive use of proximity in search, it keeps location information for all hits. • Presentation details such as font size are important for weight calculation of hits.
System Architecture • Several distributed crawlers • The fetched web pages are sent to the storeserver that compresses and stores them into a repository • Each parsed webpage has an ID number called a docID. • The indexer reads the repository, uncompresses the documents and parses them. Each doc is converted into a set of hits. The indexer distributes the hits into a set of barrels. The indexer takes the link in the webpages and stores them in an anchors file.
System Architecture • The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in docIDs. It generates a db of links which are pairs of docIDs. • The link db is used to compute PageRank. • The sorter takes the barrels which are sorted by docID and resorts them by wordID to generate the inverted index.
System functionalities: Crawling Google has a fast distributed crawling system 100 web pages per second using 4 crawlers Single URLServer serves list of URLs to a number of crawlers (typically 3).
System functionalities: Indexing • Parsing: must handle huge amount of errors; • Indexing doc into Barrels: each doc is parsed and is encoded into a number of barrels. Every word is converted into a wordID using an in-memory hash table – the lexicon. Then the word occurrences are translated into hit lists and are written into the forward barrels. • Sorting: To generate the inverted index, the sorter takes each of the forward barrels and sorts it by by wordID to produce an inverted barrel for title and anchor hits and a full text inverted barrel.
System functionalities: Searching The searcher is run by a web server and uses lexicon together with the inverter index and the PageRank to answer a query. Single word search • Google looks at the document’s hit list for the that word. • It calculates the IR weight of the doc: count weight (number of occur.) x type weight. • It computes final rank combining the IR weight with the PagerRank Multiple word search • The hits occur close in one doc are weight higher than hits occurring apart. • For every set of matched hits proximity is compute. Proximity is based on how far apart the hits are in the doc. • IRw= type-prox-w X type-w
Conclusion Google in a scalable architecture for : • gathering • indexing • searching web pages. It guarantees quality of search using pageRank, anchor text and proximity information.