“The Anatomy of a Large-Scale Hypertextual Web Search Engine” ‘98

“The Anatomy of a Large-Scale Hypertextual Web Search Engine” ‘98 Google case Angela Fogarolli afogarol@dit.unitn.it 07/06/2006

Roadmap • Google design goals • System features • Page Rank • Anchor Text • Others • System architecture • System functionalities • Crawling • Indexing • Searching • Conclusion

Google design goals • Improve search quality • Improve search engine usability • Improve scalability on large web data.

System feature: PageRank PageRank is the probability that a random surfer visits a page. PageRank is based on citation (link) graph. • It does not count links from all pages equally.It normalizes link numbers by the number of link in a page. • PageRank recursively propagates weights through the link structure of the web

PageRank Calculation PR(A)=(1-d)+d(PR(T1)/C(T1)+… (PR(Tn)/C(Tn)) Page A has pages T1…Tn which point to it • d is a dumping factor, usually is set to 0.85 • C(A) is the number of links going out of page A • Example: • A page has a high PageRank if there are many pages that point to it, or if there are some pages that point to it and have a high PageRank.

System feature: Anchor Text The text of the link is associated with the page the link is on. In addition Google associates it with the page the link points to. Advantages: • Anchors often provide more accurate descriptions of web pages than the pages themselves. • Anchors may exist for documents which cannot be indexed (images, programs and db)

System features: Others • Extensive use of proximity in search, it keeps location information for all hits. • Presentation details such as font size are important for weight calculation of hits.

System Architecture • Several distributed crawlers • The fetched web pages are sent to the storeserver that compresses and stores them into a repository • Each parsed webpage has an ID number called a docID. • The indexer reads the repository, uncompresses the documents and parses them. Each doc is converted into a set of hits. The indexer distributes the hits into a set of barrels. The indexer takes the link in the webpages and stores them in an anchors file.

System Architecture • The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in docIDs. It generates a db of links which are pairs of docIDs. • The link db is used to compute PageRank. • The sorter takes the barrels which are sorted by docID and resorts them by wordID to generate the inverted index.

System functionalities: Crawling Google has a fast distributed crawling system 100 web pages per second using 4 crawlers Single URLServer serves list of URLs to a number of crawlers (typically 3).

System functionalities: Indexing • Parsing: must handle huge amount of errors; • Indexing doc into Barrels: each doc is parsed and is encoded into a number of barrels. Every word is converted into a wordID using an in-memory hash table – the lexicon. Then the word occurrences are translated into hit lists and are written into the forward barrels. • Sorting: To generate the inverted index, the sorter takes each of the forward barrels and sorts it by by wordID to produce an inverted barrel for title and anchor hits and a full text inverted barrel.

System functionalities: Searching The searcher is run by a web server and uses lexicon together with the inverter index and the PageRank to answer a query. Single word search • Google looks at the document’s hit list for the that word. • It calculates the IR weight of the doc: count weight (number of occur.) x type weight. • It computes final rank combining the IR weight with the PagerRank Multiple word search • The hits occur close in one doc are weight higher than hits occurring apart. • For every set of matched hits proximity is compute. Proximity is based on how far apart the hits are in the doc. • IRw= type-prox-w X type-w

Conclusion Google in a scalable architecture for : • gathering • indexing • searching web pages. It guarantees quality of search using pageRank, anchor text and proximity information.

“The Anatomy of a Large-Scale Hypertextual Web Search Engine” ‘98

“The Anatomy of a Large-Scale Hypertextual Web Search Engine” ‘98

Presentation Transcript

Surviving Large Scale Internet Outages

Search Engine Optimization (SEO)

Push Button Cash Site by Daniel Young

Search Engine Optimization

Search Engine Technology (1)

Large Scale Integrated Circuits

MUMmer: fast alignment of large-scale DNA and protein sequences

An Introduction to the Human Body

What is Search Engine Optimization (SEO)?

IR for Web Pages

Kaan Yücel M.D., Ph.D .

Lucene

Scalability and Efficiency Challenges in Large-Scale Web Search Engines

Human Anatomy and Physiology

Shoulder Problems

Clinical Anatomy and Physiology for Veterinary Technicians

WWW.GMFPC.COM

CS598Visual Information retrieval

Large Scale Studies of Dyslexia in Florida

Search engine limitations insider