1 / 10

Web Search – Summer Term 2006 VI. Web Search - Indexing

Web Search – Summer Term 2006 VI. Web Search - Indexing. (c) Wolfgang Hürst, Albert-Ludwigs-University. Indexing in the 1st Google engine.

briana
Télécharger la présentation

Web Search – Summer Term 2006 VI. Web Search - Indexing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Search – Summer Term 2006VI. Web Search -Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University

  2. Indexing in the 1st Google engine - Parsing of the HTML pages in the repository- Indexing of the document - Store indexed docs in barrels - Code words in a wordID - Create lexicon that maps words to wordIDs - Store hit lists in forward barrels (Note: Indexing process is parallelized)- Sorting - Sort anchor and title hits from the forward barrels in inverted barrels and all other hits in full text inverted barrels Now: Description of the major data structures

  3. REPOSITORY: DOCID ECODE URL_LEN PAGE_LEN URL PAGE . . . CRAWLERS SORTERS Architecture of the 1st Google Search Engine URL SERVER SEARCHER REPOSITORY REPOSITORY STORE SERVER INDEXER ANCHORS URL RESOLVER DUMPLEXICON LINKS DOC INDEX LEXICON BARRELS PAGERANK (CF. [2], FIG. 1)

  4. CRAWLERS SORTERS Architecture of the 1st Google Search Engine DOCUMENT INDEX: DOCID -> - CURRENT DOCUMENT STATUS - POINTER TO REPOSITORY - DOCUMENT CHECKSUM - VARIOUS STATISTICS - DOCUMENT INFO (URL + TITLE) IF DOCUMENT HAS BEEN CRAWLED - POINTER TO URL LIST OTHERWISE ADDITIONAL FILE TO CONVERT URLS TO DOCIDs: URL CHECKSUM -> DOCID URL SERVER SEARCHER REPOSITORY STORE SERVER INDEXER ANCHORS URL RESOLVER DUMPLEXICON LINKS DOC INDEX DOC INDEX LEXICON BARRELS PAGERANK (CF. [2], FIG. 1)

  5. CRAWLERS SORTERS Architecture of the 1st Google Search Engine URL SERVER SEARCHER REPOSITORY STORE SERVER ANCHORS: SOURCE, DESTINATION,AND ANCHOR TEXT INDEXER ANCHORS ANCHORS URL RESOLVER DUMPLEXICON LINKS LINKS DOC INDEX LEXICON BARRELS PAGERANK LINKS:PAIRWISE DOCIDS (CF. [2], FIG. 1)

  6. INVERTED INDEX: WORD -> DOCUMENT LEXICON: INVERTED BARRELS: CRAWLERS WORDID, NDOCS DOCID, NO-OF-HITS, HIT1, HIT2, ... WORDID, NDOCS . . . FORWARD INDEX: DOCUMENT -> WORD . . . DOCID WORDID, NO-OF-HITS, HIT1, HIT2, ... DOCID, NO-OF-HITS, HIT1, HIT2, ... DOCID, NO-OF-HITS, HIT1, HIT2, ... WORDID, NO-OF-HITS, HIT1, HIT2, ... . . . . . . SORTERS NULL WORDID DOCID WORDID, NO-OF-HITS, HIT1, HIT2, ... • HITS: • FANCY HIT (URL, TITLE, ANCHOR TEXT, META TAG) • PLAIN HIT (EVERYTHING ELSE) WORDID, NO-OF-HITS, HIT1, HIT2, ... . . . BARRELS BARRELS CAPITALIZATION, FONTSIZE, TYPE, POSITION IN DOCUMENT CAPITALIZATION, FONTSIZE, POSITION IN DOCUMENT Architecture of the 1st Google Search Engine URL SERVER SEARCHER REPOSITORY STORE SERVER INDEXER ANCHORS URL RESOLVER DUMPLEXICON LINKS DOC INDEX LEXICON LEXICON PAGERANK (CF. [2], FIG. 1)

  7. CRAWLERS SORTERS Architecture of the 1st Google Search Engine URL SERVER SEARCHER REPOSITORY STORE SERVER INDEXER ANCHORS URL RESOLVER DUMPLEXICON LINKS DOC INDEX LEXICON BARRELS PAGERANK PAGERANK (CF. [2], FIG. 1)

  8. REPOSITORY Query Processing DOCID ECODE URLLEN PAGELEN URL PAGE DOCUMENT INDEX DOCID -> - CURRENT DOCUMENT STATUS - POINTER TO REPOSITORY - DOCUMENT CHECKSUM - VARIOUS STATISTICS - DOCUMENT INFO (URL + TITLE) PAGERANK HITLIST CAPITALIZATION, FONTSIZE, TYPE, POS. IN DOC LEXICON WORDID, NDOCS DOCID, NO-OF-HITS, HIT1, HIT2, ... INVERTED INDEX / BARRELS DOCID, NO-OF-HITS, HIT1, HIT2, ... . . .

  9. Further reading Note: This was information from a paper from 1998 (with a collection of 25 million pages) Newer information about the infrastructure and data structure used by Google (today?) can be found in the following references: Jeffrey Dean and Sanjay Ghemawat: MapReduce: Simplified Data Proc. on Large Clusters Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung: The Google File System Luiz Andre Barroso, Jeffrey Dean, Urs Hoelzle: Web Search for a Planet: The Google Cluster Archit. which are available at http://labs.google.com/papers/

  10. References - Indexing [1] A. ARASU, J. CHO, H. GARCIA-MOLINA, A. PAEPCKE, S. RAGHAVAN: "SEARCHING THE WEB", ACM TRANSACTIONS ON INTERNET TECHNOLOGY, VOL 1/1, AUG. 2001Chapter 4 (Indexing) [2] S. BRIN, L. PAGE: "THE ANATOMY OF A LARGE-SCALE HYPERTEXTUAL WEB SEARCH ENGINE", WWW 1998Chapter 4 (System Anatomy) [3] S. MELNIK, S. RAGHAVAN, B. YANG, H. GARCIA-MOLINA: "BUILDING A DISTRIBUTED FULL-TEXT INDEX FOR THE WEB", ACM TRANSACTIONS ON INFORMATION SYSTEMS, VOL. 13/3, JULY 2001

More Related