1 / 20

Web Search – Summer Term 2006 VI. Web Search - Indexing

Web Search – Summer Term 2006 VI. Web Search - Indexing. (c) Wolfgang Hürst, Albert-Ludwigs-University. General Web Search Engine Architecture. CLIENT. WWW. PAGE REPOSITORY. QUERIES. RESULTS. QUERY ENGINE. RANKING. CRAWLER(S). COLLECTION ANALYSIS MOD. INDEXER MODULE. CRAWL CONTROL.

kaemon
Télécharger la présentation

Web Search – Summer Term 2006 VI. Web Search - Indexing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Search – Summer Term 2006VI. Web Search -Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University

  2. General Web Search Engine Architecture CLIENT WWW PAGE REPOSITORY QUERIES RESULTS QUERY ENGINE RANKING CRAWLER(S) COLLECTION ANALYSIS MOD. INDEXER MODULE CRAWL CONTROL INDEXES UTILITY STRUCTURE TEXT USAGE FEEDBACK (CF. [1] FIG. 1)

  3. Types of (generic) indexes 1. Text index = "Traditional", text-based index "Inverted files have traditionally been the index structure choice of the web" [3] Main purpose: Identification and selection of relevant pages Special characteristics: - Size and rate of change - Consider anchor text and surrounding text

  4. Types of (generic) indexes 2. Structure / link index = Description of the linkage between web pages Usually modeled as a graph(nodes = pages, directed edges = links) Main purpose: Provide structure information (esp. neighborhood relationships), usually to create the ranking Problem: Requires a scalable and efficient representation of a VERY large graph

  5. Types of (generic) indexes 3. Utility index: Stores additional, search engine dependent information needed for page selection and relevance estimation, e.g. - PageRank - Site index - special site-related characteristics etc. Main purpose: Usually to speed up processing time

  6. Text Index (= Inverted File) Inverted File: Generally: term -> document (web page) - Posting (t, l):pair of term t and location l - Sometimes: Payload field to store add. info In addition: Lexicon (dictionary) with - List of all terms in the index - Related statistics (IDF, ...) Note: Similar to traditional IR but size and rate of change require special techniques

  7. The WebBase System as an example fora distributed text index [1,3] INDEXERS DISTRIBUTORS WEB PAGES QUERY SERVERS . . . . . . . . . . . . STAGE 1 STAGE 2 . . . STATIS-TICIAN INVERTED INDEX INTER-MEDIATE RUNS

  8. INDEXERS DISTRIBUTORS QUERY SERVERS . . . . . . . . . WebBase Architecture - 3 Types of Nodes WEB PAGES . . . STAGE 1 STAGE 2 . . . STATIS-TICIAN INVERTED INDEX INTER-MEDIATE RUNS

  9. WebBase Indexing Process - 2 Stages INDEXERS DISTRIBUTORS WEB PAGES QUERY SERVERS . . . . . . . . . . . . STAGE 1 STAGE 2 . . . STATIS-TICIAN INVERTED INDEX INTER-MEDIATE RUNS

  10. WebBase - Distributed inv. idx. organization Two strategies: - Local inverted files - Global inverted files INDEXERS DISTRIBUTORS WEB PAGES QUERY SERVERS . . . . . . . . . . . . STAGE 1 STAGE 2 . . . STATIS-TICIAN INVERTED INDEX INTER-MEDIATE RUNS

  11. WebBase - Parallelizing the indexing process INDEXERS DISTRIBUTORS WEB PAGES QUERY SERVERS . . . . . . . . . . . . STAGE 1 STAGE 2 . . . STATIS-TICIAN INVERTED INDEX INTER-MEDIATE RUNS

  12. LOADING PROCESSING FLUSHING PAR-SING, TOKE-NIZA-TION WEB PAGES SORTED RUNS SOR-TING MEMORY MEMORY MEMORY Parallel index construction (Indexers) INPUT: STREAM OF WEB PAGES FROM REPOSITORY OUTPUT: SORTED RUNS / INTERMEDIATE RUNS(SORTED POSTINGS OF A SUBSET OF THE REPOSITORY)

  13. Flushing F F F F F F Processing P P P P P P Loading L L L L L L Parallel index construction (Indexers) Software pipeline to create sorted runs (multi-threaded execution) TIME

  14. WebBase - Collecting global statistics INDEXERS DISTRIBUTORS WEB PAGES QUERY SERVERS . . . . . . . . . . . . STAGE 1 STAGE 2 . . . STATIS-TICIAN INVERTED INDEX INTER-MEDIATE RUNS

  15. Coll. global statistics (Statistician) Avoid disk accesses (expensive!) Communication with the statistician only if data is already in memory (i.e. during merging or flushing) Avoid intensive communication between indexer and statistician Only send partly sorted (summarized) postings Two strategies to collect statistical info on term level: - ME strategy (during merging)- FL strategy (during flushing)

  16. ME strategy CAT (6,2) (3,1) DOG (8,3) RAT (8,3) (4,1) (DOG, 1) (CAT, 2) (RAT, 2) DOG: 3 CAT: 5 RAT: 2 AGGRE-GATE (DOG, 3) (CAT, 5) (RAT, 2) DOG: 3 CAT: 5 (DOG, 2) (CAT, 3) CAT (4,2) (3,3) (7,1) DOG (5,2) (9,1) STATISTICIAN INDEXERS(LEXICON) INDEXERS (INVERTED LISTS)

  17. DOG 4 CAT 4 RAT 2 DOG ? CAT ? RAT ? FL strategy (CAT, 1) (DOG, 1) CAT (6,1) DOG (8,3) CAT (2,1) CAT (6,2) RAT (4,3) RAT (8,1) (CAT, 2) DOG: 4 CAT: 4 RAT: 2 HASH TABLE HASH TABLE (RAT, 2) DOG: 4 CAT: 4 (DOG, 1) DOG (4,2) CAT (5,2) DOG (5,1) DOG (7,2) STATISTICIAN STATISTICIAN (CAT, 1) (DOG, 2) DURING AFTER INDEXERS(LEXICON) PROCESSING INDEXERS (SORTED RUNS)

  18. STATISTICIAN LOAD MEMORY USAGE PARALLELISM ME (MERGING) + - + + - FL (FLUSHING) - - ++ Summary: ME vs. ML strategy General observations:- Relatively low overhead (both strategies)- Confirmed experimentally ("less than 5% for a 2 million page collection") Summary of characteristics (+/-)

  19. The WebBase System - Summary INDEXERS DISTRIBUTORS WEB PAGES QUERY SERVERS . . . . . . . . . . . . STAGE 1 STAGE 2 . . . STATIS-TICIAN INVERTED INDEX INTER-MEDIATE RUNS

  20. References - Indexing [1] A. ARASU, J. CHO, H. GARCIA-MOLINA, A. PAEPCKE, S. RAGHAVAN: "SEARCHING THE WEB", ACM TRANSACTIONS ON INTERNET TECHNOLOGY, VOL 1/1, AUG. 2001Chapter 4 (Indexing) [2] S. BRIN, L. PAGE: "THE ANATOMY OF A LARGE-SCALE HYPERTEXTUAL WEB SEARCH ENGINE", WWW 1998Chapter 4 (System Anatomy) [3] S. MELNIK, S. RAGHAVAN, B. YANG, H. GARCIA-MOLINA: "BUILDING A DISTRIBUTED FULL-TEXT INDEX FOR THE WEB", ACM TRANSACTIONS ON INFORMATION SYSTEMS, VOL. 13/3, JULY 2001

More Related