1 / 12

High Performance Index Build Algorithms for Intranet Search Engines

High Performance Index Build Algorithms for Intranet Search Engines. Marcus Fontoura , Eugene Shekita, Jason Zien, Sridhar Rajagopalan, Andreas Neumann fontoura@almaden.ibm.com. Agenda. Overview and problem description Global analysis Major data structures for index build

seda
Télécharger la présentation

High Performance Index Build Algorithms for Intranet Search Engines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High Performance Index Build Algorithms for Intranet Search Engines Marcus Fontoura, Eugene Shekita, Jason Zien, Sridhar Rajagopalan, Andreas Neumann fontoura@almaden.ibm.com

  2. Agenda • Overview and problem description • Global analysis • Major data structures for index build • Index build algorithm

  3. Overview and problem description • Trevi goal is to provide high quality intranet search capability to corporate portals such as w3.ibm.com • Scalable text search engine that is being developed by a joint IBM Research and Software Group team • This talk focuses on how to efficiently incorporate global analysis into the index build process

  4. Global analysis (GA) • Duplicate detection • Computes fingerprints for each page (64 bit shingle) • Master are identified by using the (previous) static rank • Anchor text (D1: <a ref=“D2”>Trevi</a>) • Appends anchor text tokens to documents • Static rank • Host in-degree, i.e., number of hosts that point to a page (~ PageRank on the IBM intranet)

  5. Index build requires GA • Rebuild the inverted text index and update the global analysis (GA) • Duplicate documents are deleted from the index • Anchor text is indexed together with the document’s content • Static rank gives the index ordering, allowing for early termination during query evaluation • The time to rebuild the index will be dominated by the GA time, as analysis get more complex • Semantic search

  6. Major data structures • Store • Storage for the tokenized version of each document • Index • Inverted text index over the Store • Delta store and delta index • Small versions of the Store and Index with new and modified documents • Allow for hourly updates of the Index content

  7. Index build algorithm (1/3) • Index build merges the current version of the Store (Storei) and with the current version of the DeltaStore and generates the new version of the Store and the new Index, Storei+1 and Indexi+1 Index Build Storei Storei+1 DeltaStore Indexi+1

  8. Global Analysis Index Build DeltaIndex Build Storei DeltaStore Storei+1 Dupi+1 Storei AnchorTexti+1 Indexi+1 Ranki+1 DeltaStorej+1 DeltaStorej DeltaIndexj+1 Newly crawled documents Index build algorithm (2/3) • Index build using global analysis DeltaStore

  9. DeltaIndex Build Index Build Global Analysis Index build algorithm (3/3) • Index build using lagging global analysis Global Analysis and DeltaIndex build can proceed in parallel Storei+1 Storei Indexi+1 DeltaStore GA inputs GAi GAi+1 GAi DeltaStorej+1 DeltaStorej Newly crawled documents DeltaIndexj+1

  10. Indexing algorithm • Radix sort • Linear time sorting • Flexibility in defining the sort criteria • Bigger sort buffers increase performance • Pipelining load and sort phases

  11. Experimental results • Lagging global analysis does not degrade quality • More than 25% of performance improvement • Even more advantageous when analysis are more complex • Indexing algorithm scales linearly with the number of documents • Superior performance when compared to several state-of-the art indexing algorithms

  12. Index Build Crawler data copy Query Server Crawled Documents Store Index DeltaStore DeltaIndex Local Gigabit Switch IP Sprayer data copy Store Index DeltaStore DeltaIndex Link to the global IBM Intranet Hardware and software architectures

More Related