1 / 48

Type Less, Find More: Fast Autocompletion Search with a Succinct Index

Type Less, Find More: Fast Autocompletion Search with a Succinct Index. SIGIR 2006 in Seattle, USA, August 6 - 11. Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Ingmar Weber. It's useful. Basic Autocompletion saves typing

kalona
Télécharger la présentation

Type Less, Find More: Fast Autocompletion Search with a Succinct Index

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Type Less, Find More:Fast Autocompletion Searchwith a Succinct Index SIGIR 2006 in Seattle, USA, August 6 - 11 Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Ingmar Weber

  2. It's useful • Basic Autocompletion • saves typing • no more information than necessary salton • find out about formulations used autocomplete, autocompose • error correction autocomplit, autocompleet

  3. It's more useful • Complete to phrases • phrase voronoi diagram→ add wordvoronoi_diagram to index • Complete to subwords • compound word eigenproblem → add word problem to index • Complete to category names • author Börkur Sigurbjörnsson → add sigurbjörnson:börkur::author börkur::sigurbjörnson:author • Faceted search • add ct:conference:sigir • add ct:author:Börkur_Sigurbjörnson • add ct:year:2005 Workshop onFaceted Search on Thursday all via the same mechanism

  4. Related Engines

  5. Related Engines

  6. Basic Problem Definition • Query • a set D of documents (= hits for the first part of the query) • a range W of words (= potential completions of last word) • Answer • all documents D' from D, containing a word from W • all words W' from W, contained in a document from D • Extensions (see paper) • ranking (best hits from D' and best completions from W') • positional information (proximity queries) • First try: inverted index (INV)

  7. Processing 1-word queries with INV • For example, sigir* D all documents W all words matchingsigir* • Iterate over all words from W sigir Doc.18, Doc. 53, Doc. 591, ... sigir03 Doc. 3, Doc. 66, Doc. 765, ... sigir04 Doc. 25, Doc. 98, Doc. 221, ... sigirlist Doc. 67, Doc. 189, Doc. 221, ... sigirforum Doc. 16, Doc. 110, Doc. 141, ... • Merge the documents lists D'Doc. 3, Doc. 16, Doc. 18, Doc. 25, … • Output all words from range as completions W'sigir, sigir03, sigir04, sigirlist, … Expensive! Trivial for 1-word queries

  8. Processing multi-word queries with INV • For example, sigir* sal* DDoc. 3, Doc. 16, Doc. 18, Doc. 25, … (hits forsigir*) W all words matchingsal* • Iterate over all words from W salary Doc. 8, Doc. 23, Doc. 291, ... salesman Doc. 24, Doc. 36, Doc. 165, ... saltonDoc. 3, Doc. 18, Doc. 66, ... salutation Doc. 56, Doc. 129, Doc. 251, ... salvadorDoc. 18, Doc. 21, Doc. 25, ... • Intersect each list with D, then merge D'Doc. 3, Doc. 18, Doc. 25, … • Output all words with non-empty intersection W'salton, salvador Most intersection are empty, but INV has to compute them all!

  9. INV — Problems • Asymptotic time complexity is bad (for our problem) • many intersections (one per potential completion) • has to merge/sort (the non-empty intersections) • Still hard to beat INV in practice • highly compressible • half the space on disk means half the time to read it • INV has very good locality of access • the ratio random access time/sequential access time is 50,000 for disk, and still 100 for main memory • simple code • instruction cache, branch prediction, etc.

  10. A Hybrid Index (HYB) • Basic Idea: have lists for ranges of words salary – salvador Doc. 3, Doc. 16, Doc.18, Doc. 25, ... • Problem: not enough to show completions • Solution: store the word(s) along with each doc id salary – salvador Doc. 3, Doc. 16, Doc.18, Doc. 25, ... salary salvador salton salary salton salvador But this looks very wasteful

  11. HYB — Details • HYB has a block for each word range, conceptually: • Replace doc ids by gaps and words by frequency ranks: • Encode both gaps and ranks such that x  log2 x bits +0  0+1  10+2  110 1st (A)  0 2nd (C)  10 3rd (D)  111 4th (B)  110 • An actual block of HYB How well does it compress? Which block size?

  12. INV vs. HYB — Space Consumption Theorem: The empirical entropy of INV isΣ ni∙ (1/ln 2 + log2(n/ni)) Theorem: The empirical entropy of HYB with block size ε∙nis Σ ni∙ ((1+ε)/ln 2 + log2(n/ni)) ni= number of documents containing i-th word, n = number of documents Nice match of theory and practice

  13. INV vs. HYB — Query Time • Theoretical analysis  see paper • Experiment: type ordinary queries from left to right • sig , sigi , sigir , sigir sal , sigir salt , sigir salto , sigir salton INV HYB HYB better by an order of magnitude

  14. System Design — High Level View Compute ServerC++ Web ServerPHP User ClientJavaScript Debugging such an application is hell!

  15. Summary of Results • Properties of HYB • highly compressible (just like INV) • fast prefix-completion queries (perfect locality of access) • fast indexing (no full inversion necessary) • Autocompletion and more • phrase and subword completion, semantic completion, XML support, … • faceted search (Workshop Talk on Thursday) • efficient DB joins: author[sigir sigmod] NEW all with one and the same (efficient) mechanism

  16. INV vs. HYB — Space Consumption Definition: empirical entropy H = optimal number of bits Theorem: H(INV) Σ ni∙ (1/ln 2 + log2(n/ni)) Theorem: The empirical entropy of HYB with block size ε∙nis Σ ni∙ ((1+ε)/ln 2 + log2(n/ni)) ni= number of documents containing i-th word, n = number of documents Perfect match of theory and practice

  17. INV vs. HYB — Space Consumption Theorem:Entropy(INV) = Σ ni∙ (1/ln 2 + log2(n/ni)) Theorem:Entropy(HYB) =Σ ni∙ ((1+ε)/ln 2 + log2(n/ni)) We define a notion of empirical entropy in the paper, in terms of ni= number of documents containing i-th word, n = number of documents Perfect match of theory and practice

  18. HYB vs. INV — Query Time

  19. Processing a 1-word Query with INV • Processing a 1-word query, e.g., sigir* • Iterate over all words matching sigir* • Merge the documents lists

  20. Processing sigir* sal with INV • Iterate over all words matching sigir* sigir Doc.18, Doc. 53, Doc. 591, ... sigir03 Doc. 3, Doc. 66, Doc. 765, ... sigir04 Doc. 25, Doc. 98, Doc. 221, ... sigirlist Doc. 67, Doc. 189, Doc. 221, ... sigirforum Doc. 16, Doc. 110, Doc. 141, ... • Merge the documents lists Hits D' Doc. 3, Doc. 16, Doc. 18, … • Output all words from range as completions Completions W' sigir, sigir03, sigir05, … Expensive! Trivial for 1-word queries

  21. Using an Inverted Index (INV) Problem 1: one intersection per potential completion Problem 2: merging of non-empty intersections

  22. HYB — Details • HYB has a block for each word range document ids words gaps ranks by frequency universalencoding:small gaps/ranks => short codes +0  0+1  10+2  110 1st (A)  0 2nd (C)  10 3rd (D)  111 4th (B)  110 one block of HYB

  23. INV vs. HYB — Query Time INV HYB avg = average time per keystrokemax = maximum time per keystroke (outliers removed)

  24. Start with DEMO autocompsig sigir sigir sal sal

  25. Related Search Engine Features • Complete from precompiled list of queries • Google Suggest • AllTheWeb Livesearch • … • Desktop Search engines • Apple Spotlight • Copernic Desktop Search • …

More Related