Engineering a Set Intersection Algorithm for Information Retrieval

Engineering a Set Intersection Algorithm for Information Retrieval Alex Lopez-Ortiz UNB / InterNAP Joint work with Ian Munro and Erik Demaine

Overview • Web Search Engine Basics • Algorithms for set operations • Theoretical Analysis • Experimental Analysis • Engineering an Improved Algorithm • Conclusions

Web Search Engine Basics • Crawl: sequential gathering process • Document ID (DocID) for each web page 2 SIGIR 1 • Cool sites: • SIGIR • SIGACT • SIGCOMM 3 SIGACT SIGCOMM 4 http://acm.org/home.html

Indexing: List of entries of type <word, docID1 , docID2 , . . . , > E.g. <cool, 1> <SIGACT, 1, 3> <SIGCOMM, 1, 4> <SIG, 1, 2, 3, 4> 1 2 3 4 SIGIR SIGACT SIGCOMM • Cool sites: • SIGIR • SIGACT • SIGCOMM

Postings set: Set of docID’s containing a word or pattern. SIGACT {1,3} SIGCOMM {1,4} 1 2 3 4 SIGIR SIGACT SIGCOMM • Cool sites: • SIGIR • SIGACT • SIGCOMM

Search Engine Basics (cont.) Postings set stored implicitly/explicitly in a string matching data structure • PAT tree/array • Inverted word index • Suffix trees • KMP (grep) ...

String Matching Problem • Different performance characteristics for each solution • Time/Space tradeoff (empirical) • Linear time/linear space lower bound [Demaine/L-O, SODA 2001]

Search Engine Basics (cont.) A user query is of the form: keyword1keyword2 …  keywordn where  is one of {and,or} E.g. computer and science or internet

Evaluating a Boolean Query The interpretation of a boolean query is the mapping: • keyword postings set • and (set intersection) • or  (set union) E.g. {computer}  {science}  {internet}

Set Operations for Web Search Engines • Average postings set size > 10 million • Postings set are sorted

Intersection Time Complexity • Worst case linear on size of postings sets: Θ(n) {1,3,5,7}  {1,3,5,7} • On size of output? {1,3,5,7}  {2,4,6,8}

Adaptive Algorithms • Assume the intersection is empty. What is the min number of comparisons needed to ascertain this fact? Examples {1,2,3,4}  {5,6,7,8}

Much ado About Nothing • A sequence of comparisons is a proof of non-intersectionif every possible instance of sets satisfying said sequence has empty intersection. • E.g. • A={1,3,5,7} • B={2,4,6,8} • a1 < b1 < a2 < b2 < a3 < b3 < a4 < b4

Adaptive Algorithms • In [SODA 2000] we proposed an adaptive algorithm that intersects k sets in: k · | shortest proof of non-intersection | • steps. Ideal for crawled, “bursty” data sets

How does it work? • <SIGACT, 1, 3, i, n> 1,_,3,... i n DocID universe set

Measuring Performance • 100MB Web Crawl • 5000 queries from Google

Baseline Standard Algorithm • Sort sets by size • Candidate answer set is smallest set • For each set S in increasing order by size • For each element e in candidate set • Binary search for e in S • If e is not found remove from candidate set • Remove elements before e in S

Upper Bound: Adaptive/Traditional Two-Smallest Algorithm

Lower Bound: Adaptive/Shortest Proof

Middle Bound: Adaptive/ Encoding of Shortest Proof

Side by Side Middle Bound Lower Bound

Possible Improvements • Adaptive performs best in two-three sets • Traditional algorithm often terminates after first pair of sets • Galloping seems better than binary search • Adaptive keeps a dynamic definition of “smallest set” • Candidate elements aggressively tested

Example {6, 7,10,11,14} {4, 8,10,11,15} {1, 2, 4, 5, 7, 8, 9}

Experimental Results Test orthogonally each possible improvement • Cyclic or Two Smallest • Symmetric • Update Smallest • Advance on Common Element • Gallop Factor/Binary Search

Binary Search vs. Gallop

Advance on Common Element

Small Adaptive Combines best of Adaptive and Two-Smallest • Two-smallest • Symmetric • Advance on common element • Update on smallest • Gallop with factor 2

Small Adaptive

Small Adaptive • Small Adaptive is faster than Two-Smallest • Aggregate speed-up 2.9x comparisons • Faster than Adaptive

Conclusions • Faster intersection algorithm for Web Search Engines • Adaptive measure for set operations • Information theoretic “middle bound” • Standard speed-up techniques for other settings THE END

Query Log Total # of elements in a query Number of queries for each total size

Example {6, 7,10,11,14} {4, 8,10,11,15} {1, 2, 4, 5, 7, 8, 9, 12}

Engineering a Set Intersection Algorithm for Information Retrieval

Engineering a Set Intersection Algorithm for Information Retrieval

Presentation Transcript

Information retrieval

Information Retrieval

Developing a Dust Retrieval Algorithm

A fast physical algorithm for hyperspectral sounding retrieval

Galago for Information Retrieval

Information Retrieval

Set-Based Model: A New Approach for Information Retrieval

Information Retrieval

Information Retrieval

Retrieval Algorithm Frameworks

“A Visual Toolkit For Information Retrieval”

Fast Approximate Point Set Matching for Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

information retrieval

“A Visual Toolkit For Information Retrieval”

Information Retrieval