330 likes | 458 Vues
This paper presents a refined algorithm for efficient set intersection operations, crucial for information retrieval in web search engines. By analyzing established algorithms, theoretical frameworks, and experimental data, we propose improvements to adaptively handle large document postings sets common in search queries. We discuss the crawling process, indexing strategies, and performance metrics, culminating in a comparison between traditional and adaptive algorithms. Our findings indicate a significant speed-up of query processing, enhancing the effectiveness of web search technologies.
E N D
Engineering a Set Intersection Algorithm for Information Retrieval Alex Lopez-Ortiz UNB / InterNAP Joint work with Ian Munro and Erik Demaine
Overview • Web Search Engine Basics • Algorithms for set operations • Theoretical Analysis • Experimental Analysis • Engineering an Improved Algorithm • Conclusions
Web Search Engine Basics • Crawl: sequential gathering process • Document ID (DocID) for each web page 2 SIGIR 1 • Cool sites: • SIGIR • SIGACT • SIGCOMM 3 SIGACT SIGCOMM 4 http://acm.org/home.html
Indexing: List of entries of type <word, docID1 , docID2 , . . . , > E.g. <cool, 1> <SIGACT, 1, 3> <SIGCOMM, 1, 4> <SIG, 1, 2, 3, 4> 1 2 3 4 SIGIR SIGACT SIGCOMM • Cool sites: • SIGIR • SIGACT • SIGCOMM
Postings set: Set of docID’s containing a word or pattern. SIGACT {1,3} SIGCOMM {1,4} 1 2 3 4 SIGIR SIGACT SIGCOMM • Cool sites: • SIGIR • SIGACT • SIGCOMM
Search Engine Basics (cont.) Postings set stored implicitly/explicitly in a string matching data structure • PAT tree/array • Inverted word index • Suffix trees • KMP (grep) ...
String Matching Problem • Different performance characteristics for each solution • Time/Space tradeoff (empirical) • Linear time/linear space lower bound [Demaine/L-O, SODA 2001]
Search Engine Basics (cont.) A user query is of the form: keyword1keyword2 … keywordn where is one of {and,or} E.g. computer and science or internet
Evaluating a Boolean Query The interpretation of a boolean query is the mapping: • keyword postings set • and (set intersection) • or (set union) E.g. {computer} {science} {internet}
Set Operations for Web Search Engines • Average postings set size > 10 million • Postings set are sorted
Intersection Time Complexity • Worst case linear on size of postings sets: Θ(n) {1,3,5,7} {1,3,5,7} • On size of output? {1,3,5,7} {2,4,6,8}
Adaptive Algorithms • Assume the intersection is empty. What is the min number of comparisons needed to ascertain this fact? Examples {1,2,3,4} {5,6,7,8}
Much ado About Nothing • A sequence of comparisons is a proof of non-intersectionif every possible instance of sets satisfying said sequence has empty intersection. • E.g. • A={1,3,5,7} • B={2,4,6,8} • a1 < b1 < a2 < b2 < a3 < b3 < a4 < b4
Adaptive Algorithms • In [SODA 2000] we proposed an adaptive algorithm that intersects k sets in: k · | shortest proof of non-intersection | • steps. Ideal for crawled, “bursty” data sets
How does it work? • <SIGACT, 1, 3, i, n> 1,_,3,... i n DocID universe set
Measuring Performance • 100MB Web Crawl • 5000 queries from Google
Baseline Standard Algorithm • Sort sets by size • Candidate answer set is smallest set • For each set S in increasing order by size • For each element e in candidate set • Binary search for e in S • If e is not found remove from candidate set • Remove elements before e in S
Side by Side Middle Bound Lower Bound
Possible Improvements • Adaptive performs best in two-three sets • Traditional algorithm often terminates after first pair of sets • Galloping seems better than binary search • Adaptive keeps a dynamic definition of “smallest set” • Candidate elements aggressively tested
Example {6, 7,10,11,14} {4, 8,10,11,15} {1, 2, 4, 5, 7, 8, 9}
Experimental Results Test orthogonally each possible improvement • Cyclic or Two Smallest • Symmetric • Update Smallest • Advance on Common Element • Gallop Factor/Binary Search
Small Adaptive Combines best of Adaptive and Two-Smallest • Two-smallest • Symmetric • Advance on common element • Update on smallest • Gallop with factor 2
Small Adaptive • Small Adaptive is faster than Two-Smallest • Aggregate speed-up 2.9x comparisons • Faster than Adaptive
Conclusions • Faster intersection algorithm for Web Search Engines • Adaptive measure for set operations • Information theoretic “middle bound” • Standard speed-up techniques for other settings THE END
Query Log Total # of elements in a query Number of queries for each total size
Example {6, 7,10,11,14} {4, 8,10,11,15} {1, 2, 4, 5, 7, 8, 9, 12}