Fast Two-Sided Error-Tolerant Search

Fast Two-Sided Error-Tolerant Search Hannah Bast, Marjan Celikik University of Freiburg, Germany KEYS 2010

Motivation • Handling uncertainty in text search is important • Query side – users make mistakes typing the query • Either due to mistyping • Or because we do not know the correct spelling (have incomplete knowledge about the underlying data) Efficient Two-Sided Error-Tolerant Search

Motivation • Handling uncertainty in text search is important • Query side – user mistakes when typing the query • Either due to mistyping • Or because we do not know the correct spelling or have incomplete knowledge about the underlying data • Document side –mistakes in the documents • Those who type the documents also make mistakes • OCR errors Efficient Two-Sided Error-Tolerant Search

State Of The Art • Not so much work on fast error-tolerant search • There is prior work done on document-side error tolerance • Overall only few relevant papers in the literature • BASELINE: Replace each query word by a disjunction of similar words A lot of work done on approximate string matching / searching Efficient Two-Sided Error-Tolerant Search

BASELINE is all but efficient • Example fast AND list ANDintersction fast AND list AND (intersection ORinterrsectionOR intersession ORintersacitionnORintrasectionOR …) There can be hundreds of similar words! • Large list merging and diskI/O overhead • But the current state-of-the-art is not much faster than BASELINE … Efficient Two-Sided Error-Tolerant Search

Our Approach - Clustering • Based on clustering of the vocabulary • A vocabulary V is the set of all words in a corpus • The clusters may overlap i.e. words can belong to few clusters • Definition (cover) • Let q be a keyword, K a clustering of Vand be the set of all words within a threshold T. Anexact cover of is a set of clusters from K with union . An approximate cover of does not necessarily contain all of Efficient Two-Sided Error-Tolerant Search

Our Approach - Clustering • Based on clustering of the vocabulary • A vocabulary V is the set of all words in a corpus • The clusters may overlap i.e. words can belong to few clusters • Definition (cover) • Let q be a keyword, K a clustering of Vand be the set of all words within a threshold T. Anexact cover of is a set of clusters from K with union . An approximate cover of does not necessarily contain all of • The number of sets n in the cover is called cover index • Precision of a cover is defined as • Recall of a cover is defined as Efficient Two-Sided Error-Tolerant Search

Our Approach - Clustering • Compute a clustering, so that for each q we can compute a good cover: • (C1) with cover index as small as possible • (C2) with recall as large as possible • (C3) with precision as large as possible • (C4) frequency-weighted overlap as small as possible Efficient Two-Sided Error-Tolerant Search

Using the Clustering – Indexing • For each occurrence of a word, determine its clusters • Add corresponding artificial postings to the index by prepending the cluster ids, e.g. C:165:house Doc. 7012 house Doc. 7012 C:9823:house Doc. 7012 In clusters 165 and 9823 Efficient Two-Sided Error-Tolerant Search

Using the Clustering – Query Time • For each q, compute and all affected cluster ids • ComputeMinimal Cover Index • Given a cover recall (and precision), there is no cover with smaller cover index (similar to the set cover problem) algoritm C:59:* OR C:1017:* 59, 201<- 59, 221<- algorithm 59, 1017,56<- Transform q into a disjunction of prefix queries alggorithm 1017, 221<- algoithm 1017<- algoirthm 61, 472<- alggorithluq 59, 201<- cluster 59 Use efficient prefix search to process the transformed query (we use the HYB index) logarithm 1017<- aglorithm cluster 1017 59, 472<- algorithmica … algorithmic … Efficient Two-Sided Error-Tolerant Search

Computing a Clustering • How to compute a clustering with favorable properties (C1) – (C4) ? • It’s easy to optimize for (C1) alone, but then (C2) will suffer • It’s easy to optimize for (C1) - (C3) alone ,but then (C4) will suffer etc. v algoirtm algoithm y a1gor1thm C:x:algorithm algorithm z C:y:algorithm algorithm aglorithmm algortm C:z:algorithm C:v:algorithm algoritluq algoritw2 … = x Efficient Two-Sided Error-Tolerant Search

Experimental results Average query times Average number of clusters and similar words Efficient Two-Sided Error-Tolerant Search

Experimental results Average cover precision and recall Index sizes Efficient Two-Sided Error-Tolerant Search

Fast Two-Sided Error-Tolerant Search