Big, Bigger Biggest

Big, Bigger Biggest Large scale issues: Phrase queries and common words OCR Tom Burton West Hathi Trust Project

Hathi Trust Large Scale Search Challenges • Goal: Design a system for full-text search that will scale to 5 million to 20 million volumes (at a reasonable cost.) • Challenges: • Must scale to 20 million full-text volumes • Very long documents compared to most large-scale search applications • Multilingual collection • OCR quality varies

Index Size, Caching, and Memory Our documents average about 300 pages which is about 700KB of OCR. Our 5 million document index is between 2 and 3 terabytes. About 300 GB per million documents Large index means disk I/O is bottleneck Tradeoff JVM vs OS memory Solr uses OS memory (disk I/O caching) for caching of postings Memory available for disk I/O caching has most impact on response time (assuming adequate cache warming) Fitting entire index in memory not feasible with terabyte size index

Response time varies with query Average: 673 Median: 91 90th: 328 99th: 7,504

Slowest 5 % of queries • The slowest 5% of queries took about 1 second or longer. • The slowest 1% of queries took between 10 seconds and 2 minutes. • Slowest 0.5% of queries took between 30 seconds and 2 minutes • These queries affect response time of other queries • Cache pollution • Contention for resources • Slowest queries are phrase queries containing common words

Query processing Phrase queries use position index (Boolean queries do not). Position index accounts for 85% of index size Position list for common words such as “the” can be many GB in size This causes lots of disk I/O . Solr depends on the operating systems disk cache to reduce disk I/O requirements for words that occur in more than one query I/O from Phrase queries containing common words pollutes the cache

Slow Queries Slowest test query: “the lives and literature of the beat generation” took 2 minutes. 4MB data read for Boolean query. 9,000+ MB read for Phrase query.

Why not use Stop Words? The word “the” occurs more than 4 billion times in our 1 million document index. Removing “stop” words (“the”, “of” etc.) not desirable for our use cases. Couldn’t search for many phrases “to be or not to be” “the who” “man in the moon” vs. “man on the moon” Stop words in one language are content words in another language German stop words “war” and “die” are content words in English English stop words “is” and “by” are content words (“ice” and “village”) in Swedish

“CommonGrams” Ported Nutch “CommonGrams” algorithm to Solr Create Bi-Grams selectively for any two word sequence containing common terms Slowest query: “The lives and literature of the beat generation” “the-lives” “lives-and” “and-literature” “literature-of” “of-the” “the-beat” “generation”

Standard index vs. CommonGrams Standard Index Common Grams

Comparison of Response time (ms)

Other issues • Analyze your slowest queries • We analyzed the slowest queries from our query logs and discovered additional “common words” to be added to our list. • We used Solr Admin panel to run our slowest queries from our logs with the “debug” flag checked. • We discovered that words such as “l’art” were being split into two token phrase queries. • We used the Solr Admin Analysis tool and determined that the analyzer we were using was the culprit.

Other issues • We broke Solr … temporarily • Dirty OCR in combination with over 200 languages creates indexes with over 2.4 billion unique terms • Solr/Lucene index size was limited to 2.1 Billion unique terms • Patched: Now it’s 274 Billion • Dirty OCR is difficult to remove without removing “good” words. • Because Solr/Lucene tii/tis index uses pointers into the frequency and position files we suspect that the performance impact is minimal compared to disk I/O demands, but we will be testing soon.

Big, Bigger Biggest