Advanced Indexing Techniques with

Advanced Indexing Techniques with Michael Busch (buschmi@apache.org) http://people.apache.org/~buschmi/apachecon/ Advanced Indexing Techniques with Apache Lucene - Payloads

Agenda • Part 1: Inverted Index 101 • Posting Lists • Stored Fields vs. Payloads • Part 2: Use cases for Payloads • BoostingTermQuery • Simple facet counting Advanced Indexing Techniques with Apache Lucene - Payloads

Lucene’s data structures Inverted Index Store search retrieve stored fields Hits Results Advanced Indexing Techniques with Apache Lucene - Payloads

Query: not c:\docs\einstein.txt: The important thing is not to stop questioning. String comparison slow! Solution: Inverted index c:\docs\shakespeare.txt: To be or not to be. Advanced Indexing Techniques with Apache Lucene - Payloads

Inverted index Query: not be important is not or questioning stop to the thing 1 0 0 0 1 1 0 0 0 1 0 0 c:\docs\einstein.txt: The important thing is not to stop questioning. 0 c:\docs\shakespeare.txt: To be or not to be. 1 Document IDs Advanced Indexing Techniques with Apache Lucene - Payloads

0 1 2 3 4 5 6 7 0 1 2 3 4 5 Inverted index Query: ”not to” be important is not or questioning stop to the thing 1 0 0 0 1 1 0 0 0 1 0 0 c:\docs\einstein.txt: The important thing is not to stop questioning. 0 c:\docs\shakespeare.txt: To be or not to be. 1 Document IDs Advanced Indexing Techniques with Apache Lucene - Payloads

Inverted index Query: ”not to” be important is not or questioning stop to the thing 1 0 0 0 1 0 0 0 0 0 1 1 3 4 2 7 6 5 0 2 5 c:\docs\einstein.txt: The important thing is not to stop questioning. 0 0 1 2 3 4 5 1 3 6 7 c:\docs\shakespeare.txt: To be or not to be. 1 1 0 4 0 1 2 3 4 5 Document IDs Positions Advanced Indexing Techniques with Apache Lucene - Payloads

5 1 1 4 0 Inverted index with Payloads 1 0 0 0 1 0 0 0 0 0 1 1 3 4 2 7 6 5 0 2 be important is not or questioning stop to the thing c:\docs\einstein.txt: The important thing is not to stop questioning. 0 0 1 2 3 4 5 6 7 c:\docs\shakespeare.txt: To be or not to be. 1 B 0 1 2 3 4 5 Document IDs Positions Payloads Advanced Indexing Techniques with Apache Lucene - Payloads

So far… • String comparison slow • Inverted index used to accelerate search • Store positions in posting lists to allow phrase searches • Store payloads in posting lists to store arbitrary data with each position Advanced Indexing Techniques with Apache Lucene - Payloads

Lucene’s data structures Inverted Index Store search retrieve stored fields Hits Results Advanced Indexing Techniques with Apache Lucene - Payloads

Documents: Store Field 1: title Field 2: content Field 3: hashvalue D0 D1 D2 F3 F3 F1 F2 F3 F1 F2 F1 F2 Store Advanced Indexing Techniques with Apache Lucene - Payloads

Store D0 D1 D2 F3 F3 F1 F2 F3 F1 F2 F1 F2 • Optimized for random access • Document-locality Advanced Indexing Techniques with Apache Lucene - Payloads

X X X Posting list with Payloads Document IDs D0 D1 D1 0 F3 0 F3 0 F3 Positions Payloads Store D0 D1 D2 F3 F3 F1 F2 F3 F1 F2 F1 F2 • Optimized for scanning and skipping • Space-efficient encoding Advanced Indexing Techniques with Apache Lucene - Payloads

Agenda • Part 1: Inverted Index 101 • Posting Lists • Stored Fields vs. Payloads • Part 2: Use cases for Payloads • BoostingTermQuery • Simple facet counting Advanced Indexing Techniques with Apache Lucene - Payloads

Payloads - API org.apache.lucene.analysis.Token void setPayload(Payload payload) org.apache.lucene.index.Payload Payload(byte[] data) Payload(byte[] data, int offset, int length) Advanced Indexing Techniques with Apache Lucene - Payloads

Payloads - API org.apache.lucene.index.TermPositions boolean next(); int doc() int freq(); int nextPosition(); int getPayloadLength(); byte[] getPayload(byte[] data, int offset) Advanced Indexing Techniques with Apache Lucene - Payloads

Example: BoostingTermQuery Use case: • Score certain occurrences of a term higher than others • E. g.: Query: ‘warning’ doc1: ”HURRICANE WARNING” doc2: “The Warning Label Generator is a fun way to generate your own warning labels!” (www.warninglabelgenerator.com) Advanced Indexing Techniques with Apache Lucene - Payloads

Example: BoostingTermQuery Analyzer: final byte BoldBoost = 5; … Token token = new Token(…); … if (isBold) { token.setPayload( new Payload(new byte[] {BoldBoost})); } … return token; Advanced Indexing Techniques with Apache Lucene - Payloads

Example: BoostingTermQuery Similarity: Similarity boostingSimilarity = new DefaultSimilarity() { // @override public float scorePayload(byte [] payload, int offset, int length) { if (length == 1) return payload[offset]; }; Advanced Indexing Techniques with Apache Lucene - Payloads

Example: BoostingTermQuery BoostingTermQuery: Query btq = new BoostingTermQuery( new Term(“field”, “searchterm”)); Searching: Searcher searcher = new IndexSearcher(…); Searcher.setSimilarity(boostingSimilarity); … Hits hits = searcher.search(btq); Advanced Indexing Techniques with Apache Lucene - Payloads

Example from java-user: Unique Doc Ids Use case: • Store a unique document id (UID) that maps to a row in a database table • Retrieve UID at search time to influence matching/scoring • FieldCache takes to long to load Advanced Indexing Techniques with Apache Lucene - Payloads

Example from java-user: Unique Doc Ids Solution: • Index one special term for each document, e. g. ID:UID • Index one occurrence for each document • Store UID in the Payload of the occurrence Advanced Indexing Techniques with Apache Lucene - Payloads

Example from java-user: Unique Doc Ids For indexing: TokenStream class SinglePayloadTokenStream extends TokenStream { boolean done = false; public void setUID(int uid) {...} public Token next() throws IOException { if (done) return null; Token token = new Token(“UID”); token.setPayload(new Payload(uid); done = true; return token; } } Advanced Indexing Techniques with Apache Lucene - Payloads

Example from java-user: Unique Doc Ids For retrieving: TermPositions public int[] getCachedUIDs(IndexReader reader) { int[] cache = new int[reader.maxDoc()]; TermPositions tp = reader.termPositions( new Term(“ID”, “UID”); byte[] buffer = new byte[4]; while(tp.next()) { // iterate over docs tp.nextPosition(); // only one pos per doc tp.getPayload(buffer, 0); cache[tp.doc()] = bytesToInt(buffer); } return cache; } Advanced Indexing Techniques with Apache Lucene - Payloads

Example from java-user: Unique Doc Ids Performance: • Load UIDs for 2M docs into memory • FieldCache: 16.5 s • Payloads: 430 ms Advanced Indexing Techniques with Apache Lucene - Payloads

Example: (Very) Simple facet counting Use case: • Collection with docs from different sources • Show top-n results from each source instead of top-n results from entire collection Advanced Indexing Techniques with Apache Lucene - Payloads

Example: (Very) Simple facet counting Analyzer: public TokenStream tokenStream(String fieldName, Reader reader) { if (fieldName.equals(“_facet”)) { return new TokenStream() { boolean done = false; public Token next() { if (done) return null; Token token = new Token(…); token.setPayload( new Payload(computeHash(url)); done = true; return token; }}}} Advanced Indexing Techniques with Apache Lucene - Payloads

Example: (Very) Simple facet counting Hitcollector: • Use different PriorityQueues for different sites • Instead of returning top-n results of the whole data set, return top-n results per site Advanced Indexing Techniques with Apache Lucene - Payloads

Example: (Very) Simple facet counting Summary • In this example: facet (site) used for scoring, but extendable for facet counting • Good performance due to locality of facet values Advanced Indexing Techniques with Apache Lucene - Payloads

Example: Efficient Numeric Search Use case: • Find documents that have a numeric value in a specific range, e. g. all docs with a date >2006 and <2007 Currently in Lucene: • RangeQuery • Store all values in the dictionary • Query expansion Advanced Indexing Techniques with Apache Lucene - Payloads

Example: Efficient Numeric Search Dictionary Postinglists 01/01/2006 01/02/2006 01/04/2006 . . . 12/30/2006 Query:[01/05/2006 TO 11/25/2006] Problem: A large number of postinglists have to be processed Advanced Indexing Techniques with Apache Lucene - Payloads

Example: Efficient Numeric Search Idea: • Index special term, e. g. ‘numeric:date’ and store actual value in a Payload for each doc • Problem: Postinglist can become very big -> entire list has to be processed • Solution: Hybrid approach Advanced Indexing Techniques with Apache Lucene - Payloads

Example: Efficient Numeric Search Dictionary Postinglists date:01/2006 date:02/2006 . . . date:12/2006 Store day in payload Store position where date occurred Document IDs Positions Payloads Advanced Indexing Techniques with Apache Lucene - Payloads

Example: Efficient Numeric Search • Tradeoff between number of postinglists to process and size of postinglists • Significant speedup possible with good choice of chunk size Advanced Indexing Techniques with Apache Lucene - Payloads

Conclusion • Payloads offer great flexibility • Payloads are stored very space-efficient • Sophisticated data structures enable efficient skipping over payloads • Payloads should be used whenever special data is required for finding hits and scoring Advanced Indexing Techniques with Apache Lucene - Payloads

Outlook • Finalize API (currently Beta) • Add more out-of-the-box query types • Per-document Payloads – updateable • FieldCache implementation that uses Payloads Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques with Questions ? http://people.apache.org/~buschmi/apachecon/ Advanced Indexing Techniques with Apache Lucene - Payloads

Advanced Indexing Techniques with