SOLR SIDE-CAR INDEX

SOLR SIDE-CAR INDEX Andrzej Bialecki. LucidWorks ab@lucidworks.com

About the speaker • Started using Lucene in 2003 (1.2-dev…) • Created Luke – the Lucene Index Toolbox • Apache Nutch, Hadoop, Solr committer, Lucene PMC member • LucidWorksengineer

Agenda • Challenge: incremental document updates • Existing solutions and workarounds • Sidecar index strategy and components • Scalability and performance • QA

Challenge: incremental document updates • Incremental update (field-level update): modification of a part of document • Sounds like a fundamentally useful functionality! • But Lucene / Solr doesn’t offer true field-level updates (yet!) • “Update” is really a sequence of “retrieve old document, update fields, add updated document, delete old document” • “Atomic update” functionality in Solr is a (useful) syntactic sugar

Common use cases for field updates • Documents composed logically of two parts with different update schedules • E.g. mostly staticdocuments with some quickly changing fields • Two different classes of data in changing fields • Numeric / boolean fields: e.g. popularity, in-stock status, promo campaigns • Text fields: e.g. reviews, tags, click-through feedback, user profiles • Challenge: how to integrate these modifications with the main index content? • Re-indexing whole documents isn’t always an option

True full-text (inverted fields) incremental updates • Very complex issue, broad impact on many Lucene internals • Inverted index structure is not optimized for partial document updates • At least another 6-12 months away? • LUCENE-4258 – work in progress

Handling updates via full re-index • If the corpus is small, or incremental updates infrequent… just re-index everything! • Pros: • Relatively easy to implement – update source documents and re-index • Allows adding all types of data, including e.g. labels as searchable text • Cons: • Infeasible for larger corpora or frequent updates, time-wise and cost-wise • Requires keeping around the source documents • Sometimes inconvenient, when documents are assembled in a complex pipeline

Handling updates via Solr’sExternalFileField • Pros: • Simple to implement • Updates are easy – just file edits, no need to re-index • Cons: • Only docId => field : number • Not suitable for full-text searchable field updates • E.g. can’t support user-generated labels attached to a doc • Still useful if a simple “popularity”-type metric is sufficient • Internally implemented as an in-memory ValueSource usable by function queries doc0=1.5 doc1=2.5 doc2=0.5 …

Numeric DocValues updates • Since Lucene/Solr 4.6 … to be released Really Soon  • Details can be found in LUCENE-5189 • As simple as: indexWriter.updateNumericDocValue(term, field, value) • Neatly solves the problem of numeric updates: popularity, in-stock, etc. • Some limitations: • Massive updates still somewhat costly until the next merge (like deletes) • Can only update existing fields • Obviously doesn’t address the full-text inverted fieldupdates

Lucene ParallelReader overview 0 f1, f2, f3, f4… • Pretends that two or more IndexReader-s are slices of the same index • Slices contain data for different fields • Both stored and inverted parts are supported • Data for matching docs is joined on the fly • Structure of all indexes MUST match 1:1 !!! • The same number of segments • The same count of docs per segment • Internal document ID-s must match 1:1 • List of deletes is taken from the first index • Sounds cool … but in practice it’s rarely used: • It’s very difficult to meet these requirements • This is even more difficult in the presence of index updates and merges ParallelReader main IR parallel IR 0 1 2 3 4 5 6 f1, f2, ... f1, f2, ... f1, f2, ... f1, f2, … f3, f4, ... f3, f4, ... f3, f4, ... f3, f4, … 0 1 2 3 0 1 2 3 f1, f2, ... f1, f2, … f3, f4, ... f3, f4, … 0 1 0 1 0 0 f1, f2, … f3, f4, …

Handling updates via ParallelReader • Pros: • All types of data (e.g. searchable full-text labels) can be added • Cons: • Must ensure that the other index always matches the structure of the main index • Complicated and fragile (rebuild on every update?) • No tools to manage this parallel index in Solr ParallelReader main IR parallel IR f3, f4, ... f3, f4, … 0 1 0 1 2 3 4 5 6 f1, f2, ... f1, f2, ... f1, f2, ... f1, f2, … 0 1 2 3 f3, f4, ... f3, f4, … 0 1 f1, f2, ... f1, f2, … 0 1 0 f3, f4, … 0 f1, f2, …

Sidecar Index Components for Solr • Uses the ParallelReader strategy for field updates • “Main” and “sidecar” data comes from two different Solr collections • “Sidecar” collection is updated independently from the main collection • “Sidecar” collection is used as a source of document fields for building and updating a parallel index • Integrates the management of ParallelReader (“sidecar index”) into Solr • Initial creation of ParallelReader, including synchronization of internal ID-s • Tracking of updates and IndexReader.reopen(…) events • Partly based on a version of Click Framework in LucidWorks Search • Available under Apache License here: http://github.com/LucidWorks/sidecar_index

“Main”, “sidecar” collections and parallel index • “Main” collection contains only the parts of documents with “main” fields • “Sidecar” collection is a source of documents with “sidecar” fields • SidecarIndexReaderFactory creates and maintains the parallel index (sidecar index) • “Main” collection uses SidecarIndexReader that acts as ParallelReader • Main index is updated as usual, via the “main” collection’s IndexWriter Solr Main_collection Sidecar_collection SidecarIndexReader main index sidecar index

Implementation details • SidecarIndexReaderFactory extends Solr’sIndexReaderFactory • newReader(Directory, SolrCore) – initial open • newReader(IndexWriter, SolrCore) – NRT open • SidecarIndexReader acts like a ParallelReader • Solr wants DirectoryReader, but ParallelReader is not a DirectoryReader • Basically had to re-implement the logic from ParallelReader  • ParallelReader challenges: • How to synchronize internal ID-s? • How to create segments that are of the same size as those of the main index? • How to handle deleted documents? • How to handle updates to the main index? • How to handle updates to the sidecar data?

Sidecar collection ParallelReader challenges and solutions • How to synchronize internal ID-s? • “Main” collection is traversed sequentially by internal docId • Primary key is retrieved for each document • Matching document is found in the “sidecar” collection • Matching document is added to the “sidecar” index • Very costly phase! • Random seek and retrieval from “sidecar” collection • Primary key lookup is fast • … but stored field retrieval and indexing isn’t main IR sidecar IR 0 1 2 3 4 5 6 q=id:D D, f2, ... B, f2, ... A, f2, ... F, f2, … f3, f4, ... f3, f4, ... f3, f4, ... 0 1 2 3 0 1 2 f3, f4, ... f3, f4, ... f3, f4, ... f3, f4, … f3, f4, ... f3, f4, ... f3, f4, … G B C E A F D C, f2, ... G, f2, … 0 1 0 E, f2, … Main collection

ParallelReader challenges and solutions • Optimization 1: don’t rebuild data for unmodified segments • Optimization 2 (cheating): ignore NRT segments • How to handle deleted docs? • Insert dummy (empty) documents so that the number and the order of documents still match ParallelReader main IR sidecar IR 0 1 2 3 4 5 7 f1, f2, ... f1, f2, ... f1, f2, ... f1, f2, … f3, f4, ... f3, f4, ... f3, f4, ... f3, f4, … 0 1 2 3 0 1 2 3 dummy f3, f4, … 0 1 f1, f2, ... f1, f2, … 0 1 X f1, f2, ... f1, f2, … f3, f4, ... f3, f4, … 0 1 0 1 NRT 0 f1, f2, …

Implementation: SidecarMergePolicy • How to create segments that are of the same size as the “main” index? • Carefully manage the “sidecar” index creation: • IndexWriter uses SerialMergeScheduler to prevent out-of-order merges • Force flush when reaching the next target count of documents • Merges are enforced using SidecarMergePolicy that tracks the sizes of the “main” index segments ParallelReader main IR sidecar IR 0 1 2 3 4 5 6 f1, f2, ... f1, f2, ... f1, f2, ... f1, f2, … f3, f4, ... f3, f4, ... f3, f4, ... f3, f4, … 0 1 2 3 0 1 2 3 f1, f2, ... f1, f2, … f3, f4, ... f3, f4, … 0 1 0 1 0 0 f1, f2, … f3, f4, … SidecarMergePolicy target sizes: Seg0 – 4 docs Seg1 – 2 docs Seg2 – 1 doc

Implementation: SidecarIndexReader • Re-implements the logic of ParallelReader • ParallelReader != DirectoryReader • Exposes Directory of the “main” index for replication • Replicas need the “sidecar” collection replica to rebuild the sidecar index locally • If document routing and shard placement is the same then we don’t have to use distributed search – all data will be local • Reopen(…) avoids rebuilding unmodified segments • Reopen(…) uses SidecarIndexReaderFactory to rebuild the sidecar index when necessary • When there’s a major merge in the “main” index • When “sidecar” data is updated • Ref-counting of IndexReaders at different levels is very tricky!

Example configuration in solrconfig.xml <indexReaderFactory name="IndexReaderFactory" class="com.lucid.solr.sidecar.SidecarIndexReaderFactory"> <str name="docIdField">id</str> <str name="sourceCollection">source</str> <bool name="enabled">true</bool> </indexReaderFactory>

Example use case: integration of click-through data • Raw click-through data: • Query, query_time, docId, click_time [, user] • Aggregated click-through data: • User-generated popularity score: F(number and timing of clicks per docId) • Numeric updates • User-generated labels: F(top-N queries that led to clicks on docId) • Full-text searchable updates • User profiles: F(top-N queries per user, top-N docId-s clicked, etc) • … • Queries can now be expanded to score based on TF/IDF in user-generated labels

Scalability and performance

Scalability and performance • Initial full rebuild is very costly • ~0.6 ms / document • 1 mln docs = 600 sec = 10 min • Not even close to “real time” … • Cost related to new segments in “main” index depends on the size of segments • Major merge events will trigger full rebuild • BUT: search-time cost is negligible

Caveats • Combination of ref-counting in Lucene, Solr and ParallelReader is difficult to track • The sidecar code is still unstable and occasionally explodes • Performance of full rebuild quickly becomes the bottleneck on frequent updates • So the main use case is massive but infrequent updates of “sidecar” data • Code: http://github.com/LucidWorks/sidecar_index • Fixes and contributions are welcome – the code is Apache licensed

Agenda • Challenge: incremental document updates • Existing solutions and workarounds • Sidecar index strategy and components • Scalability and performance • QA

QA Andrzej Bialecki ab@lucidworks.com

SOLR SIDE-CAR INDEX