Apache Lucene and Apache Solr Performance Tuning

Mark Miller (markrmiller@apache.org) Apache Lucene and Apache Solr Performance Tuning Lucene and Solr Performance Tuning

Lucene and Solr Performance Tuning Brief Intro To • Lucene: Java library for building and searching “inverted” indices. • Small, efficient, fast • Approx 1 MB jar file

Lucene and Solr Performance Tuning Inverted Index • Think of a book index

Lucene and Solr Performance Tuning Segments • Incremental indexing Segments

Lucene and Solr Performance Tuning Index Files • segments file • .fnm - Field Namesindexed? payloads? Termvectors? • .tix .tii - Term Dictionary • .frq – Term frequencies • .fdt .fdx – Stored Fields • .tvx .tvf .tfd – TermVectors – freq and opt pos/offset • .nrm – norms • .del - deletions

Lucene and Solr Performance Tuning Brief Intro To • Solr: search server built on top of Lucene • Manages index views, provides different access protocols (http, java, php, ruby, etc). • Adds many features: faceting, spellchecking, distribution, replication, caching, etc

Lucene and Solr Performance Tuning Solr's solrconfig.xml • Controls Solr's settings and in some cases, Lucene's settings • The example solrconfig is exactly that – an example (a starting point, not an end point)

Lucene and Solr Performance Tuning solrconfig.xml • useCompoundFile – writes each segment file into a single .cfs file - slower indexing (~10%) • mergeFactor – control how often merges occur, number of segments • ramBufferSizeMB (generally better than maxMergeDocs)

Lucene and Solr Performance Tuning solrconfig.xml •  • <reopenReaders>true</reopenReaders>

Lucene and Solr Performance Tuning solrconfig.xml •  • <queryResultWindowSize>20</queryResultWindowSize> •  • <queryResultMaxDocsCached>200</queryResultMaxDocsCached>

Lucene and Solr Performance Tuning solrconfig.xml •  • <useColdSearcher>false</useColdSearcher> •  • <maxWarmingSearchers>2</maxWarmingSearchers>

Lucene and Solr Performance Tuning •  •  • <listener event="newSearcher" class="solr.QuerySenderListener"> • <arr name="queries"> •  • </arr> • </listener> •  • <listener event="firstSearcher" class="solr.QuerySenderListener"> • <arr name="queries"> • <lst> <str name="q">solr rocks</str><str name="start">0</str><str name="rows">10</str></lst> • <lst><str name="q">static firstSearcher warming query from solrconfig.xml</str></lst> • </arr> • </listener>

Lucene and Solr Performance Tuning Solr Caches • Turn them off? • Size them correctly • Look at your cache stats to decide what to do (eg hits, evictions) • Play with autowarmCount

Lucene and Solr Performance Tuning

Lucene and Solr Performance Tuning • There are two implementations of cache available for Solr, LRUCache, based on a synchronized LinkedHashMap, and FastLRUCache, based on a ConcurrentHashMap. FastLRUCache has faster gets and slower puts in single threaded operation and thus is generally faster than LRUCache when the hit ratio of the cache is high (> 75%), and may be faster under other scenarios on multi-cpu systems. • The solrconig.xml uses FastLRUCache for the filter cache

Lucene and Solr Performance Tuning NIOFSDirectory • An {@link FSDirectory} implementation that uses java.nio's FileChannel's positional read, which allows multiple threads to read from the same file without synchronizing. • Solr automatically selects when it detects a Non Windows System – poor performance on Windows due to a Sun JVM bug

Lucene and Solr Performance Tuning Lucene Autocommit •  •  • There is no guarantee when exactly an auto commit will occur (it • used to be after every flush, but it is now after every • completed merge, as of 2.4).

Lucene and Solr Performance Tuning Merge Policy •  •

Lucene and Solr Performance Tuning Merge Scheduler • <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"> • <int name="maxThreadCount">3</int> • </mergeScheduler>

Lucene and Solr Performance Tuning solrconfig.xml •  • This parameter determines the amount of computation required per query • term, regardless of the number of documents that contain that term. In • particular, it is the maximum number of other terms that must be • scanned before a term is located and its frequency and position information • may be processed. In a large index with user-entered query terms, query • processing time is likely to be dominated not by term lookup but rather • by the processing of frequency and positional data. In a small index • or when many uncommon query terms are generated (e.g., by wildcard • queries) term lookup may become a dominant cost.

Lucene and Solr Performance Tuning Solr's schema.xml • Controls how content is going to be processed and stored in Solr. • Again, the version that comes with Solr is an example – not a final schema for your application.

Lucene and Solr Performance Tuning schema.xml • Only store the fields you need to retrievestored=”false” • Lazy loading (on by default in solrconfig.xml) can help if you have large stored fields that are not always returned. • Don't index the fields you only want to return - indexed=”false”

Lucene and Solr Performance Tuning schema.xml • copyfields → copy fields to target field • Remove unused copyfields. • Consider using a copyfield rather than searching many fields. • You probably don't want to store the target of a copyfield.

Lucene and Solr Performance Tuning schema.xml • Consider Trie field types for numerics • Breaks up numerics into multiple tokens • Much faster search performance on large indexes • Doesn't yet work with some features (eg faceting – though date faceting does currently work with TrieDateField) Replaces both plain numerics and sortable numerics (unless you need "sortMissingFirst" or "sortMissingLast" )

Lucene and Solr Performance Tuning schema.xml • Omit Norms where it makes sense • You lose index time boosting and document length normalization • Norms take up a byte per document in ram – allocated for every document per field no matter how many documents have that field - byte[maxDoc]

Lucene and Solr Performance Tuning schema.xml • Use omitTermFreqAndPositions when it makes sense • true by default except for text fields. • Drops tf and position info for a field • Can be useful for short db type fields – where you want term matching, but not scores or positional matching.

Lucene and Solr Performance Tuning Leading Wildcard Performance • Very slow by default – enumerates every term in the index • Lucene QueryParser does not allow by default – Solr hasn't allowed at all in the past. • Use solr.ReversedWildcardFilterFactory • A filter that reverses tokens to provide faster leading wildcard and prefix queries. Add this filter to the index analyzer, but not the query analyzer. The standard Solr query parser (SolrQuerySyntax) will use this to reverse wildcard and prefix queries to improve performance (for example, translating myfield:*foo into myfield:oof*). To avoid collisions and false matches, reversed tokens are indexed with a prefix that should not otherwise appear in indexed text.

Lucene and Solr Performance Tuning JVM Settings • Most Important: -Xmx -Xms Most ram usage: fieldcaches, solr caches, index searchers (term index, norms?) • How to choose -Xmx -Xms? • Leave room for the filesystem cache

Lucene and Solr Performance Tuning Filesystem Cache • Leave room for it. • Warming queries help fire it up • Ensure important files are in the cache? cp *.prx *.frq *.tis > /dev/null

Lucene and Solr Performance Tuning Garbage Collection Tuning • Large multi gig heap? Choose your collector: • Likely, the concurrent low pause collector – but perhaps the parallel (throughput) collector. • Adventurous? Try the G1 collector. Sill likely buggy. -XX:+UnlockExperimentalVMOptions -XX:+UseG1GC

Lucene and Solr Performance Tuning GC Tuning • Parallel compaction is used by default in JDK 6, but can be enabled by adding the option -XX:+UseParallelOldGC to the command line in JDK 5 update 6 and later. • With CMS, UseParNew is on by default on multiprocess machines.

Lucene and Solr Performance Tuning Logging • Solr logging is chatty – defaults to info • Raising the level can increase performance • Its often not worth the information loss though

Apache Lucene and Apache Solr Performance Tuning

Apache Lucene and Apache Solr Performance Tuning

Presentation Transcript

Performance Tuning Apache Tomcat

Apache Solr

NYC Apache Lucene/Solr Meetup

Apache Lucene

Apache Lucene

Apache Performance Tuning

Introduction to Open Source Search with Apache Lucene and Solr

Apache Solr

Implementing Local Search with Apache Solr and Lucene

Apache Solr/Lucene: Looking Ahead

Apache Solr

Apache Performance Tuning

Apache Lucene

Apache Performance Tuning

Apache Solr Training | Apache Solr Online Training | Online Apache Solr Training

Performance Tuning Apache Tomcat

Apache Performance Tuning

Apache Solr Beyond The Box

Apache Performance Tuning

Tuning Apache/MySQL/PHP