html5-img
1 / 32

Solr Performance & Key Innovations

Solr Performance & Key Innovations. Yonik Seeley, Lucid Imagination yonik@lucidimagination.com, May 26 2011. Solr 3.1 Highlights. Numeric range facets (similar to date faceting). New spatial search , including spatial filtering, boosting and sorting capabilities.

chanel
Télécharger la présentation

Solr Performance & Key Innovations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Solr Performance & Key Innovations Yonik Seeley, Lucid Imaginationyonik@lucidimagination.com, May 26 2011

  2. Solr 3.1 Highlights • Numeric range facets (similar to date faceting). • New spatial search, including spatial filtering, boosting and sorting capabilities. • Example Velocity driven search UI at http://localhost:8983/solr/browse • A new faster termvector-based highlighter. • Extended dismax (edismax) query parser with support for fielded queries, enhanced relevancy, and full lucene syntax support. • Distributed search support for the Spell check and Terms components.

  3. Solr 3.1 Highlights (continued) • Suggester, a fast trie-based autocomplete component. • Sort results by any function query. • JSON document indexing. • CSV response format • Apache UIMA integration for metadata extraction. • Tons of optimizations, bugfixes, and new analysis capabilities via Apache Lucene 3.1.

  4. What’s not in 3.1? • Result Grouping (AKA Field Collapsing) • Pivot Faceting • SolrCloud • Pseudo-fields • Pseudo-join • Relevancy function queries • Per-segment faceting • *Tons* of new Lucene performance/efficiency goodness

  5. Recent Lucene Performance • TieredMergePolicy – the new default • Much better for incremental indexing / NRT • Ignores segment order when selecting best merge • Takes deletes into account • Does not over-merge (no cascading merges) • Finite State Transducer (FST) based terms index

  6. DocumentWriterPerThread (DWPT) Indexing thread • Flushing new segment is now concurrent w/ indexing • Use multiple indexing threads/connections • When max mem is hit, biggest DWPT is concurrently flushed Index Writer DWPT DWPT DWPT in-memory Flush segment to disk _1_0.tiv _1_0.prx _1_0.frq … _2_0.tiv _2_0.prx _2_0.frq … _3_0.tiv _3_0.prx _3_0.frq …

  7. Solr Cloud http://.../solr/collection1?distrib=true Load-balanced sub-request shard1(replica1) shard2(replica1) replica2 replica2 replica3 replica3 ZK node /livenodes server1:8983/solr server2:8983/solr server2:8983/solr ZK node /collections /collection1 configName=myconf /shards /shard1 server1:8983/solr server2:8983/solr /shard2 server3:8983/solr server4:8983/solr ZK node /configs /myconf solrconfig.xml schema.xml ZK node ZK node ZooKeeper quorum

  8. Solr Cloud: Getting Started http://wiki.apache.org/solr/SolrCloud java -Dbootstrap_confdir=./solr/conf -Dcollection.configName=myconf -DzkRun -jar start.jar Upload /solr/conf to ZK and call it “myconf” Run an internal ZK server http://localhost:8983/solr/collection1/admin/zookeeper.jsp

  9. Distributed Requests • Explicitly specify node addresses to load-balance across shards=localhost:8983/solr|localhost:8900/solr, localhost:7574/solr|localhost:7500/solr • A list of equivalent nodes are separated by “|” • Different phases of the same distributed request use the same node • Specify logical shard ids to search across shards=NY_shard,NJ_shard • Query across all shards in the collection http://localhost:8983/solr/collection1/select?distrib=true • public CloudSolrServer(String zkHost) • SolrJ Java client that load-balances across all nodes in cluster

  10. Extended Dismax Parser • Superset of dismax • Designed to directly handle user queries w/o exceptions &defType=edismax&q=foo&qf=body • Fixes edge cases where dismax could still throw exceptions OR AND NOT - “ • Full lucene syntax support • Tries lucene syntax first • Smart escaping is done if syntax errors • Optionally supports treating “and”/”or” as AND/OR in lucene syntax • Fielded queries (e.g. myfield:foo) even in degraded mode • uf parameter controls what field names may be directly specified in “q”

  11. Extended Dismax Parser (continued) • boost parameter for multiplicative boost-by-function • Pure negative query clauses Example: solr OR (-solr) • Enhanced term proximity boosting • pf2=myfield – results in term bigrams in sloppy phrase queries myfield:“aa bb cc”-> myfield:“aa bb” myfield:“bb cc” • Enhanced stopword handling • stopwords omitted in main query, but added in optional proximity boosting part Example: q=solr is awesome & qf=myfield & pf2=myfield -> +myfield:(solr awesome) (myfield:”solr is” myfield:”is awesome”) • Currently controlled by the absence of StopWordFilter in index analyzer, and presence in query analyzer

  12. Faceting Performance Improvements • For facet.method=enum, speed up initial population of the filterCache (i.e. first time facet): from 30% to 32x improvement • Optimized facet.method=fc for multi-valued fields and large facet.limit – up to 3x faster • Optimized deep facet paging – up to 10x faster with really large facet.offsets • Less memory consumed by field cache entries • Per-segment faceting with facet.method=fcs • Only faster when re-opening index frequently (many times a second) • Only works for single-valued fields

  13. Pivot Faceting • Other names that could have made sense: • Grid Faceting, Cross-Product Faceting, Matrix Faceting • Syntax: facet.pivot=field1,field2,field3,… facet.pivot=cat,inStock

  14. Pivot Faceting http://...&facet=true&facet.pivot=cat,popularity (continued) { "field":"popularity", "value":"1", "count":2}]}, { "field":"cat", "value":"memory", "count":3, "pivot":[]}, […] "facet_counts":{ "facet_pivot":{ "cat,popularity":[{ "field":"cat", "value":"electronics", "count":14, "pivot":[{ "field":"popularity", "value":"6", "count":5}, { "field":"popularity", "value":"7", "count":4}, 14 docs w/ cat==electronics 5 docs w/ cat==electronics && popularity==6

  15. Range Faceting • Like Date faceting, but more generic http://...&facet=true &facet.range=price &facet.range.start=0 &facet.range.end=500 &facet.range.gap=50 "facet_counts":{ "facet_ranges":{ "price":{ "counts":{ "0.0":5, "50.0":2, "100.0":0, "150.0":2, "200.0":0, "250.0":1, "300.0":2, "350.0":2, "400.0":0, "450.0":1}, "gap":50.0, "start":0.0, "end":500.0}}}}

  16. Spatial Search Step1: Index some locations! <field name=“name”>The Alpine Shop</field> <field name=“store”>44.013617,-73.168264</field> Step2: Decide where you are &pt=44.0153371,-73.16734 &d=1 &sfield=store Step3: Profit! Spatial Filter: &fq={!geofilt} Bounding Box: &fq={!bbox} Distance Function: &sort=geodist() asc Returning the distance: &fl=geodist() Pseudo-fields! Note: You can now sort by any arbitrary function query!

  17. Pseudo-Fields Returns other info along with document stored fields • Function queries fl=name,location,geodist(),add(myfield,10) • Fieldname globs fl=id,attr_* • Multiple “fl” (field list) values &fl=id,attr_*&fl=geodist()&fl=termfreq(text,’solr’) • Aliasing fl=id,location:loc,_dist_:geodist() • Future: inlined highlighting, “explain”, sort-values, group-value

  18. Result Grouping / Field Collapsing • Goal • Limit the number of results per category • “category” normally defined by unique values in a field • Uses • Web Search – collapse by web site • Email threads – collapse by thread id • Ecommerce/retail • Show the top 5 items for each store category (music, movies, etc)

  19. Field Collapsing by Site

  20. Result Grouping by Category Field Collapse on Product Type

  21. Group by Field "grouped":{ "manu_exact":{ "matches":3, "groups":[{ "groupValue":"Belkin", "doclist":{"numFound":2,"start":0,"docs":[ { "id":"IW-02", "name":"iPod & iPod Mini USB 2.0 Cable"}] }}, { "groupValue":"Apple Computer Inc.", "doclist":{"numFound":1,"start":0,"docs":[ { "id":"MA147LL/A", "name":"Apple 60 GB iPod with Video Playback Black"}] }}]}}} http://...&fl=id,name&q=ipod&group=true&group.field=manu_exact

  22. Group by Query http://...&group=true&group.query=price:[0 TO 99.99]&group.query=price:[100 TO *]&group.limit=5 "grouped":{ "price:[0 TO 99.99]":{ "matches":3, "doclist":{"numFound":2,"start":0,"docs":[ { "id":"IW-02", "name":"iPod & iPod Mini USB 2.0 Cable"}, { "id":"F8V7067-APL-KIT", "name":"Belkin Mobile Power Cord for iPod"}] }}, "price:[100 TO *]":{ "matches":3, "doclist":{"numFound":1,"start":0,"docs":[ { "id":"MA147LL/A", "name":"Apple 60 GB iPod with Video Playback Black"}] }}}}

  23. Grouping Params

  24. Pseudo-Join id: post1 blog_id: blog1 author: Yonik Seeley title: Solr relevancy function queries body: Lucene’s default ranking […] id: blog1 name: Solr ‘n Stuff owner: Yonik Seeley Started: 2007-10-26 id: post2 blog_id: blog1 author: Yonik Seeley title: Solr result grouping body: Result Grouping, also called […] id: blog2 name: lifehacker owner: Gawker Media started: 2005-1-31 id: post3 blog_id: blog2 author: Whitson Gordon title: How to Install Netflix on Almost Any Android Device Restrict to blogs mentioning netflix fq={!join from=blog_id to=id}body:netflix • Finds all documents matching “netflix” • Maps to different docs by following blog_id to id

  25. Pseudo-Join Examples • Only show posts from blogs started after 2010 q=foo&fq={!join from=id to=blog_id}started:[2010 TO *] • If any post in a blog mentions “obama”, then search all posts in that blog for “bomb” (self-join) q=bomb&fq={!join from=blog_id to=blog_id}obama • If any blog post mentions “obama”, then search all websites with the same blog owner for “bomb” q=bomb&fq={!join from=owner to=website_owner}{!join from=blog_id to=id}obama

  26. Cross-Core Join http://localhost:8983/solr/collection1/select?q=foo&fq={!join fromIndex=sec1 from=security_groups to=security}user:john id: doc1 security: managers title: doc for managers only body: … id: mary security_groups: managers, employees id: john security_groups: employees id: doc1 security: managers, employees title: doc for everyone body: … sec1 collection1 Single Solr Server

  27. Pseudo-Join vs Grouping

  28. Auto-Suggest • Many people previously used terms component • Can be slow for a large corpus • New auto-suggest builds off SpellCheck component • TST implementation: compact memory based trie • FST implementation: slower to build, but smaller & faster lookup • Based on a field in the main index, or on a dictionary file http://localhost:8983/solr/suggest?wt=json&indent=true&q=ult "spellcheck":{ "suggestions":[ "ult",{ "numFound":1, "startOffset":0, "endOffset":3, "suggestion":["ultrasharp"]}, "collation","ultrasharp"]}}

  29. Index with JSON $ URL=http://localhost:8983/solr/update/json $ curl $URL -H 'Content-type:application/json' -d ’ [ { "id" : "978-0641723445", "cat" : ["book","hardcover"], "title" : "The Lightning Thief", "author" : "Rick Riordan", "series_t" : "Percy Jackson and the Olympians", "sequence_i" : 1, "genre_s" : "fantasy", "inStock" : true, "price" : 12.50, "pages_i" : 384 } ]'

  30. Query Results in CSV http://localhost:8983/solr/select?q=ipod&fl=name,price,cat,popularity&wt=csv name,price,cat,popularity iPod & iPod Mini USB 2.0 Cable,11.5,"electronics,connector",1 Belkin Mobile Power Cord for iPod w/ Dock,19.95,"electronics,connector",1 Apple 60 GB iPod with Video Playback Black,399.0,"electronics,music",10 • Can handle multi-valued fields (see “cat” field in example) • Completely compatible with the CSV update handler (can round-trip) • Results are streamed – good for dumping entire parts of the index

  31. http://localhost:8983/solr/browse

  32. Q&A

More Related