160 likes | 291 Vues
Search Stack Secrets. Ryan Gehring - Indiegogo. Practical Search for Rubyists. Elasticsearch / SOLR / alternatives roundup. Essential plugins you need to install today. Semi SOA search design. Schemaless is for amateurs! Mappings = friend. Problem solving with analyzers.
E N D
Search Stack Secrets Ryan Gehring - Indiegogo
Practical Search for Rubyists Elasticsearch/ SOLR / alternatives roundup. Essential plugins you need to install today. Semi SOA search design. Schemaless is for amateurs! Mappings = friend. Problem solving with analyzers. Avoiding Tire DSL- query json ingredients.
Elasticsearch v SOLR v … Horizontal scalability GREAT API Developer support (analyzers, etc.) Downside: slightly less great ruby client.
Awesome Plugins elasticsearch-head A web front end for an ElasticSearch cluster http://mobz.github.com/elasticsearch-head ElasticSearch Paramedic Paramedic is a simple yet sexy tool to monitor and inspect ElasticSearch clusters. ElasticsearchJDBC river https://github.com/jprante/elasticsearch-river-jdbc
One solid service-y and Rails 4-approved design Webform in view supplies GET parameters, submits to a search controller. Search controller okays the proper, permissioned parameters via strong parameters, instantiates a search object. Search model translates parameters into a query --- either using Tire (the ruby client) or JSON. Query fired and results are served!
Mappings + Analyzers: Ingredients for Success! Elasticsearch is schemaless by default, but you can optimize by providing a schema. What fields to index, How to analyze+tokenize fields. These analyzers help a lot!
Problem solving with analyzers • My search isn’t robust to misspellings! • N-gram • Edge n-gram • My search isn’t robust to plurals / caps / whitespace/ etc. • Snowball (standard+lowercase+someenglish language stemming + stopwording) • I can only solve one of these at once! • Multi field analysis.
Problem solving with boosts • Boosts are a concept from Lucene; they are multipliers on scores. • You can set the relative importance of matching fields: example: title -> 10, vs. free_text -> 1 • You can set the relative importance of matching on ANALYZED fields: example: ngram_title -> 6, snowball_title -> 10. • Bonus for fields with exact token matches.
Key queries in Elasticsearch • Filtered Query: • Apply binary filters to an arbitrary query; try it with the query_string query type for full text, analyzed search queries + filters. • Custom Score Query • Provide the exact equation for scoring --- you can take mathematical transforms of variables using MVAL or even python with the right plugin.
Theoretical Section Integrating models via custom scoring. Learning models – a qualitative, quantitiative process. Data sources and paradigms. Key metrics for search. Monitoring statistical model performance.
Custom score queries are regression equations. You can use supervised learning methods to train them over time like Google.
Statistical learning & search. • Clickstream models • Logistic regression • Binary target, click no click • Learn boosts, coefficients, etc. • Paired comparison models • Logistic regression • Binary target, A > B • Learn boosts, coefficients, etc.
Search model training is a qualitative-first process. Review search algorithms before you push them. Have other people review search results before you push them. Make your app robust to new search query models – abstract the regression to a query model. Do side-by-side qualitative search QA.
Search success metrics… any googlers here? Items consumed / session for browse pages. 1- abandoned search % for search pages. Conversion rate originating from search page.
Search model learning Explain output --- the ultimate training data, in a nasty, semi-structured mess. Built an AST parser for Lucene explain output so you can get clean rows of observations. Every query’s intimate scoring details are logged into a DB as lines of training data.
Search model monitoring You can calculate stability metrics for thousands of queries between two models and highlight the least stable queries. You can monitor prediction accuracy on clickstream data for performance degradation.