Mastering Apache Solr: Advanced Search Solutions

Working with Apache Solr How we can use Solr to solve common information retrieval problems Prepared and presented by Fabrizio Celli - UNICEF

Outline • What is Apache Solr • Field Types • What to index • Advanced features • Spellchecker • Faceting • Boosting • Synonyms • Multilingualism • SolrCloud

What is Apache Solr • Solr is an open-source search platform which is used to build search applications • built on top of Apache Lucene, an open source, Java-based, information retrieval library • Designed to drive powerful document retrieval applications • It offers faceting (guided navigation), spell check, highlighting, scoring/boosting, and many other features

Why Solr? • RESTful API: you don’t need to have Java programming skills • Document oriented • Text-Centric and Sorted by Relevance: mostly used to search text documents and results are delivered according to the relevance with the user’s query • Fulltext search: tokens, phrases, spell check, wildcard, and auto-complete • Highly Scalable • Open-Source

Once the user makes a request to search a text, the application prepares a query object using that text. It can apply filtering, boosting, etc. Query Results Searching Query Processing Ranking Index Indexing Component Documents must be indexed so that can be retrieved based on certain keys, instead of the whole contents of the document

Document oriented • We can have more indexes in the same Solr instance: cores • A Solr index consists of one or more documents • A document consists of one or more fields • Using relational databases terminology, a document corresponds to a table row, while a field to column • Every field has a type • Float, long, double, date, text • Custom types • A field can be indexed (so it can be used to match user queries), stored (so it can be returned in the result set), multi-valued

Analyzer/Tokenizer • When we define a new field type, we have to specify also an analyzer that will be used at query time, and one used at indexing time • The analyzer is composed of a series of transformations (tokenizers, filters, etc.) that happen before indexing the document or before processing the query • Very important to ensure correct information retrieval • We can create tokens to search specific lexical units • For indexing, we want to simplify and normalize words: lowercasing everything, eliminating punctuation and accents, and so on. Doing so can increase recall because, for example, “rice", “Rice" and “RICE" would all match a query for “rice“ • At query time, we may want to remove stop words, like articles, prepositions, conjunctions, etc. https://lucene.apache.org/solr/guide/6_6/tokenizers.html

Splits the text on whitespace and returns sequences of non-whitespace characters as tokens, including any punctuation. In: "To be, or what?" Out: "To", "be,", "or", "what?" Removing stop words allows the user to make more user friendly queries. You can define stop words in an external file! In: “Child Mortality in Africa and Australia" Out: " Child Mortality Africa Australia" In: "hot-spot 100+42 XL40" Whitespace Tokenizer: "hot-spot", "100+42", "XL40" Out: "hot", "spot", "hotspot", "100", "42", "10042", "XL", "40“, "hot-spot", "100+42", "XL40"

What to index? • You can index everything that can be expressed as text • What to index depends on what content you want your users being able to query • Usually you want to index and make searchable the name of a resource, keywords, categories • Other fields should be excluded by the default search, like footnotes or comments. They may cause results to be not relevant. You can mark them as «stored» in case you want retrieve them, or you can even exclude them from the index • You can create a copy field in which you put all the fields that you want to make available in the default search • Other fields can be searched directly using the field name (e.g. AltTitle:"war and peace") "Child Mortality in Afghanistan and Ethiopia in 2010"

SDMX example: indexing dimensions

Some Advanced Features

Spellchecker • A spellchecker is a software feature that checks for misspellings in a text • The source can be terms in a field, an external text file, or fields in other indexes • IndexBasedSpellChecker: use the Solr index as the source for a parallel index • DirectSolrSpellChecker: use terms from the Solr index without building a parallel index • FileBasedSpellChecker: use an external file • WordBreakSolrSpellChecker: offers suggestions by combining adjacent query terms and/or breaking terms into multiple words

Configuration • Solrconfig.xml: define one or more spellchecks and a request handler • http://localhost:8080/mySolr/collection1/spell?q=riec • http://localhost:8080/mySolr/collection1/spell?q=riec%20in%20america

Configuration • Solrconfig.xml: define one or more spellchecks and a request handler

Faceting • Faceting is the classification of search results into categories based on indexed terms • Faceting makes it easy for users to explore search results, narrowing in on exactly the results they are looking for • Categories come with numerical counts of how many matching documents were found for each of them

An Example http://agris.fao.org/agris-search/searchIndex.do?query=organophosphorus+compounds Click to narrow your search and reduce the number of results!

How to do that in Solr • Identify an index field you want to use to categorize your results. Usually, a field containing countries or keywords is the most appropriate one • <field indexed="true" multiValued="true" name="agrovoc_facet" stored="true" termVectors="true" type="lowercase"/> • Enable the TermVector search component in Solrconfig.xml. It allows to return additional information about documents matching your search, like frequency, position, offset, etc. • <searchComponent name="tvComponent" class="solr.TermVectorComponent"/> • Enable the RequestHandler that uses it

Query! http://localhost:8080/mySolr/collection1/select?q=*%3A*&rows=0&wt=json&indent=true&facet=true&facet.field=agrovoc_facet

Accessing the score • If no other sort order is specified, the default is by relevancy score • Relevance score is computed according to the TFIDF algorithm (term frequency–inverse document frequency) • a numerical statistic that is intended to reflect how important a word is to a document in a collection • You can access the score for each result by adding “score“ to the fl (field list) parameter • Parameters influencing the score: Term Frequency +, Inverse Document Frequency + (rarity across the document), Field length -, Coordination Factor + http:// localhost:8080/mySolr/collection1 /select?q=rice&wt=json&indent=true&fl=*,score

Boosting at query time • The score is influenced by how you set up your query parameters and whether or not you apply boosting techniques • We may run the default search on a copy field, which contains data from all fields we want to use for our search but… • We may also want certain fields to score higher! • For example, let’s assume we want to search for the term “rice“ and we want that the “subject“ field scores more than the “title“ field. There are two ways> • q=title:rice subject:rice^2 (using the standard query parser) • q=rice&qf=title subject^2 (using DisMax query parser) http://localhost:8080/mySolr/collection1/select?q=rice&qf=title^2+subject&wt=json&indent=true&fl=*,score

Boosting at indexing time • We can also decide that some documents are more important than others at indexing time (for example, most recent documents, even if here we may sort by date) • Or that some fields are more important than others <add> <doc boost="2.5"> <field name=“title">Rice production in China</field> <field name=“subject" boost="2.0">rice</field> </doc> </add>

Synonyms • Synonyms are words that mean the same thing, within the context where they are used • It is possible to use the SynonymFilterFactoryto cope with synonyms at query time: • The recommended approach for dealing with synonyms like this, is to expand the synonym when indexing (e.g. tokenizers can be a problem at query time) Synonyms defined in a text file: sea biscuit => sea biscit, seabiscuit

Multilingualism • When we have data in various languages and we do not know the number of languages (or they may change in future, or they are too many) we can use dynamic fields • <dynamicField indexed="true" multiValued="true" name="title_*" stored="true" type="html_ws"/> • At indexing time, we can identify the concrete field adding the language code after the underscore • title_eng, title_fra, title_ita • Solr Admin panel can be used to generate statistics • We can also have a copy field that we use to put the content in all the languages • <copyFielddest="title" source="title_*"/>

Expanding a query • We can analyze the user query before sending it to the Solr index and edit it • Boosting is an example of editing we can do • But we can also expand the query • For example, if we have a database of concepts/keywords in several languages, we can use it to translate the user query in other languages and match results that are available not only in English • The user searches in English and we can return results indexed in other languages • This is not only applicable to multilingual search, but to synonyms and any other enrichment we can do at query time

Multilingual Query Expansion Module An Example 1. Query AGRIS 2. Expand the query AGRIS website Query pattern analyzer 3. Return the expanded query Q1 2.1 4. Use Q1 to query AGRIS index Query expander 2.2 AGRIS core index AGROVOC label index Enabling Multilingual Search through Controlled Vocabularies: the AGRIS Approach Fabrizio Celli, Johannes Keizer Metadata and Semantics Research, pp.237-248 (Springer)

SolrCloud • Subset of optional features in Solr to enable and simplify horizontal scaling • Solr is distributed on several servers, using a cluster of nodes • One server acts as master server: the index is created there, then replicated to all slave servers • Why? Fault tolerance and high availability

Thanks!

Mastering Apache Solr: Advanced Search Solutions

Mastering Apache Solr: Advanced Search Solutions

Presentation Transcript

apache with modsecurity

Apache Solr

NYC Apache Lucene/Solr Meetup

Practical Solr

Introduction to Open Source Search with Apache Lucene and Solr

Enhancing Discovery with Solr and Mahout

Apache Bigtop Working Group

Apache Solr

Apache Bigtop Working Group

Implementing Local Search with Apache Solr and Lucene

Apache Solr/Lucene: Looking Ahead

Apache Solr

Apache Lucene and Apache Solr Performance Tuning

Building Intelligent Search Applications with Apache Solr and PHP5

VSA Integration with Apache

Implementing Autocomplete with Solr and jQuery

Advanced Search with Solr - User Guide

Apache Solr Training | Apache Solr Online Training | Online Apache Solr Training

Apache Solr Search: Why Integrate It with Your Magento Store?

Apache Solr Beyond The Box

apache solr web development

Enhancing User Experience in Web Development with Apache Solr