1 / 17

High-quality searching with Lucene

High-quality searching with Lucene. Kristoffer Dyrkorn Senior Consultant, BEKK Consulting. Dish of the day. How does a search engine work? Applying Lucene Searching and quality Experiences. whois 127.0.0.1. Developer/architect at BEKK Consulting Systems integration

jerica
Télécharger la présentation

High-quality searching with Lucene

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High-quality searching with Lucene Kristoffer Dyrkorn Senior Consultant, BEKK Consulting

  2. Dish of the day • How does a search engine work? • Applying Lucene • Searching and quality • Experiences

  3. whois 127.0.0.1 • Developer/architect at BEKK Consulting • Systems integration • Java, open source toolkits, information retrieval • Languages, usability

  4. How it works Query processing Search engine Query parsing Lookups Hits Indexing Link to original data source Index (copy of data) Field mapping Parsing HTML DOC PDF XML ...

  5. The index • In practice: A table • 1 row = 1 Document (web page, Word document, PDF, etc) • 1 column = 1 Field (document contents or metadata) • Stored in a file system, a database or RAM • Field mapping, storage parameters

  6. Getting data • Web pages, files, databases • Presentation end • Data end • Beware of • Data formats, character encodings, multiple languages Simplified pages Full pages Web server App server Database

  7. Letting users find data • From words to exact definition • Query parsing • Term • Expansion and aggregation • What to find, what to avoid • Presentation • Mandatory fields • Helpful functionality

  8. Why searching? • Several paths to information • Search, navigate, or use agents • Tradeoffs • Text-based search is gaining ground • But still has limitations • Apply technology, usability and care • Or else.... “This is the end we won't take any more Say goodbye” Searching, seek and destroy (Hetfield/Ulrich)

  9. Lucene • API • Index storage and lookups • Sophisticated ranking • Query parser • Related documents • Performance, efficiency • To the point • News! Compressed/binary data fields, virtual indexes • Open Source modules • Presentation utilities • Sophisticated language tools • Categorisation

  10. Lucene vs a full system Search engine Administration User interface Lucene API Synonyms Stemming Highlighting Summary Did you mean? (Open Source) Queries Language processing Index storage and lookups Format parsers Connectors

  11. Applying Lucene MultiFieldQueryParser Query.setBoost() BooleanQuery Query -> Field 1 Query 1 & Boost 1 hello world Query -> Field 2 Query 2 & Boost 2 query IndexSearcher (Extra queries) Query N & Boost N Lucene BooleanQuery Query Query 1 -> Filterfield Hits filter spec Query 2 -> Filterfield filter Document Query 3 -> Filterfield Field

  12. Problems, problems, problems • Can’t find it! • Extraction • Character encoding • Field mapping • Spelling? Fuzzy search? • Synonyms? • Searching is slow • Term expansion • Being too clever • Concurrency • Field contents and size, stop words • Index size, chunk size, fragmentation • Storage alternatives • Juice

  13. Hints, hints, hints • Know your data! • Distribution and decisiveness • Relevancy • Re-apply statistics • Monitor hits • Embrace change! • Content, users, traffic • Test! • Define critical queries • Specify, verify and tune

  14. The ancient Chinese art of Chi Ting • Relevancy tuning • Apply care! • Achieve objective value • Overdetermination • Stemming and (non-) synonyms • meaning, mean, evil • Keep it secret

  15. Adding search to sites • Data or presentation end? • Consider pros and cons • New page templates? • Separate index pages • Modify page templates? • DIV tags • Indexing impact

  16. Quality • Make all available information available • Support inexact searching • Be clear and decisive • Provide hints and help • Be quick

  17. Thank you! • More at • http://lucene.apache.org • http://www.jguru.com/faq/Lucene • http://www.getopt.org/luke/ • Questions? kristoffer [at] bekk.no Doug Cutting

More Related