170 likes | 295 Vues
High-quality searching with Lucene. Kristoffer Dyrkorn Senior Consultant, BEKK Consulting. Dish of the day. How does a search engine work? Applying Lucene Searching and quality Experiences. whois 127.0.0.1. Developer/architect at BEKK Consulting Systems integration
E N D
High-quality searching with Lucene Kristoffer Dyrkorn Senior Consultant, BEKK Consulting
Dish of the day • How does a search engine work? • Applying Lucene • Searching and quality • Experiences
whois 127.0.0.1 • Developer/architect at BEKK Consulting • Systems integration • Java, open source toolkits, information retrieval • Languages, usability
How it works Query processing Search engine Query parsing Lookups Hits Indexing Link to original data source Index (copy of data) Field mapping Parsing HTML DOC PDF XML ...
The index • In practice: A table • 1 row = 1 Document (web page, Word document, PDF, etc) • 1 column = 1 Field (document contents or metadata) • Stored in a file system, a database or RAM • Field mapping, storage parameters
Getting data • Web pages, files, databases • Presentation end • Data end • Beware of • Data formats, character encodings, multiple languages Simplified pages Full pages Web server App server Database
Letting users find data • From words to exact definition • Query parsing • Term • Expansion and aggregation • What to find, what to avoid • Presentation • Mandatory fields • Helpful functionality
Why searching? • Several paths to information • Search, navigate, or use agents • Tradeoffs • Text-based search is gaining ground • But still has limitations • Apply technology, usability and care • Or else.... “This is the end we won't take any more Say goodbye” Searching, seek and destroy (Hetfield/Ulrich)
Lucene • API • Index storage and lookups • Sophisticated ranking • Query parser • Related documents • Performance, efficiency • To the point • News! Compressed/binary data fields, virtual indexes • Open Source modules • Presentation utilities • Sophisticated language tools • Categorisation
Lucene vs a full system Search engine Administration User interface Lucene API Synonyms Stemming Highlighting Summary Did you mean? (Open Source) Queries Language processing Index storage and lookups Format parsers Connectors
Applying Lucene MultiFieldQueryParser Query.setBoost() BooleanQuery Query -> Field 1 Query 1 & Boost 1 hello world Query -> Field 2 Query 2 & Boost 2 query IndexSearcher (Extra queries) Query N & Boost N Lucene BooleanQuery Query Query 1 -> Filterfield Hits filter spec Query 2 -> Filterfield filter Document Query 3 -> Filterfield Field
Problems, problems, problems • Can’t find it! • Extraction • Character encoding • Field mapping • Spelling? Fuzzy search? • Synonyms? • Searching is slow • Term expansion • Being too clever • Concurrency • Field contents and size, stop words • Index size, chunk size, fragmentation • Storage alternatives • Juice
Hints, hints, hints • Know your data! • Distribution and decisiveness • Relevancy • Re-apply statistics • Monitor hits • Embrace change! • Content, users, traffic • Test! • Define critical queries • Specify, verify and tune
The ancient Chinese art of Chi Ting • Relevancy tuning • Apply care! • Achieve objective value • Overdetermination • Stemming and (non-) synonyms • meaning, mean, evil • Keep it secret
Adding search to sites • Data or presentation end? • Consider pros and cons • New page templates? • Separate index pages • Modify page templates? • DIV tags • Indexing impact
Quality • Make all available information available • Support inexact searching • Be clear and decisive • Provide hints and help • Be quick
Thank you! • More at • http://lucene.apache.org • http://www.jguru.com/faq/Lucene • http://www.getopt.org/luke/ • Questions? kristoffer [at] bekk.no Doug Cutting