Lucene-Demo

Lucene-Demo Brian Nisonger

Intro • No details about Implementation/Theory • See Treehouse Wiki- Lucene for additional info • Set of Java classes • Not an end to end solution • Designed to allow rapid development of IR tools

Index • The first step is to take a set of text documents and build an Index • Demo:IndexFiles on Pongo • Two major classes • Analyzer • Used to Tokenize data • More on this later • IndexWriter • IndexWriter writer = new IndexWriter(INDEX_DIR, new StandardAnalyzer(), true);

Index Writer • Index Writer creates an index of documents • First argument is a directory of where to build/find the index • Second argument calls an Analyzer • Third argument determines if a new index should be created

Analyzer • Standard Analyzer • Porter Stemming w/ Stop Words • Krovetz Stemmer-Example • package org.apache.lucene.analysis; • import org.apache.lucene.analysis.Analyzer; • import org.apache.lucene.analysis.standard.*; • import org.apache.lucene.analysis.TokenStream; • import org.apache.lucene.analysis.StopFilter; • import org.apache.lucene.analysis.LowerCaseTokenizer; • import org.apache.lucene.analysis.KStemFilter; • import java.io.Reader; • public class KStemAnalyzer extends Analyzer • { • public final TokenStream tokenStream(String fieldName, Reader reader) • { • return new KStemFilter(new LowerCaseTokenizer(reader)); • } • }

Analyzer-II • Snowball Stemmer • A stemmer language created by Porter used to build Stemmers • Multilingual analyzers/Stemmers • Porter2 • Fully Integrated with Lucene 1.9.1 • MyAnalyzer(Home Built) • Demo

Adding Documents • The Next step after creating an index is to add documents • writer.addDocument(FileDocument.Document(file)); • Remember we already determined how the document will be tokenized • Fields • Can split document in to parts such as document title,body,date created, paragraphs

Adding Documents-II • Assigns Token/doc ID • For why this is important see Lucene –TreeHouse Wiki • Create some type of loop to add all the documents • This is the actual creation of the Index before we merely set the Index parameters

Finalizing Index Creation • After that the Index is optimized with writer.optimize(); • Merges etc. • The Index is close with writer.close();

Searching an Index • Open Index • IndexReader reader = IndexReader.open(index); • Create Searcher • Searcher searcher = new IndexSearcher(reader); • Assign Analyzer • Use the same Analyzer used to create Index (Why?)

Searching an Index-II • Parse/Create query • Query query = QueryParser.parse(line, field, analyzer); • Takes a line, looks for a particular field, and runs it through an analyzer to create query • Determine which documents are matches • Hits hits = searcher.search(query);

Retrieving Documents • Hits creates a collection of documents • Using a loop we can reference each doc • Document doc = hits.doc(i); • This allows us to get info about the document • Name of document, date is was created, words in document • Relevancy Score(TF/IDF) • Demo

Finishing Searching • Return list of documents • Close Reader

Other Functions • Spans (Example from http://lucene.apache.org/java/docs/api/index.html) • Useful for Phrasal matching • Allows for Passage Retrieval

Questions? • Any Questions, comments, jokes, opinions??

I said “Good Day” • The END

Lucene-Demo

Lucene-Demo

Presentation Transcript

Advanced Lucene

Apache Lucene

Apache Lucene

Lucene

Lucene

Lucene (Concluded) ‏

Advanced Lucene

Lucene in action

Lucene Tutorial

Lucene Performance

Apache Lucene

Lucene/SOLR 2: Lucene search API

Lucene

Lucene

Topic: Lucene

Lucene Part3 ‏

Lucene (Concluded) ‏

Lucene Homework