1 / 16

Lucene-Demo

Lucene-Demo . Brian Nisonger. Intro. No details about Implementation/Theory See Treehouse Wiki- Lucene for additional info Set of Java classes Not an end to end solution Designed to allow rapid development of IR tools. Index.

emlyn
Télécharger la présentation

Lucene-Demo

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lucene-Demo Brian Nisonger

  2. Intro • No details about Implementation/Theory • See Treehouse Wiki- Lucene for additional info • Set of Java classes • Not an end to end solution • Designed to allow rapid development of IR tools

  3. Index • The first step is to take a set of text documents and build an Index • Demo:IndexFiles on Pongo • Two major classes • Analyzer • Used to Tokenize data • More on this later • IndexWriter • IndexWriter writer = new IndexWriter(INDEX_DIR, new StandardAnalyzer(), true);

  4. Index Writer • Index Writer creates an index of documents • First argument is a directory of where to build/find the index • Second argument calls an Analyzer • Third argument determines if a new index should be created

  5. Analyzer • Standard Analyzer • Porter Stemming w/ Stop Words • Krovetz Stemmer-Example • package org.apache.lucene.analysis; • import org.apache.lucene.analysis.Analyzer; • import org.apache.lucene.analysis.standard.*; • import org.apache.lucene.analysis.TokenStream; • import org.apache.lucene.analysis.StopFilter; • import org.apache.lucene.analysis.LowerCaseTokenizer; • import org.apache.lucene.analysis.KStemFilter; • import java.io.Reader; • public class KStemAnalyzer extends Analyzer • { • public final TokenStream tokenStream(String fieldName, Reader reader) • { • return new KStemFilter(new LowerCaseTokenizer(reader)); • } • }

  6. Analyzer-II • Snowball Stemmer • A stemmer language created by Porter used to build Stemmers • Multilingual analyzers/Stemmers • Porter2 • Fully Integrated with Lucene 1.9.1 • MyAnalyzer(Home Built) • Demo

  7. Adding Documents • The Next step after creating an index is to add documents • writer.addDocument(FileDocument.Document(file)); • Remember we already determined how the document will be tokenized • Fields • Can split document in to parts such as document title,body,date created, paragraphs

  8. Adding Documents-II • Assigns Token/doc ID • For why this is important see Lucene –TreeHouse Wiki • Create some type of loop to add all the documents • This is the actual creation of the Index before we merely set the Index parameters

  9. Finalizing Index Creation • After that the Index is optimized with writer.optimize(); • Merges etc. • The Index is close with writer.close();

  10. Searching an Index • Open Index • IndexReader reader = IndexReader.open(index); • Create Searcher • Searcher searcher = new IndexSearcher(reader); • Assign Analyzer • Use the same Analyzer used to create Index (Why?)

  11. Searching an Index-II • Parse/Create query • Query query = QueryParser.parse(line, field, analyzer); • Takes a line, looks for a particular field, and runs it through an analyzer to create query • Determine which documents are matches • Hits hits = searcher.search(query);

  12. Retrieving Documents • Hits creates a collection of documents • Using a loop we can reference each doc • Document doc = hits.doc(i); • This allows us to get info about the document • Name of document, date is was created, words in document • Relevancy Score(TF/IDF) • Demo

  13. Finishing Searching • Return list of documents • Close Reader

  14. Other Functions • Spans (Example from http://lucene.apache.org/java/docs/api/index.html) • Useful for Phrasal matching • Allows for Passage Retrieval

  15. Questions? • Any Questions, comments, jokes, opinions??

  16. I said “Good Day” • The END

More Related