1 / 15

The Lucene Search Engine

The Lucene Search Engine. Kira Radinsky Modified by Amit Gross to Lucene 4. Based on the material from: Thomas Paul and Steven J. Owens. What is Lucene ?. Doug Cutting’s grandmother’s middle name A open source set of Java Classses Search Engine/Document Classifier/Indexer

ulema
Télécharger la présentation

The Lucene Search Engine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Lucene Search Engine Kira Radinsky Modified by Amit Gross to Lucene 4 Based on the material from: Thomas Paul and Steven J. Owens

  2. What is Lucene? • Doug Cutting’s grandmother’s middle name • A open source set of Java Classses • Search Engine/Document Classifier/Indexer • Developed by Doug Cutting (1996) • Xerox/Apple/Excite/Nutch/Yahoo/Cloudera • Hadoop founder, Board of directors of the Apache Software • Jakarta Apache Product. Strong open source community support. • High-performance, full-featured text search engine library • Easy to use yet powerful API

  3. Use the Source, Luke • Document • Field • Represents a section of a Document: name for the section + the actual data. • Analyzer • Abstract class (to provide interface) • Document -> tokens (for later indexing) • StandardAnalyzer class. • IndexWriter • Creates and maintains indexes. • IndexSearcher • Searches through an index. • QueryParser • Builds a parser that can search through an index. • Query • Abstract class that contains the search criteria created by the QueryParser. • TopDocs • Contains the top K Document objects found in a serach by an IndexSearcher, and their scores.

  4. Indexing a Document

  5.  Document from an article private DocumentcreateDocument(String article, String author, String title, String topic, Stringurl, DatedateWritten) { document.add(newTextField("author",author, Store.YES)); document.add(newTextField("title",title, Store.YES )); document.add(newTextField("topic",topic, Store.YES )); document.add(newTextField("article", article, Store.NO)); document.add(newStoredField("URL", url)); document.add(newStringField("Date", dateWritten, Store.NO)); return document; }

  6. The Field Object

  7. Store a Document in the index Directory dir= FSDirectory.open(new File("lucene-index")); privatevoidindexDocument(Documentdocument) throwsException { Analyzer analyzer = newStandardAnalyzer(Version.LUCENE_45); IndexWriterConfigiwc = newIndexWriterConfig(Version.LUCENE_45, analyzer); IndexWriterwriter = new IndexWriter(dir, iwc); writer.addDocument(document); writer.close(); }

  8. Analyzers and Tokenizers

  9. Adding to an Index public void indexArticle( String article, String author, String title, String topic, Stringurl, DatedateWritten) throwsException { Documentdocument = createDocument ( article, author, title, topic, url, dateWritten ); indexDocument(document); }

  10. Searching the Index

  11. Searching Analyzeranalyzer = newStandardAnalyzer(Version.LUCENE_45); IndexSearchersearcher = newIndexSearcher(DirectoryReader.open(dir)); QueryParserqp = newQueryParser(Version.LUCENE_45, "article", analyzer); Query q = qp.parse(searchString); TopDocstop = searcher.search(q, numResults);

  12. Extracting Document objects for (ScoreDocsd : top.scoreDocs) { Document doc = searcher.doc(sd.doc); // display the articles that were found to the user }

  13. Search Criteria Supports several searches: AND OR and NOT, fuzzy, proximity searches, wildcard searches, and range searches • author:Henry relativity AND "quantum physics“ • "string theory" NOT Einstein • "Galileo Kepler"~5 • author:Johnson date:[01/01/2004 TO 01/31/2004]

  14. Thread Safety • Indexing and searching are not only thread safe, but process safe. What this means is that: • Multiple index searchers can read the lucene index files at the same time. • An index writer or reader can edit the lucene index files while searches are ongoing • Multiple index writers or readers can try to edit the lucene index files at the same time (it's important for the index writer/reader to be closed so it will release the file lock). • The query parser is not thread safe, • The index writer however, is thread safe,

  15. Luke • Luke is a handy tool for development, that allows you to watch an already existing Lucene Index. • http://code.google.com/p/luke/

More Related