1 / 8

Indexing

Indexing. The essential step in searching. Review a bit. We have seen so far Crawling In the abstract and as implemented Your own code and Nutch If you are unsure about anything related to crawling, be sure to speak up now! Collection Building

cutter
Télécharger la présentation

Indexing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Indexing The essential step in searching

  2. Review a bit • We have seen so far • Crawling • In the abstract and as implemented • Your own code and Nutch • If you are unsure about anything related to crawling, be sure to speak up now! • Collection Building • Once you have crawled, you have a collection of documents. Presumably, you want to be able to retrieve the documents that are relevant to specified information need.

  3. Information Retrieval • Finding the specific bit of information in the collection that satisfies a need and allows a user to complete a task. • Remember – a web search does not search the web directly. • It searches in the index created when the web pages were found and analyzed. • Last class, we saw the basic structure of indexing. Review 

  4. Tokenizer Token stream. Friends Romans Countrymen Linguistic modules Modified tokens. friend friend roman countryman roman Indexer 2 4 countryman Inverted index. 1 2 16 13 Inverted index construction Documents to be indexed. Friends, Romans, countrymen. Stop words, stemming, capitalization, cases, etc.

  5. Indexer steps: Token sequence • Sequence of (Modified token, Document ID) pairs. Initially, all the tokens from document 1, then all the tokens from document 2, etc., without regard for duplication. Doc 1 Doc 2 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious

  6. Indexer steps: Sort • Sort by terms • docID within terms Core indexing step

  7. Indexer steps: Dictionary & Postings • Multiple term entries in a single document are merged. • Split into Dictionary and Postings • Doc. frequency information is added. ID of documents that contain the term Number of documents in which the term appears

  8. Spot check • Complete the indexing for the following two “documents.” • Of course, the examples have to be very small to be manageable. Imagine that you are indexing the entire news stories. • Construct the charts as seen on the previous slide • Put your solution in the Blackboard Indexing – Spot Check 1. There is a discussion board in Blackboard. You will find it on the content homepage. Document 1: Pearson and Google Jump Into Learning Management With a New, Free System Document 2: Pearson adds free learning management tools to Google Apps for Education

More Related