80 likes | 239 Vues
Indexing. The essential step in searching. Review a bit. We have seen so far Crawling In the abstract and as implemented Your own code and Nutch If you are unsure about anything related to crawling, be sure to speak up now! Collection Building
E N D
Indexing The essential step in searching
Review a bit • We have seen so far • Crawling • In the abstract and as implemented • Your own code and Nutch • If you are unsure about anything related to crawling, be sure to speak up now! • Collection Building • Once you have crawled, you have a collection of documents. Presumably, you want to be able to retrieve the documents that are relevant to specified information need.
Information Retrieval • Finding the specific bit of information in the collection that satisfies a need and allows a user to complete a task. • Remember – a web search does not search the web directly. • It searches in the index created when the web pages were found and analyzed. • Last class, we saw the basic structure of indexing. Review
Tokenizer Token stream. Friends Romans Countrymen Linguistic modules Modified tokens. friend friend roman countryman roman Indexer 2 4 countryman Inverted index. 1 2 16 13 Inverted index construction Documents to be indexed. Friends, Romans, countrymen. Stop words, stemming, capitalization, cases, etc.
Indexer steps: Token sequence • Sequence of (Modified token, Document ID) pairs. Initially, all the tokens from document 1, then all the tokens from document 2, etc., without regard for duplication. Doc 1 Doc 2 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious
Indexer steps: Sort • Sort by terms • docID within terms Core indexing step
Indexer steps: Dictionary & Postings • Multiple term entries in a single document are merged. • Split into Dictionary and Postings • Doc. frequency information is added. ID of documents that contain the term Number of documents in which the term appears
Spot check • Complete the indexing for the following two “documents.” • Of course, the examples have to be very small to be manageable. Imagine that you are indexing the entire news stories. • Construct the charts as seen on the previous slide • Put your solution in the Blackboard Indexing – Spot Check 1. There is a discussion board in Blackboard. You will find it on the content homepage. Document 1: Pearson and Google Jump Into Learning Management With a New, Free System Document 2: Pearson adds free learning management tools to Google Apps for Education