1 / 10

Creating an Inverted File Database for Efficient Text Tokenization and Location Tracking

This project focuses on constructing an inverted file database designed to tokenize text documents and maintain a record of each token's occurrences. The process starts by tokenizing documents and associating each token with a list of its locations within the texts, completely avoiding delimiters to ensure the capture of tokens. The final output is organized in an Oracle database format where tokens are sorted in the first column, document names in the second, and the token locations in subsequent columns. The methodology integrates both simple and complex data structures for effective storage and retrieval.

liza
Télécharger la présentation

Creating an Inverted File Database for Efficient Text Tokenization and Location Tracking

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Project Description 2Inverted List Database

  2. Create an Inverted File • Tokenize a text document, and attach to each token a list of locations that this token has appeared • Sort and Store these result in Oracle database

  3. Tokenizer • Tokenizer • Admissible symbols for token; we will not user delimiter to capture the token. • Keep a record of the position of each token

  4. Tokenizer Example: Document1: He is a dumb teacher Dumb! Dumb! and Dumb! Document2:He is a great council. His advices are really great. He truly helps.

  5. Tokenizer Inverted File for document 1: -continue: dumb 4 Dumb 6 Dumb 8 Dumb 11 He 1 is 2 teacher 5

  6. Tokenizer - Example: Inverted File for document 1: ! 12 ! 7 ! 9 a 3 and 10

  7. Tokenizer Inverted File for document 1 ! 7, 9, 12 (frequency= 3/ 12) a 3 and 10 Dumb 4, 6, 8 , 11 He 1 is 2 teacher 5

  8. Tokenizer Inverted File for document 2: (period) . 6 , 12 a 3 advices 8 are 9 council 5 great 4 , 11 He 1, 13 His 7, is 2 really 10

  9. Create a Token Database Organize a Inverted file for the following documents For Simple data Fro complex data

  10. Token database • Store the token into database • First Column is sorted tokens • Second Column is the Document Names • Rest of a tuple keeps locations of the token • This is the so called inverted list

More Related