100 likes | 213 Vues
This project focuses on constructing an inverted file database designed to tokenize text documents and maintain a record of each token's occurrences. The process starts by tokenizing documents and associating each token with a list of its locations within the texts, completely avoiding delimiters to ensure the capture of tokens. The final output is organized in an Oracle database format where tokens are sorted in the first column, document names in the second, and the token locations in subsequent columns. The methodology integrates both simple and complex data structures for effective storage and retrieval.
E N D
Create an Inverted File • Tokenize a text document, and attach to each token a list of locations that this token has appeared • Sort and Store these result in Oracle database
Tokenizer • Tokenizer • Admissible symbols for token; we will not user delimiter to capture the token. • Keep a record of the position of each token
Tokenizer Example: Document1: He is a dumb teacher Dumb! Dumb! and Dumb! Document2:He is a great council. His advices are really great. He truly helps.
Tokenizer Inverted File for document 1: -continue: dumb 4 Dumb 6 Dumb 8 Dumb 11 He 1 is 2 teacher 5
Tokenizer - Example: Inverted File for document 1: ! 12 ! 7 ! 9 a 3 and 10
Tokenizer Inverted File for document 1 ! 7, 9, 12 (frequency= 3/ 12) a 3 and 10 Dumb 4, 6, 8 , 11 He 1 is 2 teacher 5
Tokenizer Inverted File for document 2: (period) . 6 , 12 a 3 advices 8 are 9 council 5 great 4 , 11 He 1, 13 His 7, is 2 really 10
Create a Token Database Organize a Inverted file for the following documents For Simple data Fro complex data
Token database • Store the token into database • First Column is sorted tokens • Second Column is the Document Names • Rest of a tuple keeps locations of the token • This is the so called inverted list