Creating an Inverted File Database for Efficient Text Tokenization and Location Tracking

Project Description 2Inverted List Database

Create an Inverted File • Tokenize a text document, and attach to each token a list of locations that this token has appeared • Sort and Store these result in Oracle database

Tokenizer • Tokenizer • Admissible symbols for token; we will not user delimiter to capture the token. • Keep a record of the position of each token

Tokenizer Example: Document1: He is a dumb teacher Dumb! Dumb! and Dumb! Document2:He is a great council. His advices are really great. He truly helps.

Tokenizer Inverted File for document 1: -continue: dumb 4 Dumb 6 Dumb 8 Dumb 11 He 1 is 2 teacher 5

Tokenizer - Example: Inverted File for document 1: ! 12 ! 7 ! 9 a 3 and 10

Tokenizer Inverted File for document 1 ! 7, 9, 12 (frequency= 3/ 12) a 3 and 10 Dumb 4, 6, 8 , 11 He 1 is 2 teacher 5

Tokenizer Inverted File for document 2: (period) . 6 , 12 a 3 advices 8 are 9 council 5 great 4 , 11 He 1, 13 His 7, is 2 really 10

Create a Token Database Organize a Inverted file for the following documents For Simple data Fro complex data

Token database • Store the token into database • First Column is sorted tokens • Second Column is the Document Names • Rest of a tuple keeps locations of the token • This is the so called inverted list

Creating an Inverted File Database for Efficient Text Tokenization and Location Tracking

Creating an Inverted File Database for Efficient Text Tokenization and Location Tracking

Presentation Transcript

Project Description

Project Description

Project Description

Project Description

Project Description:

Project Description

Project Description

2 .Project Description:

Project Description

Project Description

Project Description

PROJECT DESCRIPTION

PROJECT DESCRIPTION

Project Description

Project description

Project Description

PROJECT DESCRIPTION:

Project Description

Project Description

Project Description

Project Description

Project Description