70 likes | 201 Vues
This project centers on creating clusters of related concepts representing specific knowledge domains using advanced indexing and a knowledge-based vector space model (VSM). Leveraging Wikipedia as a knowledge domain, the project contrasts traditional Bag of Words (BOW) indexing with knowledge-based indexing. The implementation includes constructing knowledge-based vectors for documents using term similarity measures, extracting similar documents from Wikipedia, and applying a document clustering algorithm built on Wikipedia's structure. Tools utilized involve MySQL database dumps, JWPL API, and Lucene API.
E N D
Information Retrieval Project Creation of clusters of concepts that represent a domain corpus.
Background • Vector Space Model. • Knowledge-Based Vector Space Model. • Wikipedia as a knowledge domain. • BOW indexing versus knowledge-based indexing. • Indexing Wikipedia. • Wikipedia-based concept clustering
Knowledge-based VSM for text Clustering • Problem Definition: • Creating clusters of related concepts, each cluster represents a specific knowledge domain. • Creation of The knowledge-based Vectors for documents in a given corpus based on term similarity measures in each document.
Given: • Wikipedia index. • Working Code for Knowledge-based corpus indexes. • Working code to define term-term relatedness weight. • Working Similarity code “To extract a similar document to an existing one from Wikipedia”. • Algorithm for Document Clustering based on the Wikipedia structure”.
Email me @ • eea7236@louisiana.edu • Elshaimaa.ali@hotmail.com
Required To implement: • Building a knowledge-based VSM Index for documents in two different domain corpuses using the term similarity code given. • Implementation of the Wikipedia Structure-based given clustering Algorithm.
Tools that will be used • Wikipedia Database Dumps. (MySql Database). • JWPL API to access the Wikipedia database dumps. • Lucene API to build indexes. • Assistance and codes will be provided to help using the APIs