110 likes | 261 Vues
CSE 450 – Web Mining Seminar Professor Brian D. Davison Fall 2005. A Project Presentation on Identifying most descriptive terms by Osama Ahmed Khan 12/16/2005. Problem. Finding the most descriptive terms for a particular document in a collection of documents (webpages)
E N D
CSE 450 – Web Mining SeminarProfessor Brian D. DavisonFall 2005 A Project Presentation on Identifying most descriptive terms by Osama Ahmed Khan 12/16/2005
Problem • Finding the most descriptive terms for a particular document in a collection of documents (webpages) • Estimating the best description for a new location in a higher-dimensional space
Terminology • Term: Adjective Noun (bi-gram) -- ti • Document: Content -- di
Creates a 2-D matrix A (t x d), representing the frequency of each term ti for each document di Creates a 3-D matrix B (d x t x t), representing the frequency of co-occurrence of each term ti with every other term tj for each document di Sorts the pairs titj for each document di in descending order of frequency, where titj represents the descriptive terms for that document di Extracts the first n pairs in the sorted index for each document di, where n represents the user input Algorithm
A document is represented in a higher-dimensional space by plotting its t(t-1)/2 coordinates, where each dimension is a titj pair Any missing coordinate for a document di is assigned a value of zero A new document dj located in t(t-1)/2-dimensional space is best described by using Mahalanobis Distance metric to find the minimum distance between dj and (d-1) documents A new document dj identified in t(t-1)/2-dimensional space without its coordinates being known is best described by using k-Nearest Neighbors approach Algorithm (contd.)
Dataset • Xiaoguang Qi provided pre-processed data http://wume.cse.lehigh.edu/~xiq204/topics/
Implementation • Code • Text Mining Infrastructure (TMI) http://hddi.cse.lehigh.edu • C++ • Metrics • Precision • Recall
Topic Detection through search engines Finding document representation in different domains Applications
Finding an approximate transformation from t-dimensional space to a new k-dimensional space (if any exists), when the set of documents D is also represented in k-dimensional space, where k is equal to t(t-1)/2 dimensions Estimating the best description of a document in either of the two spaces when one set of space coordinates are missing Open Problems
References • Improved Automatic Keyword Extraction Given More Linguistic Knowledge, Annette Hulth, Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing • Using Web Structure for Classifying and Describing Web Pages. E.J.Glover, K.Tsioutsiouliklis, S.Lawrence, D.M.Pennock & G.W.Flake, WWW2002, Hawaii, USA • Lexically-Generated Subject Hierarchies for Browsing Large Collections, C.G.Nevill-Manning, I.H.Witten & G.W.Paynter