Text Mining: Fast Phrase-based Text Indexing and Matching

LORNET Theme 4 Text Mining:Fast Phrase-based Text Indexing and Matching Khaled Hammouda, Ph.D. Student PAMI Research Group University of Waterloo Waterloo, Ontario, Canada

Web / LOR Pattern Recognition Text Documents Web Documents Discussion Articles . . . Programming Languages Data Mining Database Systems The Problem How do we judge similarity? Automatic Clustering/Grouping

Clustering Documents • Group Similar Documents Together • Maximize intra-cluster similarity • Minimize inter-cluster similarity • Need to accurately calculate document similarity

Document Similarity How similar each document is to every other document? Very time consuming! O(n2)

Document Similarity • Information Theoretic Measure (Dekang’98): • How do we intersect every pair of documents without sacrificing efficiency? • What features should we intersect? • Words • Phrases

Fast Phrase-based Document Indexingand Matching • Document Index Graph Structure • A model based on a digraph representation of the phrases in the document set • Nodes correspond to unique terms • Edges maintain phrase representation • A phrase is a path in the graph • The model is an inverted list (terms  documents) • Nodes carry term weight information for each document in which they appear • Shared phrases can be matched efficiently • Phrase-based Features • Phrases: more informative feature than individual words  local context matching • Represent sentences rather than words • Facilitate phrase-matching between documents • Achieves accurate document pair-wise similarity • Avoid high-dimensionality of vector space model • Allow incremental processing Document Index Graph

Document Index Graph - river - vacation plan - river rafting - river - trips

Phrase-based Document Indexing Document Index Graph (size scalability) Document Index Graph (internal structure) Document Index Graph (time performance)

Effect of using phrase-based similarity over individual words Effect of using phrase similarity (F-measure) Effect of using phrase similarity (Entropy)

Applications • Grouping search engine results on-the-fly(incremental processing) • Creating taxonomies of documents(Yahoo! and Open Directory style) • Implementing “Find Related” or “Find Similar” features of information retrieval systems • Automatic generation of descriptive phrases about a set of documents (i.e. labeling clusters) • Detecting plagiarism

Collaboration • Provide Data Mining services (primarily text mining) for other groups • Opportunity for collaboration with U of Saskatchewan: • I-Help Discussion System • Course Delivery Tools • Others are welcome

Questions • Instant Messaging • MSN Messenger: lornet_uw@hotmail.com • E-mail • lornet@pami.uwaterloo.ca

Text Mining: Fast Phrase-based Text Indexing and Matching

Text Mining: Fast Phrase-based Text Indexing and Matching

Presentation Transcript

Text Mining Tools

SQL Server Full-Text Search Using full-text search in SQL Server 2005

DATA MINING Introductory and Advanced Topics Part I

IR - Indexing

IR - Indexing

Functional Theories of Translation

5.RL.1

Making Connections: Text to Self and Text to Text

CMPT 454

Unit of measure

此报告仅供客户内部使用。未经麦肯锡公司的书面许可，其它任何机构不得擅自传阅、引用或复制。

Unit 6 : Text A

Lexical networks, lexical centrality, and text mining

Text Structure

807 - TEXT ANALYTICS

Example text Go ahead and replace it with your own text. This is an example text.

Dialogue-main

Lexical networks, lexical centrality, and text mining

Temple University – CIS Dept. CIS616– Principles of Data Management

Text-main1