A Cross-Collection Mixture Model for Comparative Text Mining

A Cross-Collection Mixture Model for Comparative Text Mining ChengXiang Zhai1 Atulya Velivelli2 Bei Yu3 1Department of Computer Science 2Department of Electrical and Computer Engineering 3Graduate School of Library and Information Science University of Illinois, Urbana-Champaign U.S.A.

Motivation • Many applications involve a comparative analysis of several comparable text collections, e.g., • Given news articles from different sources (about the same event), can we extract what is common to all the sources and what is unique to one specific souce? • Given customer reviews about 3 different brands of laptops, can we extract the common themes (e.g., battery life, speed, warranty) and compare the three brands in terms of each common theme? • Given web sites about companies selling similar products, can we analyze the strength/weakness of each company? • Existing work in text mining has conceptually focused on one single collection of text thus is inadequate for comparative text analysis • We aim at developing methods for comparing multiple collections of text and performing comparative text mining

Comparative Text Mining (CTM) Problem definition: • Given a comparable set of text collections • Discover & analyze their common and unique properties A pool of text Collections Collection C2 …. Collection Ck Collection C1 Common themes C1- specific themes C2- specific themes Ck- specific themes

Example: Summarizing Customer Reviews IBM Laptop Reviews APPLE Laptop Reviews DELL Laptop Reviews Ideal results from comparative text mining

A More Realistic Setup of CTM IBM Laptop Reviews APPLE Laptop Reviews DELL Laptop Reviews Common Word Distr. Collection-specific Word Distributions

A Basic Approach: Simple Clustering • Pool all documents together and perform clustering • Hopefully, some clusters are reflecting common themes and others specific themes • However, we can’t “force” a common theme to cover all collections Background B Theme 1 1 Theme 3 3 Theme 2 2 Theme 4 4 …………………

Improved Clustering: Cross-Collection Mixture Models • Explicitly distinguish and model common themes and specific themes • Fit a mixture model with the text data • Estimate parameters using EM • Clusters are more meaningful C1 C2 Cm Background B Theme 1 in common: 1 Theme 1 Specific to C1 1,1 Theme 1 Specific to C2 1,2 Theme 1 Specific to Cm 1,m ………………… … Theme k in common: k Theme k Specific to C1 k,1 Theme k Specific to C2 k,2 Theme k Specific to Cm k,m

Details of the Mixture Model Account for noise (common non-informative words) Background B Common Distribution “Generating” word w in doc d in collection Ci B 1 C Theme 1 W 1,i 1-C d,1 Collection-specific Distr. 1-B … d,k Common Distribution k C Parameters: B=noise-level (manually set) C=Common-Specific tradeoff (manually set) ’s and ’s are estimated with Maximum Likelihood Theme k k,i 1-C Collection-specific Distr.

Experiments • Two Data Sets • War news (2 collections) • Iraq war: A combination of 30 articles from CNN and BBC websites • Afghan war: A combination of 26 articles from CNN and BBC websites • Laptop customer reviews (3 collections) • Apple iBook Mac: 34 reviews downloaded from epinions.com • Dell Inspiron: 22 reviews downloaded from epinions.com • IBM Thinkpad: 42 reviews downloaded from epinions.com • On each data set, we compare a simple mixture model with the cross-collection mixture model

Comparison of Simple and Cross-Collection Clustering Simple Cross Collection Results from Cross-collection clustering are more meaningful

Cross-Collection Clustering Results (Laptop Reviews) Top words serve as “labels” for common themes (e.g., [sound, speakers], [battery, hours], [cd,drive]) These word distributions can be used to segment text and add hyperlinks between documents

Summary and Future Work • We defined a new text mining problem, referred to as comparative text mining (CTM), which has many applications • We proposed and evaluated a cross-collection mixture model for CTM • Experiment results show that the proposed cross-collection model is more effective for CTM than a simple mixture model for CTM • Future work • Further improve the mixture model and estimation method (e.g., consider proximity, MAP estimation) • Use the model to segment documents and create hyperlinks between segments (e.g., feed the learned word distributions into HMMs for segmentation)

A Cross-Collection Mixture Model for Comparative Text Mining

A Cross-Collection Mixture Model for Comparative Text Mining

Presentation Transcript

A Comparative Advantage Model

Text Mining

NLP for Text Mining

Text Independent Speaker Identification Using Gaussian Mixture Model

Cross-Sectional Mixture Modeling

A Gamma model for mixture STR samples

Text mining- text analytics- data mining

Gaussian Mixture Model

Text Mining

A data model for Comparative Genomics

Text Mining

Text Mining

A Mixture Model for Expert Finding

Text Mining

Comparative Text Mining

Growth Mixture Model

Text Mining