1 / 12

A Cross-Collection Mixture Model for Comparative Text Mining

A Cross-Collection Mixture Model for Comparative Text Mining. ChengXiang Zhai 1 Atulya Velivelli 2 Bei Yu 3 1 Department of Computer Science 2 Department of Electrical and Computer Engineering 3 Graduate School of Library and Information Science

kiet
Télécharger la présentation

A Cross-Collection Mixture Model for Comparative Text Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Cross-Collection Mixture Model for Comparative Text Mining ChengXiang Zhai1 Atulya Velivelli2 Bei Yu3 1Department of Computer Science 2Department of Electrical and Computer Engineering 3Graduate School of Library and Information Science University of Illinois, Urbana-Champaign U.S.A.

  2. Motivation • Many applications involve a comparative analysis of several comparable text collections, e.g., • Given news articles from different sources (about the same event), can we extract what is common to all the sources and what is unique to one specific souce? • Given customer reviews about 3 different brands of laptops, can we extract the common themes (e.g., battery life, speed, warranty) and compare the three brands in terms of each common theme? • Given web sites about companies selling similar products, can we analyze the strength/weakness of each company? • Existing work in text mining has conceptually focused on one single collection of text thus is inadequate for comparative text analysis • We aim at developing methods for comparing multiple collections of text and performing comparative text mining

  3. Comparative Text Mining (CTM) Problem definition: • Given a comparable set of text collections • Discover & analyze their common and unique properties A pool of text Collections Collection C2 …. Collection Ck Collection C1 Common themes C1- specific themes C2- specific themes Ck- specific themes

  4. Example: Summarizing Customer Reviews IBM Laptop Reviews APPLE Laptop Reviews DELL Laptop Reviews Ideal results from comparative text mining

  5. A More Realistic Setup of CTM IBM Laptop Reviews APPLE Laptop Reviews DELL Laptop Reviews Common Word Distr. Collection-specific Word Distributions

  6. A Basic Approach: Simple Clustering • Pool all documents together and perform clustering • Hopefully, some clusters are reflecting common themes and others specific themes • However, we can’t “force” a common theme to cover all collections Background B Theme 1 1 Theme 3 3 Theme 2 2 Theme 4 4 …………………

  7. Improved Clustering: Cross-Collection Mixture Models • Explicitly distinguish and model common themes and specific themes • Fit a mixture model with the text data • Estimate parameters using EM • Clusters are more meaningful C1 C2 Cm Background B Theme 1 in common: 1 Theme 1 Specific to C1 1,1 Theme 1 Specific to C2 1,2 Theme 1 Specific to Cm 1,m ………………… … Theme k in common: k Theme k Specific to C1 k,1 Theme k Specific to C2 k,2 Theme k Specific to Cm k,m

  8. Details of the Mixture Model Account for noise (common non-informative words) Background B Common Distribution “Generating” word w in doc d in collection Ci B 1 C Theme 1 W 1,i 1-C d,1 Collection-specific Distr. 1-B … d,k Common Distribution k C Parameters: B=noise-level (manually set) C=Common-Specific tradeoff (manually set) ’s and ’s are estimated with Maximum Likelihood Theme k k,i 1-C Collection-specific Distr.

  9. Experiments • Two Data Sets • War news (2 collections) • Iraq war: A combination of 30 articles from CNN and BBC websites • Afghan war: A combination of 26 articles from CNN and BBC websites • Laptop customer reviews (3 collections) • Apple iBook Mac: 34 reviews downloaded from epinions.com • Dell Inspiron: 22 reviews downloaded from epinions.com • IBM Thinkpad: 42 reviews downloaded from epinions.com • On each data set, we compare a simple mixture model with the cross-collection mixture model

  10. Comparison of Simple and Cross-Collection Clustering Simple Cross Collection Results from Cross-collection clustering are more meaningful

  11. Cross-Collection Clustering Results (Laptop Reviews) Top words serve as “labels” for common themes (e.g., [sound, speakers], [battery, hours], [cd,drive]) These word distributions can be used to segment text and add hyperlinks between documents

  12. Summary and Future Work • We defined a new text mining problem, referred to as comparative text mining (CTM), which has many applications • We proposed and evaluated a cross-collection mixture model for CTM • Experiment results show that the proposed cross-collection model is more effective for CTM than a simple mixture model for CTM • Future work • Further improve the mixture model and estimation method (e.g., consider proximity, MAP estimation) • Use the model to segment documents and create hyperlinks between segments (e.g., feed the learned word distributions into HMMs for segmentation)

More Related