10 likes | 106 Vues
Explore methods for extracting common and specific themes from multiple text collections using a cross-collection mixture model. Applications include opinion extraction, news summarization, and more.
E N D
gene 0.0173expressions 0.0096probability 0.0081microarray 0.0038… t1 … t2 T microarray 0.2gene 0.1protein 0.05 ? B Information 0.2topic 0.1 classification 0.1text 0.05 A ? web 0.3classification 0.1topic 0.1 C rules 0.0142association 0.0064 support 0.0053… Common Themes “IBM” specific “APPLE” specific “DELL” specific Battery Life Long, 4-3 hrs Medium, 3-2 hrs Short, 2-1 hrs Theme similarity Evolutionary Transition = Hard disk Large, 80-100 GB Small, 5-10 GB Medium, 20-50 GB Speed Slow, 100-200 Mhz Very Fast, 3-4 Ghz Moderate, 1-2 Ghz Collection-specific themes indicate different roles of “United Nations” in the two wars IBM Laptop Reviews APPLE Laptop Reviews DELL Laptop Reviews Background Background B B 1 B Theme 1 in common: 1 Theme 1 Specific to C1 1,1 Theme 1 Specific to C2 1,2 Theme 1 Specific to Cm 1,m C 1,i W d,1 1-C 1-B ………………… Theme k in common: k … k Theme k Specific to C1 k,1 Theme k Specific to C2 k,2 Theme k Specific to Cm k,m … d,k C k,i 1-C warning 0.3 system 0.2.. Theme 1 w w w w w w w w w w w w w w w w w w w … … Aid 0.1donation 0.05support 0.02 .. Theme 2 … θ1 θ2 statistics 0.2loss 0.1dead 0.05 .. Decoding Collection Theme k B Is 0.05the 0.04a 0.03 .. θ3 Background B P (w|θ) = output probability DAIS The Database and Information Systems Laboratory . at The University of Illinois at Urbana-Champaign Large Scale Information Management Comparative Text Mining Q. Mei, C. Liu, H. Su, A. Velivelli,B. Yu, C. Zhai Cross-Collection Text Mining (II) Cross-Collection Text Mining • Goal: Extract common themes and specific themes from comparable collections • Applications: Opinion extraction, business intelligence, news summarization, etc. Sample results (comparing news articles about Iraq war and Afghan war) • Many applications involve a comparative analysis of several text collections • Existing work in text mining has conceptually focused on one single collection of text thus is inadequate for comparative text analysis • We aim at developing methods for comparing multiple collections of text and performing comparative text mining The common theme indicates that “United Nations” is involved in both wars - A mixture model for cross-collection comparative text mining “Generating” word w in doc d in collection Ci Reference: • C. Zhai, A. Velivelli, and B. Yu. A Cross-Collection Mixture Model for Comparative Text Mining. KDD 2004. 1 2 Temporal Text Mining (II) Temporal Text Mining Sample results: • Goal: Extract evolutionary theme patterns from time labeled collection • Applications: News summarization, literature analysis, opinion monitoring, etc. Time Theme evolution thread Theme Evolution Graph and threads of Tsunami data set Models: Statistics of Death and loss Statistics of further impact Immediate Reports Personal Experience of Survivors Document d Donations from countries “Generating” word w in doc d in the collection Aid from Local Areas Aid from the world 1 d,1 Research inspired Doc1 Doc3 Doc .. … 2 Lessons from Tsunami d,2 1 - B Specific Events of Aid d, k k W Theme spans Evolutionary transitions B B The Collection Reference: Rising • Q. Mei and C. Zhai. Discovering Evolutionary Theme Patterns from Text -- An Exploration of Temporal Text Mining. KDD 2005. Themes life cycles from CNN news dataset Themes life cycles of KDD Abstracts Dropping The next 2 weeks are mostly about “personal experience” The first 2 weeks are mostly about “aid from the world” 3 4 Spatiotemporal Text Mining (II) Spatiotemporal Text Mining • Goal: model the spatiotemporal theme patterns from a collection of text. • model the mixture of topics: common themes • spatiotemporal content analysis: theme life cycles, theme coverage snapshots • Applications: Weblog mining, search result summarization, opinion tracking, business intelligence, etc. Sample results (Weblog data about “Hurricane Katrina”, 5 weeks, U.S.): Week2: The discussion moves towards the northern and western states Week1: The theme is the strongest along the Gulf of Mexico Week3: The theme is distributed more uniformly over the states Spatiotemporal model: Document d at time t and location l Spatiotemporal Context Time = t; Location = l d TL 1 - TL Compute theme life cycles: P(i|t,l) P(i|d) 1 … i … k Themes Compute theme snapshots: P(w|i) 1 - B Background Week4: The theme is again strong along the east coast and the Gulf of Mexico B B Week5: The theme fades out in most states Reference: Word w P(w|B) • Q. Mei, C. Liu, H. Su, and C. Zhai, A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs. WWW 2006. 5 6