1 / 1

Comparative Text Mining

gene 0.0173 expressions 0.0096 probability 0.0081 microarray 0.0038 …. t1. …. t2. T. microarray 0.2 gene 0.1 protein 0.05. ?. B. Information 0.2 topic 0.1 classification 0.1 text 0.05. A. ?. web 0.3 classification 0.1 topic 0.1. C. rules 0.0142 association 0.0064

brady-fry
Télécharger la présentation

Comparative Text Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. gene 0.0173expressions 0.0096probability 0.0081microarray 0.0038… t1 … t2 T microarray 0.2gene 0.1protein 0.05 ? B Information 0.2topic 0.1 classification 0.1text 0.05 A ? web 0.3classification 0.1topic 0.1 C rules 0.0142association 0.0064 support 0.0053… Common Themes “IBM” specific “APPLE” specific “DELL” specific Battery Life Long, 4-3 hrs Medium, 3-2 hrs Short, 2-1 hrs Theme similarity Evolutionary Transition = Hard disk Large, 80-100 GB Small, 5-10 GB Medium, 20-50 GB Speed Slow, 100-200 Mhz Very Fast, 3-4 Ghz Moderate, 1-2 Ghz Collection-specific themes indicate different roles of “United Nations” in the two wars IBM Laptop Reviews APPLE Laptop Reviews DELL Laptop Reviews Background Background B B 1 B Theme 1 in common: 1 Theme 1 Specific to C1 1,1 Theme 1 Specific to C2 1,2 Theme 1 Specific to Cm 1,m C 1,i W d,1 1-C 1-B ………………… Theme k in common: k … k Theme k Specific to C1 k,1 Theme k Specific to C2 k,2 Theme k Specific to Cm k,m … d,k C k,i 1-C warning 0.3 system 0.2.. Theme 1 w w w w w w w w w w w w w w w w w w w … … Aid 0.1donation 0.05support 0.02 .. Theme 2 … θ1 θ2 statistics 0.2loss 0.1dead 0.05 .. Decoding Collection Theme k B Is 0.05the 0.04a 0.03 .. θ3 Background B P (w|θ) = output probability DAIS The Database and Information Systems Laboratory . at The University of Illinois at Urbana-Champaign Large Scale Information Management Comparative Text Mining Q. Mei, C. Liu, H. Su, A. Velivelli,B. Yu, C. Zhai Cross-Collection Text Mining (II) Cross-Collection Text Mining • Goal: Extract common themes and specific themes from comparable collections • Applications: Opinion extraction, business intelligence, news summarization, etc. Sample results (comparing news articles about Iraq war and Afghan war) • Many applications involve a comparative analysis of several text collections • Existing work in text mining has conceptually focused on one single collection of text thus is inadequate for comparative text analysis • We aim at developing methods for comparing multiple collections of text and performing comparative text mining The common theme indicates that “United Nations” is involved in both wars - A mixture model for cross-collection comparative text mining “Generating” word w in doc d in collection Ci Reference: • C. Zhai, A. Velivelli, and B. Yu. A Cross-Collection Mixture Model for Comparative Text Mining. KDD 2004. 1 2 Temporal Text Mining (II) Temporal Text Mining Sample results: • Goal: Extract evolutionary theme patterns from time labeled collection • Applications: News summarization, literature analysis, opinion monitoring, etc. Time Theme evolution thread Theme Evolution Graph and threads of Tsunami data set Models: Statistics of Death and loss Statistics of further impact Immediate Reports Personal Experience of Survivors Document d Donations from countries “Generating” word w in doc d in the collection Aid from Local Areas Aid from the world 1 d,1 Research inspired Doc1 Doc3 Doc .. … 2 Lessons from Tsunami d,2 1 - B Specific Events of Aid d, k k W Theme spans Evolutionary transitions B B The Collection Reference: Rising • Q. Mei and C. Zhai. Discovering Evolutionary Theme Patterns from Text -- An Exploration of Temporal Text Mining. KDD 2005. Themes life cycles from CNN news dataset Themes life cycles of KDD Abstracts Dropping The next 2 weeks are mostly about “personal experience” The first 2 weeks are mostly about “aid from the world” 3 4 Spatiotemporal Text Mining (II) Spatiotemporal Text Mining • Goal: model the spatiotemporal theme patterns from a collection of text. • model the mixture of topics: common themes • spatiotemporal content analysis: theme life cycles, theme coverage snapshots • Applications: Weblog mining, search result summarization, opinion tracking, business intelligence, etc. Sample results (Weblog data about “Hurricane Katrina”, 5 weeks, U.S.): Week2: The discussion moves towards the northern and western states Week1: The theme is the strongest along the Gulf of Mexico Week3: The theme is distributed more uniformly over the states Spatiotemporal model: Document d at time t and location l Spatiotemporal Context Time = t; Location = l d TL 1 - TL Compute theme life cycles: P(i|t,l) P(i|d) 1 … i … k Themes Compute theme snapshots: P(w|i) 1 - B Background Week4: The theme is again strong along the east coast and the Gulf of Mexico B B Week5: The theme fades out in most states Reference: Word w P(w|B) • Q. Mei, C. Liu, H. Su, and C. Zhai, A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs. WWW 2006. 5 6

More Related