Current Research in Data Mining Research Group

Current Research in Data Mining Research Group Jiawei Han Data Mining Research Group Department of Computer Science University of Illinois at Urbana-Champaign Acknowledgements: NSF, ARL, ARO, NASA, AFOSR (MURI), DHS, Microsoft, IBM, Yahoo! Labs, LinkedIn, HP Lab & Boeing September 10, 2014

Outline An Introduction to Data Mining Research Group Pattern Discovery Methods Mining Heterogeneous Information Networks Construction of Heterogeneous Information Networks from Unstructured Data TextCube and OLAP heterogeneous networks Mining Cyber-Physical Systems and Networks Conclusions

Data Mining and Data WarehousingJiawei Han’s Group at CS, UIUC • Mining patterns and knowledge discovery from massive data • Data mining in heterogeneous information networks • Exploring broad applications of data mining • Developed popular data mining algorithms: FPgrowth, gSpan, PrefixSpan, RankingCube, TruthFinder, NetClus, RankClass, … • 600+ research papers, most cited author/group in data mining • ACM Fellow, IEEE Fellow, ACM SIGKDD Innovation Award, W. McDowell Award; Students: ACM KDD Dissertation Awards (2008, 2013), …… • Textbook, “Data mining: Concepts and Techniques,” adopted worldwide • Funded as NSCTA (Network Science Collaborative Technology Alliance) by ARL [09-14, 15-19], ARO, AFSOR (MURI), NSF, NASA, DHS, Boeing, MSR, Google, Yahoo!, HP Labs, … • Graduated 37 Ph.D.’s: joined Google, Microsoft Research, Yahoo! Labs, Facebook, Twitter, as well as professors (13) • Supervising 17 Ph.D., 4 M.S. students & 5 visitors/postdocs

Data Mining Research Group in CS, Univ. Illinois • Student Prominent Awards • SIGKDD or SIGMOD Ph.D. Dissertation Awards/ Runner-Ups • 10-year impact paper awards • Best student paper awards, best papers, best posters, … • KDDCUP 2013 Runner Up Award • IBM/Microsoft/NSF/NDSEG Ph.D. Fellowships • Graduation: • Professors at UCSB, PSU, SUNY Buffalo, Northeastern, FSU, MSU, Notre Dame, CUHK, … • Researchers at IBM, MSR, Google Research, Yahoo! Labs, Facebook, Twitter, NEC, etc. 4

Mining Sequential Patterns from Shopping Sequences • Sequential pattern mining: Given a set of (shopping) sequences, find the complete set of frequent subsequences A sequence database Idea ofPrefixSpan Idea ofCloSpan <a(bc)dc>: a subsequenceof <a(abc)(ac)d(cf)> s=<a(abc)(ac)d(cf)> Given support thresholdmin_sup=2, <(ab)c> is a sequential pattern <a> s|<a>: ( , 2) <(abc)(ac)d(cf)> • Our innovation: • PrefixSpan (TKDE’04): 1598 citations • CloSpan (SDM’03): 568 (reduce redundancy) • FPgrowth (SIGMOD’00): 4956 <ab> Difficulty to generalize it to biosequence mining: approximate patterns & noise s|<ab>: ( , 4) <(_c)(ac)d(cf)>

Mining Frequent Subgraph Patterns from Graph DBs GRAPH DATASET (e.g., Chemical Compound Database) Graph pattern mining: Given a set of graphs, find the complete set of frequent subgraphs (k+1)-edge Idea of gSpan FREQUENT PATTERNS (Let MIN SUPPORT = 2) G1 Graph pattern growth + completeness of right-most extension k-edge G2 • Our innovation: • gSpan (ICDM’02): 1319 citations • CloseGraph (KDD’03): 520 (not to mine subgraphs covered by their super-patterns) G At what condition, can we stopsearching their Children. i.e., early termination? CloseGraph … NCI/NIH AIDS antiviral screen compound data Gn minsup = 5% Extend to mine structures in large single networks (VLDB’11)

query graph graph DB Graph Indexing and Graph Similarity Search Graph Search: Given a query graph Q, find all the graphs in graph DB containing Q gIndex key idea: index on frequent and discriminative substructures (mined) # candidates/query size # indices/ DBsize Graph Index helps search grafil key idea: explore feature similarity Query:Q Graph (G) Query:Q … Graph Index Our Innovation: gIndex(SIGMOD’04): 419 citations grafil(SIGMOD’05): similarity search Graph (G) Approximate features features

CoDense, Mining Frequent CoherentDenseSubgraphs across Multiple Microarray Datasets c1 c2… cm g1 .1 .2… .2 g2 .4 .3… .4 … f f f j j a j a h a c h c h c e c1 c2… cm g1 .8 .6… .2 g2 .2 .3… .4 … e e b b k b k k d d i g i g d i g . . . c1 c2… cm g1 .9 .4… .1 g2 .7 .3… .5 … f j a h c e c1 c2… cm g1 .2 .5… .8 g2 .7 .1… .3 … b k d i g MRP49 Coherent dense graphs: f YDR115W j a h c e MRPL51 Frequency: all edges occur in ≥ k graphs Coherency: correlated edge occurrences Density: subgraph is dense ≥ threshold  b k d g i PHB1 f j e a c h ATP12 PET100 Experiment b k d i g ATP17 MRPL37 . . . . . . f j a h c e b k d i g ACN9 MRPL38 f j a h c e MRPL39 MRPL32 b k d i g Discovery FMC1 Our innovation: MRPS18 Red:PHB1,ATP17,MRPL51,MRPL39, MRPL49, MRPL51,PET100 GO:0006091(generation of precursor metabolites and energy; pvalue=0. 001339) CoDense: ISMB’05: mining noisy micro-array data to derive interesting dense subgraphs (collab. w. USC: Jamine Zhou)

a f f f a a h h c c c h b e e e b b f f d d d i i i g g g a Step 2 Step 1 h h c c G1 G2 G3 e e b a f f f d MODES Add/Cut a a i i g g h h h c c c summary graph Ĝ b Sub(Ĝ) e e e b b d d d i g i i g g G4 G5 G6 Step 3 g-h g-h f-i h e-i e-i h-i h-i E G1 G2 G3 G4 G5 G6 e c-e 0 0 1 1 1 1 Step 6 Step 5 Step 4 i g g-i g-i e-g e-g c-f 0 1 0 1 1 1 c-h 0 0 0 1 1 1 e-h e-h f MODES c-i 0 0 1 1 1 0 c-h c-h Restore G and MODES f-h h c f-h e-f 0 0 0 1 1 1 c-f c-f e-f e e-f … … … … … … … c-i c-e c-e edge occurrence profiles second-order graph S Sub(G) Sub(S) Data Mining Process of CoDense

Mining Heterogeneous Information Networks Heterogeneous networks: Multiple object types and/or multiple link types Movie Studio The Facebook Network Homogeneous networks are info. lossprojection of heterogeneous networks! Directly mining information-richer heterogeneous networks Venue Paper Author Director Current work: Mining DBLP (CS bibliographic DB), PubMed, news, tweets, data.gov, … DBLP Bibliographic Network Actor Movie The IMDB Movie Network

Structured Heterogeneous Network Modeling Leads to the New Power of Data Mining! • DBLP: A Computer Science bibliographic database A sample publication record in DBLP (>2 M papers, >0.7 M authors, >10 K venues), … Power of het. network modeling: Treat Author, Venue, Term, Paper all first-class citizens! 13

RankClus: Rank-Based Clustering RankClus (EDBT’09)/NetClus (KDD’09): Integrate ranking & clustering for mining heterogeneous info networks Rank treatments for AIDS from MEDLINE DBLP Schema RankCompete: Organize your photo album automatically!

RankClass: Integration of Tanking and Classification Knowledge propagation via multi-typed heterogeneous networks Our innovation: • DBLP: 4-fields data set (DB, DM, AI, IR) forming a heterog. info. network • Rank objects within each class (with extremely limited label information) • Obtain High classification accuracy and excellent rankings within each class ECMLPKDD'10/KDD’11: integrate ranking and classification; small training set; knowledge propagation across typed links; efficient and scalable Potential applications: Biological network mining

Meta-Path GuidedSimilarity Search in Networks • Similarity search: Find similar objects in networks • Who are most similar to AnHai Doan? Anhai Doan CS, Wisconsin Database area PhD: 2002 Jignesh Patel CS, Wisconsin Database area PhD: 1998 • Meta-Path: Meta-level description of a path between two objects • Different meta-paths carry rather different semantics DBLP Network Schema Meta-Path: Author-Paper-Venue-Paper-Author (APVPA) Our innovation Application in biomedical domain Amol Deshpande CS, Maryland Database area PhD: 2004 Jun Yang CS, Duke Database area PhD: 2001 PathSim (VLDB’11): Similarity search in heterogeneous networks; a balanced similarity measure; user-guidance by selecting different meta-paths IBM: search for close relationships among disease, drugs, treatments, side-effects, and explanations

PathPredict: Meta-Path Based Relationship Prediction Network schema Who will be your new coauthors? Our contribution Different meta-paths have different prediction power: p-values obtained from the DBLP data PathPredict (ASONAM’11) Co-author prediction (A—P—A) using topological features encoded by meta paths, e.g., (A—P→P—A). Which meta-path is more important? publish publish-1 author paper topic venue mention-1 write-1 Applications write mention Co-author prediction for Jian Pei: Only 42 among 4809 candidates are true first-time co-authors! (Trained based on data collected in [1996, 2002]; Testing period: [2003,2009]) Meta path-guided prediction: Infer or predict new relationships among multi-typed links cite/cite-1 contain/contain-1

Claim Objects Info provider w1 f1 o1 w2 f2 w3 f3 o2 w4 f4 Truth Analysis: Enhancing the Quality of Heterogeneous Information Networks: Motivation: Info. provided can be untrustworthy, error-prone, missing, … Application: handling conflicting claims on biomedical properties • Experimental datasets: Large and real datasets • Book Authors from abebooks.com (1263 books, 879 sources, 48153 claims, 2420 book-author, 100 labeled) • Movie Directors from Bing (15073 movies, 12 sources, 108873 claims, 33526 movie-director, 100 labeled) Our contribution TruthFinder(TKDE’08): mutual enhancement of trustworthiness of info providers and claims Latent Truth Model (VLDB’12): modeling two sided truth Multiple facts, two-sided claims: High Precision, High Recall IMDB Positive Claim Negative Claim High Precision, Low Recall Netflix Correct Claim Low Precision, Low Recall BadSource Incorrect Claim Harry Potter 18

Hierarchical Relationship Discovery • From partially ordered objects to hierarchy (tree) • Based on NLP or other techniques to extract partially ordered objects • Using constraints to discover relationships Singleton Potential Discovery of the Kenny Family Tree Pairwise Potential Function: Cases

Recursive Construction of a Topical Hierarchy by Phrase Mining Recursive construction Topic discovery Term co-occurrence network The Framework of CATHY (Constructing A Topical HierarchY) Topical phrase mining and ranking

Growing Parallel Paths (WWW 2011) Result:

WinaCS: Web Information Network Analysis for Computer Science Database records can be found on link paths!

Research-Insight [SIGMOD’13 Demo] Query on “Jim Gray” Query on “Machine Learning” Advisor-Advisee result for “Kevin Chang” Potentialcollaboratorsfor “JiaweiHan”

Event Cube: An Overview Topic Topic … Analyst Funded by NASA (2008-2010) turbulence Analysis Support Multidimensional OLAP, Ranking, Cause Analysis, Encounter birds …… Topic Summarization/Comparison undershoot Deviation overshoot 1998 98.02 Event Cube Representation 98.01 1999 99.02 99.01 Time Time LAX SJC MIA CA FL TX AUS Location Location drill-down roll-up Multidimensional Text Database Event Cube: An Organized Approach for Mining and Understanding Anomalous Aviation Events 26

Text/Topic Cube: General Idea ACN Time Location Place Environment … … Event Report Text data Cube: Categorical Attributes • Heterogeneous: categorical attributes + unstructured text • How to combine? • Our solution: Measure Text/Topic Model: Unstructured Text

Effective OLAP Exploration TEXplorer System Top-1 Dimension: Person Healthcare Reform Top-2 Dimension: Org Top-3 Dimension: Time 2010 2008 2004 28 TopCells (ICDE’ 10): Ranking aggregated cells (objects) in TextCube TEXplorer (CIKM’11): Integrating keyword-based ranking and OLAP exploration

EventCube Snapshot: Query Result

MoveMine: Mining Moving Object Databases A system that mines moving object patterns: Z. Li, et al., “MoveMine: Mining Moving Object Databases", SIGMOD’10 (system demo) 31

Mining Spatiotemporal and Mobility Data Raw movement data (time series view) density map #1 Longitude #2 #1 #4 #2 #4 #3 Latitude #3 Time (hour) Spot #1: Office Spot #2: Commuting city Spot #3: Home Spot #4: Vacation place 32

Mining Periodicity in Sparse Data [KDD12] • Event has a period of 20 • Occurrences of the event happen between 20k+5 to 20k+10 33

GeoTopic Discovery: Mining Spatial Text Geo-tagged photos w. landscape (coast vs. desert vs. mountain) LDM TDM GeoFolk LGTA Z. Yin, et a., GeoTopic Discovery and Comparison, WWW'11

LPTA: Latent Periodic Topic Analysis: Discovery of Temporal Patterns of Topics • Periodic topic: repeating in regular intervals • Background topic: covered uniformly over the entire period • Bursty topic: A transient topic that is intensively covered only in a certain time period Integration of both text and time in analysis Time distribution of topics 35

Social Relationship Mining from Sensor Trace Data • T-Motif: a time interval [S,T], that • many positive pairs meet at that time • few negative pairs meet at that time • Ex.: MIT Reality mining dataset: • 94 people tracked for 10 months • Use only spatiotemporal info • Algs. for efficient mining of T-motifs and effective classification 36

Mining RFID Data to Explore Trajectories (Factory, T1,T2) (Shipping,T3,T4) (Warehouse, T5,T6) Warehousing and mining RFID data (Shelf, T7,T8) (Checkout,T9,T10) 37

Conclusions An Introduction to Data Mining Research Group Pattern Discovery Methods Mining Heterogeneous Information Networks Construction of Heterogeneous Information Networks from Unstructured Data TextCube and OLAP heterogeneous networks Mining Cyber-Physical Systems and Networks Lots to be done in this promising research frontier!

Current Research in Data Mining Research Group

Current Research in Data Mining Research Group

Presentation Transcript

Current Research

Database Systems Research on Data Mining

Current Research

Data Mining Research and Applications

Text Mining in Biomedical Research

Data Engineering Research Group

Connecting research data, current research information and publications

Research Data Strategy Working Group

Data Mining Concepts and Research Trends

Current Research

Current Research

Current Research

Current Research in Cosmeceuticals

Current Research in Data Mining Research Group

Research Group

Research group

Research Problems in Digital Libraries: Data Mining and Text Mining

Optimization-Based Data Mining Approaches in Neuroscience Research

Current Research Challenges

Research Issues in Web Data Mining