Data Mining / KDD

Data Mining / KDD Definition := “KDD is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” (Fayyad) Let us find something interesting!

Research Focus of UH-DMML Helping Scientists to Make Sense of their Data Geographical Information Systems (GIS) Machine Learning Data Mining High Performance Computing Output: Graduated 12 PhD students (5 in 2009-11) and 79 Master Students Christoph F. Eick

Research Areas and Projects • Data Mining and Machine Learning Group Its research is focusing on: • Spatial Data Mining • Clustering • Helping Scientists to Make Sense out of their Data • Classification and Prediction • Current and Planned Projects • Spatial Clustering Algorithms with Plug-in Fitness Functions and Other Non-Traditional Clustering Approaches • Patch-based Prediction Techniques • Mining Point of Interest (POI) Datasets and its Application to Urban Computing and Understanding Causes of Alcohol Addiction • Data Mining with a Lot of Cores • Educational Data Mining UH-DMML

Mining POI Datasets Motivation: • A lot of POI datasets (e.g. in Google Earth) are becoming available now. • http://bloomington.in.gov/documents/viewDocument.php?document_id=2455;dir=building/buildingfootprints/shape • https://data.cityofchicago.org/Buildings/Building-Footprints/w2v3-isjw Buildings of the City of Chicago (830,000 Polygons) : Challenges: • Extract Valuable Knowledge from such datasets Data Mining • Facilitate Querying and Visualizing of such dataset HPC / BigData Initiative

Summarizing the Composition of Spatial Datasets Given: A Spatial Dataset which Covers an Area of Interest Output: A Partitioning of the Area of Interest into Uniform Regions Applications: Urban Computing(http://www.cs.uic.edu/~urbcomp2013/index.html)/ Alcohol Addiction Ch. Eick

Non-Traditional Clustering Algorithms Clustering Algorithms With plug-in Fitness Functions Creating Polygon Models For Spatial Clusters Mining Spatio-Temporal Datasets Agglomerative Clustering and Hotspot Discovery Algorithms Prototype-based Clustering Parallel Computing Randomized Hill Climbing With a Lot of Cores UH-DMML

Current Suite of Spatial Clustering Algorithms • Representative-based: SCEC, CLEVER • Grid-based: SCMRG,… • Agglomerative: MOSAIC • Density-based: DCONTOUR (not really plug-in but some fitness functions can be simulated) Density-based Grid-based Representative-based Agglomerative-based Clustering Algorithms Remark: All algorithms partition a dataset into clusters by maximizing a reward-based, plug-in fitness function.

MOSAIC—a Clustering Algorithm that Supports Plug-in Fitness Functions MOSAIC supports plug-in fitness functions and provides a generic framework that integrates representative-based clustering, agglomerative clustering, and proximity graphs, and which approximates arbitrary shape clusters using unions of small convex polygons. (a) input (b) output Fig. 6: An illustration of MOSAIC’s approach

Patch-based Prediction Techniques • New Algorithms for Regression Tree Induction • New Decision Tree Induction Algorithms • Multi-Target Regression • Spatial Prediction Techniques Ch. Eick

Helping Scientists to Make Sense Out of their Data Figure 1: Co-location regions involving deep and shallow ice on Mars Figure 2: Interestingness hotspots where both income and CTR are high. Figure 3: Mining hurricane trajectories Ch. Eick

Other Unassigned Research Topics • Trajectory Classification and Prediction • Collocation Mining • Creating Parallel Versions of Existing Clustering Algorithms • Models for the Evolution of Spatial Datasets • Hierarchical Learning Algorithms • … 5p 3p 7p ? Ozone Hotspot Evolution

UH-DMML Mission Statement The Data Mining and Machine Learning Group at the University of Houston aims at the development of data analysis, data mining, and machine-learning techniques and to apply those techniques to challenging problems in geology, astronomy, urban computing, ecology, environmental sciences, web advertising and medicine. In general, our research group has a strong background in the areas of clustering and spatial data mining. Areas of our current research include: clustering algorithms with plug-in fitness functions, association analysis, mining related spatial data sets, patch-based prediction techniques, summarizing the composition of spatial datasets, change and progression analysis, and data mining with a lot of cores. Website: http://www2.cs.uh.edu/~UH-DMML/index.html Research Group Publications: http://www2.cs.uh.edu/~ceick/pub.html Data Mining Course Website: http://www2.cs.uh.edu/~ceick/DM/DM.html Machine Learning Course Website: http://www2.cs.uh.edu/~ceick/ML/ML.html Ch. Eick

Reading Material Urban Computing/Spatial Clustering: SIGKDD Urban Computing Workshop 2013 Paper Agglomerative Clustering: R. Jiamthapthaksin, C. F. Eick, and S. Lee, GAC-GEO: A Generic Agglomerative Clustering Framework for Geo-referenced Datasets, in Knowledge and Information Systems (KAIS). Patch-based Prediction Techniques: MLDM 2013 Paper, ACM-GIS 2010 Paper Data Mining with a lot of Cores: ParCo 2011 Paper GIS/Creating Polygon Models: ACM-GIS 2013 Submission Machine Learning Course Website: http://www2.cs.uh.edu/~ceick/ML/ML.html Collocation Mining: ACM-GIS 2008 Paper Spatial Clustering and Association Analysis: W. Ding, C. F. Eick, X. Yuan, J. Wang, and J.-P. Nicot, A Framework for Regional Association Rule Mining and Scoping in Spatial Datasets, Geoinformatica (2011) 15:1-28, DOI 10.1007/s10707-010-0111-6, January 2011. Supervised Clustering: TAI 2005 Paper Ch. Eick

What Courses Should You Take to Conduct Research in this Research Group? Data Mining Machine Learning Parallel Programming, AI, Software Design, Data Structures, Databases, Visualization, Evolutionary Computing, Image Processing, GIS courses, Geometry, Optimization. UH-DMML

Some UH-DMML Graduates 1 Tae-wan Ryu, Professor, Department of Computer Science, California State University, Fullerton Dr. Wei Ding, Assistant Professor Department of Computer Science, University of Massachusetts, Boston Sharon M. Tuttle, Professor, Department of Computer Science, Humboldt State University, Arcata, California Christoph F. Eick

Some UH-DMML Graduates 2 Ruth Miller Ruth Miller, PhD Washington Unversityin St. Louis, Postdoc - Midwest Alcohol Research Center, Department of Psychiatry. Adjunct Instructor - Department of Computer Science Chun-sheng Chen, PhD TidalTV, Baltimore (an internet advertizing company) RachsudaJiamthapthaksin PhD Lecturer Assumption University, Bangkok, Thailand Justin Thomas MS Section Supervisor at Johns Hopkins University Applied Physics Laboratory Mei-kang Wu MSMicrosoft, Bellevue, Washington Jing Wang MS AOL, California Christoph F. Eick

Models for Progression of Hotspots and Other Spatial Objects 5p 3p 7p ? Ozone Hotspot Evolution ? Building Evolution ? Progression of Glaucoma Ch. Eick

Mining Related Datasets Using Polygon Analysis Work on a methodology that does the following: • Generate polygons from spatial cluster extensions / from continuous density or interpolation functions. • Meta cluster polygons / set of polygons • Extract interesting patterns / create summaries from polygonal meta clusters Analysis of Glaucoma Progression Analysis of Ozone Hotspots Christoph F. Eick

Clustering and Hotspot Discovery in Labeled Graphs Potential Problems to be investigated: 1. Clustering Protein Based on Their Interactions 2. Generalize Region Discovery Framework to Graphs Partitioning Using Plug-in Interestingness Functions 3. … 4. … Ch. Eick

Methodologies and Tools toAnalyze and Mine Related Datasets • Subtopics: • Disparity Analysis/Emergent Pattern Discovery (“how do two groups differ with respect to their patterns?”) [SDE10] • Change Analysis (“what is new/different?”) [CVET09] • Correspondence Clustering (“mining interesting relationships between two or more datasets”) [RE10] • Meta Clustering (“cluster cluster models of multiple datasets”) • Analyzing Relationships between Polygonal Cluster Models Example: Analyze Changes with Respect to Regions of High Variance of Earthquake Depth. Time 1 Time 2 Novelty (r’) = (r’—(r1 … rk)) Emerging regions based on the novelty change predicate UH-DMML

Mining Spatial Trajectories • Goal: Understand and Characterize Motion Patterns • Themes investigated: Clustering and summarization of trajectories, classification based on trajectories, likelihood assessment of trajectories, prediction of trajectories. Arctic Tern Arctic Tern Migration Hurricanes in the Golf of Mexico UH-DMML

Current UH-DMML Activities Mining Related Datasets & Polygon Analysis Regional Knowledge Extraction Cluster Correspondence Analysis Yahoo! User Modeling Strasbourg Building Evolution Understanding Glaucoma Knowledge Scoping POLY/TRAJ- SNN Regional Association Analysis Discrepancy Mining Polygonal Meta Clustering Air Pollution Analysis Parallel CLEVER TRAJ-CLEVER Poly-CLEVER Regional Regression Classification Clustering Cluster Polygon Generation SCMRG Sub-Trajectory Mining Trajectory Density Estimation MOSAIC Repository Clustering Trajectory Mining Animal Motion Analysis Cougar^2 Spatial Clustering Algorithms With Plug-in Fitness Functions Christoph F. Eick

Extracting Regional Knowledge from Spatial Datasets Application 1: Supervised Clustering [EVJW07] Application 2: Regional Association Rule Miningand Scoping [DEWY06, DEYWN07] Application 3: Find Interesting Regions with respect to a Continuous Variables [CRET08] Application 4: Regional Co-location Mining Involving Continuous Variables [EPWSN08] Application 5: Find “representative” regions (Sampling) Application 6: Regional Regression [CE09] Application 7: Multi-Objective Clustering [JEV09] Application 8: Change Analysis in Spatial Datasets [RE09] b=1.01 RD-Algorithm b=1.04 Wells in Texas: Green: safe well with respect to arsenic Red: unsafe well UH-DMML

A Framework for Extracting Regional Knowledge from Spatial Datasets DomainExperts Spatial Databases Regional Knowledge Integrated Data Set Regional Association Rule Mining Algorithms Measures of interestingness Fitness Functions Family of Clustering Algorithms Ranked Set of Interesting Regions and their Properties Framework for Mining Regional Knowledge Objective: Develop and implement an integrated framework to automatically discover interesting regional patterns in spatial datasets. Hierarchical Grid-based & Density-based Algorithms Spatial Risk Patterns of Arsenic UH-DMML

REG^2: a Regional Regression Framework • Motivation: Regression functions spatially vary, as they are not constant over space • Goal:To discover regions with strong relationships between dependent & independent variables and extract their regional regression functions. Discovered Regions and Regression Functions REG^2 Outperforms Other Models in SSE_TR • Clustering algorithms with plug-in fitness functions are employed to find such region; the employed fitness functions reward regions with a low generalization error. • Various schemes are explored to estimate the generalization error: example weighting, regularization, penalizing model complexity and using validation sets,… Regularization Improves Prediction Accuracy UH-DMML

Finding Regional Co-location Patterns in Spatial Datasets Figure 1: Co-location regions involving deep and shallow ice on Mars Figure 2: Chemical Co-location patterns in Texas Water Supply Objective: Find co-location regions using various clustering algorithms and novel fitness functions. Applications: 1. Finding regions on planet Mars where shallow and deep ice are co-located, using point and raster datasets. In figure 1, regions in red have very high co-location and regions in blue have anti co-location. 2. Finding co-location patterns involving chemical concentrations with values on the wings of their statistical distribution in Texas’ ground water supply. Figure 2 indicates discovered regions and their associated chemical patterns. UH-DMML

Data Mining / KDD

Data Mining / KDD

Presentation Transcript

Sea Ice

Sea Ice