Download
data mining research and applications n.
Skip this Video
Loading SlideShow in 5 Seconds..
Data Mining Research and Applications PowerPoint Presentation
Download Presentation
Data Mining Research and Applications

Data Mining Research and Applications

119 Vues Download Presentation
Télécharger la présentation

Data Mining Research and Applications

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Data MiningResearch and Applications Workshop on Cyberinfrastructure For Environmental Research and Education October 31, 2002 Steve Tanner Information Technology and Systems Center University of Alabama in Huntsville stanner@itsc.uah.edu 256.824.5143 www.itsc.uah.edu

  2. Key Questions: • What is the most effective approach to developing an integrated framework and plan for an interdisciplinary environmental cyberinfrastructure? • What organizational structure is needed to provide long-term support for data storage, access, model development, and services for a global clientele of researchers, educators, policy makers, and citizens? • How will effective interagency and public-private partnerships be formed to provide financial support for such an extensive and costly system? • How can communication and coordination among computer scientists and environmental researchers and educators be enhanced to develop this innovative, powerful, and accessible infrastructure?

  3. Data Mining • Data Mining is an interdisciplinary field drawing from areas such as statistics, machine learning, pattern recognition and others • Automated discovery of patterns, anomalies, etc. from vast observational and model data sets • Derived knowledge for decision making, predictions and disaster response • ADaM – Algorithm Development and Mining System datamining.itsc.uah.edu

  4. Techniques used for Data Mining Data Mining systems usually involve a toolbox of many different techniques and a means for combining them • Clustering Techniques • K Means • Isodata • Maximum • Pattern Recognition • Bayes Classifier • Minimum Distribution Classifier • Image Analysis • Boundary Detection • Cooccurrence Matrix • Dilation and Erosion • Histogram Operations • Polygon Circumscript • Spatial Filtering • Texture Operations • Genetic Algorithms • Neural Networks • Etc.

  5. Typical Everyday Encounters with Data Mining • Google • Complex algorithm sequence to decide order • Amazon.Com • Additional purchase suggestions • Credit Card Fraud • Event notification of odd usage Most current Data Mining applications are text based. Text provides an easily readable source of heterogeneous data. Mining of scientific data sets is more complex.

  6. User Perspective and Data Perspective of the Data Mining Process Analysis Decision Value Volume Transformation Knowledge Preprocessing Information Dataset Specific Algorithms Domain Specific Algorithms Data Calibration & Navigation Data Stores Dataset User Perspective Data Perspective

  7. Data Mining Scientific Analysis • Provides automation of the analysis process • Can be used for dimensionality reduction when manual examination of data is impossible • Can have limitations • May not utilize domain knowledge • May be difficult to prove validity of the results • There may not be a physical basis • Should be viewed as complimentary tool and not a replacement for scientific analysis • Harnesses human analysis capabilities • Highly creative • Based on theory and hypothesis formulation • Physical basis is normally used for algorithms • Drawing insights about the underlying phenomena • Rapidly widening gap between data collection capabilities and the ability to analyze data • Potential of vast amounts of data to be unused

  8. Similarity between Data Mining and Scientific Analysis Process

  9. Mining Environments Mining Framework (ADaM) • Complete System (Client and Engine) • Mining Engine (User provides its own client) • Application Specific Mining Systems • Operations Tool Kit • Stand Alone Mining Algorithms • Data Fusion Distributed/Federated Mining • Distributed services • Distributed data • Chaining using Interchange Technologies On-board Mining (EVE) • Real time and distributed mining • Processing environment constraints

  10. Using the Mining Framework: Focusing on the information in data

  11. Processing Input Preprocessing Analysis Output PIP-2 SSM/I Pathfinder SSM/I TDR SSM/I NESDIS Lvl 1B SSM/I MSFC Brightness Temp US Rain Landsat ASCII Grass Vectors (ASCII Text) HDF HDF-EOS GIF Intergraph Raster Others... Selection and Sampling Subsetting Subsampling Select by Value Coincidence Search Grid Manipulation Grid Creation Bin Aggregate Bin Select Grid Aggregate Grid Select Find Holes Image Processing Cropping Inversion Thresholding Others... GIF Images HDF Raster Images HDF Scientific Data Sets HDF-ESO Polygons (ASCII, DXF) SSM/I MSFC Brightness Temp TIFF Images GeoTIFF Others... Clustering K Means Isodata Maximum Pattern Recognition Bayes Classifier Min. Dist. Classifier Image Analysis Boundary Detection Cooccurrence Matrix Dilation and Erosion Histogram Operations Polygon Circumscript Spatial Filtering Texture Operations Genetic Algorithms Neural Networks Others… The ADaM Processing Model Preprocessed Data Patterns/ Models Results Raw Data Translated Data

  12. Iterative Nature of the Data Mining Process EVALUATION And PRESENTATION KNOWLEDGE DISCOVERY MINING SELECTION And TRANSFORMATION CLEANING And INTEGRATION PREPROCESSING DATA

  13. Distributed/Federated Mining: Meshing data and algorithms to generate knowledge

  14. ADaM : Mining Environment for Scientific Data • The system provides knowledge discovery, feature detection and content-based searching for data values, as well as for metadata. • contains over 120 different operations • Operations vary from specialized science data-set specific algorithms to various digital image processing techniques, processing modules for automatic pattern recognition, machine perception, neural networks, genetic algorithms and others

  15. Classification Based on Texture Features and Edge Density • Science Rationale: Man-made changes to land use cause changes in weather patterns, especially cumulus clouds • Comparison based on • Accuracy of detection • Amount of time required to classify Cumulus cloud fields have a very characteristic texture signature in the GOES visible imagery

  16. Parallel Version of Cloud Extraction • GOES images can be used to recognize cumulus cloud fields • Cumulus clouds are small and do not show up well in 4km resolution IR channels • Detection of cumulus cloud fields in GOES can be accomplished by using texture features or edge detectors Master Slave 1 Slave 2 Slave 3 GOES Image Laplacian Filter Sobel Horizontal Filter Sobel Vertical Filter Energy Computation Energy Computation Energy Computation Energy Computation Classifier Cloud Image GOES Image Cumulus Cloud Mask • Three edge detection filters are used together to detect cumulus clouds which lends itself to implementation on a parallel cluster

  17. Automated Data Analysis for Boundary Detection and Quantification • Analysis of polar cap auroras in large volumes of spacecraft UV images • Science Rationale: Indicators to predict geomagnetic storm • Damage satellites • Disrupt radio connection • Developing different mining algorithms to detect and quantify polar cap boundary Polar Cap Boundary

  18. Detecting Signatures • Science Rationale: Mesocyclone signatures in Radar data are indicators of Tornadic activity • Developing an algorithm based on wind velocity shear signatures • Improve accuracy and reduce false alarm rates

  19. Genetic Subtyping Using Hierarchical Clustering • Biologists are interested in comparing DNA sequences to see how closely related they are to one another • Phylogenetic trees are constructed by performing hierarchical clustering on DNA sequences using genetic distance as a distance measure • Such trees show which organisms are most likely share common ancestors, and may provide information about how various subtypes of organisms evolved • This information is useful when studying disease causing organisms such as viruses and bacteria, because genetically similar types should behave in similar ways

  20. Mining on Data Ingest: Tropical Cyclone Detection Advanced Microwave Sounding Unit (AMSU-A) Data • Mining Plan: • Water cover mask to eliminate land • Laplacian filter to compute temperature gradients • Science Algorithm to estimate wind speed • Contiguous regions with wind speeds above a desired threshold identified • Additional test to eliminate false positives • Maximum wind speed and location produced Further Analysis Calibration/ Limb Correction/ Converted to Tb Knowledge Base Data Archive Hurricane Floyd Mining Environment Result Results are placed on the web, made available to National Hurricane Center & Joint Typhoon Warning Center, and stored for further analysis pm-esip.msfc.nasa.gov/

  21. Visualization & Exploration Web Interfaces & Applications Temperature Trends STT Application Data Ordering FTP AMSU-A Images Cyclone Winds In- put Process Subset//Grid/Format Out put ADaM Servers Multiple Mining Environments:Passive Microwave ESIP Information System AMSU Product Generation ADaM-based Processing PM-ESIP Catalog Order Staging Custom Processing AMSU-A Ingest TMI AMSU-A SSM/I SSM/T2 TMI Ingest and Product Generation Distributed Data Stores Data Ingest & Processing

  22. The Problem Interoperability: Accessing Heterogeneous Data DATA FORMAT 3 DATA FORMAT 1 DATA FORMAT 2 • Science data comes in: • Different formats, types and structures • Different states of processing (raw, calibrated, derived, modeled or interpreted) • Enormous volumes • Heterogeneity leads to data usability problems • One approach: Standard data formats • Difficult to implement and enforce • Can’t anticipate all needs • Some data can’t be modeled or is lost in translation • The cost of converting legacy data • A better approach: Interchange Technologies • Earth Science Markup Language FORMAT CONVERTER READER 1 READER 2 APPLICATION The Solution DATA FORMAT 1 DATA FORMAT 2 DATA FORMAT 3 ESML FILE ESML FILE ESML FILE ESML LIBRARY APPLICATION

  23. Chained Image Processing Services WMS (Java/Windows) Service Chaining is used to integrate modules – or services – developed on distributed platforms and different languages for a single processing solution. Format (Perl/Linux) Resample (Perl/C – Linux) GeoCrop (Perl/Linux) Chained Services Draw Image (PERL/C – Linux) Data Streams Data Reader (Java/C+ Windows) Data Files ESML ESML Lib Knowledge Base Data Files

  24. Data Integration using Web Mapping Services Countries Cyclone Events AMSU-A Channel 01 MCS Events Coastlines Knowledge Base AMSU-A ITSC Globe AMSU-A data overlaid with MCS and Cyclone events for September 2000, merged with world boundaries from Globe.

  25. Fused Displays from Multiple Servers Analysis: Correlate MCSs and cyclones with atmospheric temperatures for September 2000.

  26. MULTI-LEVEL MINING CONCEPT MINING DECISION SUPPORT EVENT A EVENT B CONCEPTUAL LEVEL FEATURE SET I FEATURE I FEATURE II FEATURE III FEATURE X FEATURE Y Model and Observation Data DATA FILE LEVEL Concept Hierarchy for Data Mining and Fusion

  27. On-Board Real-Time Processing Sensor Control/Targeting EVE – Environment for On-board Processing • Anomaly detection • Data Mining • Autonomous Decision Making • Immediate response • Direct satellite to Earth delivery of results www.itsc.uah.edu/eve

  28. A Reconfigurable Web of Interacting Sensors Communications Weather Satellite Constellations Military Ground Network Ground Network Ground Network

  29. Example Plan: Threshold events in AMSU-A Streaming Data EVE

  30. Data Integration and Mining: From Global Information to Local Knowledge Emergency Response Precision Agriculture Urban Environments Weather Prediction

  31. Key Questions: • What is the most effective approach to developing an integrated framework and plan for an interdisciplinary environmental cyberinfrastructure? • What organizational structure is needed to provide long-term support for data storage, access, model development, and services for a global clientele of researchers, educators, policy makers, and citizens? • How will effective interagency and public-private partnerships be formed to provide financial support for such an extensive and costly system? • How can communication and coordination among computer scientists and environmental researchers and educators be enhanced to develop this innovative, powerful, and accessible infrastructure?