Distributed Data Mining Research in the NASA Intelligent Systems Program

NASA AISRP Mtg NASA Ames, Moffett Field, CA April 4 – 6, 2005 Distributed Data Mining Research in the NASA Intelligent Systems Program Kirk D. Borne School of Computational Sciences, George Mason University Fairfax, Virginia kborne@gmu.edu

Outline • What is ISP? • Short Descriptions • Space Science Project • Earth Science Project

What is (was) ISP? • NASA Code R / Ames ISP (Intelligent Systems Program) had 2 components: • Open competition (low TRL = pure research) • NASA Center-to-Center Mission Infusion Tasks (mid-TRL = development and application) • Mission Infusion Teams were tasked with taking the low-TRL technology to mid-TRL, by infusing the technology into a mission, project, or existing program.

Funded ISP Mission Infusion Projects • Two funded projects (both are now completed): • Distributed Data Mining in the NVO* • PI: K. Borne, GMU • Co-I: Cynthia Cheung, NASA • Collaborator: Hillol Kargupta, UMBC • Space Science infusion task • Automated Wildfire Detection (and Prediction) through Artificial Neural Networks • PI: Jerry Miller, NASA • Co-I: K. Borne, GMU • Collaborators: NOAA-NESDIS staff • Earth Science infusion task • Projects were funded by the IDU (Intelligent Data Understanding) component of ISP *NVO = National Virtual Observatory

AISRP Projects – still more data mining • Machine Learning and Data Mining for Automatic Detection and Interpretation of Solar Events • Develop an automatic system for CME detection, tracking, characterization, and source region location • Discover specific associations between solar events (CME) and Earth events (Space Weather) • PI: Art Poland, GMU • Co-I’s: Jie Zhang, K. Borne, Harry Wechsler • Novel Approaches to Semi-supervised Data Exploration • Develop an efficient and effective automated system for astronomical object classification, with an emphasis on star-galaxy discrimination and morphological galaxy classification • PI: David Bazell, Eureka • Co-I’s: K. Borne (GMU), David Miller (Penn St.)

Short Descriptions of ISP Projects • Distributed Data Mining (DDM) in the NVO • Search for examples of interacting/colliding/merging galaxies across multiple distributed databases • Apply Distributed Learning • Apply Distributed Classification • Use DDM algorithms being developed by UMBC group • Apply algorithms within NVO data environment • Automated Wildfire Detection (and Prediction) through Artificial Neural Networks (ANN) • Identify all wildfires in Earth-observing satellite images • Train ANN to mimic human analysts’ classifications • Apply ANN to new data (from 3 remote-sensing satellites: GOES, AVHRR, MODIS) • Extend NOAA fire product from USA to the whole Earth

Searching, retrieving, mining, integrating, and analyzing geographically distributed data repositories is one of the major challenges in data mining today

Why so many telescopes and databases? … Because … • Many great astronomical • discoveries have come • from inter-comparisons • of various wavelengths: • Quasars • Gamma-ray bursts • Ultraluminous IR galaxies • X-ray black-hole binaries • Radio galaxies • . . .

Distributed Data Mining – 2 perspectives:– robust statistical analysis of “typical” events– automated search for “rare” events Figure: The clustering of data clouds (dc#) within a multidimensional parameter space (p#). Such a mapping can be used to search for and identify clusters, voids, outliers, one-of-kinds, relationships, and associations among arbitrary parameters in a database (or among various parameters in geographically distributed databases). Credit: S. G. Djorgovski

Space Science Project:Distributed Data Mining in the NVO –A case study to find colliding, interacting, and merging galaxies among the IR-luminous galaxy population • Examine several distributed databases (HST, 2MASS, Sloan, FIRST, IRAS) • Solve a particular science problem (a NVO science scenario) NASA Information Power Grid (IPG)

NVO Science Cases & Drivers(from Aspen 2001 NVO Workshop) • Solar System : NEOs, Long-Period Comets, TNOs, Killer Asteroids!!! • The Digital Galaxy : Find star streams and populations -- relics of past/present assembly phase. Identify components of disk, thick disk, bulge, halo, arms, ?? • The Low-Surface Brightness Universe : spatial filtering, multi-wavelength searches, intersection of the image and catalog domains • Panchromatic Census of AGN (Active Galactic Nuclei) : Complete sample of the AGN zoo, their emission mechanisms, and their environments • Precision Cosmology & Large-Scale Structure : **Hierarchical Assembly History of Galaxies and Structure**, Cosmological Parameters, Dark Matter and Galaxy Biasing as f(z) • Precision science of any kind that depends on very large sample sizes • "Survey Science Deluxe" • Search for rare and exotic objects (e.g., high-z QSOs, high-z Sne, L/T dwarfs) • Serendipity : Explore new domains of parameter space (e.g., time domain, or "color-color space" of all kinds) **This is the scientific goal of the ISP-funded project described here.

Colliding and Merging Galaxies: Building Blocks of the Universe

Ultra-Luminous Infrared Galaxies (ULIRGs)and other IR-Luminous Galaxies (LIRGs): Nearly 100% are involved in collisions and mergers ULIRGs: the most luminous galaxies in the Universe

Merger Tree - Galaxy Merger Family History Past Present The goal of this study is to identify collision and merger remnant candidates at increasing redshift, in order to measure the galaxy hierarchical mass assembly rate as a function of cosmic epoch.

Distributed Data Mining in the NVO • 1. Identify classes of galaxies among several large photometric catalogs (e.g., 2MASS, Sloan DSS, FIRST, NVSS, etc.): the galaxy class is either normal or IR-luminous (the latter being indicative of collision/merger activity) • 2. Identify all known examples of ULIRGs: • linked to Starburst Galaxies, Gamma-Ray Bursts, Quasars, Hierarchical Galaxy Assembly, etc. • 3. Learn new properties of ULIRGs (e.g., Association Rule Mining) by examining multiple distributed databases. • 4. Build a classifier from these rules. • 5. Find new cases of ULIRGs in the distributed databases. • 6. Results will contribute to understanding of many classes of astronomical phenomena. • 7. Techniques will be applicable to NVO, LWS, other VxOs, JWST science program, ..., and E/PO projects (e.g., mining Kepler mission catalog by students; or VO@Home)

An example of clustering in a 3-dimensional color-color parameter space using data from two different (distributed) astronomical databases. In this case, the 3 colors are pairings of 2MASS near-IR and Sloan optical magnitudes. Plot provided by H.Kargupta (UMBC)

Science Result IRAS F12509+3122 • Successfully completed one true proof-of-concept science case within small subset of the IRAS, HST, and FIRST databases. • We re-discovered exactly the type of object that we are hoping to find automatically with our data mining tools: • We found a very distant hyper-luminous infrared galaxy, one of the brightest galaxies in the known Universe. • This particular galaxy was previously known (catalogued), but we re-discovered it serendipitously. • References: • K.Borne, "Distributed Data Mining in the National Virtual Observatory", SPIE Data Mining & Knowledge Discovery V, vol. 5098, p. 211 (2003). • K.Borne, "A National Virtual Observatory (NVO) Science Case: Properties of Very Luminous IR Galaxies (VLIRGs)", in "The Emergence of Cosmic Structure", p. 307 (2003). Redshift = 0.780

Additional application areas of ISP-funded NVO data mining project • Application of XML to distributed data mining: • ADQL (Astronomical Data Query Language) • XMLA (XML for Analysis) • PMML (Predictive Modeling Markup Language) • Application of different data mining techniques: • Bayes classification • Neural nets • Decision trees • Association rule mining • Genetic Algorithms for rapid data modeling • Supervised and Unsupervised Learning algorithms for robust classification • Application of Beowulfs to parallel high-performance data mining • Application to new mission data sets: GALEX, Spitzer, WISE, JWST, LWS, Sensor Webs, Constellations (distributed Sciencecraft)

NASA Intelligent Systems (IS) Project Intelligent Data Understanding (IDU) Earth Science Project Automated Wildfire Detection Through Artificial Neural Networks Jerry Miller (P.I.), NASA, GSFC Dr. Kirk Borne (Co-I), GMUDr. Brian Thomas, University of Maryland Dr. Zhenping Huang, University of Maryland Yuechen Chi, GMU Donna McNamara, NOAA-NESDIS, Camp Springs, MD George Serafino , NOAA-NESDIS, Camp Springs, MD

NOAA’S HAZARD MAPPING SYSTEM NOAA’s Hazard Mapping System (HMS) is an interactive processing system that allows trained satellite analysts to manually integrate data from 3 automated fire detection algorithms corresponding to the GOES, AVHRR and MODIS sensors. The result is a quality controlled fire product in graphic (Fig 1), ASCII (Table 1) and GIS formats for the continental US. Figure – Hazard Mapping System (HMS) Graphic Fire Product for day 5/19/2003

OVERALL TASK OBJECTIVES • To mimic the NOAA-NESDIS Fire Analysts’ subjective decision-making and fire detection algorithms with a Neural Network in order to: • remove subjectivity in results • improve automation & consistency • allow NESDIS to expand coverage globally

Hazard Mapping System (HMS) ASCII Fire Product OLD FORMAT NEW FORMAT (as of May 16, 2003) Lon, Lat Lon, Lat, Time, Satellite, Method of Detection -80.531, 25.351 -80.597, 22.932, 1830, MODIS AQUA, MODIS -81.461, 29.072 -79.648, 34.913, 1829, MODIS, ANALYSIS -83.388, 30.360 -81.048, 33.195, 1829, MODIS, ANALYSIS -95.004, 30.949 -83.037, 36.219, 1829, MODIS, ANALYSIS -93.579, 30.459 -83.037, 36.219, 1829, MODIS, ANALYSIS -108.264, 27.116 -85.767, 49.517, 1805, AVHRR NOAA-16, FIMMA -108.195, 28.151 -84.465, 48.926, 2130, GOES-WEST, ABBA -108.551, 28.413 -84.481, 48.888, 2230, GOES-WEST, ABBA -108.574, 28.441 -84.521, 48.864, 2030, GOES-WEST, ABBA -105.987, 26.549 -84.557, 48.891, 1835, MODIS AQUA, MODIS -106.328, 26.291 -84.561, 48.881, 1655, MODIS TERRA, MODIS -106.762, 26.152 -84.561, 48.881, 1835, MODIS AQUA, MODIS -106.488, 26.006 -89.433, 36.827, 1700, MODIS TERRA, MODIS -106.516, 25.828 -89.750, 36.198, 1845, GOES, ANALYSIS

GOES CH2 (3.78 - 4.03 μm) – Northern Florida Fire 2003: Day 126 , –82.10 Deg West Longitude, 30.49 Deg North Latitude File: florida_ch2.png

NOAA-NESDIS FIRE DETECTION SYSTEM WF-ABBA = Wildfire Automated Biomass Burning Alg FIMMA = Fire Identification Mapping and Monitoring Alg NOAA S/C NASA TAP-OFF POINT FOR IMAGERY WF-ABBA FIRE DET CH’s 1, 2, 4 (0.62, 3.9, 10.7 μm) GOES EAST-WEST IMAGER 5 CHAN 10-BIT WDS 10-bit MCIDAS (COTS) HAZARD MAPPING SYSTEM (HMS) ------- ENVI GVAR FORMAT CH’S 1, 2, 4 ( 0.62, 3.9, 10.7 μm ) 8-BIT WDS, LCC FIRE ANALYSTS NOAA 14-17 AVHRR 5 CHAN 10-BIT WDS Geo-correction FIMMA FIRE DET CH’s 2, 3b, 4, 5 (0.91, 3.7, 10.8, 12 μm) DAILY NOAA FIRE PRODUCT (automated algorithms and manual additions) HRPT FORMAT TERASCAN (COTS) 10-bit CH’S 1, 2, 3b (0.63, 0.91, 3.7 μm) 8-BIT WDS, LCC MODIS MOD14 FIRE PRODUCT CH’s 2, 22, 31 (0.86, 03.9, 11 μm) NASA S/C TERRA-AQUA MODIS 36 CHAN 12-BIT WDS Bow-Tie Effect Removal MCIDAS (COTS) CH’S 1, 2, 22 ( 0.66, 0.86, 3.96 μm ) 8-BIT WDS, LCC HDF FORMAT LCC = Lambert Conformal Conic Projection MCIDAS = Man Computer Interactive Data Access System

SIMPLIFIED DATA EXTRACTION PROCEDURE DATA: GOES (96 Files/day) AVHRR (25 Files/day) MODIS (14 Files/day) Daily HMS ASCII Fire Product Geographic Coords (lat/lon) SpectralData Image Coords Neural Network Training Set ENVI Function Call Conversion to Image Coords (row/col) Image Ref’s Filter Out Bad data points

DECISION REGIONS AND BOUNDARIES FOR HIGHLY IDEAL SCATTER PLOT CLUSTERING PATTERNS Single Fire Signature Multiple Fire Signatures X2 X2 Surface Fire Crown Fire Fire Ground Fire Background Background X1 X1

Scatter Plot of Background-Subtracted GOES CH 1 vs. CH 2 Fire (lower) and non-fire (upper) separation of clusters 2003: June 2 Northern Florida File: scatter_fires12.png (GOES CH1, CH2, CH4 are input to neural network)

Scatter Plot of Background –Subtracted GOES CH 2 vs. CH 4 Fire (left) and non-fire (right) separation of clusters 2003: June 2 Northern Florida File:scatter_fires22.png (GOES CH1, CH2, CH4 are input to neural network)

Neural Network Configuration Connections (weights) Connections (weights) Band A Inputs:1 - 49 Band B Inputs: 50 - 98 Output Classification Output Layer 2 Band C Inputs: 99 - 147 Input Layer 0 Hidden Layer 1

Typical Error Matrix(for MODIS instrument) True Positive False Positive False Negative True Negative TRAINING DATA Fire NonFire Totals 2834 (TP) 173 (FP) 3007 Fire NonFire Totals Neural Network Classification 3421 318 (FN) 3103 (TN) 3152 3276 6428

Typical Measures of Accuracy • Overall Accuracy = (TP+TN)/(TP+TN+FP+FN) • Producer’s Accuracy (fire) = TP/(TP+FN) • Producer’s Accuracy (nonfire) = TN/(FP+TN) • User’s Accuracy (fire) = TP/(TP+FP) • User’s Acuracy (nonfire) = TN/(TN+FN) Accuracy of our NN Classification • Overall Accuracy = 92.4% • Producer’s Accuracy (fire) = 89.9% • Producer’s Accuracy (nonfire) = 94.7% • User’s Accuracy (fire) = 94.2% • User’s Acuracy (nonfire) = 90.7%

Summary

Summary – NVO Data Mining Applications Data Mining Resource Guide for Space Science: http://nvo.gsfc.nasa.gov/nvo_datamining.html Sample Data Mining Applications within the NVO: • Discover data stored in geographically distributed heterogeneous systems. • Search huge databases for trends and correlations in high-dimensional parameter spaces: identify new properties or new classes of objects. • Search for rare, one-of-a-kind, and exotic objects in huge databases. • Identify temporal variations in objects from millions or billions of observations. • Identify moving objects in huge survey catalogs and image databases. • Identify parameter glitches / anomalies / deviations either in static databases (e.g., archives) or in dynamic data (e.g., science / telemetry / engineering data streams from remote satellites). • Find clusters, nearest neighbors, outliers, and/or zones of avoidance in the distribution of astrophysical objects or other observables in arbitrary parameter spaces. • Serendipitously explore the huge databases that will be part of the NVO, through access to distributed, autonomous, federated, heterogeneous, multi-wavelength, multi-mission astrophysics data archives. http://www.us-vo.org/

Addressing NASA Exploration Challengesthrough Intelligent Data Understanding • Autonomy: “making systems more intelligent” • Robotic Networks: “enabling networks of cooperating robotic systems” • Data-Rich Virtual Presence: “local and remote, both real-time and asynchronous virtual presence to enable effective science and robust operations (including tele-presense , tele-science, tele-supervision)” Source: Human & Robotic Technology, Program Formulation Plan, 15 May 2004

Distributed Data Mining Research in the NASA Intelligent Systems Program

Distributed Data Mining Research in the NASA Intelligent Systems Program

Presentation Transcript

Database Systems Research on Data Mining

Data Mining Data Mining Intelligent Miner IBM DB2 IBM DB2 Intelligent Miner AccessIntelligent Miner

Troubleshooting Distributed Systems via Data Mining

NASA Vehicle Systems Program Overview

Privacy-Preserving Distributed Data Mining

Intelligent Distributed Data Management in Earth System Science

Distributed Data Mining in Discovery Net

Current Research in Data Mining Research Group

Intelligent Distributed Data Management in Earth system science

Data Mining in Ubiquitous Distributed Environments

Distributed Data Mining System in Java

Current Research in Data Mining Research Group

Data Mining Algorithms for Large-Scale Distributed Systems

Intelligent Data Mining

NASA Undergraduate Student Research Program

NASA Undergraduate Student Research Program

NASA Human Research Program

Intelligent Robotics Group NASA Ames Research Center