System Support for High Performance Scientific Data Mining

System Support for High Performance Scientific Data Mining Gagan Agrawal Ruoming Jin Raghu Machiraju S. Parthasarathy Department of Computer and Information Sciences Ohio State University

Scientific Data Mining Problem • Datasets used for scientific data mining are large – particularly from simulations • Our understanding of what algorithms and parameters will give desired insights is limited • Time required for implementing different algorithms and running them with different parameters on large datasets slows down the scientific data mining process

Project Overview • FREERIDE (Framework for Rapid Implementation of datamining engines) as the base system • Already demonstrated for a variety of standard mining algorithms • Working for feature analysis and mining of simulation data currently

FREERIDE offers: • The ability to rapidly prototype a high-performance mining implementation • Distributed memory parallelization • Shared memory parallelization • Ability to process large and disk-resident datasets • Only modest modifications to a sequential implementation for the above three

Popular algorithms have a common canonical loop Can be used as the basis for supporting a common middleware Key Observation from Mining Algorithms While( ) { forall( data instances d) { I = process(d) R(I) = R(I) op d } ……. }

Performance of Shared Memory Parallelization K-means clustering

Performance on Cluster of SMPs Apriori Association Mining

SPIES On (a) FREERIDE • Developed a new communication efficient decision tree construction algorithm – Statistical Pruning of Intervals for Enhanced Scalability (SPIES) • Combines RainForest with statistical pruning of intervals of numerical attributes to reduce memory requirements and communication volume • Does not require sorting of data, or partitioning and writing-back of records

Broader Research Agenda

Applying FREERIDE for Scientific Data Mining • Focusing on feature extraction, tracking, and mining approach developed by Machiraju et al. • A feature is a region of interest in a dataset • A suite of algorithms for extracting and tracking them

A Feature Analysis Algorithm Data Transform Tour Grid Operator Aggregate Classify Points Denoise Track Rank Catalog ROIs Classify-Aggregate

Ongoing Work – Parallelization Using FREERIDE • Most of the steps involve generalized reductions - supported well in FREERIDE • Extensions to FREERIDE required for aggregation and tracking steps • Overall, FREERIDE can allow rapid implementation of scalable versions of a variety of steps and algorithms that are part of the feature mining paradigm

System Support for High Performance Scientific Data Mining