Nagiza Samatova & George Ostrouchov Computer Science and Mathematics Division

Multi-agent based High-Dimensional Cluster AnalysisSciDAC SDM-ISIC KickoffMeetingJuly 10-11, 2001 Nagiza Samatova & George Ostrouchov Computer Science and Mathematics Division Oak Ridge National Laboratory

Science driven Bottlenecks • Data management and data mining algorithms:not scalable to petabytes of scientific data • Retrieving data subsets from storage systems: too slow, especially for tertiary storage • Transferring large datasets between sites is inefficient • Navigating between heterogeneous, distributed data sources very user intensive • I/O techniques: too low access rate To improve the transfer of large datasets Major Focus: • To implement effective high-bandwidth transfers (Randy Burris) Approaches: • To minimize the amount of data transferred

Minimizing the amount of scientific simulation data transfer – State of the Art • Data compression utilities (zip, compress, etc.): • large overheads • modest compression rates • Post-processing data analysis tools (like PCMDI): • Scientists must wait for the simulation completion • can use lots of CPU cycles on long-running simulations • can use up to 50% more storage and require unnecessary data transfer for data-intensive simulations • Simulation monitoring tools: • interference with simulations • lack of flexibility

Improvements through — Multi-level data minimization mechanisms • Simulation level Data stream  not simulation  monitoring tools for: • “Any-time” feedback to decide whether to terminate a simulation, restart with new parameters, or continue • Filtering runs to decide whether to transfer to a central archive, keep locally, or delete • Comparative analysis level Application-specific search engines for: • Simulation data comparison, esp. against archived databases • Distributed simulation data query, search, and retrieval • In-depth analysis level Application-specific inference engines for: • Inferring rules relating fragments in two or more simulation outputs • New scientific discoveries

How we will address these needs • Our Approach:Develop ASPECT(Adaptable Simulation Product Exploration via Clustering Toolkit) that includes: • Dynamic first-look multivariate time series miner (Level I) • Distributed time-series query, search, and retrieval engine (Level II) • Time-series-based rules inference engine (Level III) • Our Strategy: • Leverage existing work • Expand our prior work • Integrate with other SDM tasks • Work closely with application scientists • Develop ASPECT in an iterative fashion

Our work will be leveraging • Distributed Scientific Data Mining Research (Probe/MICS) [SOA+01a, SOA+01b] • Analysis of Large Scientific Datasets (LDRD/ORNL) [DFL+96, DFL+00, DFL+00] • Statistical Downscaling for Climate (LDRD/ORNL) [PDO00 ]

Distributed Scientific Data Mining Research(funded under Probe/MICS) Motivation Big picture SDM-ETC related effort Relevance to our task: Levels II and III Limitations w.r.t. to our task: Enabling Technology research not application-specific

Motivation for Scientific Data Mining Research under Probe • Existing data mining tools have limited applicability to the emergingscientificdata sets that are: • Massive (terabytes to petabytes) • Existing methods do not scale in terms of time, storage, number of dimensions. • Need scalable data analysis algorithms. • Distributed(e.g., across computational grids, multiple files, disks, tapes) • Existing methods work on a single, centralized dataset. Data transfer is prohibitive (high bandwidth, security/privacy concerns). • Need distributed data analysis algorithm. • Dynamic • Existing methods work with static datasets. Any changes require complete re-computation. • Need dynamic (updating & downdating) techniques. • High-dimensional • Usual assumptions about homogeneity or ergodicity can not be made • Need segmented dimension reduction methods.

Our Approach – Distributed agents and peer-to-peer negotiation • Strategy • to perform data mining in a distributed and recursive fashion • with reasonable data transfer overheads • Key idea • Generate local components using distributed agents • Merge these components into a global system via peer-to-peer agents’ collaboration and negotiation • Requirements for Resulting System • Qualitative comparability • Computational complexity reduction • Scalability • Communication acceptability • Flexibility (in the choice of a local algorithm) • Visual representation sufficiency

Distance Matrix 75% 40% A 0 .6 C D .25 .7 B A E B .75 A .25 C D E 60% E .8 B .6 0 .25 .5 .70 .4 D C .6 Spanning Tree with Dissimilarity Measures Dendrogram Background: Hierarchical Clustering

SDM-ETC Tie-in: Distributed Hierarchical Clustering • Given: • A data set with N d-dimensional data items distributed across multiples data sites • Task: Determine a hierarchical decomposition of this dataset • Application of Clustering: • Database Management • Multi-dimensional indexing • Data Mining • and…. Problem Description:

Local Dendrogram Local Dendrogram Local Dendrogram Generate local dendrograms Distributed dendrograms Transmit local dendrograms Centralized dendrograms Merge local dendrograms RACHET Global Dendrogram Improve Comparable Quality? Increase k Reconstruct Geometry for visualization (optional) Global Dendrogram RACHET: Distributed Clustering Algorithm Control flow of RACHET

Nc– number of data points in the cluster • – square norm of centroid • – radius of the cluster • – sum of centroid components • – minimum centroid component • – maximum centroid component Features: • vs. space cost • Sufficient for efficiently calculating all measurements involved in making clustering decisions • Sufficient for visualization is a cluster centroid of Nc points Centroid Descriptive Statistics -summarized cluster representation QuestionHow many statistical parameters are sufficient to make clustering decisions (merging or splitting clusters)?

Merging Theorem: Let and be descriptive statistics of two clusters. Then the following statements hold for of cluster formed by merging and : S1 C1 O C2 S2 Updating Descriptive Statistics

Squared Euclidean Distance: transmission cost Lower and Upper Bounds: transmission cost Euclidean Distance Approximation

RACHET Performance Analysis:linear in time, space and transmission |S|<<N and k<<N O(N)

Analysis of Large Scientific Datasets Focus: Univariate time series data Applications: ARM, EEG Relevance to our task: Level III Limitations w.r.t. our task: No support of dynamic & distributed time series No support of multivariate time series

Local Models For Global Analysis and Comparison of Data Series • Strategy • Segment series • Model the usual to find the unusual • Key ideas • Fit simple local models to segments • Use parameters for global analysis and monitoring • Resulting system • Detects specific events (targeted or unusual) • Provides a global description of one or several data series • Provides data reduction to parameters of local model

From Local Models to Annotated Time Series Segment series (100 obs) Fit simple local model ( c0, c1, c2, ||e||, ||e||2) Select extreme (10%) Cluster extreme (4) Map back to series

Statistical Downscaling for Climate Focus: Image time series Application: Climate Relevance to our task: Levels I and II Limitation w.r.t. our task: Works as a post-processing tool

Climate Downscaling Contains Several Post-Processing Tools

Trend and Periodic Components Provide a Concise Description of Model Run Filter periodic and trend components Compute EOFs Monitor model run

Summary of where efforts are needed • Research: • Multivariate time series datasets • Dynamic versions of time series processing & analysis tools • Application-specific distributed & dynamic clustering • Application-specific rules inference algorithms • Implementations: • ASPECT’s framework • Simulation data monitoring engine: • with pluggable user-driven data analysis modules • with “any-time”, “real-time” not post-processing • with no or very little interference with simulation • Simulation data query, search, & retrieval engine • Simulation data rules inference engine • A lot of integration work…

4) Distributed, heterogeneous data access d) Dataset • Multi-tier metadata system for querying heterogeneous data sources (LLNL, Georgia Tech) Federation • Knowledge-based federation of heterogeneous databases (SDSC) Level 2) Access optimization 1) Storage and retrieval of 3) Data mining and Very large datasets discovery of access patterns of distributed data • Analysis of application-level query patterns (LLNL, NWU) • Optimizing shared access • High-dimensional indexing techniques (LBNL) c) Dataset to tertiary storage Level (LBNL, ORNL) Multi-agent high-dimensional cluster analysis (ORNL) • • MPI I/O: implementation based on file-level hints (ANL, NWU) b) File • Low level API for grid I/O Level (ANL) Dimension reduction and sampling (LLNL, LBNL) • • Parallel I/O: improving parallel a) Storage access from clusters (ANL, NWU) Adaptive file caching in a distributed system (LBNL) Level • Optimization of low-level data storage, retrieval and transport (ORNL) • [ Grid Enabling Technology] • 5) Agent technology • Enabling communication among tools and data (ORNL, NCSU) Integration with other SDM-ETC tasks

Nagiza Samatova & George Ostrouchov Computer Science and Mathematics Division