Tools and Techniques for Data Grid in Computational Science

Tools and Techniques for the Data Grid Gagan Agrawal The Ohio State University

Overall Motivation • Computation has long become an integral part of any scientific discipline • Parallels theory and experiments • Last 2 (or more) decades have seen Computational-X emerge • Major emphasis on computational modeling • Involved CS support for high-end computing • In last 5-10 years, X-Informatics is emerging • Data-driven science and engineering applications • Needs CS support for high-end and distributed computing

Context: Grid Computing • Wide area collaborations and pooling of resources • Natural synergy with data-intensive applications • Wide-area sharing of data • Using distributed resources for data analysis • Stage multiple tasks: data generation, processing, visualization

Scientific data repositories Large volume Gigabyte, Terabyte, Petabyte Distributed datasets Generated/collected by scientific simulations or instruments Data could be streaming in nature Scientific data analysis Scientific Data Analysis on (Grid-based) Data Repositories Data Specification Data Organization Data Extraction Data Movement Data Analysis Data Visualization

Opportunities • Scientific simulations and data collection instruments generating large scale data • Rapidly increasing wide-area bandwidths • Grid standards enabling sharing of data • Service/grid model of computing • Plug and play application modules / data sources

Existing Efforts • Data grids recognized as important component of grid/distributed computing • Major topics • Efficient/Secure Data Movement • Replica Selection • Metadata catalogs / Metadata services • Setting up workflows

Open Issues • Accessing / Retrieving / Processing data from scientific repositories • Need to deal with low-level formats • Integrating tools and services having/requiring data with different formats • Support for processing streaming data in a distributed environment • Developing scalable data analysis applications

Ongoing Projects • Automatic Data Virtualization • On the fly data integration in a distributed environment • Middleware for Processing Streaming Data • Compiling XQuery on Scientific and Streaming Data • Middleware for Scalable Data Processing • Data Mining Algorithms and Systems

Coastal Forecasting and Change Detection (Lake Erie)

An Example Application Scenario

Outline • Automatic Data Virtualization • Relational/SQL • XML/XQuery based • Data Integration • Middleware for Streaming Data • Cluster and Grid-based data mining middleware

Automatic Data Virtualization: Motivation • Emergence of grid-based data repositories • Can enable sharing of data in an unprecedented way • Access mechanisms for remote repositories • Complex low-level formats make accessing and processing of data difficult • Main desired functionality • Ability to select, down-load, and process a subset of data

Data Virtualization An abstract view of data dataset Data Virtualization Data Service • By Global Grid Forum’s DAIS working group: • A Data Virtualization describes an abstract view of data. • A Data Service implements the mechanism to access and process data • through the Data Virtualization

Our Approach: Automatic Data Virtualization • Automatically create data services • A new application of compiler technology • A metadata descriptor describes the layout of data on a repository • An abstract view is exposed to the users • Two implementations: • Relational /SQL-based • XML/XQuery based

System Overview SELECT < Data Elements > FROM < Dataset Name > WHERE …. AND Filter( < Data Element> );

Design a Meta-data Description Language • Requirements • Specify the relationship of a dataset to the virtual dataset schema • Describe the dataset physical layout within a file • Describe the dataset distribution on nodes of one or more clusters • Specify the subsetting index attributes • Easy to use for data repository administrators and also convenient for our code generation

Design Overview • Dataset Schema Description Component • Dataset Storage Description Component • Dataset Layout Description Component

An Example Component I: Dataset Schema Description [IPARS] // { * Dataset schema name *} REL = short int // {* Data type definition *} TIME = int X = float Y = float Z = float SOIL = float SGAS = float • Oil Reservoir Management • The dataset comprises several simulation on the same grid • For each realization, each grid point, a number of attributes are stored. • The dataset is stored on a 4 node cluster. Component II: Dataset Storage Description [IparsData] //{* Dataset name *} //{* Dataset schema for IparsData *} DatasetDescription = IPARS DIR[0] = osu0/ipars DIR[1] = osu1/ipars DIR[2] = osu2/ipars DIR[3] = osu3/ipars

Data Layout Description Component DATASET “ROOT” { DATATYPE { … } DATAINDEX { … } DATA { DATASET dataset1 DATASET dataset2 DATASET dataset3 } DATASET “dataset1”{ DATATYPE { … } DATASPACE { … } DATA { data1 data2 data3 } } DATASET “dataset2”{ DATATYPE { … } DATASPACE { … } DATA { data4 } } DATASET “dataset3”{ …. } } Dataset Root dataset 3 dataset 1 dataset 2 Data5 Data6 Data2 Data1 Data3 Data4

An Example Component III: Dataset Layout Description DATASET “IparsData” { //{* Name for Dataset *} DATATYPE { IPARS } //{* Schema for Dataset *} DATAINDEX { REL TIME } DATA { DATASET ipars1 DATASET ipars2 } DATASET “ipars1” { DATASPACE { LOOP GRID ($DIRID*100+1):(($DIRID+1)*100):1 { X Y Z } } DATA { $DIR[$DIRID]/COORDS $DIRID = 0:3:1 } } // {* end of DATASET “ipars1” *} DATASET “ipars2” { DATASPACE { LOOP TIME 1:500:1 { LOOP GRID ( $DIRID*100+1):(( $DIRID+1)*100):1 { SOIL SGAS } } } DATA { $DIR[ $DIRID]/DATA$REL $REL = 0:3:1 $DIRID = 0:3:1 } } //{* end of DATASET “ipars2” *} } • Oil Reservoir Management • Use LOOP keyword for capturing the repetitive structure within a file. • The grid has 4 partitions (0~3). • “IparsData” comprises “ipars1” and “ipars2”. “ipars1” describes the data files with the spatial coordinates’ stored; “ipars2” specifies the data files with other attributes stored.

Automatic Virtualization Using Meta-data • Aligned file chunks {num_rows, {File1,Offset1,Num_Bytes1}, {File2,Offset2,Num_Bytes2}, ……, {Filem,Offsetm,Num_Bytesm} } • Our tool parses the meta-data descriptor and generates function codes. At run time, the query would provide parameters to invoke the generated functions to create Aligned File Chunks. Dataset Root dataset 3 dataset 1 dataset 2 Data5 Data6 Data2 Data1 Data3 Data4

Compiler Analysis Data _Extract{ Find _File _Groups() Process _File _Groups() } Find _File _Groups{ Let S be the set of files that match against the query Classify files in S by the set of attributes they have Let S1, … ,Sm be the m sets T = Ø foreach {s1, … ,sm } si∈ Si { {* cartesian product between S1, … ,Sm *} If the values of implicit attributes are not inconsistent { T = T ∪ {s1, … ,sm } } } Output T } Process _File _Groups{ foreach {s1, … ,sm } ∈ T Find _Aligned _File _Chunks() Supply implicit attributes for each file chunk foreach Aligned File Chunk { Check against index Compute offset and length Output the aligned file chunk } } • Meta-data descriptor Create AFC Process AFC Index & Extraction function code

Outline • Automatic Data Virtualization • Relational/SQL • XML/XQuery based • Information Integration • Middleware for Streaming Data • Coarse-grained pipelined parallelism

XQuery ??? XML XML/XQuery Implementation HDF5 NetCDF TEXT RMDB …

Programming/Query Language • High-level declarative languages ease application development • Popularity of Matlab for scientific computations • New challenges in compiling them for efficient execution • XQuery is a high-level language for processing XML datasets • Derived from database, declarative, and functional languages ! • XPath (a subset of XQuery) embedded in an imperative language is another option

Approach / Contributions • Use of XML Schemas to provide high-level abstractions on complex datasets • Using XQuery with these Schemas to specify processing • Issues in Translation • High-level to low-level code • Data-centric transformations for locality in low-level codes • Issues specific to XQuery • Recognizing recursive reductions • Type inferencing and translation

System Architecture External Schema XML Mapping Service logical XML schema physical XML schema Compiler XQuery Sources C++/C

Outline • Automatic Data Virtualization • Relational/SQL • XML/XQuery based • Information Integration • Middleware for Streaming Data • Cluster and Grid-based data mining middleware

Data Integration: Overall Goal • Tools for data integration driven by: • Data explosion • Data size & number of data sources • New analysis tools • Autonomous resources • Heterogeneous data representation & various interfaces • Frequent Updates • Common Situations: • Flat-file datasets • Ad-hoc sharing of data

Current Approaches • Manually written wrappers • Problems • O(N2) wrappers needed, O(N) for a single updates • Mediator-based integration systems • Problems • Need a common intermediate format • Unnecessary data transformation • Integration using web/grid services • Needs all tools to be web-services (all data in XML?)

Our Approach • Automatically generate wrappers • Stand-alone programs • For integrated DBs, (grid) workflow systems • Transform data in files of arbitrary formats • No domain- or format-specific heuristics • Layout information provided by users • Help biologists write layout descriptors using data mining techniques • Particularly attractive for • flat-file datasets • ad hoc data sharing • data grid environments

Our Approach: Advantages • Advantages: • No DB or query support required • One descriptor per resource needed • No unnecessary transformation • New resources can be integrated on-the-fly

Our Approach: Challenges • Description language • Format and logical view of data in flat files • Easy to interpret and write • Wrapper generation and Execution • Correspondence between data items • Separating wrapper analysis and execution • Interactive tools for writing layout descriptors • What data mining techniques to use ?

Wrapper Generation System Overview Layout Descriptor Schema Descriptors Parser Mapping Generator Data Entry Representation Schema Mapping Application Analyzer WRAPINFO Source Dataset Target Dataset DataReader DataWriter Synchronizer

Outline • Automatic Data Virtualization • Relational/SQL • XML/XQuery based • Information Integration • Middleware for Streaming Data • Coarse-grained pipelined parallelism

Streaming Data Model • Continuous data arrival and processing • Emerging model for data processing • Sources that produce data continuously: sensors, long running simulations • WAN bandwidths growing faster than disk bandwidths • Active topic in many computer science communities • Databases • Data Mining • Networking ….

Summary/Limitations of Current Work • Focus on • centralized processing of stream from a single source (databases, data mining) • communication only (networking) • Many applications involve • distributed processing of streams • streams from multiple sources

X Network Fault Management System Motivating Application Network Fault Management System Switch Network

Motivating Application (2) Computer Vision Based Surveillance

Features of Distributed Streaming Processing Applications • Data sources could be distributed • Over a WAN • Continuous data arrival • Enormous volume • Probably can’t communicate it all to one site • Results from analysis may be desired at multiple sites • Real-time constraints • A real-time, high-throughput, distributed processing problem

Need for a Grid-Based Stream Processing Middleware • Application developers interested in data stream processing • Will like to have abstracted • Grid standards and interfaces • Adaptation function • Will like to focus on algorithms only • GATES is a middleware for • Grid-based • Self-adapting Data Stream Processing

Adaptation for Real-time Processing • Analysis on streaming data is approximate • Accuracy and execution rate trade-off can be captured by certain parameters (Adaptation parameters) • Sampling Rate • Size of summary structure • Application developers can expose these parameters and a range of values

API for Adaptation Public class Sampling-Stage implements StreamProcessing{ … void init(){…} … void work(buffer in, buffer out){ … while(true) { Image img = get-from-buffer-in-GATES(in); Image img-sample = Sampling(img, sampling-ratio); put-to-buffer-in-GATES(img-sample, out); } … } GATES.Information-About-Adjustment-Parameter(min, max, 1) sampling-ratio = GATES.getSuggestedParameter();

Outline • Automatic Data Virtualization • Relational/SQL • XML/XQuery based • Information Integration • Middleware for Streaming Data • Cluster and Grid-based data mining middleware

Scalable Mining Problem • Our understanding of what algorithms and parameters will give desired insights is often limited • The time required for creating scalable implementations of different algorithms and running them with different parameters on large datasets slows down the data mining process

Mining in a Grid Environment • A data mining application in a grid environment - - Needs to exploit different forms of available parallelism - Needs to deal with different data layouts and formats - Needs to adapt to resource availability

FREERIDE Overview • Framework for Rapid Implementation of datamining engines • Demonstrated for a variety of standard mining algorithm • Targeted distributed memory parallelism, shared memory parallelism, and combination • Can be used as basis for scalable grid-based data mining implementations • Published in SDM 01, SDM 02, SDM 03, Sigmetrics 02, Europar 02, IPDPS 03, IEEE TKDE (to appear)

FREERIDE-G • Data processing may not be feasible where the data resides • Need to identify resources for data processing • Need to abstract data retrieval, movement and parallel processing

Students Involved Recent Ph.D Grads (2005-06) • Ruoming Jin (Kent State University) • Wei Du (Yahoo) • Xiaogang Li (Ask.com) • Liang Chen (Amazon) • Li Weng (Oracle) • Current Students: • Xuan Zhang (graduating Winter 07) • Kaushik Sinha (joint with Misha Belkin) • Leonid Glimcher (4th year) • Qian Zhu (3rd year) • Wenjing Ma (2nd year) • David Chiu (2nd year) • Fan Wang (2nd year)

Some Newer Topics • Resource allocation, fault tolerance, and process migration in GATES (Qian Zhu) • FREERIDE-G using SRB (Leonid Glimcher) • FREERIDE on newer architectures (Wenjing Ma) • Deep web mining (for bioinformatics) (Fan Wang) • Service-oriented composition of data and services (David Chiu)

Tools and Techniques for Data Grid in Computational Science