An Approach for Automatic Data Virtualization

An Approach for Automatic Data Virtualization Li Weng, Gagan Agrawal et al.

Motivating Applications Oil Reservoir Management Magnetic Resonance Imaging Data-driven applications from science, Engineering, biomedicine: Oil Reservoir Management Water Contamination Studies Cancer Studies using MRI Telepathology with Digitized Slides Satellite Data Processing Virtual Microscope …

Opportunity and Issues • Emergence of grid-based data repositories • Can enable sharing of data in an unprecedented way • Access mechanisms for remote repositories • Complex low-level formats make accessing and processing of data difficult • Main desired functionality • Ability to select and down-load a subset of data

Current Approaches • Databases • Relational model using SQL • Properties of transactions: Atomicity, Isolation, Durability, Consistency • Good! But is it too heavyweight for read-mostly scientific data ? • Manual implementation based on low-level datasets • Need detailed understanding of low-level formats • HDF5, NetCDF, etc • No single established standard • BinX, BFD, DFDL • Machine readable descriptions, but application is dependent on a specific layout

Data Virtualization An abstract view of data dataset Data Virtualization Data Service • By Global Grid Forum’s DAIS working group: • A Data Virtualization describes an abstract view of data. • A Data Service implements the mechanism to access and process data • through the Data Virtualization

Our Approach: Automatic Data Virtualization • Automatically create data services • A new application of compiler technology • A meta-data descriptor describes the layout of data on a repository • An abstract view is exposed to the users • This paper: • Relational table view • Specify subsetting through SQL Select and Where statements

Outline • Introduction • Motivation • system overview • System design and algorithm • Design a meta-data descriptor • Automatic data virtualization using our meta-data descriptor • Experimental results • Related work • Conclusions and future work

System Overview SELECT < Data Elements > FROM < Dataset Name > WHERE …. AND Filter( < Data Element> );

STORM Runtime System A middleware to support data selection, data partitioning, and data transfer operations on flat-file datasets hosted on a parallel system. Services Query service Data source service Indexing service Filtering service Partition generation service Data mover service

Scientific datasets • Large volume • Gigabyte, Terabyte, Petabyte, … • Stored as binary/character flat files with highly repetitive structure • Distributed datasets • Generated/collected by scientific simulations or instruments • Multi-dimensional datasets • Spatial and/or temporal coordinates as subsetting index attributes • Filtering attributes

Design a Meta-data Description Language • Requirements • Specify the relationship of a dataset to the virtual dataset schema • Describe the dataset physical layout within a file • Describe the dataset distribution on nodes of one or more clusters • Specify the subsetting index attributes • Easy to use for data repository administrators and also convenient for our code generation

Design Overview • Dataset Schema Description Component • Dataset Storage Description Component • Dataset Layout Description Component

An Example • Oil Reservoir Management • The dataset comprises several simulation on the same grid • For each realization, each grid point, a number of attributes are stored. • The dataset is stored on a 4 node cluster. Component I: Dataset Schema Description [IPARS] // { * Dataset schema name *} REL = short int // {* Data type definition *} TIME = int X = float Y = float Z = float SOIL = float SGAS = float Component II: Dataset Storage Description [IparsData] //{* Dataset name *} //{* Dataset schema for IparsData *} DatasetDescription = IPARS DIR[0] = osu0/ipars DIR[1] = osu1/ipars DIR[2] = osu2/ipars DIR[3] = osu3/ipars

Data Layout Description Component DATASET “ROOT” { DATATYPE { … } DATAINDEX { … } DATA { DATASET dataset1 DATASET dataset2 DATASET dataset3 } DATASET “dataset1”{ DATATYPE { … } DATASPACE { … } DATA { data1 data2 data3 } } DATASET “dataset2”{ DATATYPE { … } DATASPACE { … } DATA { data4 } } DATASET “dataset3”{ …. } } Dataset Root dataset 3 dataset 1 dataset 2 Data5 Data6 Data2 Data1 Data3 Data4

An Example Component III: Dataset Layout Description DATASET “IparsData” { //{* Name for Dataset *} DATATYPE { IPARS } //{* Schema for Dataset *} DATAINDEX { REL TIME } DATA { DATASET ipars1 DATASET ipars2 } DATASET “ipars1” { DATASPACE { LOOP GRID ($DIRID*100+1):(($DIRID+1)*100):1 { X Y Z } } DATA { $DIR[$DIRID]/COORDS $DIRID = 0:3:1 } } // {* end of DATASET “ipars1” *} DATASET “ipars2” { DATASPACE { LOOP TIME 1:500:1 { LOOP GRID ( $DIRID*100+1):(( $DIRID+1)*100):1 { SOIL SGAS } } } DATA { $DIR[ $DIRID]/DATA$REL $REL = 0:3:1 $DIRID = 0:3:1 } } //{* end of DATASET “ipars2” *} } • Oil Reservoir Management • Use LOOP keyword for capturing the repetitive structure within a file. • The grid has 4 partitions (0~3). • “IparsData” comprises “ipars1” and “ipars2”. “ipars1” describes the data files with the spatial coordinates’ stored; “ipars2” specifies the data files with other attributes stored.

Automatic Virtualization Using Meta-data • Aligned file chunks {num_rows, {File1,Offset1,Num_Bytes1}, {File2,Offset2,Num_Bytes2}, ……, {Filem,Offsetm,Num_Bytesm} } • Our tool parses the meta-data descriptor and generates function codes. At run time, the query would provide parameters to invoke the generated functions to create Aligned File Chunks. Dataset Root dataset 3 dataset 1 dataset 2 Data5 Data6 Data2 Data1 Data3 Data4

Compiler Analysis Data _Extract{ Find _File _Groups() Process _File _Groups() } Find _File _Groups{ Let S be the set of files that match against the query Classify files in S by the set of attributes they have Let S1, … ,Sm be the m sets T = Ø foreach {s1, … ,sm } si∈ Si { {* cartesian product between S1, … ,Sm *} If the values of implicit attributes are not inconsistent { T = T ∪ {s1, … ,sm } } } Output T } Process _File _Groups{ foreach {s1, … ,sm } ∈ T Find _Aligned _File _Chunks() Supply implicit attributes for each file chunk foreach Aligned File Chunk { Check against index Compute offset and length Output the aligned file chunk } } • Meta-data descriptor Create AFC Process AFC Index & Extraction function code

An Example Component III: Dataset Layout Description DATASET “IparsData” { //{* Name for Dataset *} DATATYPE { IPARS } //{* Schema for Dataset *} DATAINDEX { REL TIME } DATA { DATASET ipars1 DATASET ipars2} DATASET “ipars1” { DATASPACE { LOOP GRID ( $DIRID*100+1):(( $DIRID+1)*100):1 { X Y Z } } DATA { $DIR[$DIRID]/COORDS $DIRID = 0:3:1 } } // {* end of DATASET “ipars1” *} DATASET “ipars2” { DATASPACE { LOOP TIME 1:500:1 { LOOP GRID ( $DIRID*100+1):(( $DIRID+1)*100):1 { SOIL SGAS } } } DATA { $DIR[ $DIRID]/DATA$REL $REL = 0:3:1 $DIRID = 0:3:1 } } //{* end of DATASET “ipars2” *} } • Consider a query for selecting a subset with REL values of 0 and 1, TIME from 1 to 100. • Exclude DATA2, DATA3 • Exclude COORD2, COORD3 • Decide eight file groups k = 0, 1, 2, 3 DIR[k]/{COORD0, DATA0} DIR[k]/{COORD1, DATA1} • Create 100 Aligned File Chunks for each file group

Experimental Setup & Design A Linux cluster connected via a Switched Fast Ethernet. Each node with a PIII 933MHz CPU, 512 MB main Memory, and three 100GB IDE disks. Three sets of experiments: • Code generation ability • Evaluate scalability • Comparison with hand written codes

Test the ability of our code generation tool • Layout0 - original layout from the application collaborators • Layout1 – all data stored as a table in a file • Layout2 - all data in a file and each attribute stored as an array • Layout3 – split the layout1into multiple files based on value of the time step • Layout4 – like layout3, but each attribute stored as an array in each data file • Layout5 – data stored in 7 files where the first file with spatial coordinates and the other attributes divided into 6 files • Layout6 – like layout5, but each attribute stored as an array in each data file

Test the ability of our code generation tool Oil Reservoir Management The performance difference is within 4%~10% as for Layout 0. Correctly and efficiently handle a variety of different layouts for the same data

Evaluate the Scalability of Our Tool • Scale the number of nodes hosting the Oil reservoir management dataset • Extract a subset of interest at the size of 1.3GB • The execution times scale almost linearly. • The performance difference varies between 5%~34%, with an average difference of 16%.

Comparison with hand written codes Oil reservoir management dataset stored on 16 nodes. Performance difference is within 17%, With an average difference of 14% Satellite data processing stored on a single node. Performance difference is within 4%

Related Work • Describe data on the Grid • BinX and Binary Format Description • HDF5 • Parallel / distributed databases • Data cube • Magda on top of MySQL • Oracle’s external tables • OpeNDAP • SRS

Conclusions and Future Work • An automatic approach to support data virtualization for large distributed scientific datasets in low-level formats. • Design a meta-data description language • Compiler based strategy to generate extractor codes automatically • The dataset can be stored in the format it is generated in and no effort is involved in loading it in a database system. • Experimental evaluation demonstrates the efficacy and efficiency of our tool • Future work • Experimental studies for more real data-driven and interactive applications with larger scientific datasets under distributed and heterogeneous computing environment • Extend computation capability and flexibility by supporting User Defined Aggregate • Multiple datasets’ integration in the grid computing environment

Comparison with an existing database (PostgreSQL) 6GB data for Satellite data processing. The total storage required after loading the data in PostgreSQL is 18GB. Create Index for both spatial coordinates and S1 in PostgreSQL. No special performance tuning applied for the experiment.

An Approach for Automatic Data Virtualization

An Approach for Automatic Data Virtualization

Presentation Transcript

Data Virtualization

An approach to automatic music playlist generation using iTunes and behavioral data

Virtualization for Hosters

An Automatic Calibration Approach for

An Virtualization based Data Management Framework for Big Data Applications

Automatic software deployment using user-level virtualization for cloud-computing

Virtualization: An Overview

Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets

Towards an Automatic Approach for Quality Improvement in Object-Oriented Design

Retarget Open64 to an Embedded CPU  A practice for automatic approach

Data Virtualization an Overview

Cryospheric Data Assimilation An Integrated Approach for Generating Consistent Cryosphere Data Set

AN SLA-BASED RESOURCE VIRTUALIZATION APPROACH FOR ON-DEMAND SERVICE PROVISION

An automatic algorithm selection approach for nurse rostering

Multi-Level Architecture for Data Plane Virtualization

An evolutionary approach for improving the quality of automatic summaries

Automatic Data

Retarget Open64 to an Embedded CPU  A practice for automatic approach

An adjoint data assimilation approach

AN EUDAT- based FAIR Data Approach for Data Interoperability

Data Virtualization Market