A Grid-Based Middleware for Scalable Processing of Remote Data

A Grid-Based Middleware for Scalable Processing of Remote Data Leonid Glimcher Advisor: Gagan Agrawal DataGrid Lab

Abundance of Data Data generated everywhere: • Sensors (intrusion detection, satellites) • Scientific simulations (fluid dynamics, molecular dynamics) • Business transactions (purchases, market trends) Analysis needed to translate data into knowledge Growing data size creates problems for datamining DataGrid Lab

Emergence of Grid Computing Provides access to: • Resources • Data otherwise inaccessible Aimed at providing a stable foundation for distributed services Comes at a price: • Standards are needed to integrate distributed services • Open Grid Service Architecture • Web Service Resource Framework DataGrid Lab

Remote Data Analysis • Remote data analysis • Grid is a good fit Details can be very tedious: • Data retrieval, movement and caching • Parallel data processing • Resource allocation • Application configuration Middleware can be useful to abstract away details DataGrid Lab

Our Approach Supporting development of scalable applications that process remote data using middleware – FREERIDE-G (Framework for Rapid Development of Datamining Engines in Grid) Middleware user Repository cluster Compute cluster DataGrid Lab

FREERIDE-G Middleware Realized: • Active Data Repository based implementation • Storage Resource Broker based implementation • compliance with remote repository standards • Performance modeling and prediction framework Future: • Compliance with grid computing standards • Support for data integration and load balancing DataGrid Lab

Publications Published: • “Parallelizing EM Clustering Algorithm on a Cluster of SMPs”, Euro-Par’04. • “Scaling and Parallelizing a Scientific Feature Mining Application Using a Cluster Middleware”, IPDPS’04 • “Parallelizing a Defect Detection and Categorization Application”, IPDPS’05. • “FREERIDE-G: Supporting Applications that Mine Remote Data Repositories”, ICPP’06. • “A Performance Prediction Framework for Grid-Based Data Mining Applications”, IPDPS’07 In Submission: • “A Middleware for Remote Mining of Data From SRB-based Servers”, submitted to SC’07 • “Middleware for Data Mining Applications on Clusters and Grids, invited to special issue of Journal of Parallel and Distributed Computing (JPDC) on "Parallel Techniques for Information Extraction" DataGrid Lab

Outline • Introduction • FREERIDE-G over SRB • Performance prediction for FREERIDE-G • Future directions • Standardization as Grid Service • Data integration and load balancing • Related work • Conclusion DataGrid Lab

FREERIDE-G Architecture Compute Node Data Host I/O Registration Connection establishment SRB Agent While (more chunks to process) I/O request dispatched Pending I/O polled MCAT SRB Master Retrieved data chunks analyzed SRB Agent Compute Node DataGrid Lab

SRB Data Host SRB Master: • Connection establishment • Authentication • Forks agent to service I/O SRB Agent: • Performs remote I/O • Services multiple client API requests MCAT: • Catalogs data associated with datasets, users, resources • Services metadata queries (through SRB agent) DataGrid Lab

SDSC Storage Resource Broker Standard for remote data: • Storage • Access Data can be distributed across organizations and heterogeneous storage system Used to host data for FREERIDE-G Access provided through client API DataGrid Lab

Compute Node More compute nodes than data hosts Each node: • Registers IO (from index) • Connects to data host While (chunks to process) • Dispatch IO request(s) • Poll pending IO • Process retrieved chunks DataGrid Lab

FREERIDE-G Processing Structure KEY observation: most data mining algorithms follow canonical loop Middleware API: • Subset of data to be processed • Reduction object • Local and global reduction operations • Iterator Derived from precursor system FREERIDE While( ) { forall( data instances d) { I = process(d) R(I) = R(I) op d } ……. } DataGrid Lab

Classic Data Mining Applications Tested here: • Expectation Maximization clustering • Kmeans clustering • K Nearest Neighbor search Previously on FREERIDE: • Apriori and FPGrowth frequent itemset miners • RainForest decision tree construction Middleware is useful for wide variety of algorithms DataGrid Lab

Scientific applications VortexPro: • Finds vortices in volumetric fluid/gas flow datasets Defect detection: • Finds defects in molecular lattice datasets • Categorizes defects into classes DataGrid Lab

Experimental setup (LAN) Single cluster of dual processor 2.4 GHz Opteron 250 connected through Mellanox Infiniband (1 Gb) Goals: • Evaluate data and compute node scalability • Quantify overhead of using SRB (vs. ADR and TCP/IP) Active Data Repository implementation uses ADR for data retrieval and TCP/IP streams for data delivery to compute node (for processing) DataGrid Lab

LAN results • Defect detection (16.4 GB) • Kmeans (6.4 GB) Observations: • Configurations with equal numbers of compute nodes and data hosts ADR outperforms SRB • Configurations with more compute nodes, SRB wins • Great scalability with respect to data host and compute nodes DataGrid Lab

Setup 2: Organizational Grid Data hosted on Opteron 250’s (as before) Processed on Opteron 254’s 2 clusters connected through 2 10 GB optical fibers Goals: • Investigate effects of remote processing • Data chunk effect on remote retrieval overhead • Retrieval concurrency effect on overhead DataGrid Lab

Effects of chunk size Applications: • Vortex detection (13.6 GB) • kNN search (6.4 GB) Observations: • Chunk size has no effect on execution time in LAN • Enlarging chunk size in organization grid decreases remote retrieval overheads • Difference accounted for in ability to overlap retrieval with processing DataGrid Lab

Effects of concurrent retrieval Applications: • Defect detection (12.2 GB) • Kmeans clustering (6.4GB) Observations: • Retrieval and processing more overlapped • All applications tested benefited from increase in concurrency level • Remote retrieval overhead can be reduced effectively DataGrid Lab

Motivation for Performance Prediction Dataset replica selection complicated: • Cluster size (number of partitions) • Network speed between repository and compute clusters Compute cluster selection: • Cluster size (processing parallelism) • Network speed (data delivery) • Processing structure Modeling performance is essential for selection of data replicas and compute resources DataGrid Lab

Approach and Processing Structure • 3 phases of execution: • Retrieval at data server • Data delivery to compute node • Parallel processing at compute node • Special processing structure: • Generalized reduction Texec = Tdisk + Tnetwork + Tcompute DataGrid Lab

Information Needed Dataset replica (repository cluster) size Compute cluster size Connecting network transfer rate Application profile • 3 stage breakdown • Reduction operation characteristics • Reduction object size DataGrid Lab

Initial Phases Data retrieval and delivery phases Both phases scale with: • Dataset size • Number of storage nodes Data delivery also scales with: • Network transfer rate Appropriate component of execution time breakdown is scaled using factor ratios DataGrid Lab

Data Processing Phase Processing data in 2 components: • Parallel processing on each node scales with: • number of cluster nodes • dataset size • Sequential: • reduction object communication • global reduction Common processing structure leveraged for accurate processing time prediction DataGrid Lab

Predicting for Different Resources Requires scaling factors: • 3 phases of computation • representative set of applications • application being modeled doesn’t have to be part of representative set Average used for scaling: • susceptible to outliers • accuracy effected DataGrid Lab

Summarized Prediction Results Modeling reduction in detail pays off Correctly model changes in (very accurate): • Dataset size • Parallel configuration • storage cluster • compute cluster • Network speed • Underlying resources (less accurate) DataGrid Lab

FREERIDE-G Grid Service Emergence of Grid leads to computing standards and infrastructures Data hosts component already integrated through SRB for remote data access OGSA – previous standard for Grid Services WSRF – standard that merges Web and Grid Service definitions Globus Toolkit is most fitting infrastructure for Grid Service conversion of FREERIDE-G DataGrid Lab

Data Mining Service Provider Dataset & Repository Info FREERIDE-G User Web Server Configuration WSDL API functions FREERIDE-G configuration Parallelization Support Web Server Remote Retrieval Support Launcher Deployer for Data Servers Compute Nodes Resource Mgr CN 1 Data Server 1 CN 1 … … Data Server N CN M-1 CN M DataGrid Lab

Conversion Challenges • Interfacing C++ middleware with GT 4.0 (Java) • Swig to generate Java Native Interface (JNI) wrapper • Integrating data mining service through WSDL interface • Encapsulate service as a resource • Middleware deployment using GT 4.0 • Use Web Service – Grid Resource Allocation Module for deployment DataGrid Lab

Need for Load Balancing & Integration Data integration: • Physical layout different from application (logical) view • Automatically generated wrapper to make connection between 2 views Load balancing: • Horizontal vs. vertical views • Uneven distribution of vertical views • Different network transfer rates (for delivery of different views) DataGrid Lab

Run-time Load Balancing • Based on unbalanced (initial) distribution figure out an even partitioning scheme (across repository nodes) • Based on difference in initial and even distributions, figure out data senders/receivers • Match up sender/receiver pairs • Organize multiple concurrent transfers required to alleviate data delivery bottleneck DataGrid Lab

Approach to Integration Bandwidth available also used to determine level of concurrency After all vertical views of chunks are re-distributed to destination repository nodes, use wrapper to convert data Automatically generated wrapper: • Input – physical data layout (storage layout) • Output – logical data layout (application view) DataGrid Lab

Grid Computing THE GRID Standards: • Open Grid Service Architecture • Web Services Resource Framework Implementations: • Globus toolkit, WSRF.NET, WSRF::Lite Other Grid Systems: • Condor and Legion, • Workflow composition systems: • GridLab, Nimrod/G (GridBus), Cactus, GRIST DataGrid Lab

Data Grids Metadata cataloging: • Artemis project Remote data retrieval: • SRB, SEMPLAR • DataCutter, STORM Stream processing midleware: • GATES • dQUOB Data integration: • Automatic wrapper generation DataGrid Lab

Remote Data in Grid KnowledgeGrid tool-set: • Developing datamining workflows (composition) GridMiner toolkit: • Composition of distributed, parallel workflows DiscoveryNet layer: • Creation, deployment, management of services DataMiningGrid framework: • Distributed knowledge discovery Other projects partially overlap in goals with ours DataGrid Lab

Resource Allocation Approaches 3 broad categories for resource allocation: • Heuristic approach to mapping • Prediction through modeling: • Statistical estimation/prediction • Analytical modeling of parallel application • Simulation based performance prediction DataGrid Lab

Conclusions FREERIDE-G – middleware for scalable processing of remote data • Support for high-end processing • Ease use of parallel configurations • Hide details of data movement and caching • Predictable and scalable performance • Compliance with remote repository standards Future: • Compliance with grid computing standards • Support for data integration and load balancing DataGrid Lab

A Grid-Based Middleware for Scalable Processing of Remote Data

A Grid-Based Middleware for Scalable Processing of Remote Data

Presentation Transcript

Globus: A Core Grid Middleware

A Middleware for Developing and Deploying Scalable Remote Mining Services

Adaptively Processing Remote Data

Scalable Integration and Processing of Linked Data

Grid Middleware

A comparison of distributed data storage middleware for HPC, GRID and Cloud

A European Grid Middleware

A Grid-Based Middleware’s Support for Processing Distributed Data Streams

GriT: A CORBA-based Grid Middleware Architecture

Scalable, Fault-tolerant Management of Grid Services: Application to Messaging Middleware

ESIP - Grid based Satellite Data Processing Platform

Grid Middleware

Scalable Clustering on the Data Grid

Grid-based Data Stream Processing in e-Science

Scalable Autonomic Streaming Middleware for Real-Time Processing of Massive Data Flows

Grid Middleware

A Web-Based Data Grid

Scalable, Fault-tolerant Management of Grid Services: Application to Messaging Middleware

A comparison of distributed data storage middleware for HPC, GRID and Cloud

Scalable, Fault-tolerant Management of Grid Services: Application to Messaging Middleware