1 / 41

A Grid-Based Middleware for Scalable Processing of Remote Data

A Grid-Based Middleware for Scalable Processing of Remote Data. Leonid Glimcher Advisor: Gagan Agrawal. Abundance of Data. Data generated everywhere: Sensors (intrusion detection, satellites) Scientific simulations (fluid dynamics, molecular dynamics)

ruth-haley
Télécharger la présentation

A Grid-Based Middleware for Scalable Processing of Remote Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Grid-Based Middleware for Scalable Processing of Remote Data Leonid Glimcher Advisor: Gagan Agrawal DataGrid Lab

  2. Abundance of Data Data generated everywhere: • Sensors (intrusion detection, satellites) • Scientific simulations (fluid dynamics, molecular dynamics) • Business transactions (purchases, market trends) Analysis needed to translate data into knowledge Growing data size creates problems for datamining DataGrid Lab

  3. Emergence of Grid Computing Provides access to: • Resources • Data otherwise inaccessible Aimed at providing a stable foundation for distributed services Comes at a price: • Standards are needed to integrate distributed services • Open Grid Service Architecture • Web Service Resource Framework DataGrid Lab

  4. Remote Data Analysis • Remote data analysis • Grid is a good fit Details can be very tedious: • Data retrieval, movement and caching • Parallel data processing • Resource allocation • Application configuration Middleware can be useful to abstract away details DataGrid Lab

  5. Our Approach Supporting development of scalable applications that process remote data using middleware – FREERIDE-G (Framework for Rapid Development of Datamining Engines in Grid) Middleware user Repository cluster Compute cluster DataGrid Lab

  6. FREERIDE-G Middleware Realized: • Active Data Repository based implementation • Storage Resource Broker based implementation • compliance with remote repository standards • Performance modeling and prediction framework Future: • Compliance with grid computing standards • Support for data integration and load balancing DataGrid Lab

  7. Publications Published: • “Parallelizing EM Clustering Algorithm on a Cluster of SMPs”, Euro-Par’04. • “Scaling and Parallelizing a Scientific Feature Mining Application Using a Cluster Middleware”, IPDPS’04 • “Parallelizing a Defect Detection and Categorization Application”, IPDPS’05. • “FREERIDE-G: Supporting Applications that Mine Remote Data Repositories”, ICPP’06. • “A Performance Prediction Framework for Grid-Based Data Mining Applications”, IPDPS’07 In Submission: • “A Middleware for Remote Mining of Data From SRB-based Servers”, submitted to SC’07 • “Middleware for Data Mining Applications on Clusters and Grids, invited to special issue of Journal of Parallel and Distributed Computing (JPDC) on "Parallel Techniques for Information Extraction" DataGrid Lab

  8. Outline • Introduction • FREERIDE-G over SRB • Performance prediction for FREERIDE-G • Future directions • Standardization as Grid Service • Data integration and load balancing • Related work • Conclusion DataGrid Lab

  9. FREERIDE-G Architecture Compute Node Data Host I/O Registration Connection establishment SRB Agent While (more chunks to process) I/O request dispatched Pending I/O polled MCAT SRB Master Retrieved data chunks analyzed SRB Agent Compute Node DataGrid Lab

  10. SRB Data Host SRB Master: • Connection establishment • Authentication • Forks agent to service I/O SRB Agent: • Performs remote I/O • Services multiple client API requests MCAT: • Catalogs data associated with datasets, users, resources • Services metadata queries (through SRB agent) DataGrid Lab

  11. SDSC Storage Resource Broker Standard for remote data: • Storage • Access Data can be distributed across organizations and heterogeneous storage system Used to host data for FREERIDE-G Access provided through client API DataGrid Lab

  12. Compute Node More compute nodes than data hosts Each node: • Registers IO (from index) • Connects to data host While (chunks to process) • Dispatch IO request(s) • Poll pending IO • Process retrieved chunks DataGrid Lab

  13. FREERIDE-G Processing Structure KEY observation: most data mining algorithms follow canonical loop Middleware API: • Subset of data to be processed • Reduction object • Local and global reduction operations • Iterator Derived from precursor system FREERIDE While( ) { forall( data instances d) { I = process(d) R(I) = R(I) op d } ……. } DataGrid Lab

  14. Classic Data Mining Applications Tested here: • Expectation Maximization clustering • Kmeans clustering • K Nearest Neighbor search Previously on FREERIDE: • Apriori and FPGrowth frequent itemset miners • RainForest decision tree construction Middleware is useful for wide variety of algorithms DataGrid Lab

  15. Scientific applications VortexPro: • Finds vortices in volumetric fluid/gas flow datasets Defect detection: • Finds defects in molecular lattice datasets • Categorizes defects into classes DataGrid Lab

  16. Experimental setup (LAN) Single cluster of dual processor 2.4 GHz Opteron 250 connected through Mellanox Infiniband (1 Gb) Goals: • Evaluate data and compute node scalability • Quantify overhead of using SRB (vs. ADR and TCP/IP) Active Data Repository implementation uses ADR for data retrieval and TCP/IP streams for data delivery to compute node (for processing) DataGrid Lab

  17. LAN results • Defect detection (16.4 GB) • Kmeans (6.4 GB) Observations: • Configurations with equal numbers of compute nodes and data hosts ADR outperforms SRB • Configurations with more compute nodes, SRB wins • Great scalability with respect to data host and compute nodes DataGrid Lab

  18. Setup 2: Organizational Grid Data hosted on Opteron 250’s (as before) Processed on Opteron 254’s 2 clusters connected through 2 10 GB optical fibers Goals: • Investigate effects of remote processing • Data chunk effect on remote retrieval overhead • Retrieval concurrency effect on overhead DataGrid Lab

  19. Effects of chunk size Applications: • Vortex detection (13.6 GB) • kNN search (6.4 GB) Observations: • Chunk size has no effect on execution time in LAN • Enlarging chunk size in organization grid decreases remote retrieval overheads • Difference accounted for in ability to overlap retrieval with processing DataGrid Lab

  20. Effects of concurrent retrieval Applications: • Defect detection (12.2 GB) • Kmeans clustering (6.4GB) Observations: • Retrieval and processing more overlapped • All applications tested benefited from increase in concurrency level • Remote retrieval overhead can be reduced effectively DataGrid Lab

  21. Outline • Introduction • FREERIDE-G over SRB • Performance prediction for FREERIDE-G • Future directions • Standardization as Grid Service • Data integration and load balancing • Related work • Conclusion DataGrid Lab

  22. Motivation for Performance Prediction Dataset replica selection complicated: • Cluster size (number of partitions) • Network speed between repository and compute clusters Compute cluster selection: • Cluster size (processing parallelism) • Network speed (data delivery) • Processing structure Modeling performance is essential for selection of data replicas and compute resources DataGrid Lab

  23. Approach and Processing Structure • 3 phases of execution: • Retrieval at data server • Data delivery to compute node • Parallel processing at compute node • Special processing structure: • Generalized reduction Texec = Tdisk + Tnetwork + Tcompute DataGrid Lab

  24. Information Needed Dataset replica (repository cluster) size Compute cluster size Connecting network transfer rate Application profile • 3 stage breakdown • Reduction operation characteristics • Reduction object size DataGrid Lab

  25. Initial Phases Data retrieval and delivery phases Both phases scale with: • Dataset size • Number of storage nodes Data delivery also scales with: • Network transfer rate Appropriate component of execution time breakdown is scaled using factor ratios DataGrid Lab

  26. Data Processing Phase Processing data in 2 components: • Parallel processing on each node scales with: • number of cluster nodes • dataset size • Sequential: • reduction object communication • global reduction Common processing structure leveraged for accurate processing time prediction DataGrid Lab

  27. Predicting for Different Resources Requires scaling factors: • 3 phases of computation • representative set of applications • application being modeled doesn’t have to be part of representative set Average used for scaling: • susceptible to outliers • accuracy effected DataGrid Lab

  28. Summarized Prediction Results Modeling reduction in detail pays off Correctly model changes in (very accurate): • Dataset size • Parallel configuration • storage cluster • compute cluster • Network speed • Underlying resources (less accurate) DataGrid Lab

  29. Outline • Introduction • FREERIDE-G over SRB • Performance prediction for FREERIDE-G • Future directions • Standardization as Grid Service • Data integration and load balancing • Related work • Conclusion DataGrid Lab

  30. FREERIDE-G Grid Service Emergence of Grid leads to computing standards and infrastructures Data hosts component already integrated through SRB for remote data access OGSA – previous standard for Grid Services WSRF – standard that merges Web and Grid Service definitions Globus Toolkit is most fitting infrastructure for Grid Service conversion of FREERIDE-G DataGrid Lab

  31. Data Mining Service Provider Dataset & Repository Info FREERIDE-G User Web Server Configuration WSDL API functions FREERIDE-G configuration Parallelization Support Web Server Remote Retrieval Support Launcher Deployer for Data Servers Compute Nodes Resource Mgr CN 1 Data Server 1 CN 1 … … Data Server N CN M-1 CN M DataGrid Lab

  32. Conversion Challenges • Interfacing C++ middleware with GT 4.0 (Java) • Swig to generate Java Native Interface (JNI) wrapper • Integrating data mining service through WSDL interface • Encapsulate service as a resource • Middleware deployment using GT 4.0 • Use Web Service – Grid Resource Allocation Module for deployment DataGrid Lab

  33. Need for Load Balancing & Integration Data integration: • Physical layout different from application (logical) view • Automatically generated wrapper to make connection between 2 views Load balancing: • Horizontal vs. vertical views • Uneven distribution of vertical views • Different network transfer rates (for delivery of different views) DataGrid Lab

  34. Run-time Load Balancing • Based on unbalanced (initial) distribution figure out an even partitioning scheme (across repository nodes) • Based on difference in initial and even distributions, figure out data senders/receivers • Match up sender/receiver pairs • Organize multiple concurrent transfers required to alleviate data delivery bottleneck DataGrid Lab

  35. Approach to Integration Bandwidth available also used to determine level of concurrency After all vertical views of chunks are re-distributed to destination repository nodes, use wrapper to convert data Automatically generated wrapper: • Input – physical data layout (storage layout) • Output – logical data layout (application view) DataGrid Lab

  36. Outline • Introduction • FREERIDE-G over SRB • Performance prediction for FREERIDE-G • Future directions • Standardization as Grid Service • Data integration and load balancing • Related work • Conclusion DataGrid Lab

  37. Grid Computing THE GRID Standards: • Open Grid Service Architecture • Web Services Resource Framework Implementations: • Globus toolkit, WSRF.NET, WSRF::Lite Other Grid Systems: • Condor and Legion, • Workflow composition systems: • GridLab, Nimrod/G (GridBus), Cactus, GRIST DataGrid Lab

  38. Data Grids Metadata cataloging: • Artemis project Remote data retrieval: • SRB, SEMPLAR • DataCutter, STORM Stream processing midleware: • GATES • dQUOB Data integration: • Automatic wrapper generation DataGrid Lab

  39. Remote Data in Grid KnowledgeGrid tool-set: • Developing datamining workflows (composition) GridMiner toolkit: • Composition of distributed, parallel workflows DiscoveryNet layer: • Creation, deployment, management of services DataMiningGrid framework: • Distributed knowledge discovery Other projects partially overlap in goals with ours DataGrid Lab

  40. Resource Allocation Approaches 3 broad categories for resource allocation: • Heuristic approach to mapping • Prediction through modeling: • Statistical estimation/prediction • Analytical modeling of parallel application • Simulation based performance prediction DataGrid Lab

  41. Conclusions FREERIDE-G – middleware for scalable processing of remote data • Support for high-end processing • Ease use of parallel configurations • Hide details of data movement and caching • Predictable and scalable performance • Compliance with remote repository standards Future: • Compliance with grid computing standards • Support for data integration and load balancing DataGrid Lab

More Related