a middleware for developing and deploying scalable remote mining services n.
Skip this Video
Loading SlideShow in 5 Seconds..
A Middleware for Developing and Deploying Scalable Remote Mining Services PowerPoint Presentation
Download Presentation
A Middleware for Developing and Deploying Scalable Remote Mining Services

A Middleware for Developing and Deploying Scalable Remote Mining Services

132 Vues Download Presentation
Télécharger la présentation

A Middleware for Developing and Deploying Scalable Remote Mining Services

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. A Middleware for Developing and Deploying Scalable Remote Mining Services Leonid Glimcher Gagan Agrawal DataGrid Lab

  2. Abundance of Data Data generated everywhere: • Sensors (intrusion detection, satellites) • Scientific simulations (fluid dynamics, molecular dynamics) • Business transactions (purchases, market trends) Analysis needed to translate data into knowledge Growing data size creates problems for datamining DataGrid Lab

  3. Grid and Cloud Computing Provides access to otherwise inaccessible: • Resources • Data Goal of Grid Computing: • to provide a stable foundation for distributed services Price for such foundation: • Standards needed for distributed service integration (OGSA & WSRF) Cloud computing: access resources (data storage and processing) as services available for consumption DataGrid Lab

  4. Remote Data Analysis • Remote data analysis • Grid is a good fit Details can be very tedious: • Data retrieval, movement and caching • Parallel data processing • Resource allocation • Application configuration Middleware can be useful to abstract away details DataGrid Lab

  5. Our Approach Supporting development of scalable applications that process remote data using middleware – FREERIDE-G (Framework for Rapid Implementation of Datamining Engines in Grid) Middleware user Repository cluster Compute cluster DataGrid Lab

  6. Outline • Motivation • Introduction • Middleware Overview • Challenges • Experimental Evaluation • Related work • Conclusion DataGrid Lab

  7. FREERIDE-G Grid Service Grid computing standards and infrastructures: • OGSA vs. WSRF (merged Grid and Web Services) Data hosts compliant with repository standards (SRB) MPICH-G2 -- pre-WS mechanism to support MPI Globus Toolkit and MPICH-G2 -- most fitting infrastructure for Grid Service conversion of FREERIDE-G DataGrid Lab

  8. SDSC Storage Resource Broker Standard for remote data: • Storage • Access Data can be distributed across organizations and heterogeneous storage systems Used to host data for FREERIDE-G Access provided through client API DataGrid Lab

  9. FREERIDE-G Processing Structure KEY observation: most data mining algorithms follow canonical loop Middleware API: • Subset of data to be processed • Reduction object • Local and global reduction operations • Iterator Derived from precursor system FREERIDE While( ) { forall( data instances d) { I = process(d) R(I) = R(I) op d } ……. } DataGrid Lab

  10. FREERIDE-G Evolution FREERIDE data stored locally FREERIDE-G • ADR responsible for remote data retrieval • SRB responsible for remote data retrieval FREERIDE-G grid service Grid service featuring • Load balancing • Data integration DataGrid Lab

  11. Evolution FREERIDE-G-ADR Application Data ADR SRB Globus FREERIDE FREERIDE-G-SRB FREERIDE-G-GT DataGrid Lab

  12. FREERIDE-G System Architecture DataGrid Lab

  13. Implementation Challenges I • Interaction with Code Repository • Simplified Wrapper and Interface Generator • XML descriptors of API functions • Each API function wrapped in own class C++ C++ Java SWIG XML DataGrid Lab

  14. Implementation Challenges Integration with MPICH-G2 and Globus Toolkit • Supports MPI • Deployed through pre-WS Globus components • Hides potential heterogeneity in: • service startup • management DataGrid Lab

  15. FREERIDE-G Applications VortexPro: • Finds vortices in volumetric fluid/gas flow datasets Kmeans Clustering: • Clusters points based on Euclidean distance in an attribute space EM Clustering: • Another distance-based clustering algorithm DataGrid Lab

  16. Experimental setup Organizational Grid: • Data hosted on Opteron 250 cluster • Processed on Opteron 254 cluster • Connected using 2 10 GB optical fibers Goals: • Demonstrate parallel scalability of applications • Evaluate overhead of using MPICH-G2 and Globus Toolkit deployment mechanisms Repository cluster Compute cluster DataGrid Lab

  17. Scalability Evaluation Scalable implementations with respect to numbers of: • Compute nodes, • Data repository nodes. Vortex Detection with 14.8 GB dataset. Kmeans Clustering with 6.4 GB dataset (bottom) DataGrid Lab

  18. Scalability Evaluation II Sub-linear speedup for scaled up compute nodes explained by sequential scheduling of concurrent data reads Kmeans Clustering (25.6 GB) EM Clustering (25.6 GB) DataGrid Lab

  19. Deployment Overhead Evaluation Clearly a small overhead associated with using Globus and MPICH-G2 for middleware deployment. Kmeans Clustering with 6.4 GB dataset: 18-20%. DataGrid Lab

  20. Deployment Overhead Evaluation II Vortex Detection with 14.8 GB dataset: 17-20%. Overhead: • scales with the total execution time • doesn’t effect overall data processing service scalability DataGrid Lab

  21. Outline • Motivation • Introduction • Middleware Overview • Challenges • Experimental Evaluation • Related work • Conclusion DataGrid Lab

  22. Grid Computing THE GRID Standards: • Open Grid Service Architecture • Web Services Resource Framework Implementations: • Globus toolkit, WSRF.NET, WSRF::Lite Other Grid Systems: • Condor and Legion, • Workflow composition systems: • GridLab, Nimrod/G (GridBus), Cactus, GRIST DataGrid Lab

  23. Data Grids Metadata cataloging: • Artemis project Remote data retrieval: • SRB, SEMPLAR • DataCutter, STORM Stream processing midleware: • GATES • dQUOB Data integration: • Automatic wrapper generation DataGrid Lab

  24. Remote Data in Grid KnowledgeGrid tool-set: • Developing datamining workflows (composition) GridMiner toolkit: • Composition of distributed, parallel workflows DiscoveryNet layer: • Creation, deployment, management of services DataMiningGrid framework: • Distributed knowledge discovery Other projects partially overlap in goals with ours DataGrid Lab

  25. Conclusions FREERIDE-G – middleware for scalable processing of remote data -- now as a grid service • Compliance with grid and repository standards • Support for high-end processing • Ease use of parallel configurations • Hide details of data movement and caching • Scalable performance with respect to data and compute nodes • Low deployment overheads as compared to non-grid version DataGrid Lab

  26. SRB Data Host SRB Master: • Connection establishment • Authentication • Forks agent to service I/O SRB Agent: • Performs remote I/O • Services multiple client API requests MCAT: • Catalogs data associated with datasets, users, resources • Services metadata queries (through SRB agent) DataGrid Lab

  27. Compute Node More compute nodes than data hosts Each node: • Registers IO (from index) • Connects to data host While (chunks to process) • Dispatch IO request(s) • Poll pending IO • Process retrieved chunks DataGrid Lab

  28. FREERIDE-G in Action Compute Node Data Host I/O Registration Connection establishment SRB Agent While (more chunks to process) I/O request dispatched Pending I/O polled MCAT SRB Master Retrieved data chunks analyzed SRB Agent Compute Node DataGrid Lab

  29. Conversion Challenges • Interfacing C++ middleware with GT 4.0 (Java) • Swig to generate Java Native Interface (JNI) wrapper • Integrating data mining service through WSDL interface • Encapsulate service as a resource • Middleware deployment using GT 4.0 • Use Web Service – Grid Resource Allocation Module for deployment DataGrid Lab