A Middleware for Developing and Deploying Scalable Remote Mining Services

A Middleware for Developing and Deploying Scalable Remote Mining Services Leonid Glimcher Gagan Agrawal DataGrid Lab

Abundance of Data Data generated everywhere: • Sensors (intrusion detection, satellites) • Scientific simulations (fluid dynamics, molecular dynamics) • Business transactions (purchases, market trends) Analysis needed to translate data into knowledge Growing data size creates problems for datamining DataGrid Lab

Grid and Cloud Computing Provides access to otherwise inaccessible: • Resources • Data Goal of Grid Computing: • to provide a stable foundation for distributed services Price for such foundation: • Standards needed for distributed service integration (OGSA & WSRF) Cloud computing: access resources (data storage and processing) as services available for consumption DataGrid Lab

Remote Data Analysis • Remote data analysis • Grid is a good fit Details can be very tedious: • Data retrieval, movement and caching • Parallel data processing • Resource allocation • Application configuration Middleware can be useful to abstract away details DataGrid Lab

Our Approach Supporting development of scalable applications that process remote data using middleware – FREERIDE-G (Framework for Rapid Implementation of Datamining Engines in Grid) Middleware user Repository cluster Compute cluster DataGrid Lab

Outline • Motivation • Introduction • Middleware Overview • Challenges • Experimental Evaluation • Related work • Conclusion DataGrid Lab

FREERIDE-G Grid Service Grid computing standards and infrastructures: • OGSA vs. WSRF (merged Grid and Web Services) Data hosts compliant with repository standards (SRB) MPICH-G2 -- pre-WS mechanism to support MPI Globus Toolkit and MPICH-G2 -- most fitting infrastructure for Grid Service conversion of FREERIDE-G DataGrid Lab

SDSC Storage Resource Broker Standard for remote data: • Storage • Access Data can be distributed across organizations and heterogeneous storage systems Used to host data for FREERIDE-G Access provided through client API DataGrid Lab

FREERIDE-G Processing Structure KEY observation: most data mining algorithms follow canonical loop Middleware API: • Subset of data to be processed • Reduction object • Local and global reduction operations • Iterator Derived from precursor system FREERIDE While( ) { forall( data instances d) { I = process(d) R(I) = R(I) op d } ……. } DataGrid Lab

FREERIDE-G Evolution FREERIDE data stored locally FREERIDE-G • ADR responsible for remote data retrieval • SRB responsible for remote data retrieval FREERIDE-G grid service Grid service featuring • Load balancing • Data integration DataGrid Lab

Evolution FREERIDE-G-ADR Application Data ADR SRB Globus FREERIDE FREERIDE-G-SRB FREERIDE-G-GT DataGrid Lab

FREERIDE-G System Architecture DataGrid Lab

Implementation Challenges I • Interaction with Code Repository • Simplified Wrapper and Interface Generator • XML descriptors of API functions • Each API function wrapped in own class C++ C++ Java SWIG XML DataGrid Lab

Implementation Challenges Integration with MPICH-G2 and Globus Toolkit • Supports MPI • Deployed through pre-WS Globus components • Hides potential heterogeneity in: • service startup • management DataGrid Lab

FREERIDE-G Applications VortexPro: • Finds vortices in volumetric fluid/gas flow datasets Kmeans Clustering: • Clusters points based on Euclidean distance in an attribute space EM Clustering: • Another distance-based clustering algorithm DataGrid Lab

Experimental setup Organizational Grid: • Data hosted on Opteron 250 cluster • Processed on Opteron 254 cluster • Connected using 2 10 GB optical fibers Goals: • Demonstrate parallel scalability of applications • Evaluate overhead of using MPICH-G2 and Globus Toolkit deployment mechanisms Repository cluster Compute cluster DataGrid Lab

Scalability Evaluation Scalable implementations with respect to numbers of: • Compute nodes, • Data repository nodes. Vortex Detection with 14.8 GB dataset. Kmeans Clustering with 6.4 GB dataset (bottom) DataGrid Lab

Scalability Evaluation II Sub-linear speedup for scaled up compute nodes explained by sequential scheduling of concurrent data reads Kmeans Clustering (25.6 GB) EM Clustering (25.6 GB) DataGrid Lab

Deployment Overhead Evaluation Clearly a small overhead associated with using Globus and MPICH-G2 for middleware deployment. Kmeans Clustering with 6.4 GB dataset: 18-20%. DataGrid Lab

Deployment Overhead Evaluation II Vortex Detection with 14.8 GB dataset: 17-20%. Overhead: • scales with the total execution time • doesn’t effect overall data processing service scalability DataGrid Lab

Outline • Motivation • Introduction • Middleware Overview • Challenges • Experimental Evaluation • Related work • Conclusion DataGrid Lab

Grid Computing THE GRID Standards: • Open Grid Service Architecture • Web Services Resource Framework Implementations: • Globus toolkit, WSRF.NET, WSRF::Lite Other Grid Systems: • Condor and Legion, • Workflow composition systems: • GridLab, Nimrod/G (GridBus), Cactus, GRIST DataGrid Lab

Data Grids Metadata cataloging: • Artemis project Remote data retrieval: • SRB, SEMPLAR • DataCutter, STORM Stream processing midleware: • GATES • dQUOB Data integration: • Automatic wrapper generation DataGrid Lab

Remote Data in Grid KnowledgeGrid tool-set: • Developing datamining workflows (composition) GridMiner toolkit: • Composition of distributed, parallel workflows DiscoveryNet layer: • Creation, deployment, management of services DataMiningGrid framework: • Distributed knowledge discovery Other projects partially overlap in goals with ours DataGrid Lab

Conclusions FREERIDE-G – middleware for scalable processing of remote data -- now as a grid service • Compliance with grid and repository standards • Support for high-end processing • Ease use of parallel configurations • Hide details of data movement and caching • Scalable performance with respect to data and compute nodes • Low deployment overheads as compared to non-grid version DataGrid Lab

SRB Data Host SRB Master: • Connection establishment • Authentication • Forks agent to service I/O SRB Agent: • Performs remote I/O • Services multiple client API requests MCAT: • Catalogs data associated with datasets, users, resources • Services metadata queries (through SRB agent) DataGrid Lab

Compute Node More compute nodes than data hosts Each node: • Registers IO (from index) • Connects to data host While (chunks to process) • Dispatch IO request(s) • Poll pending IO • Process retrieved chunks DataGrid Lab

FREERIDE-G in Action Compute Node Data Host I/O Registration Connection establishment SRB Agent While (more chunks to process) I/O request dispatched Pending I/O polled MCAT SRB Master Retrieved data chunks analyzed SRB Agent Compute Node DataGrid Lab

Conversion Challenges • Interfacing C++ middleware with GT 4.0 (Java) • Swig to generate Java Native Interface (JNI) wrapper • Integrating data mining service through WSDL interface • Encapsulate service as a resource • Middleware deployment using GT 4.0 • Use Web Service – Grid Resource Allocation Module for deployment DataGrid Lab

A Middleware for Developing and Deploying Scalable Remote Mining Services

A Middleware for Developing and Deploying Scalable Remote Mining Services

Presentation Transcript

Developing and Deploying Java Applications for BlackBerry

EmStar: a Software Environment for Developing and Deploying Wireless Sensor Networks

Planning and Deploying VDI and Remote Desktop Services

Scalable Data Mining

Em: A Software Environment for Developing and Deploying Wireless Sensor Networks

Developing and Deploying Applications on Microsoft Internet Information Services (IIS)

Planning and Deploying VDI and Remote Desktop Services

Scalable data mining for functional genomics and metagenomics

Scalable data mining for functional genomics and metagenomics

EmStar: A Software Environment for Developing and Deploying Wireless Sensor Networks

SLIQ: A Fast Scalable Classifier for Data Mining

A Scalable Information Management Middleware for Large Distributed Systems

SPRINT: A Scalable Parallel Classifier for Data Mining

Empowering Communities for Successful Aging Developing and Deploying Aging Services Technology

SPRINT : A Scalable Parallel Classifier for Data Mining

Tools for Composing and Deploying Grid Middleware Web Services

A Grid-Based Middleware for Scalable Processing of Remote Data

Scalable data mining for functional genomics and metagenomics

Deploying Remote Desktop Services (RDS) Roles in Microsoft Azure

SLIQ: A Fast Scalable Classifier for Data Mining

Scalable Algorithms for Association Mining

A Scalable Approach to Deploying and Managing Appliances