Efficient Data Access, Query Processing, and Integration Strategies for High-Performance Computing

Efficient access and query, data integration Group 4 Group coordinators: Alok Choudhary Rob Ross

Parallel and Random I/O • I/O Stacks • High-level I/O libraries (PnetCDF, HDF5, SILO) • I/O middleware (MPI-IO) • Parallel file systems (Lustre, GPFS, PVFS) • Other shared file systems (CXFS, GFS, Panasas, qfs) • Solutions may exist • Performance/scalability are “ok” • Will these scale to next-generation systems (e.g. BG/L, Red Storm?) • Random I/O • Query metadata for optimizing seemingly random accesses • Research and development • Scale! Not just an engineering problem. • DB-like, query operations (more later) • Recognizing and/or passing on access pattern information, then acting on it • Related to metadata issues • Execution of app. code at the I/O server (active disk • (user) Metadata as file system constructs • Hardening and packaging • Large FC configurations • Fault tolerance • System support • Deployment and maintenance • Low BW, serial applications in good shape • High BW, embarrassingly parallel, task farming

Parallel and Random I/O • Gaps with Priority • Scaling of parallel I/O stack • Both scaling of # of clients, and • Scaling of size of the file system (# of files/objects) • APIs for passing more information to the system • (already there in MPI-IO to some extent, some PFSs, but not adequate, also needed support at the high-level I/O library) • Management of large scale storage • Fault tolerance • Autonomic (self-managing, etc.) storage • Connecting PFSs to hierarchical storage systems efficiently

Large-scale feature-based Queries • Lots of dimensions • existing indexing techniques aren’t particularly good for this • Not worth building an index at all in some instances • Research and development • Parallel update problem with existing representations • When to linear scan, streaming • Hardware-assisted searching (e.g. Netezza, NexQL, Seisint) • Hardening and packaging • Bitmapped indexing, in some use • Deployment and maintenance • Relational DBs • Object DBs

Large-Scale, Feature-Based Queries • Gaps with Priorities • Scalability of techniques, such as indexing, as a solution to this problem • Support for runtime feature extraction • Concurrent update (addition) to indices • only for some groups

Query processing over files • DB-like operations on files • Structured data files such as HDF5, PnetCDF, SILO • Alternative APIs, file format independent • Java database objects, ODMG • Research and development • What should the API look like? • Protocols for accessing databases in distributed environments with arbitrary backends (e.g., GGF DAIS group) • Hardening and packaging • Ad-hoc Query package (LLNL work) • Range queries over SILO mesh data • Root (HEP community) • Operates on files in internal file format • Deployment and maintenance • nothing

Query Processing over Files • Gaps with Priorities • Determining the API for this query processing • What capabilities are needed from this API? • Implementing this API for common file formats • Appropriate underlying optimizations may impact all of I/O stack (e.g. query optimizations, cache management, etc.) • Extensible, parallel runtime for aiding in the use of this API, constructing queries, etc.

Data Integration • Digital libraries, federations and warehousing • Research and development • Tools for aiding in creation of warehouses, ontology creation • Fine-grained access control • Security in federated/dist. environment (pharma etc.) • Applies even to the queries, not just the data itself • Hardening and packaging • Digital libraries (SRB) • Many one-off instances of domain-specific integrations • Deployment and maintenance • DiscoveryLink (IBM), other commercial packages – framework for doing data integration with their DB offerings • Linking similar (R) DBs together isn’t too difficult

Data Integration • Gaps with Priorities • Converging on a language for describing metadata for communities • Tools to support wrapping and integrating complex data • From arbitrary sources (free text, mesh data, etc.), including files • For this domain (community exists looking at bio domain) • Provenance • Security • Cross-domain access and authentication • Encryption of both queries and data • Authentication of data sources

The End

Efficient Data Access, Query Processing, and Integration Strategies for High-Performance Computing

Efficient Data Access, Query Processing, and Integration Strategies for High-Performance Computing

Presentation Transcript

Module 9 Animals in danger

Enhanced Group Call (EGC)

Access Chapter 2

Displaying Selected Data with Queries

GROUP COHESION

Access Project 2

Group Policy

QUERY OPTIMIZATION AND QUERY PROCESSING

Query Processing: Query Formulation

SPARQL

ROSS Service Coordinators NOFA

FAMILY SUPPORT GROUP

Working with Group Process in CPE: Advances in Theory and Practice

2005 GROUP RESULTS Alessandro Profumo - CEO

依據群體模組監控之網路群體學習系統 Group model monitor on network group learning system

Intelligence Integration for Training Transformation Working Group (I2T2 WG)

GROUP GUIDANCE

INTRODUCTION TO PEOPLESOFT QUERY

SPARQL

Basic SQL

Preview