100 likes | 212 Vues
This document explores advanced strategies for efficient data access and query processing in high-performance computing environments. It discusses parallel and random I/O techniques, the use of high-level I/O libraries like PnetCDF and HDF5, and the integration of various file systems, including Lustre and GPFS. Special attention is given to challenges in scaling systems for next-generation architectures and the importance of metadata in optimizing access patterns. Additionally, it addresses issues around fault tolerance, data security, and the deployment of scalable data integration frameworks in distributed environments.
E N D
Efficient access and query, data integration Group 4 Group coordinators: Alok Choudhary Rob Ross
Parallel and Random I/O • I/O Stacks • High-level I/O libraries (PnetCDF, HDF5, SILO) • I/O middleware (MPI-IO) • Parallel file systems (Lustre, GPFS, PVFS) • Other shared file systems (CXFS, GFS, Panasas, qfs) • Solutions may exist • Performance/scalability are “ok” • Will these scale to next-generation systems (e.g. BG/L, Red Storm?) • Random I/O • Query metadata for optimizing seemingly random accesses • Research and development • Scale! Not just an engineering problem. • DB-like, query operations (more later) • Recognizing and/or passing on access pattern information, then acting on it • Related to metadata issues • Execution of app. code at the I/O server (active disk • (user) Metadata as file system constructs • Hardening and packaging • Large FC configurations • Fault tolerance • System support • Deployment and maintenance • Low BW, serial applications in good shape • High BW, embarrassingly parallel, task farming
Parallel and Random I/O • Gaps with Priority • Scaling of parallel I/O stack • Both scaling of # of clients, and • Scaling of size of the file system (# of files/objects) • APIs for passing more information to the system • (already there in MPI-IO to some extent, some PFSs, but not adequate, also needed support at the high-level I/O library) • Management of large scale storage • Fault tolerance • Autonomic (self-managing, etc.) storage • Connecting PFSs to hierarchical storage systems efficiently
Large-scale feature-based Queries • Lots of dimensions • existing indexing techniques aren’t particularly good for this • Not worth building an index at all in some instances • Research and development • Parallel update problem with existing representations • When to linear scan, streaming • Hardware-assisted searching (e.g. Netezza, NexQL, Seisint) • Hardening and packaging • Bitmapped indexing, in some use • Deployment and maintenance • Relational DBs • Object DBs
Large-Scale, Feature-Based Queries • Gaps with Priorities • Scalability of techniques, such as indexing, as a solution to this problem • Support for runtime feature extraction • Concurrent update (addition) to indices • only for some groups
Query processing over files • DB-like operations on files • Structured data files such as HDF5, PnetCDF, SILO • Alternative APIs, file format independent • Java database objects, ODMG • Research and development • What should the API look like? • Protocols for accessing databases in distributed environments with arbitrary backends (e.g., GGF DAIS group) • Hardening and packaging • Ad-hoc Query package (LLNL work) • Range queries over SILO mesh data • Root (HEP community) • Operates on files in internal file format • Deployment and maintenance • nothing
Query Processing over Files • Gaps with Priorities • Determining the API for this query processing • What capabilities are needed from this API? • Implementing this API for common file formats • Appropriate underlying optimizations may impact all of I/O stack (e.g. query optimizations, cache management, etc.) • Extensible, parallel runtime for aiding in the use of this API, constructing queries, etc.
Data Integration • Digital libraries, federations and warehousing • Research and development • Tools for aiding in creation of warehouses, ontology creation • Fine-grained access control • Security in federated/dist. environment (pharma etc.) • Applies even to the queries, not just the data itself • Hardening and packaging • Digital libraries (SRB) • Many one-off instances of domain-specific integrations • Deployment and maintenance • DiscoveryLink (IBM), other commercial packages – framework for doing data integration with their DB offerings • Linking similar (R) DBs together isn’t too difficult
Data Integration • Gaps with Priorities • Converging on a language for describing metadata for communities • Tools to support wrapping and integrating complex data • From arbitrary sources (free text, mesh data, etc.), including files • For this domain (community exists looking at bio domain) • Provenance • Security • Cross-domain access and authentication • Encryption of both queries and data • Authentication of data sources