470 likes | 595 Vues
NCSA-NARA investigations of HDF5 in support of EXPRESS-Driven data. Mike Folk The HDF NARA Project PDES, Inc. Offsite Meeting September 24-29, 2006. Acknowledgement.
E N D
NCSA-NARA investigations of HDF5 in support of EXPRESS-Driven data Mike FolkThe HDF NARA Project PDES, Inc. Offsite Meeting September 24-29, 2006
Acknowledgement This report is based upon work supported by the National Archives and Records Administration (NARA) through the grant NARA NSF 0202 GPG. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the NARA. PDES, Inc. Offsite Sept 2006
Participants Mike Folk, Vailin Choi, Elena Pourmal – The HDF Group Mark Conrad and Bob Chadduck – NARA David Price – EuroSTEP Keith Hunten – Lockheed-Martin Steve Cooper and Denny Moore – Electric Boat Others PDES, Inc. Offsite Sept 2006
HDF5 is • A file format for managing any kind of data • Software system to manage data in the format • Suited especially to large volume or complex data • Suited for every size and type of system • Open file format, open software PDES, Inc. Offsite Sept 2006
Definitions • “HDF” – Hierarchical Data Format • Originated in 1988 • NCSA at University of Illinois at Urbana-Champaign • “HDF5” • Successor to HDF, introduced in 1998 PDES, Inc. Offsite Sept 2006
palette An HDF5 file is a container… …into which you can put your data objects. lat | lon | temp ----|-----|----- 12 | 23 | 3.1 15 | 24 | 4.2 17 | 21 | 3.6 PDES, Inc. Offsite Sept 2006
HDF5 data model • HDF5 file – container for data objects • Primary Objects • Groups • Datasets • Additional ways to organize data • Attributes for metadata • Sharable objects • Storage and access properties Everything else is built from these parts. PDES, Inc. Offsite Sept 2006
“/” (root) “/foo” 3-D array lat | lon | temp ----|-----|----- 12 | 23 | 3.1 15 | 24 | 4.2 17 | 21 | 3.6 Table palette Raster image Raster image 2-D array HDF “groups” for organizing objects in files PDES, Inc. Offsite Sept 2006
Metadata Data Dataspace Rank Dimensions 3 Dim_1 = 4 Dim_2 = 5 Dim_3 = 7 Datatype IEEE 32-bit float Attributes Storage info time = 32.4 Chunked pressure = 987 compressed temp = 56 HDF5 “dataset” for holding the data PDES, Inc. Offsite Sept 2006
Datatypes (array elements) • Datatype – how to interpret a data element • Two classes: atomic and compound PDES, Inc. Offsite Sept 2006
Datatypes • HDF5 atomic types • normal integer & float • user-definable (e.g. 13-bit integer) • fixed length and variable length multiples (e.g. strings) • references to objects/dataset regions • enumeration - names mapped to integers • array • HDF5 compound types • Records with fields – comparable to C structs • Members can be atomic or compound types PDES, Inc. Offsite Sept 2006
A mechanism for collections of related objects Every file starts with a root group Similar to UNIX directories Can have attributes “Groups” “/” harry tom dick b a c PDES, Inc. Offsite Sept 2006
Better subsetting access time; extendable chunked Improves storage efficiency, transmission speed compressed Arrays can be extended in any direction extendable File B Metadata in one file, raw data in another. Dataset “Fred” Split file File A Metadata for Fred Data for Fred Special Storage Options PDES, Inc. Offsite Sept 2006
Mesh Example, in HDFView PDES, Inc. Offsite Sept 2006
HDF5 Software Tools & Applications HDF I/O Library HDF File PDES, Inc. Offsite Sept 2006
Features of library • Ability to create and access complex data structures • Fast, flexible I/O • Data transformation and filtering during I/O • Flexible API for power users • Compatibility with common data models • Able to represent all common data structures • Supports key language models – C, Fortran, Java, etc. PDES, Inc. Offsite Sept 2006
Other info • Library and tools run almost anywhere • Other software from THG • Java viewer • Command-line utilities • Other software • Commercial (IDL, Matlab, Labview, etc.) • Community (EOS, ASCI, etc.) • Integration with other software (SRB, databases, etc.) PDES, Inc. Offsite Sept 2006
Making HDF useful for your application • There are many ways to organize and access data in HDF5 • How do we apply these capabilities to a particular domain, such as product data? • We have to decide how we will organize and access our data in a way that best addresses our needs. • And create data models, APIs and tools as appropriate to support our applications. • Or adapt existing data models, APIs and tools as appropriate to support our applications. PDES, Inc. Offsite Sept 2006
Aqua (6/01) Terra CERES MISR MODIS MOPITT AquaCERES MODIS AMSR Aura TES HRDLS MLS OMI HDF-EOS 1. NASA Earth Observing System (EOS) PDES, Inc. Offsite Sept 2006
2. Advanced Simulation & Computing (ASC) Question: How do we maintain a nuclear stockpile in the absence of testing? Answer: Very large simulations PDES, Inc. Offsite Sept 2006
ASC Data requirements • Large datasets (> a terabyte) • Fast I/O on massive parallel systems • Complex data and extensive metadata • Availability on leading edge systems PDES, Inc. Offsite Sept 2006
3. Bioinformatics--Managing genomic data caacaagccaaaactcgtacaa Cgagatatctcttggaaaaact gctcacaatattgacgtacaag gttgttcatgaaactttcggta Acaatcgttgacattgcgacct aatacagcccagcaagcagaat
DNA sequencing workflows are complex • Diverse formats • Highly redundant data • Multiple levels of information • Complex associations • Repeated file processing • Non-scalable storage • Lack of persistence PDES, Inc. Offsite Sept 2006
BioHDF HDF5 as binary exchange format for bioinformatics PDES, Inc. Offsite Sept 2006
HDF- Time-history HDF- PACKET Boeing flight test PDES, Inc. Offsite Sept 2006
Apps: simulation, visualization, remote sensing… Examples: Thermonuclear simulations Product modeling Data mining tools Visualization tools Climate models BioHDF SAF HDF-Packet Matlab HDF-EOS app-specificAPI or GUI LANL LLNL, SNL Grids COTS NASA HDF5 virtual file layer (I/O drivers) HDF5 serial & parallel I/O Split Files MPI I/O Custom Stdio Stream Storage ? Across the networkor to/from another application or library HDF5 format User-defined device Split metadata and raw data files File on parallel file system File Common application-specificdata models HDF5 data model & API PDES, Inc. Offsite Sept 2006
2. Why is there interest in HDF5 for product data? (Courtesy of David Price, EuroSTEP)
Needs • STEP and related models exist using EXPRESS • ASCII, XML STEP formats defined, software developed • But ASCII/XML don’t adapt well for highly voluminous, complex data • Finite element analysis • Computational fluid dynamics • Heterogeneous product data PDES, Inc. Offsite Sept 2006
EuroSTEP project • VIVACE: “Value Improvement through a Virtual Aeronautical Collaborative Enterprise” • Deliverable: EXPRESS-driven Large Volume Binary Data Representation PDES, Inc. Offsite Sept 2006
Survey of State of the Art • Candidates • ASN.1 : Abstract Syntax Notation 1 • HDF5 : Hierarchical Data Format • XML/Binary • CGNS : CFD General Notation System • SDAI implementation by LKSoft • Found HDF5 most suitable for very large scientific datasets and complex relationships PDES, Inc. Offsite Sept 2006
Product model Applications Apps: simulation, visualization, remote sensing… Examples: Thermonuclear simulations Product modeling Data mining tools Visualization tools Examples: Thermonuclear simulations Product modeling Data mining tools Visualization tools Climate models BioHDF SAF HDF-Packet Matlab HDF-EOS appl-specificAPIs LANL LLNL, SNL Grids COTS NASA HDF5 virtual file layer (I/O drivers) HDF5 serial & parallel I/O Split Files MPI I/O Custom Stdio Stream Storage ? Across the networkor to/from another application or library HDF5 format User-defined device Split metadata and raw data files File on parallel file system File Common application-specificdata models STEPdata models STEP-HDF5 HDF5 data model & API PDES, Inc. Offsite Sept 2006
NCSA-THG NARA Research • Investigate the viability of scientific data formats, such as HDF5, for long-term preservation of engineering data in the federal archives PDES, Inc. Offsite Sept 2006
Heterogeneous data aggregation, with HDF5 • Goal: Using NARA’s TWR collection, investigate the possibilities and limitations of using HDF5 as a container for archiving heterogeneous collections of records, with special attention to STEP data. PDES, Inc. Offsite Sept 2006
Activities • Use files, datatypes, structures in NARA TWR collection – STEP files, photos, schematics, etc. • Map these to HDF5 objects and structures, exploiting features of HDF5 • Assess benefits and costs in terms of storage efficiency and accessibility • Investigate use of HDF5 as container for collection PDES, Inc. Offsite Sept 2006
Relationship EuroSTEP, Electric Boat, et al • Working together to develop mappings from EXPRESS to HDF5 • Sharing data for testing • Periodic meetings to share information and coordinate research • Some involvement with standardization PDES, Inc. Offsite Sept 2006
Investigating I/O efficiency and size • Explore different datatypes and storage options for b-spline surface models (later: finite element models) • Two types of data – b-splines themselves and cartesian points • Variables • Different HDF5 datatypes • Dataset compression • Use of extra indexes in HDF5 for fast access PDES, Inc. Offsite Sept 2006
Some results • Small files • HDF5 not appreciably better then STEP, sometimes worse • Large files • Compression always made HDF5 files smaller • Even without compression, HDF5 storage better • Indexing approach also tended to save space • Lessons • HDF5 can provide very efficient storage for cartesian points • Choice of data types and data storage is important PDES, Inc. Offsite Sept 2006
HDF5 as container HDFView Demo
HDF Information • HDF Information Center • http://hdfgroup.org/ • HDF Help email address • help@hdfgroup.org/ • HDF users mailing list • hdfnews@hdfgroup.org/ PDES, Inc. Offsite Sept 2006