1.14k likes | 1.34k Vues
New Features in HDF5. Why new features?. Why new features?. HDF5 1.8.0 was released in February 2008 Major update of HDF5 1.6.* series (stable set of features and APIs since 1998) New features 200 new APIs Changes to file format Changes to APIs Backward compatible
E N D
New Features in HDF5 SPEEDUP Workshop - HDF5 Tutorial
Why new features? SPEEDUP Workshop - HDF5 Tutorial
Why new features? • HDF5 1.8.0 was released in February 2008 • Major update of HDF5 1.6.* series (stable set of features and APIs since 1998) • New features • 200 new APIs • Changes to file format • Changes to APIs • Backward compatible • New releases in November 2008 • HDF5 1.6.8 and 1.8.2 • Minor bug fixes • Support for new platforms and compilers SPEEDUP Workshop - HDF5 Tutorial
Information about the release http://www.hdfgroup.org/HDF5/doc/ Follow “New Features and Compatibility Issues” links SPEEDUP Workshop - HDF5 Tutorial
Why new features? • Need to address some deficiencies in initial design • Examples: • Big overhead in file sizes • Non-tunable metadata cache implementation • Handling of free-space in a file SPEEDUP Workshop - HDF5 Tutorial
Why new features? • Need to address new requirements • Add support for • New types of indexing (object creation order) • Big volumes of variable-length data (DNA sequences) • Simultaneous real-time streams (fast append to one -dimensional datasets) • UTF-8 encoding for objects’ path names • Accessing objects stored in another HDF5 files (external or user-defined links) SPEEDUP Workshop - HDF5 Tutorial
Outline • Dataset and datatype improvements • Group improvements • Link revisions • Shared object header messages • Metadata cache improvements • Error handling • Backward/forward compatibility • HDF5 and NetCDF-4 SPEEDUP Workshop - HDF5 Tutorial
Dataset and Datatype Improvements SPEEDUP Workshop - HDF5 Tutorial
Text-based data type descriptions • Why: • Simplify data type creation • Make data type creation code more readable • Facilitate debugging by printing the text description of a data type • What: • New routines to create an HDF5 data type through the text description of the data type and get a text description from the HDF5 data type SPEEDUP Workshop - HDF5 Tutorial
Text data type description Example /* Create the data type from DDL text description */ dtype = H5LTtext_to_dtype( "H5T_IEEE_F32BE\n”,H5LT_DDL); /* Convert the data type back to text */ H5LTtype_to_text(dtype, NULL, H5LT_DLL, str_len); dt_str = (char*)calloc(str_len, sizeof(char)); H5LTdtype_to_text(dtype, dt_str, H5LT_DDL, &str_len); SPEEDUP Workshop - HDF5 Tutorial
Serialized datatypes and dataspaces • Why: • Allow datatype and dataspace info to be transmitted between processes • Allow datatype/dataspace to be stored in non-HDF5 files • What: • A new set of routines to serialize/deserialize HDF5 datatypes and dataspaces. SPEEDUP Workshop - HDF5 Tutorial
Serialized datatypes and dataspaces Example /* Find the buffer length and encode a datatype into buffer */ status = H5Tencode(t_id, NULL, &cmpd_buf_size); cmpd_buf = (unsigned char*)calloc(1, cmpd_buf_size); H5Tencode(t_id, cmpd_buf, &cmpd_buf_size) /* Decode a binary description of a datatype and retune a datatype handle */ t_id = H5Tdecode(cmpd_buf); SPEEDUP Workshop - HDF5 Tutorial
Integer to float convert during I/O • Why: • HDF5 1.6 and earlier supported conversion within the same class (16-bit integer 32-bit integer, 64-bit float 32-bit float) • Conversion needed to support NetCDF 4 programming model • What: • Integer to float conversion supported during I/O SPEEDUP Workshop - HDF5 Tutorial
Integer to float convert during I/O Example: conversion is transparent to application /* Create a dataset of 64-bit little-endian type */ dset_id = H5Dcreate(loc_id,“Mydata”, H5T_IEEE_F64LE,space_id, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT); /* Write integer data to “Mydata” */ status = H5Dwrite(dset_id, H5T_NATIVE_INT, …); SPEEDUP Workshop - HDF5 Tutorial
Revised conversion exception handling • Why: • Give apps greater control over exceptions (range errors, etc.) during datatype conversion • Needed to support NetCDF 4 programming model • What: • Revised conversion exception handling SPEEDUP Workshop - HDF5 Tutorial
Revised conversion exception handling • To handle exceptions during conversions, register handling function through H5Pset_type_conv_cb(). • Cases of exception: • H5T_CONV_EXCEPT_RANGE_HI • H5T_CONV_EXCEPT_RANGE_LOW • H5T_CONV_EXCEPT_TRUNCATE • H5T_CONV_EXCEPT_PRECISION • H5T_CONV_EXCEPT_PINF • H5T_CONV_EXCEPT_NINF • H5T_CONV_EXCEPT_NAN • Return values: H5T_CONV_ABORT, H5T_CONV_UNHANDLED, H5T_CONV_HANDLED SPEEDUP Workshop - HDF5 Tutorial
Compression filter for n-bit data • Why: • Compact storage for user-defined datatypes • What: • When data stored on disk, padding bits chopped off and only significant bits stored • Supports most datatypes • Works with compound datatypes SPEEDUP Workshop - HDF5 Tutorial
N-bit compression example • In memory, one value of N-Bit datatype is stored like this: | byte 3 | byte 2 | byte 1 | byte 0 | |????????|????SPPP|PPPPPPPP|PPPP????| S-sign bit P-significant bit ?-padding bit • After passing through the N-Bit filter, all padding bits are chopped off, and the bits are stored on disk like this: | 1st value | 2nd value | |SPPPPPPP PPPPPPPP|SPPPPPPP PPPPPPPP|... • Opposite (decompress) when going from disk to memory • Limited to integer and floating-point data SPEEDUP Workshop - HDF5 Tutorial
N-bit compression example Example /* Create a N-bit datatype */ dt_id = H5Tcopy(H5T_STD_I32LE); H5Tset_precision(dt_id, 16); H5Tset_offset(dt_id, 4); /* Create and write a dataset */ dcpl_id = H5Pcreate(H5P_DATASET_CREATE); H5Pset_chunk(dcpl_id, …); H5Pset_nbit(dcpl_id); dset_id = H5Dcreate(…,…,…,…,…,dcpl_id,…); H5Dwrite(dset_id,…,…,…,…,buf); SPEEDUP Workshop - HDF5 Tutorial
Offset+size storage filter • Why: • Use less storage when less precision needed • What: • Performs scale/offset operation on each value • Truncates result to fewer bits before storing • Currently supports integers and floats • Precision may be lost SPEEDUP Workshop - HDF5 Tutorial
Example with floating-point type • Data: {104.561, 99.459, 100.545, 105.644} • Choose scaling factor: decimal precision to keepE.g. scale factor D = 2 1. Find minimum value (offset): 99.459 2. Subtract minimum value from each element Result: {5.102, 0, 1.086, 6.185} 3. Scale data by multiplying 10D = 100 Result: {510.2, 0, 108.6, 618.5} 4. Round the data to integer Result: {510 , 0, 109, 619} 5. Pack and store using min number of bits SPEEDUP Workshop - HDF5 Tutorial
Offset+size storage filter Example /* Use scale+offset filter on integer data; let library figure out the number of minimum bits necessary to story the data without loss of precision */ H5Pset_scaleoffset (dcrp_id,H5Z_SO_INT,H5Z_SO_INT_MINBITS_DEFAULT); H5Pset_chunk(dcrp_id,…,…); dset_id = H5Dcreate(…,…,…,…,…,dcpl_id, …); /* Use sclae+offset filter on floating-point data; compression may be lossy */ H5Pset_scaleoffset(dcrp_id,H5Z_SO_FLOAT_DSCALE,2 ); SPEEDUP Workshop - HDF5 Tutorial
“NULL” Dataspace • Why: • Allow datasets with no elements to be described • NetCDF 4 needed a “place holder” for attributes • What: • A dataset with no dimensions, no data SPEEDUP Workshop - HDF5 Tutorial
NULL dataspace Example /* Create a dataset with “NULL” dataspace*/ sp_id = H5Screate(H5S_NULL); dset_id = H5Dcreate(…,"SDS.h5”,…,sp_id,…,…,…); HDF5 "SDS.h5" { GROUP "/" { DATASET "IntArray" { DATATYPE H5T_STD_I32LE DATASPACE NULL DATA { } } } } SPEEDUP Workshop - HDF5 Tutorial
HDF5 file format revision SPEEDUP Workshop - HDF5 Tutorial
HDF5 file format revision • Why: • Address deficiencies of the original file format • Address space overhead in an HDF5 file • Enable new features • What: • New routine that instructs the HDF5 library to create all objects using the latest version of the HDF5 file format (cmp. with the earliest version when object became available, e.g. array datatype) • Will talk about the versioning later SPEEDUP Workshop - HDF5 Tutorial
HDF5 file format revision Example /* Use the latest version of a file format for each object created in a file */ fapl_id = H5Pcreate(H5P_FILE_ACCESS); H5Pset_latest_format(fapl_id, 1); fid = H5Fcreate(…,…,…,fapl_id); or fid = H5Fopen(…,…,fapl_id); SPEEDUP Workshop - HDF5 Tutorial
Group Revisions SPEEDUP Workshop - HDF5 Tutorial
Better large group storage • Why: • Faster, more scalable storage and access for large groups • What: • New format and method for storing groups with many links SPEEDUP Workshop - HDF5 Tutorial
Informal benchmark • Create a file and a group in a file • Create up to 10^6 groups with one dataset in each group • Compare files sizes and performance of HDF5 1.8.1 using the latest group format with the performance of HDF5 1.8.1 (default, old format) and 1.6.7 • Note: Default 1.8.1 and 1.6.7 became very slow after 700000 groups SPEEDUP Workshop - HDF5 Tutorial
Time to open and read a dataset SPEEDUP Workshop - HDF5 Tutorial
Time to close the file SPEEDUP Workshop - HDF5 Tutorial
File size SPEEDUP Workshop - HDF5 Tutorial
Access links by creation-time order • Why: • Allow iteration & lookup of group’s links (children) by creation order as well as by name order • Support netCDF access model for netCDF 4 • What: • Option to access objects in group according to relative creation time SPEEDUP Workshop - HDF5 Tutorial
Access links by creation-time order Example /* Track and index creation order of the links */ H5Pset_link_creation_order(gcpl_id, (H5P_CRT_ORDER_TRACKED | H5P_CRT_ORDER_INDEXED)); /* Create a group */ gid = H5Gcreate(fid, GNAME, H5P_DEFAULT, gcpl_id, H5P_DEFAULT); SPEEDUP Workshop - HDF5 Tutorial
Example: h5dump --group=1 tordergr.h5 HDF5 "tordergr.h5" { GROUP "1" { GROUP "a" { GROUP "a1" { } GROUP "a2" { GROUP "a21" { } GROUP "a22" { } } } GROUP "b" { } GROUP "c" { } } } SPEEDUP Workshop - HDF5 Tutorial
Example: h5dump --sort_by=creation_order HDF5 "tordergr.h5" { GROUP "1" { GROUP "c" { } GROUP "b" { } GROUP "a" { GROUP "a1" { } GROUP "a2" { GROUP "a22" { } GROUP "a21" { } } } } } SPEEDUP Workshop - HDF5 Tutorial
“Compact groups” • Why: • Save space and access time for small groups • If groups small, don’t need B-tree overhead • What: • Alternate storage for groups with few links • Default storage when “latest format” is specified • Library converts to “original” storage (B-tree based) using default or user-specified threshold SPEEDUP Workshop - HDF5 Tutorial
“Compact groups” • Example • File with 11,600 groups • With original group structure, file size ~ 20 MB • With compact groups, file size ~ 12 MB • Total savings: 8 MB (40%) • Average savings/group: ~700 bytes SPEEDUP Workshop - HDF5 Tutorial
Compact groups Example /* Change storage to “dense” if number of group members is bigger than 16 and go back to compact storage if number of group members is smaller than 12 */ H5Pset_link_phase_change(gcpl_id, 16, 12) /* Create a group */ g_id = H5Gcreate(…,…,…,gcpl_id,…); SPEEDUP Workshop - HDF5 Tutorial
Intermediate group creation • Why: • Simplify creation of a series of connected groups • Avoid having to create each intermediate group separately, one by one • What: • Intermediate groups can be created when creating an object in a file, with one function call SPEEDUP Workshop - HDF5 Tutorial
/ / A A B C dset1 Intermediate group creation • Want to create “/A/B/C/dset1” • “A” exists, but “B/C/dset1” do not One call creates groups “B” & “C”, then creates “dset1” SPEEDUP Workshop - HDF5 Tutorial
Intermediate group creation Example /* Create link creation property list */ lcrp_id = H5Pcreate(H5P_LINK_CREATE); /* Set flag for intermediate group creation Groups B and C will be created automatically */ H5Pset_create_intermediate_group(lcrp_id, TRUE); ds_id = H5Dcreate (file_id, "/A/B/C/dset1",…,…, lcrp_id,…,…,); SPEEDUP Workshop - HDF5 Tutorial
Link Revisions SPEEDUP Workshop - HDF5 Tutorial
<address> “/target dataset” What are links? • Links connect groups to their members • “Hard” links point to a target by address • “Soft” links store the path to a target root group Hard link Soft link dataset SPEEDUP Workshop - HDF5 Tutorial
New: External Links • Why: • Access objects stored in other HDF5 files in a transparent way • What: • Store location of file and path within that file • Can link across files SPEEDUP Workshop - HDF5 Tutorial
“target object” <address> “External_link” “file2.h5” “/A/B/C/D/E” New: External Links file2.h5 root group file1.h5 root group group External link object “External_link” in file1.h5 points to the group /A/B/C/D/E in file2.h5 SPEEDUP Workshop - HDF5 Tutorial
External links Example /* Create an external link */ H5Lcreate_external(TARGET_FILE, ”/A/B/C/D/E", source_file_id, ”External_link”, …,…); /* We will use external link to create a group in a target file */ gr_id = H5Gcreate(source_file_id,”External_link/F”,…,…,…,…); /* We can access group “External_link/F” in the source file and group “/A/B/C/D/E/F” in the target file */ SPEEDUP Workshop - HDF5 Tutorial
New: User-defined Links • Why: • Allow applications to create their own kinds of links and link operations, such as • Create “hard” external link that finds an object by address • Create link that accesses a URL • Keep track of how often a link accessed, or other behavior • What: • Applications can create new kinds of links by supplying custom callback functions • Can do anything HDF5 hard, soft, or external links do SPEEDUP Workshop - HDF5 Tutorial
Traversing an HDF5 file SPEEDUP Workshop - HDF5 Tutorial