Multi-threading in HDF5: Paths Forward

Multi-threading in HDF5:Paths Forward Current implementation - Future directions HDF5 Workshop at PSI

Outline • Introduction • Current implementation • Paths forward: • Improve concurrency • Reduce latency • Conclusions and Recommendations HDF5 Workshop at PSI

Introduction HDF5 Workshop at PSI

Introduction • HDF5 design principles • Flexibility • Adaptability to new computational environments • Current challenges: • Multi-threaded applications run on multi-core systems • HDF5 thread-safe library cannot support concurrency built into such applications HDF5 Workshop at PSI

Current implementation HDF5 Workshop at PSI

Current Implementation • HDF5 uses single global semaphore • Controls modification of memory and file data structures: • One thread at a time enters the library • An application thread enters HDF5 API routine, acquires semaphore • Other threads are blocked until the thread completes API call and releases semaphore • No simultaneous modifications of data structures that can cause file corruption • No race conditions when several threads try to modify a memory data structure HDF5 Workshop at PSI

Current Implementation • Pros: • Current implementation provides thread-safety needed to avoid corruption of data structures • Cons: • No concurrent use of HDF5 library by multi-threaded applications HDF5 Workshop at PSI

Paths forward HDF5 Workshop at PSI

Improving Concurrency • Replace single global semaphore with semaphores that guard individual data structures • Pros: • Greater level of concurrency • No corruption of internal data structures • Each thread waits only when it needs to modify a data structure locked by another thread • Reduces waiting time for a resource to become available HDF5 Workshop at PSI

Improving Concurrency • Cons: • Replacing the global semaphore with individual semaphores, locks, etc. requires careful analysis of HDF5 data structures and their interactions • 300K lines of C code in library will require 4-6 FTE years of knowledgeable staff • Significant future maintenance effort • Testing challenges HDF5 Workshop at PSI

Reducing Latency • Reduce waiting time for each thread to acquire global semaphore • Reduce time by removing known HDF5 bottlenecks: • I/O performance • “Compute bound” (CB) operations • Datatype conversions • Compression and other filters • General overhead • E.g., structures for storing and accessing chunked datasets and metadata HDF5 Workshop at PSI

Reducing Latency: I/O Performance • Support asynchronous I/O (AIO) access to data in HDF5 file • AIO initiated within the library in response to an API call • Completes in the backgroundafter API call has returned • Global semaphore is released when API call returns – less waiting time HDF5 Workshop at PSI

Reducing Latency: CB operations • Use multiple threads within HDF5 library to • Perform datatype conversion • Perform compression on one chunk • Multiple threads work on one chunk • Perform compression on many chunks • Each thread works on a chunk HDF5 Workshop at PSI

Reducing Latency: General Optimizations • Traditional optimizations • Some examples: • Algorithm improvements for handling • Chunk cache • Hyperslabselections • Memory usage • Data structure improvements • Chunk indices with O(1) lookup speed • Advanced B-tree implementations HDF5 Workshop at PSI

Reducing Latency • Pros: • Smaller development effort, ~ 1.5 FTE years • Localized changes to the library • Easier to maintain • Incremental improvements • Cons: • Still uses global semaphore HDF5 Workshop at PSI

Recommendations HDF5 Workshop at PSI

Decision • Reduce Latency • Decision factors: • Available expertise • Cost • Already funded features: • AIO • Using multiple threads to compress a chunk • Future maintainability HDF5 Workshop at PSI

Other considerations • Approaches are not mutually exclusive • Both can be implemented in the future if funding is available HDF5 Workshop at PSI

Thank You! Questions? HDF5 Workshop at PSI

Multi-threading in HDF5: Paths Forward