Download
floating point data compression at 75 gb s on a gpu n.
Skip this Video
Loading SlideShow in 5 Seconds..
Floating-Point Data Compression at 75 Gb/s on a GPU PowerPoint Presentation
Download Presentation
Floating-Point Data Compression at 75 Gb/s on a GPU

Floating-Point Data Compression at 75 Gb/s on a GPU

106 Vues Download Presentation
Télécharger la présentation

Floating-Point Data Compression at 75 Gb/s on a GPU

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Floating-Point Data Compressionat 75 Gb/s on a GPU Molly A. O’Neil and Martin Burtscher Department of Computer Science

  2. Introduction Texas Advanced Computing Center • Scientific simulations on HPC clusters • Run on interconnected compute nodes • Produce and transfer lots of floating-point data • Data storage and transfer are expensive and slow • Compute nodes have multiple cores but only one link • Interconnects are getting faster • Lonestar: 40 Gb/s InfiniBand • Speeds of up to 100 Gb/s soon March 2011

  3. Introduction (cont.) Charles Trevelyan for http://plus.maths.org/ • Compression • Reduced storage, faster transfer • Only useful when done in real time • Saturate network with compressed data • Requires compressor tailored to hardware capabilities • GFC algorithm for IEEE 754 double-precision data • Designed specifically for GPU hardware (CUDA) • Provides reasonable compression ratio and operates above throughput of emerging networks March 2011

  4. Lossless Data Compression • Dictionary-based (Lempel-Ziv family) [gzip,lzop] • Variable-length entropy coders (Huffman, AC) • Run-length encoding [fax] • Transforms (Burrows-Wheeler) [bzip2] • Special-purpose FP compressors[FPC, FSD, PLMI] • Prediction and leading-zero suppression • None of these offer real-time speeds forstate-of-the-art networks March 2011

  5. GFC Algorithm • Divide data into n chunks, processed in parallel • Best perf: choose n to match max number of resident warps • Each chunk composed of 32-word subchunks • One double per warp thread • Use previous subchunk to provide prediction values GPUs require 1000s of parallel activities, but…compression is a generally serial operation March 2011

  6. Dimensionality • Many scientific data sets display dimensionality • Interleaved coordinates from multiple dimensions • Optional dimensionality parameter to GFC • Determines index of previous subchunkto use as the prediction March 2011

  7. GFC Algorithm (cont.) March 2011

  8. GPU Optimizations gamedsforum.ca • Low thread divergence (few ifstatements) • Some short enough to be predicated • Coalesce memory accesses by packing/unpacking data in shared memory (for CC < 2.0) • Very little inter-threadcommunication and synchronization • Prefix sum only • Warp-based implementation March 2011

  9. Evaluation Method • Systems • Two quad-core 2.53 GHz Xeons • NVIDIA FX 5800 GPU(CC 1.3) • 13 datasets: real-world data (19 – 277 MB) • Observational data, simulation results, MPI messages • Comparisons • Compression ratio vs. 5 compressors in common use • Throughput vs. pFPC (fastest known CPU compressor) March 2011

  10. Compression Ratio • 1.188 (range: 1.01 – 3.53) • Low (FP data), but in line with other algos • Largely independent of number of chunks • When done in real-time, compression at this ratio can greatly speed up MPI apps • 3% – 98% speed-up [Ke et al., SC’04] March 2011

  11. Throughput • C: 75 – 87 Gb/s • Mean: 77.9 Gb/s • D:90 – 121 Gb/s • Mean: 96.6 Gb/s • 4x faster than pFPC on 8 cores (2 CPUs) • Improvement over pFPC’s compression ratio vs. performance trend March 2011

  12. NEW:Fermi Throughput • Fermi improvements: • Faster, simpler memory accesses • Hardware support for count-leading-zeros op • Compression ratio: 1.187 • C: 119 – 219 (HM: 167.5 Gb/s) • D: 169 – 219 (HM: 180.3 Gb/s) • Compresses over 9.5x faster than pFPC on 8 x86 cores March 2011

  13. Summary • GFC algorithm • Chunks up data, each warp processes a chunk iteratively by 32-word subchunks • No communication required between warps • Minimum 75 Gb/s – 90 Gb/s (encode-decode) throughput on GTX-285, and 119 Gb/s – 169 Gb/s on Fermi, with a compression ratio of 1.19 • CUDA source code is freely available athttp://www.cs.txstate.edu/~burtscher/research/GFC/ March 2011

  14. Conclusions AMD NVIDIA • GPU can compress much faster than PCIe bus can transfer the data • But… • PCIe bus will become faster • CPU-GPU increasingly on single die • GPU-to-GPU, GPU-to-NIC transfers coming? • GFC is the first compressor with the potential to deliver real-time FP data compression for current and emerging network speeds March 2011