1 / 14

Floating-Point Data Compression at 75 Gb/s on a GPU

Floating-Point Data Compression at 75 Gb/s on a GPU. Molly A. O’Neil and Martin Burtscher Department of Computer Science. Introduction. Texas Advanced Computing Center. Scientific simulations on HPC clusters Run on interconnected compute nodes

mills
Télécharger la présentation

Floating-Point Data Compression at 75 Gb/s on a GPU

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Floating-Point Data Compressionat 75 Gb/s on a GPU Molly A. O’Neil and Martin Burtscher Department of Computer Science

  2. Introduction Texas Advanced Computing Center • Scientific simulations on HPC clusters • Run on interconnected compute nodes • Produce and transfer lots of floating-point data • Data storage and transfer are expensive and slow • Compute nodes have multiple cores but only one link • Interconnects are getting faster • Lonestar: 40 Gb/s InfiniBand • Speeds of up to 100 Gb/s soon March 2011

  3. Introduction (cont.) Charles Trevelyan for http://plus.maths.org/ • Compression • Reduced storage, faster transfer • Only useful when done in real time • Saturate network with compressed data • Requires compressor tailored to hardware capabilities • GFC algorithm for IEEE 754 double-precision data • Designed specifically for GPU hardware (CUDA) • Provides reasonable compression ratio and operates above throughput of emerging networks March 2011

  4. Lossless Data Compression • Dictionary-based (Lempel-Ziv family) [gzip,lzop] • Variable-length entropy coders (Huffman, AC) • Run-length encoding [fax] • Transforms (Burrows-Wheeler) [bzip2] • Special-purpose FP compressors[FPC, FSD, PLMI] • Prediction and leading-zero suppression • None of these offer real-time speeds forstate-of-the-art networks March 2011

  5. GFC Algorithm • Divide data into n chunks, processed in parallel • Best perf: choose n to match max number of resident warps • Each chunk composed of 32-word subchunks • One double per warp thread • Use previous subchunk to provide prediction values GPUs require 1000s of parallel activities, but…compression is a generally serial operation March 2011

  6. Dimensionality • Many scientific data sets display dimensionality • Interleaved coordinates from multiple dimensions • Optional dimensionality parameter to GFC • Determines index of previous subchunkto use as the prediction March 2011

  7. GFC Algorithm (cont.) March 2011

  8. GPU Optimizations gamedsforum.ca • Low thread divergence (few ifstatements) • Some short enough to be predicated • Coalesce memory accesses by packing/unpacking data in shared memory (for CC < 2.0) • Very little inter-threadcommunication and synchronization • Prefix sum only • Warp-based implementation March 2011

  9. Evaluation Method • Systems • Two quad-core 2.53 GHz Xeons • NVIDIA FX 5800 GPU(CC 1.3) • 13 datasets: real-world data (19 – 277 MB) • Observational data, simulation results, MPI messages • Comparisons • Compression ratio vs. 5 compressors in common use • Throughput vs. pFPC (fastest known CPU compressor) March 2011

  10. Compression Ratio • 1.188 (range: 1.01 – 3.53) • Low (FP data), but in line with other algos • Largely independent of number of chunks • When done in real-time, compression at this ratio can greatly speed up MPI apps • 3% – 98% speed-up [Ke et al., SC’04] March 2011

  11. Throughput • C: 75 – 87 Gb/s • Mean: 77.9 Gb/s • D:90 – 121 Gb/s • Mean: 96.6 Gb/s • 4x faster than pFPC on 8 cores (2 CPUs) • Improvement over pFPC’s compression ratio vs. performance trend March 2011

  12. NEW:Fermi Throughput • Fermi improvements: • Faster, simpler memory accesses • Hardware support for count-leading-zeros op • Compression ratio: 1.187 • C: 119 – 219 (HM: 167.5 Gb/s) • D: 169 – 219 (HM: 180.3 Gb/s) • Compresses over 9.5x faster than pFPC on 8 x86 cores March 2011

  13. Summary • GFC algorithm • Chunks up data, each warp processes a chunk iteratively by 32-word subchunks • No communication required between warps • Minimum 75 Gb/s – 90 Gb/s (encode-decode) throughput on GTX-285, and 119 Gb/s – 169 Gb/s on Fermi, with a compression ratio of 1.19 • CUDA source code is freely available athttp://www.cs.txstate.edu/~burtscher/research/GFC/ March 2011

  14. Conclusions AMD NVIDIA • GPU can compress much faster than PCIe bus can transfer the data • But… • PCIe bus will become faster • CPU-GPU increasingly on single die • GPU-to-GPU, GPU-to-NIC transfers coming? • GFC is the first compressor with the potential to deliver real-time FP data compression for current and emerging network speeds March 2011

More Related