360 likes | 597 Vues
Exascale radio astronomy. M Clark. Contents. GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale. The March of GPUs. What is a GPU?. Kepler K20X (2012) 2688 processing cores 3995 SP Gflops peak Effective SIMD width of 32 threads (warp)
E N D
Exascale radio astronomy M Clark
Contents • GPU Computing • GPUs for Radio Astronomy • The problem is power • Astronomy at the Exascale
What is a GPU? • Kepler K20X (2012) • 2688 processing cores • 3995 SP Gflopspeak • Effective SIMD width of 32 threads (warp) • Deep memory hierarchy • As we move away from registers • Bandwidth decreases • Latency increases • Limited on-chip memory • 65,536 32-bit registers per SM • 48 KiB shared memory per SM • 1.5 MiBL2 • Programmed using a thread model
Minimum Port, Big Speed-up Application Code Rest of Sequential CPU Code Only Critical Functions GPU CPU +
Strong CUDA GPU Roadmap 20 Pascal Unified Memory 3D Memory NVLink 18 16 14 12 Maxwell DX12 10 SGEMM / W Normalized 8 Kepler Dynamic Parallelism 6 4 2 Fermi FP64 Tesla CUDA 0 2008 2010 2012 2014 2016
Introducing NVLINK and Stacked Memory NVLINK • GPU high speed interconnect • 80-200 GB/s • Planned support for POWER CPUs Stacked Memory • 4x Higher Bandwidth (~1 TB/s) • 3x Larger Capacity • 4x More Energy Efficient per bit
NVLink Enables Data Transfer At Speed of CPU Memory TESLA GPU CPU NVLink 80 GB/s HBM 1 Terabyte/s DDR4 50-75 GB/s DDR Memory Stacked Memory
Radio Telescope Data Flow Real-Time R-T, post R-T Correlator Calibration & Imaging RF Samplers N O(N) O(N) O(N) O(N2) O(N log N) O(N2) digital
Where can GPUs be Applied? • Cross correlation – GPU are ideal • Performance similar to CGEMM • High performance open-source library https://github.com/GPU-correlators/xGPU • Calibration and Imaging • Gridding - Coordinate mapping of input data to a regular grid • Arithmetic intensity scales with kernel convolution size • Compute-bound problem maps well to GPUs • Dominant time sink in compute pipeline • Other image processing steps • CUFFT – Highly optimized Fast Fourier Transform library • PFB – Computational intensity increases with number of taps • Coordinate transformations and resampling
LOFAR GPUs in Radio Astronomy • Already an essential tool in radio astronomy • ASKAP – Western Australia • LEDA – United States of America • LOFAR – Netherlands (+ Europe) • MWA – Western Australia • NCRA - India • PAPER – South Africa LEDA ASKAP MWA PAPER
Cross Correlation on GPUs • Cross correlation is essentially GEMM • Hierarchical locality
Correlator Efficiency 64 Pascal 32 Maxwell 16 Kepler 8 >2.5 TFLOPS sustained X-engine GFLOPS per Watt Fermi 4 >1 TFLOPS sustained 2 Tesla 1 0.35 TFLOPS sustained 2016 2008 2010 2012 2014
Software Correlation Flexibility • Why do software correlation? • Software correlatorsinherently have a great degree of flexibility • Software correlation can do on-the-fly reconfiguration • Subset correlation at increased bandwidth • Subset correlation at decreased integration time • Pulsar binning • Easy classification of data (RFI threshold) • Software is portable, correlator unchanged since 2010 • Already running on 2016 architecture
= Power for the city of San Francisco HPC’s Biggest Challenge: Power Power of 300 PetaflopCPU-only Supercomputer
The End of Historic Scaling C Moore, Data Processing in ExaScale-Class Computer Systems, Salishan, April 2011
The End of Voltage Scaling • In the Good Old Days • Leakage was not important, and voltage scaled with feature size • L’ = L/2 • V’ = V/2 • E’ = CV2 = E/8 • f’ = 2f • D’ = 1/L2 = 4D • P’ = P • Halve L and get 4x the transistors and 8x the capability for the • same power • The New Reality • Leakage has limited threshold voltage, largely ending voltage scaling • L’ = L/2 • V’ = ~V • E’ = CV2 = E/2 • f’ = 2f • D’ = 1/L2 = 4D • P’ = 4P • Halve L and get 4x the transistors and 8x the capability for • 4x the power, • or 2x the capability for the same power in ¼ the area.
Major Software Implications • Need to expose massive concurrency • Exaflop at O(GHz) clocks O(billion-way) parallelism! • Need to expose and exploit locality • Data motion more expensive than computation • > 100:1 global v. local energy • Need to start addressing resiliency in the applications
How Parallel is Astronomy? • SKA1-LOW specifications • 1024 dual-pol stations => 2,096,128 visibilities • 262,144 frequency channels • 300 MHz bandwidth • Correlator • 5 Pflops of computation • Data-parallel across visibilities • Task-parallel across frequency channels • O(trillion-way) parallelism
How Parallel is Astronomy? • SKA1-LOW specifications • 1024 dual-pol stations => 2,096,128 visibilities • 262,144 frequency channels • 300 MHz bandwidth • Gridding (W-projection) • Kernel size 100x100 • Parallel across kernel size and visibilities (J. Romein) • O(10 billion-way) parallelism
Energy EfficiencyDrives Locality 26 pJ 256 pJ 20 pJ 64-bit DP 16000 pJ DRAM Rd/Wr 256 bits 256-bit access 8 kB SRAM 50 pJ 500 pJ Efficient off-chip link 1000 pJ 20mm 28nm IC
Energy Efficiency Drives Locality picoJoules
Energy Efficiency Drives Locality • This is observable today • We have lots of tunable parameters: • Register tile size: how many much work should each thread do? • Thread block size: how many threads should work together? • Input precision: size of the input words • Quick and dirty cross correlation example • 4x4 => 8x8 register tiling • 8.5% faster, 5.5% lower power => 14% improvement in GFLOPS / watt
SKA1 LOW Sketch 50 Tb/s 10 Tb/s Correlator Calibration & Imaging RF Samplers 8-bit digitization O(10) PFLOPS O(100) PFLOPS N = 1024 digital
Energy Efficiency Drives Locality picoJoules
Do we need Moore’s Law? • Moore’s law come from shrinking process • Moore’s law is slowing down • Denard scaling is dead • Increasing wafer costs means that it takes longer to move to the next process
Improving Energy Efficiency @ Iso-Process • We don’t know how to build the perfect processor • Huge focus on improved architecture efficiency • Better understanding of a given process • Compare Fermi vs. Kepler vs. Maxwell architectures @ 28 nm • GF117: 96 cores, peak 192 Gflops • GK107: 384 cores, peak 770 Gflops • GM107: 640 cores, peak 1330 Gflops • Use cross-correlation benchmark • Only measure GPU power
Improving Energy Efficiency @ 28 nm GFLOPS / watt 55% 80% 80%
How Hot is Your Supercomputer? 2. Wilkes Cluster U. Cambridge, air cooled 3631 GFLOPS / watt 1. TSUBAME-KFC Tokyo Tech, oil cooled 4503 GFLOPS / watt Number 1 is 24% more efficient than number 2
Temperature is Power • Power is dynamic and static • Dynamic power is work • Static power is leakage • Dominant term from sub-threshold leakage Voltage terms: Vs: Gate to source voltage Vth: Switching threshold voltage n: transistor sub-threshold swing coeff Device specifics: A: Technology specific constant L, W: device channel length & width Thermal Voltage: 8.62×10−5eV/K 26 mV at room temperature
Temperature is Power Geforce GTX 580 Tesla K20m GF110, 40nm GK110, 28nm
Tuning for Power Efficiency • A given processor does not have a fixed power efficiency • Dependent on • Clock frequency • Voltage • Temperature • Algorithm • Tune in this multi-dimensional space for power efficiency • E.g., cross-correlation on Kepler K20 • 12.96 -> 18.34 GFLOPS / watt • Bad news: no two processors are exactly alike
Precision is Power • Power scales with the square of the multiplier (approximately) • Most computation done in FP32 / FP64 • Should use the minimum precision required by science needs • Maxwell GPUs have 16-bit integer multiply-add at FP32 rate • Algorithms should increasingly use hierarchical precision • Only invoke in high precision when necessary • Signal processing folks known this for a long time • Lesson feeding back into the HPC community...
Conclusions • Astronomy has insatiable amount of compute • Many-core processors are a perfect match to the processing pipeline • Power is a problem but • Astronomy has oodles of parallelism • Key algorithms possess locality • Precision requirements are well understood • Scientists and Engineers wedded to the problem • Astronomy is perhaps the ideal application for the exascale