Operational Weather Forecasting using GPUs

Operational Weather Forecasting using GPUs • Dr. Shujia Zhou • Lawrence Sebald

NOAA Long Wave Radiation Code • Production version of the weather forecast model code • Accounts for about 10-15% of the global weather forecast simulation time • NOAA is interested in accelerating this code so that it may be called once per hour rather than once per three hours as it is done now

NOAA Long Wave Radiation Code Structure • Approximately 4000 lines of Fortran 90 code • Additionally, approximately 30,000 lines of raw data within the code • Code is structured in a way that has many random accesses into lookup tables in RAM • Algorithmically, speeds the code from O(L2) time to O(L) time on CPU • Efficient on a CPU, horribly inefficient on a GPU

rtrn or rtrnmr main taugb## (01-16) rlwinit cldprop taumol lwrad Code Structure and Memory Requirements *: Time stated for taumol includes time used by the taugb## functions **: Only one of these two functions is used

Optimization Differences between CPU and GPU • CPU • Each core has fairly large cache sizes • For instance, on Intel Nehalem: 32KB L1 (data), 256 KB L2 per core, 4-12MB L3 shared • Often, using precomputed lookup tables provides decent speedup over brute-force computation • NASA Goddard Solar (short wave) radiation code and NOAA long wave radiation code are optimized in this way • GPU • Each core has much smaller shared memory (16 cores with 16KB in Tesla, 32 cores with 64KB in Fermi) • Brute force calculation is more efficient due to large number of SIMD cores (512 in Fermi) • Streaming computation with many threads is preferable to lookup table centric programming • Reversing the lookup table approach back to computational functions reduces memory consumption

Translation from Fortran 90 to C • Utilized a NOAA tool known as F2C-ACC to translate the Fortran 90 code to C • C is better supported for GPU programming than is Fortran, and will generally be supported first on future chips as well • Fortran only recently supported by a compiler by PGI • Little documentation, few examples, potentially less efficient than C code • F2C-ACC did a relatively good job of translating the raw computation code, however the tool is not perfect • Took approximately 3 months to hand-tune conversion • Hand editing of translated code was necessary • Some portions of the code were much more negatively impacted than others due to features not implemented in F2C-ACC (lookup tables were translated very poorly)

NOAA Long Wave Radiation: CUDA Issues • Due to memory requirements of the lookup table centric code, it is impossible to compile with CUDA on a GPU, or even with OpenCL on IBM JS22 (POWER6) • Each thread requires approximately 1MB of local storage space (registers/memory), which is too large for CUDA/OpenCL to cope with • GPU duplicates the thread memory requirement 32 times to have a full warp, even if less than 32 threads are active within the warp

NOAA Long Wave Radiation: Status • Successfully ported cldprop() to GPU • Successfully ported taugb##() to GPU • Currently optimizing performance with these functions • We plan to reverse the pre-calculated lookup tables back to brute force computation • Need to try to find original code and/or re-implement from AER documentation!

Operational Weather Forecasting using GPUs

Operational Weather Forecasting using GPUs

Presentation Transcript

Weather forecasting

National Weather Service Southern Region Forecast Office’s Operational Weather Forecasting Requirements

Weather Forecasting

Weather Forecasting

Weather forecasting

Weather Forecasting

Weather Forecasting

Weather Forecasting

Weather Forecasting

Weather Forecasting

Weather Forecasting

Weather Forecasting

WEATHER FORECASTING

Weather Forecasting

Forecasting Weather

SVW in Operational Weather Forecasting

Weather Forecasting

Operational Flood Forecasting for Bangladesh using ECMWF ensemble weather forecasts

Weather Forecasting

Weather Forecasting