CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Fall 2010 • Jih-Kwon Peir • Computer Information Science Engineering • University of Florida CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Chapter 7: Floating Point Considerations

Objective • To understand the fundamentals of floating-point representation • To know the IEEE-754 Floating Point Standard • GeForce 8800 CUDA Floating-point speed, accuracy and precision • Deviations from IEEE-754 • Accuracy of device runtime functions • -fastmath compiler option • Future performance considerations • To understand CUDA on Multi-cores

GPU Floating Point Features

What is IEEE floating-point format? • A floating point binary number consists of three parts: • sign (S), exponent (E), and mantissa (M). • Each (S, E, M) pattern uniquely identifies a floating point number. • For each bit pattern, its IEEE floating-point value is derived as: • value = (-1)S * M * {2E}, where 1.0 ≤ M < 10.0B • The interpretation of S is simple: S=0 results in a positive number and S=1 a negative number.

Normalized Representation • Specifying that 1.0B ≤ M < 10.0B makes the mantissa value for each floating point number unique. • For example, the only one mantissa value allowed for 0.5D is M =1.0 • 0.5D = 1.0B * 2-1 • Neither 10.0B * 2 -2 nor 0.1B * 2 0 qualifies • Because all mantissa values are of the form 1.XX…, one can omit the “1.” part in the representation. • The mantissa value of 0.5D in a 2-bit mantissa is 00, which is derived by omitting “1.” from 1.00.

Exponent Representation • In an n-bits exponent representation, 2n-1-1 is added to its 2's complement representation to form its excess representation. • See Table for a 3-bit exponent representation • A simple unsigned integer comparator can be used to compare the magnitude of two FP numbers • Symmetric range for +/- exponents (111 reserved)

A simple, hypothetical 5-bit FP format • Assume 1-bit S, 2-bit E, and 2-bit M • 0.5D = 1.00B * 2-1 • 0.5D = 0 00 00, where S = 0, E = 00, and M = (1.)00

Representable Numbers • The representable numbers of a given format is the set of all numbers that can be exactly represented in the format. • See Table for representable numbers of an unsigned 3-bit integer format -1 0 1 2 3 4 5 7 8 9 6

Representable Numbers of a 5-bit Hypothetical IEEE Format Cannot represent Zero! 0

Flush to Zero • Treat all bit patterns with E=0 as 0.0 • This takes away several representable numbers near zero and lump them all into 0.0 • For a representation with large M, a large number of representable numbers numbers will be removed. 0 0 1 2 3 4

Flush to Zero 0

Denormalized Numbers • The actual method adopted by the IEEE standard is called denromalized numbers or gradual underflow. • The method relaxes the normalization requirement for numbers very close to 0. • whenever E=0, the mantissa is no longer assumed to be of the form 1.XX. Rather, it is assumed to be 0.XX. In general, if the n-bit exponent is 0, the value is • 0.M * 2 - 2 ^(n-1) + 2 0 2 1 3

Denormalization 0

Arithmetic Instruction Throughput • int and float add, shift, min, max and float mul, mad: 4 cycles per warp • int multiply (*) is by default 32-bit • requires multiple cycles / warp • Use __mul24() / __umul24() intrinsics for 4-cycle 24-bit int multiply • Integer divide and modulo are expensive • Compiler will convert literal power-of-2 divides to shifts • Be explicit in cases where compiler can’t tell that divisor is a power of 2! • Useful trick: foo % n == foo & (n-1) if n is a power of 2

Arithmetic Instruction Throughput • Reciprocal, reciprocal square root, sin/cos, log, exp: 16 cycles per warp • These are the versions prefixed with “__” • Examples:__rcp(), __sin(), __exp() • Other functions are combinations of the above • y / x == rcp(x) * y == 20 cycles per warp • sqrt(x) == rcp(rsqrt(x)) == 32 cycles per warp

Runtime Math Library • There are two types of runtime math operations • __func(): direct mapping to hardware ISA • Fast but low accuracy (see prog. guide for details) • Examples: __sin(x), __exp(x), __pow(x,y) • func() : compile to multiple instructions • Slower but higher accuracy (5 ulp, units in the least place, or less) • Examples: sin(x), exp(x), pow(x,y) • The -use_fast_math compiler option forces every func() to compile to __func()

Make your program float-safe! • Future hardware will have double precision support • G80 is single-precision only • Double precision will have additional performance cost • Careless use of double or undeclared types may run more slowly on G80+ • Important to be float-safe (be explicit whenever you want single precision) to avoid using double precision where it is not needed • Add ‘f’ specifier on float literals: • foo = bar * 0.123; // double assumed • foo = bar * 0.123f; // float explicit • Use float version of standard library functions • foo = sin(bar); // double assumed • foo = sinf(bar); // single precision explicit

Deviations from IEEE-754 • Addition and Multiplication are IEEE 754 compliant • Maximum 0.5 ulp (units in the least place) error • However, often combined into multiply-add (FMAD) • Intermediate result is truncated • Division is non-compliant (2 ulp) • Not all rounding modes are supported • Denormalized numbers are not supported • No mechanism to detect floating-point exceptions

GPU Floating Point Features

Floating-Point Calculation Results Can Depend on Execution Order Order 1 1.00*20 +1.00*20 + 1.00*2-2 + 1.00*2-2 = 1.00*21 + 1.00*2-2 + 1.00*2-2 = 1.00*21 + 1.00*2-2 = 1.00*21 Order 2 (1.00*20 +1.00*20) + (1.00*2-2 + 1.00*2-2 ) = 1.00*21 + 1.00*2-1 = 1.01*21 Pre-sorting is often used to increase stability of floating point results.

Supplement Material

(Device) Grid Block (0, 0) Block (1, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Local Memory Local Memory Host Global Memory Constant Memory Texture Memory CUDA Device Memory Space: Review • Each thread can: • R/W per-thread registers • R/W per-thread local memory • R/W per-block shared memory • R/W per-grid global memory • Read only per-grid constant memory • Read only per-grid texture memory • The host can R/W global, constant, and texture memories using Copy function

Thread Local Memory Block Shared Memory . . . . . . Parallel Memory Sharing • Local Memory: per-thread • Private per thread • Auto variables, register spill • Shared Memory: per-Block • Shared by threads of the same block • Inter-thread communication • Global Memory: per-application • Shared by all threads • Inter-Grid communication Grid 0 Global Memory Sequential Grids in Time Grid 1

MT IU MT IU SP SP Shared Memory Shared Memory t0 t1 t2 … tm t0 t1 t2 … tm TF SM Memory Architecture SM 0 SM 1 Blocks • Threads in a block share data & results • In Memory and Shared Memory • Synchronize at barrier instruction • Per-Block Shared Memory Allocation • Keeps data close to processor • Minimize trips to global Memory • Shared Memory is dynamically allocated to blocks, one of the limiting resources Blocks Texture L1 Courtesy: John Nicols, NVIDIA L2 Memory

SM Register File • Register File (RF) • 32 KB (8K entries) for each SM in G80 • TEX pipe can also read/write RF • 2 SMs share 1 TEX • Load/Store pipe can also read/write RF I $ L 1 Multithreaded Instruction Buffer R C $ Shared F L 1 Mem Operand Select MAD SFU

Programmer View of Register File 3 blocks 4 blocks • There are 8192 registers in each SM in G80 • This is an implementation decision, not part of CUDA • Registers are dynamically partitioned across all blocks assigned to the SM • Once assigned to a block, the register is NOT accessible by threads in other blocks • Each thread in the same block only access registers assigned to itself

Matrix Multiplication Example • If each Block has 16X16 threads and each thread uses 10 registers, how many thread can run on each SM? • Each block requires 10*256 = 2560 registers • 8192 = 3 * 2560 + change • So, three blocks can run on an SM as far as registers are concerned • How about if each thread increases the use of registers by 1? • Each Block now requires 11*256 = 2816 registers • 8192 < 2816 *3 • Only two Blocks can run on an SM, 1/3 reduction of parallelism!!! • Programmers’ responsibility!!!

More on Dynamic Partitioning • Dynamic partitioning gives more flexibility to compilers/programmers • One can run a smaller number of threads that require many registers each or a large number of threads that require few registers each • This allows for finer grain threading than traditional CPU threading models. • The compiler can tradeoff between instruction-level parallelism and thread level parallelism •  Note not only registers, also shared memory!!

ILP vs. TLP Example • Assume that a kernel has 256-thread Blocks, 4 independent instructions for each global memory load in the thread program, and each thread uses 10 registers, global loads have 200 cycles • 3 Blocks can run on each SM • If a compiler can use one more register to change the dependence pattern so that 8 independent instructions exist for each global memory load • Only two can run on each SM • However, one only needs 200/(8*4) = 7 Warps to tolerate the memory latency • Two blocks have 16 Warps. The performance can be actually higher!

Memory Layout of a Matrix in C M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 M M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3

Memory Coalescing • When accessing global memory, peak performance utilization occurs when all threads in a half warp access continuous memory locations. Not coalesced coalesced Md Nd Thread 1 H T D I Thread 2 W WIDTH

Memory Layout of a Matrix in C M0,0 M1,0 M2,0 M3,0 Access direction in Kernel code M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 Time Period 1 Time Period 2 … T1 T2 T3 T4 T1 T2 T3 T4 M M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3

Memory Layout of a Matrix in C M0,0 M1,0 M2,0 M3,0 Access direction in Kernel code M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 … Time Period 2 T1 T2 T3 T4 Time Period 1 T1 T2 T3 T4 M M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3

Constants • Immediate address constants • Indexed address constants • Constants stored in DRAM, and cached on chip • L1 per SM • A constant value can be broadcast to all threads in a Warp • Extremely efficient way of accessing a value that is common for all threads in a block! I $ L 1 Multithreaded Instruction Buffer R C $ Shared F L 1 Mem Operand Select MAD SFU

Shared Memory • Each SM has 16 KB of Shared Memory • 16 banks of 32bit words • CUDA uses Shared Memory as shared storage visible to all threads in a thread block • Fast read and write access • Not used explicitly for pixel shader programs • we dislike pixels talking to each other  I $ L 1 Multithreaded Instruction Buffer R C $ Shared F L 1 Mem Operand Select MAD SFU

Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Bank 15 Parallel Memory Architecture • In a parallel machine, many threads access memory • Therefore, memory is divided into banks • Essential to achieve high bandwidth • Each bank can service one address per cycle • A memory can service as many simultaneous accesses as it has banks • Multiple simultaneous accesses to a bankresult in a bank conflict • Conflicting accesses are serialized

No Bank Conflicts Linear addressing stride == 1 No Bank Conflicts Random 1:1 Permutation Thread 0 Bank 0 Thread 0 Bank 0 Bank 1 Thread 1 Bank 1 Thread 1 Thread 2 Thread 2 Bank 2 Bank 2 Bank 3 Thread 3 Thread 3 Bank 3 Bank 4 Thread 4 Thread 4 Bank 4 Thread 5 Bank 5 Bank 5 Thread 5 Thread 6 Bank 6 Bank 6 Thread 6 Bank 7 Bank 7 Thread 7 Thread 7 Thread 15 Bank 15 Bank 15 Thread 15 Bank Addressing Examples

2-way Bank Conflicts Linear addressing stride == 2 8-way Bank Conflicts Linear addressing stride == 8 Bank 0 Thread 0 x8 Thread 1 Bank 1 Thread 0 Bank 0 Bank 2 Thread 2 Thread 1 Bank 1 Thread 3 Bank 3 Thread 2 Bank 2 Thread 4 Bank 4 Thread 3 Thread 5 Bank 5 Thread 4 Thread 6 Bank 6 Bank 7 Thread 7 Bank 7 Bank 8 Bank 9 Thread 8 x8 Thread 9 Bank 15 Thread 15 Thread 10 Thread 11 Bank 15 Bank Addressing Examples

How addresses map to banks on G80 • Each bank has a bandwidth of 32 bits per clock cycle • Successive 32-bit words are assigned to successive banks • G80 has 16 banks • So bank = address % 16 • Same as the size of a half-warp • Memory access scheduling are half-wrap based • No bank conflicts between different half-warps, only within a single half-warp.

Shared memory bank conflicts • Shared memory is as fast as registers if there are no bank conflicts • The fast case: • If all threads of a half-warp access different banks, there is no bank conflict • If all threads of a half-warp access the identical address, there is no bank conflict (broadcast) • The slow case: • Bank Conflict: multiple threads in the same half-warp access the same bank • Must serialize the accesses • Cost = max # of simultaneous accesses to a single bank

Thread 0 Thread 0 Bank 0 Bank 0 Bank 1 Thread 1 Thread 1 Bank 1 Bank 2 Thread 2 Bank 2 Thread 2 Bank 3 Bank 3 Thread 3 Thread 3 Bank 4 Thread 4 Thread 4 Bank 4 Thread 5 Bank 5 Thread 5 Bank 5 Thread 6 Bank 6 Bank 6 Thread 6 Thread 7 Bank 7 Bank 7 Thread 7 Bank 15 Thread 15 Bank 15 Thread 15 Linear Addressing s=1 • Given: • __shared__ float shared[256]; • float foo = • shared[baseIndex + s * threadIdx.x]; • This is only bank-conflict-free if s shares no common factors with the number of banks • 16 on G80, so s must be odd s=3

Differences between Global and Shared Memory • Global memory: Use coalesced reads (see the programming guide) It is likely the bandwidth of global memory interface is 64 bytes, hence can coalesce 16*32 bit data from half warp; or read using a texture with good spatial locality within each warp. • Shared memory: Use multiple (16) banks for independent accesses from half warp. •  There are many interesting discussions on the memory issue in CUDA Forum: http://forums.nvidia.com/index.php?showtopic=63919

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming