An Evaluation of Graphics Processors as Stream Co-Processors

An Evaluation of Graphics Processors as Stream Co-Processors Half-baked paper Francois Labonte, Ian Buck, Mark Horowitz, Christos Kozyrakis

Graphic Chips are fast

And they’re now programmable Application Command Geometry Vertex Program Rasterization Texture Fragment Program Texture Fragment Display Traditional Pipeline Programmable Pipeline

GPU ISA • Short vectors of 4 words • Can be bytes, 16b fp, 32b fp • Vector instructions • ADD, SUB, MUL, MAX, MIN, SGE, SLT, MAD, CMP • DP3, DP4 • Scalar instructions • RCP, RSQ, LG2, COS, SIN • Texture instructions (Memory read) • Can also be dependent fetches • Swizzle – allows position of words in operand vector to be switched • Word mask – write enable on each output word

Two programmable parts • Vertex Programs are run on each vertex • Vertex programs can have conditionals • But no texture access • 4 parallel pipes in current generation • Fragment Programs are run on each fragments (pixels) • No Conditionals • Many textures, dependent texture access • 8 parallel pipes in current generation • We’ll concentrate on these from now on

Meanwhile at Stanford and other places • Some people are obsessed with stream computing. • Being one of these, to program different architectures using streams, we have devised a Stream Virtual Machine • The SVM abstracts an underlying architecture to allow a compiler to produce stream code for it.

SVM concept • SVM is a co-processor model, a thread processor controls a stream co-processor which has a special fast memory • Not a far fetch from GPUs, CPU is thread processor, Graphics memory is Stream Register File, GPU is stream processor

The Goal of this Paper • Evaluate the performance of the GPU as a SVM • Look at mechanism that are available, • Limitation of gpu programming • Architectural issues with current generation • Look at Both Nvidia and ATI’s best offering • Nvidia GeForceFX 5900 Ultra (NV35) • ATI Radeon 9800 Pro (R350)

Bandwidth between host and GPU • Host->GPU 350MB/s • GPU->Host 181 MB/s • AGP 2.0 (4x)1066MB/s, AGP 3.0 (8x) 2133MB/s

1d viewof Fragments memory Stride Strided memory access • Study memory accesses of 2d textures: Most common access when using output of fragment program as input to a next pass. • We are mapping a 1d memory space into 2d • There are 2 ways to do it: row major, column major • We are doing strided memory accesses where the stride is varied

Strided Bandwidth • Alu limited till 3 textures, 13GB/s is max

Random memory access • Experiment setup: • Read a texture randomly initialized • Use texture read to index multiple other textures (thus being randomly accessed) • Get to know size of cache

Random Memory Access Bandwidth • Small texture cache size (8x8x4x4=1kB) need more points between 8 and 16

Floating Point Ops per second • ATI stable at 3G Inst/s, 380MHz clock 8 parallel pipelines => 3040 G Inst/s • Some inst are implemented in multiple inst (cos = 10) • NVidia rocks for all mul and add

Nvidia’s funky architecture • NV 35 is rumored to have process 2x2 fragments, so 4 times what you see up there which has 3 madd units => 12 * 450MHz = 5400

Dependence on number of live registers • ATI solid as a rock • The Nvidia Ferrari quickly becomes a skateboard, from previous diagram, my guess is you lose the 2 final mul units if you use more than 3 registers… drop of 1/3, but then it gets worst, much worst.

Reductions • In a kernel sometimes you want to do an operation that is commutative, associative ex: for(i=0; i<N;i++){ Do some work Sum+= work; } • This is fully parallel if we have N processors, we can compute N sums and add them at the end • On GPU we cannot carry state from one fragment to another, so we need to do multiple pass to combine in a tree fashion result. This is a programming issue (gpu doesn’t let us access

Drops at square textures? (need to check) • Plot against reduction if we could carry state across each fragment

An Evaluation of Graphics Processors as Stream Co-Processors

An Evaluation of Graphics Processors as Stream Co-Processors

Presentation Transcript

Graphics processors

A Validation Methodology for Graphics Processors

Cryptography on Graphics Processors

Superscalar Processors

Processors

Design Automation of Co-Processors for Application Specific Instruction Set Processors

An Introduction to CUDA/ OpenCL and Manycore Graphics Processors

Graphics processors

Programming Massively Parallel Graphics Processors

Learners as Active Information Processors

HISTORY OF PROCESSORS

An Introduction to CUDA/OpenCL and Graphics Processors

An Introduction to CUDA and Manycore Graphics Processors

Graphics Processors

Processors

Frequent Itemset Mining on Graphics Processors

Parallel Computing on Graphics Processors

Processors

PROCESSORS

An Introduction to CUDA and Manycore Graphics Processors

A Validation Methodology for Graphics Processors