Optimizing Ray Tracing on the Cell Microprocessor

Optimizing Ray Tracing on the Cell Microprocessor David Oguns

Agenda • Goals and Motivation • What is SIMD? • Cell Architectural Overview • Implementation & Challenges • Demo • Performance Results • Future Improvements • Questions?

Goals and Motivation • <3 Performance • Cell is a radically new architecture • Not just multicore programming • Get my hands dirty with SIMD • Learning

Single Instruction Multiple Data • Normal processors execute a single instruction for a single result. (Single Instruction Single Data)‏ • SIMD processors execute a single instruction across multiple data for multiple results. • SIMD processing sometimes called vector/stream/dsp processors • Useful in graphics acceleration, signal processing, and simulation applications.

V = { 1, 3, 5, 7} + { 2, 4, 6, 8}; SISD vs SIMD Approach A = 1 + 2; B = 3 + 4; C = 5 + 6; D = 7 + 8;

SIMD Continued... • Desktop CPUs usually have SIMD units • MMX, SSE, 3DNow!, AltiVec • Hardware likely to make heavy use of SIMD • GPUs • AgeiaPhysX • Super computers • Signal processors in multimedia devices or sensors • Potential to be order(s) of magnitude faster in pure math...

It's not all peachy • Previous example was construed to show data and instruction parallelism. • Sum of 1+2 was not dependent on 2+3 or vice versa. • SIMDizing A + B + C + D is less efficient • SIMD can’t help with A + B + C. • Data must be aligned properly in vectors.

Cell Overview • Asymmetric multicore processor designed by STI group(Sony/Toshiba/IBM) for high performance computing applications. • 1 Power Processing Element (PPE)‏ • Much like Intel/AMD desktop CPUs but cheaper. • 8 Synergistic Processing Elements (SPE)‏ • SIMD processors • Element Interconnect Bus (EIB)‏ • 4 ring 16b wide unidirectional bus • Connects PPE, SPEs, main memory, FlexIO. • Many ways to accomplish inter-element communication

PPE • 64-bit 3.2Ghz • Dual threaded • 512KB L2 cache • In order execution • Transparent access to main memory • SIMD unit called AltiVec

SPE • SIMD core clocked at 3.2GHz • Even pipeline for most execution instructions • Odd pipeline for load/store/DMA/branch hints • 128 x 128b register file • 256KB Local Store • Very very low latency (7 cycles)‏ • Memory Flow Controller • No direct access to main memory! • No branch prediction hardware

EIB • Arbitrates all communication between elements in the Cell. • Runs at half the clock speed of the PPE and SPEs • ~300Gb/s theoretical bandwidth • 200Gb/s observed • Main memory is connected to EIB – not the PPE. • FlexIO

Intercore Communication • DMA (Direct Memory Access)‏ • Like memcpy on crack. • Used primary to keep data moving • Latency ~ hundreds of cycles • Non blocking calls • Mailboxes • Short messages to and from SPEs • Signals and Interrupts

http://www.blachford.info/computer/Cell/Cell0_v2.html

Implementation • Can use stock ray tracer and simply run on PPE • Lets make it run faster! • Multithreading (partitioning, synchronization)‏ • Pthreads, libspe2 • SIMDzing (both on PPE AltiVec and SPUs)‏ • IBM Toolchain Language intrinsics • General C multiplatform development

Multithreading or Partitioning? • How do we divide up work between N spes? • Goals are to load balance and minimize synchronization • Data driven design

Data Partitioning • Each SPE is given basic information • Scene address, frame buffer address, ray buffer address, number of pixels to process, sqrt(samples per pixel), numSpes, depth • Entire scene copied over to LS via DMA. • Very small. • SPE(N) processes every Nth pixel • Load balances very well • No synchronization with PPE necessary*

First Approach • PPE • Setup scene • Generate primary rays in ray buffers • Initialize work loads for N SPEs • Launch each SPE thread • Wait for SPEs to finish running. • When SPE are done, it means frame buffer is ready • SPE • Receive workload • Use multibuffered DMA to transfer and process primary rays • Output pixels using DMA list to main memory. • Terminate when DMA for outgoing pixels is complete.

First Approach = FAIL • DMA is hard to use for streaming data... • Must implement multibuffered solution for streaming rays • Outputting scattered pixels is a pain • Generating all primary rays at once uses a lot of memory. • Ray size: 48bytes (36bytes->48bytes with padding)‏ • Pixel size: 4bytes. • 3MB frame buffer -> 36MB ray buffer (1024x768) • Super sampling makes it even worse! • 3MB frame buffer -> 144MB ray buffer at 4x super sampling • Removed ray buffer from normal ray tracer as well.

Second Approach • PPE • Setup scene • Extra view plane information in scene • Initialize work loads for N spes • Launch each SPE thread • Poll each SPE's outbound mailbox for outgoing pixels. • Write pixel to frame buffer. • Done when last pixel received. • SPE • Receive info about workload • Generate and process primary rays on the fly • Output pixel using mailbox. (blocking call)‏

SIMDizing • Implemented my own vector library • Modified using conditional compilation • Later removed entire calls to my own functions • IBM Toolchain allows for some clean syntax. • Some of the compiler intrinsics work for both PPE AltiVec and SPEs.

General C Multiplatform Development • Loading scene was identical (PPE) • Re-wrote start up code for PPE and SPE • Ray tracing / shading algorithms • Identical on SPE • C math not available. SIMD Math instead.

Challenges • First time doing something like this. • IBM SDK does not install easily. • Using IBM SDK libraries was a pain. • Indirect nesting in data structures • Keeping data structures the same size on PPE and SPE. • PPE pointer: 64bits; SPE pointer: 32bits • Aligning memory...

“Another common error in SPU programs is a DMA that specifies an invalid combination of Local Store address, effective address, and transfer size. The alignment rules for DMAs specify that transfers for less than 16 bytes must be "naturally aligned," meaning that the address must be divisible by the size. Transfers of 16 bytes or more must be 16-byte aligned. The size can have a value of 1, 2, 4, 8, 16, or a multiple of 16 bytes to a maximum of 16KB. In addition, the low-order four bits of the Local Store address must match the low-order four bits of effective address (in other words, they must have the same alignment within a quadword). Any DMA that violates one of these rules will generate an alignment exception which is presented to the user as a bus error.” This explains most of it... http://www.ibm.com/developerworks/power/library/pa-celldebug/

Demo(sort of...)‏

Performance Results • Recursion depth of 4 • PS3 Cell implies use of 6 SPEs • No PPE multithreading • Pentium 4 trials used no hyperthreading.

Performance Results

Performance Results Run Time versus Number of SPEs Test run at 1600x1200x4

Future Improvements • Partition data differently • Eliminate all synchronization • More efficient use of SIMD • Colors were essentially vectors. • Eliminate branches, add branch hints • Learn to use IBM's profiling tools • Real time rendering is possible!

Questions?

Optimizing Ray Tracing on the Cell Microprocessor

Optimizing Ray Tracing on the Cell Microprocessor

Presentation Transcript

Ray Tracing

Ray Tracing on GPU

Ray-tracing

Ray Tracing

Ray Tracing

Ray Tracing

Ray Tracing

Ray Tracing

More on Ray Tracing

Ray Tracing

Ray Tracing

Ray Tracing

Ray Tracing

Ray Tracing

Ray Tracing

Ray Tracing

Ray Tracing

Ray Tracing