Maximizing Performance: Multithreading in Modern Computing

Why multi-threading/multi-core? • Clock rates are stagnant • Future CPUs will be predominantly multi-thread/multi-core • Xbox 360 has 3 cores • PS3 will be multi-core • >70% of PC sales will be multi-core by end of 2006 • Most Windows Vista systems will be multi-core • Two performance possibilities: • Single-threaded? Minimal performance growth • Multi-threaded? Exponential performance growth

Design for Multithreading • Good design is critical • Bad multithreading can be worse than no multithreading • Deadlocks, synchronization bugs, poor performance, etc.

Bad Multithreading Thread 1 Thread 2 Thread 3 Thread 4 Thread 5

Good Multithreading Physics Rendering Thread Rendering Thread Rendering Thread Game Thread Rendering Thread Game Thread Main Thread Particle Systems Animation/ Skinning Networking File I/O

Another Paradigm: Cascades Frame 1 Frame 2 Frame 4 Frame 3 Thread 1 Input • Advantages: • Synchronization points are few and well-defined • Disadvantages: • Increases latency (for constant frame rate) • Needs simple (one-way) data flow Thread 2 Physics Thread 3 AI Thread 4 Rendering Thread 5 Present

Typical Threaded Tasks • File Decompression • Rendering • Graphics Fluff • Physics

File Decompression • Most common CPU heavy thread on the Xbox 360 • Easy to multithread • Allows use of aggressive compression to improve load times • Don’t throw a thread at a problem better solved by offline processing • Texture compression, file packing, etc.

Rendering • Separate update and render threads • Rendering on multiple threads (D3DCREATE_MULTITHREADED) works poorly • Exception: Xbox 360 command buffers • Special case of cascades paradigm • Pass render state from update to render • With constant workload gives same latency, better frame rate • With increased workload gives same frame rate, worse latency

Graphics Fluff • Extra graphics that doesn't affect play • Procedurally generated animating cloud textures • Cloth simulations • Dynamic ambient occlusion • Procedurally generated vegetation, etc. • Extra particles, better particle physics, etc. • Easy to synchronize • Potentially expensive, but if the core is otherwise idle...?

Physics? • Could cascade from update to physics to rendering • Makes use of three threads • May be too much latency • Could run physics on many threads • Uses many threads while doing physics • May leave threads mostly idle elsewhere

Overcommitted Multithreading? Physics Rendering Thread Rendering Thread Rendering Thread Game Thread Particle Systems Animation/ Skinning

How Many Threads? • No more than one CPU intensive software thread per core • 3-6 on Xbox 360 • 1-? on PC (1-4 for now, need to query) • Too many busy threads adds complexity, and lowers performance • Context switches are not free • Can have many non-CPU intensive threads • I/O threads that block, or intermittent tasks

Case Study: Kameo (Xbox 360) • Started single threaded • Rendering was taking half of time—put on separate thread • Two render-description buffers created to communicate from update to render • Linear read/write access for best cache usage • Doesn't copy const data • File I/O and decompress on other threads

Separate Rendering Thread Update Thread Buffer 0 Buffer 1 Render Thread

Case Study: Kameo (Xbox 360) • Total usage was ~2.2-2.5 cores

Case Study: Project Gotham Racing • Total usage was ~2.0-3.0 cores

Available Synchronization Objects • Events • Semaphores • Mutexes • Critical Sections • Don't use SuspendThread() • Some title have used this for synchronization • Can easily lead to deadlocks • Interacts badly with Visual Studio debugger

Exclusive Access: Mutex // Initialize HANDLE mutex = CreateMutex(0, FALSE, 0); // Use void ManipulateSharedData() { WaitForSingleObject(mutex, INFINITE); // Manipulate stuff... ReleaseMutex(mutex); } // Destroy CloseHandle(mutex);

Exclusive Access: CRITICAL_SECTION // Initialize CRITICAL_SECTION cs; InitializeCriticalSection(&cs); // Use void ManipulateSharedData() { EnterCriticalSection(&cs); // Manipulate stuff... LeaveCriticalSection(&cs); } // Destroy DeleteCriticalSection(&cs);

Lockless programming • Trendy technique to use clever programming to share resources without locking • Includes InterlockedXXX(), lockless message passing, Double Checked Locking, etc. • Very hard to get right: • Compiler can reorder instructions • CPU can reorder instructions • CPU can reorder reads and writes • Not as fast as avoiding synchronization entirely

Lockless Messages: Buggy void SendMessage(void* input) { // Wait for the message to be 'empty'. while (g_msg.filled) ; memcpy(g_msg.data, input, MESSAGESIZE); g_msg.filled = true; } void GetMessage() { // Wait for the message to be 'filled'. while (!g_msg.filled) ; memcpy(localMsg.data, g_msg.data, MESSAGESIZE); g_msg.filled = false; }

Synchronization tips/costs: • Synchronization is moderately expensive when there is no contention • Hundreds to thousands of cycles • Synchronization can be arbitrarily expensive when there is contention! • Goals: • Synchronize rarely • Hold locks briefly • Minimize shared data

Threading File I/O & Decompression • First: use large reads and asynchronous I/O • Then: consider compression to accelerate loading • Don't do format conversions etc. that are better done at build time! • Have resource proxies to allow rendering to continue

File I/O Implementation Details • vector<Resource*>g_resources; • Worst design: decompressor locks g_resources while decompressing • Better design: decompressor adds resources to vector after decompressing • Still requires renderer to synch on every resource access • Best design: two Resource* vectors • Renderer has private vector, no locking required • Decompressor use shared vector, syncs when adding new Resource* • Renderer moves Resource* from shared to private vector once per frame

Profiling multi-threaded apps • Need thread-aware profilers • Profiling may hide many synchronization stalls • Home-grown spin locks make profiling harder • Consider instrumenting calls to synchronization functions • Don't use locks in instrumentation • Windows: Intel VTune, AMD CodeAnalyst, and the Visual Studio Team System Profiler • Xbox 360: PIX, XbPerfView, etc.

PIX timing capture

Naming Threads typedef struct tagTHREADNAME_INFO { DWORD dwType; // must be 0x1000 LPCSTR szName; // pointer to name (in user addr space) DWORD dwThreadID; // thread ID (-1=caller thread) DWORD dwFlags; // reserved for future use, must be zero } THREADNAME_INFO; void SetThreadName( DWORD dwThreadID, LPCSTR szThreadName) { THREADNAME_INFO info; info.dwType = 0x1000; info.szName = szThreadName; info.dwThreadID = dwThreadID; info.dwFlags = 0; __try { RaiseException( 0x406D1388, 0, sizeof(info)/sizeof(DWORD), (DWORD*)&info ); } __except(EXCEPTION_CONTINUE_EXECUTION) { } } SetThreadName(-1, "Main thread");

Windows tips • Avoid using wglMakeCurrent or this.Invoke() • Best to do all rendering calls from a single thread • Test on multiple machines and configurations • Single-core, SMT (i.e. Hyper-Threading), Dual-core, Intel and AMD chips, Multi-socket multicore (4+ cores)

Maximizing Performance: Multithreading in Modern Computing