Seismic Duck “ Reflection Seismology Was Never So Much Fun !”

Seismic Duck“Reflection Seismology Was Never So Much Fun!” Presented by: KobieShmuel NadavElyahu

Outline Preface Threading Building Blocks (Intel® TBB) Background Overall parallelization strategy for the game Threading Conclusion and further optimizations

Preface • Seismic Duck was parallelized with Intel® TBB. It was rewritten by Arch D. Robison (the architect for Intel® TBB). The original game was written for Macs in the mid 1990s. • The goals of Seismic Duck are to demonstrate some physics and be an interesting game. Seismic Duck is a freeware game about reflection seismology, which is imaging underground structures by sending soundwaves into the ground and interpreting the echos.

Preface • Despite the complexity and the realistic model the game tries to imply it’s very intuitive and is targeted for (bright) children as well. Seismic Duck tries to be qualitatively accurate, but quantitatively interesting. Thus time and space scales are distorted: The sound waves are too slow The horizontal scale covers a very large lateral distance in proportion to the vertical scale.

Threading Building Blocks (Intel® TBB) A C++ template library developed by Intel for writing software programs that take advantage of multi-core processors. The library consists of data structures and algorithms that allow a programmer to avoid some complications arising from the use of native threading packages. Instead the library abstracts access to the multiple processors by allowing the operations to be treated as "tasks", which are allocated to individual cores dynamically by the library's run-time engine, and by automating efficient use of the CPU cache. Arch D. Robison was the architect of TBB, and was the lead developer for KAI C++, developed the game ‘Seismic Duck’ as a hobby.

Background Seismic Duck runs three independent core computations: Seismic wave propagation and rendering. Gas/oil/water flow through a reservoir and rendering. Seismogram rendering. All three run in parallel using parallel_invoke(template function that evaluates several functions in parallel). At a high level, it follows the ’Three Layer Cake’ pattern (a method of parallel programming for shared-memory hardware to maximize performance, readability, and flexibility). Still that level of parallelism was not enough to get a good animation speed. Wave propagation simulation was a bottleneck.

Background – Numerical simulation of Waves Waves are simulated using FDTD method. Five 2D arrays represent a scalar field over a 2D grid. Three of the arrays represent variables that step through time. The other two represent rock properties.

Background – Numerical simulation of Waves “Leap Frog” algorithm is used The advantage of staggering and leap frogging is that it delivers results accurate to 2nd order for the cost of a 1st order approach. The update operations are beautifully simple: forall i, j {Vx[i][j] += (A[i][j+1]+A[i][j])*(U[i][j+1]-U[i][j]);Vy[i][j] += (A[i+1][j]+A[i][j])*(U[i+1][j]-U[i][j]);} forall i, j {U[i][j] += B[i][j]*((Vx[i][j]-Vx[i][j-1]) + (Vy[i][j]-Vy[i-1][j]));}

Overall parallelization strategy The TBB demo "seismic" and Seismic Duck use two very different approaches to parallelizing these updates. The TBB demo is supposed to show how to use a TBB feature ("parallel_for") and a basic parallel pattern. keep it as simple as possible. Seismic Duck is written for high performance, at the expense of increased complexity. It uses several parallel patterns. It also has more ambitious numerical modeling. The pattern behind the TBB demo is "Odd-Even Communication Group" in time and a geometric decomposition pattern in space. The big drawback of the Odd-Even pattern is memory bandwidth.

Overall parallelization strategy Using the Odd-Even pattern, each grid point is loaded once from main memory per time step. First sweep (update Vx and Vy): 6 memory references (read A, U, Vx, Vy; write Vx, Vy). 6 floating-point adds 2 floating-point multiplications Second sweep (update U): 5 memory references (read B, Vx, Vy, U; write U) 4 floating-point adds 1 floating-point multiplication

The key consideration is the (6+4) floating-point additions per (6+5) memory references. C=(6+4)/(6+5)=10/11≈0.91 is a serious bottleneck. C=12 for typical (core 2 duo) hardware but C≈0.91 for the Odd-Even code. Thus the odd-even version delivers a small fraction of machine’s theoretical peak floating-point performance. Overall parallelization strategy

Overall parallelization strategy Improving ‘C’ Single sweep (update Vx, Vy, and U): 8 memory references (read A, B, U, Vx, Vy; write Vx, Vy, U). 10 floating-point adds 3 floating-point multiplications Now C=10/8=1.25. About a 37% improvement over the C=.91 value in the original code.

Overall parallelization strategy However, the improvement in C complicates parallelism. for( i=1; i<m; ++i )for( j=1; j<n; ++j ) {Vx[i][j] += (A[i][j+1]+A[i][j])*(U[i][j+1]-U[i][j]);Vy[i][j] += (A[i+1][j]+A[i][j])*(U[i+1][j]-U[i][j]);U[i][j] += B[i][j]*((Vx[i][j]-Vx[i][j-1]) + (Vy[i][j]-Vy[i-1][j]));} The fused code has a sequential loop nest, because it must preserve the following constraints: The update of Vx[i][j] must use the value of U[i][j+1] from the previous sweep. The update of Vy[i][j] must use the value of U[i+1][j] from the previous sweep. The update of U[i][j] must use the values of Vx[i][j-1] and Vy[i-1][j] from the currentsweep.

Threading Treating the grids as having map coordinates, a grid point must be updatedafter the points north and west of it are updated, but before the points south and east of it are updated. One way to parallelize the loop nest is the Wavefront Pattern . In that pattern, execution sweeps diagonally from the northwest to southeast corner. But that pattern has relatively high synchronization costs. Furthermore, in this context, it has poor cache locality because it would tend to schedule adjacent grid points on different processors. An alternative - geometric decomposition and the Ghost Cell Pattern.

Threading – Wavefront Pattern • Driving Forces • Workload must be balanced as the diagonal wavefront computation sweeps the elements. • Processing units must minimize the idle time while others are executing. • Performance of the overall system must be efficient. Problem Data elements are laid out as multidimensional grids representing a logical plane or space. The dependency between the elements results in computations that resemble a diagonal sweep Creates nontrivial problem for distribution of work between the parallel processing units.

Threading – Wavefront Pattern • simple partitions across rows or columns are not encouraged. • most widely used data distribution scheme in practice is block cyclic distribution. Solution Data distribution is a critical factor; processor’s idle time must be minimized. computational load at given instance of time differs throughout the sweeping process.

Threading – Wavefront Pattern Threading strategies Fixed block based strategy Cyclical fixed block based strategy Variable cyclical block based strategy

These speedups are not necessarily the best that can be achieved, as the blocking factor greatly affects the results. • They are meant to show that for a little effort, it is possible to quickly build a parallel program with reasonable speedups. Threading – Wavefront Pattern Example Dynamic programming matrix (DPM) that is generated by a biological sequence alignment algorithm.

Threading – Wavefront Pattern Delegation of work to threads The delegation of work to threads in the Wavefront pattern is handled by a work queue. Diagonal synchronization Prerequisite synchronization

Threading – Ghost Cell Pattern Problem Computing problems divided geometrically into chunks computed on different processors are not embarrassingly parallel. Specifically, the points at the borders of a chunk require the values of points from the neighboring chunks. Values between processes need to be transferred in an efficient and structured manner? Points that influence the calculations of a point is often called a stencil. • Retrieving points introduce communication operations which lead to high latency.

Solution Allocate additional space for a series of ghost cells around the edges of each chunk. For every iteration have each pair of neighbors exchange their borders. Wide Halo. Corner Cells. Multi dimensional border exchange. NOTE : unnecessary synchronization may slow implementation. For example MPI_Send (and MPI_Sendrecv) may block until the receiver starts to receive the message. Threading – Ghost Cell Pattern

Threading Wide Halo method has been used. C has been increased from 0.91 to 1.25. Each chunk can be updated independently by a different thread, except around its border. To see the exception, consider the interaction of chunk 0 and chunk 1. Let i0 be the index of the last row of chunk 0 and let i1 be the index of the first row of chunk 1. The update of Vy[i0][j] must use the value of U[i1][j] from the previous sweep. The update of U[i1][j] must use the value Vy[i0][j] from the current sweep. The ghost cell pattern enables the chunks to be updated in parallel. Each chunk becomes a separate grid with an extra row of grid points added above and below it.

Conclusion and further optimizations Several parallelization techniques have been discussed. Further optimizations have been made through vectorization and cache optimizations. Game is “fun” you should try it out (free download).

THE END

Seismic Duck “ Reflection Seismology Was Never So Much Fun !”