Simulating Collective Effects on GPUs

Simulating Collective Effects on GPUs Final Presentation @ CERN, 19.02.2016 Master‘sThesis CSE, ETHZ Stefan Hegglin Supervision: Prof. Dr. P. Arbenz, ETHZ Dr. K. Li, CERN BE-ABP-HSC

Overview PyHEADTAIL Goal of Thesis Implementation / Methods Results for users, developers & profiling

PyHEADTAIL • Simulation code at CERN to study collective effects • Scriptable • Fully dynamic (Python) • Extensible: PyECLOUD… • Easy to use Beam Instabilities: Numerical Model, Kevin Li, unpublished manuscript

PyHEADTAIL: Model Drift/Kick • Splits synchrotron into segments • Linear tracking along ring segments • Kicks applied between segments Collective effects, dampers,… Macro-particles • typically 105-107 are tracked y x z

Scope of Thesis • Develop GPU interface for PyHEADTAIL • Speedup • Simplicity for user • Simplicity for other developers • Make CPU/GPU switch easy • Strategies used: • PyCUDA: Interface between Python and CUDA

Methods / Implementation: Strategies Use GPUArrays as much as possible  No rewriting of code Hide all GPU-specific algorithms behind a layer  Transparent to user and other developers Some adaptions to PyCUDA to unify interface and increase scope Streams to compute statistics of independent dimensions  Speedup Optimise functions only if profiling deems it necessary

Results: Software Design • Setup simulation • Set context with context manager • Track bunch: correct implementation automatically chosen • Move data back to CPU The user does not need to care about the system internals

Context & Contextmanager Details Context: general/pmath.py A module/file which contains: • A dictionary per context referencing the function implementations (GPU/CPU) • A function to update the currently active dictionary: Spills its contents to the module-global namespace Functions are callable via pmath.functionname() Contextmanager: gpu/contextmanager.py with GPU(bunch): track() A Class which can be used in a with-statement which: • Switches the implementations by updating the active dictionary (update_active_dict() upon entering the with()-statement) • Moves the bunch-data to and from the GPU in its __enter__ and __exit__ methods

Results: Qualitative User: add two lines of code to script import contextmanager.GPU as GPU for n in xrange(nturns): machine.track(bunch) Developer: Write code (almost) as before: Dispatch all mathematical function calls via context: sin(x) --> pm.sin(x) Statements like these work out of the box due to GPUArrays: bunch.z -= a * bunch.mean_x() import contextmanager.GPU as GPU with GPU(bunch): And use_cython=False in the new BasicSynchrotron class

Developer: Available functions • Mathematical: sin, cos, exp, arcsin, min, max, floor • Statistical: mean, std, emittance • Array: diff, cumsum, seq, arange, argsort, apply_permutation, take, convolve (GPU on CPU) • Slicing:mean_per_slice, std_per_slice, particles_within_cuts, macroparticles_per_slice, serchsortedleft, searchsortedright • Creation: zeros, ones • Marker: device • Monitor:init_bunch_buffer, init_slice_buffer Example: defmy_kick(bunch): print ‘Running on ‘, pm.device() bunch.x -= pm.sin(bunch.z) bunch.xp *= bunch.mean_xp() a = pm.zeros(100, dtype=np.float64) bunch.z = pm.take(bunch.dp, indices) … bunch.z -= 3 * a

Developer: How to add new functionality defmy_kick(bunch): print ‘Running on ‘, pm.device() p = np.fft(bunch.x) p *= factor bunch.x = np.ifft(p) What do I do? What to do: • Create an entry in the pmath function dictionaries for both CPU and GPU with the same interface • Implement both versions and call them from the dictionary (not necessary if one-liner such as np.cos, store in dict directly) • Add tests which compare the two versions in test_dispatch.py (see examples there) • Check PyCUDA and scikit-cuda before implementing own kernels!

User: Available trackers & kicks Available and tested in branch: feature/PyPIC_integration  Adrian • Tracker: transverse map with/without detuning and dispersion, rf systems, linear longitudinal, drift, … • Kick: wake kick (all types, convolution performed on CPU atm!), damper, rfq • Slicing: uniform bin slicing • Monitor: bunchmonitor, slicemonitor

Results: Profiling Transverse map embarrassingly parallel speedup up to x27 saturated at > 106mp Why not faster?  PyCUDA/GPUArray... Wake field 500 slices Speedup of up to x6 Convolution on CPU < 10% of runtime No speedup for <105mp

Discussion: GPUArray overhead A statement like: x = a*x + b*y invokes 3 kernel calls via PyCUDA: tmp b*y tmp2  a*x x  tmp + tmp2 This leads to: • Low arithmetic intensity • Lots of kernel call overhead, especially for small problem sizes • Memory allocation  mitigated by using memory pool

Results: Benchmark Study Typical application: LHC@injection instability, wake field & damper CPU Time: ~ 1day GPU Time: 5x less Two lines of code added to script Benchmark for CPU Results agree

Results: Quantitative IITypical application runtime 5x

Conclusion PyHEADTAIL was successfully ported to GPUs and benchmarked against the previous implementation + easy to use + extensible + maintainable (Python) - not fully exploiting GPU - dependent on PyCUDA https://commons.wikimedia.org

Backup: GPU Metrics

Simulating Collective Effects on GPUs