1 / 21

Simulating Collective Effects on GPUs

Simulating Collective Effects on GPUs. Final Presentation @ CERN, 19.02.2016. Master‘s T hesis CSE, ETHZ Stefan Hegglin. Supervision: Prof. Dr. P. Arbenz , ETHZ Dr. K. Li, CERN BE-ABP-HSC. Overview. PyHEADTAIL Goal of Thesis Implementation / Methods

chaney
Télécharger la présentation

Simulating Collective Effects on GPUs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Simulating Collective Effects on GPUs Final Presentation @ CERN, 19.02.2016 Master‘sThesis CSE, ETHZ Stefan Hegglin Supervision: Prof. Dr. P. Arbenz, ETHZ Dr. K. Li, CERN BE-ABP-HSC

  2. Overview PyHEADTAIL Goal of Thesis Implementation / Methods Results for users, developers & profiling

  3. PyHEADTAIL • Simulation code at CERN to study collective effects • Scriptable • Fully dynamic (Python) • Extensible: PyECLOUD… • Easy to use Beam Instabilities: Numerical Model, Kevin Li, unpublished manuscript

  4. PyHEADTAIL: Model Drift/Kick • Splits synchrotron into segments • Linear tracking along ring segments • Kicks applied between segments Collective effects, dampers,… Macro-particles • typically 105-107 are tracked y x z

  5. Scope of Thesis • Develop GPU interface for PyHEADTAIL • Speedup • Simplicity for user • Simplicity for other developers • Make CPU/GPU switch easy • Strategies used: • PyCUDA: Interface between Python and CUDA

  6. Methods / Implementation: Strategies Use GPUArrays as much as possible  No rewriting of code Hide all GPU-specific algorithms behind a layer  Transparent to user and other developers Some adaptions to PyCUDA to unify interface and increase scope Streams to compute statistics of independent dimensions  Speedup Optimise functions only if profiling deems it necessary

  7. Results: Software Design • Setup simulation • Set context with context manager • Track bunch: correct implementation automatically chosen • Move data back to CPU The user does not need to care about the system internals

  8. Context & Contextmanager Details Context: general/pmath.py A module/file which contains: • A dictionary per context referencing the function implementations (GPU/CPU) • A function to update the currently active dictionary: Spills its contents to the module-global namespace Functions are callable via pmath.functionname() Contextmanager: gpu/contextmanager.py with GPU(bunch): track() A Class which can be used in a with-statement which: • Switches the implementations by updating the active dictionary (update_active_dict() upon entering the with()-statement) • Moves the bunch-data to and from the GPU in its __enter__ and __exit__ methods

  9. Results: Qualitative User: add two lines of code to script import contextmanager.GPU as GPU for n in xrange(nturns): machine.track(bunch) Developer: Write code (almost) as before: Dispatch all mathematical function calls via context: sin(x) --> pm.sin(x) Statements like these work out of the box due to GPUArrays: bunch.z -= a * bunch.mean_x() import contextmanager.GPU as GPU with GPU(bunch): And use_cython=False in the new BasicSynchrotron class

  10. Developer: Available functions • Mathematical: sin, cos, exp, arcsin, min, max, floor • Statistical: mean, std, emittance • Array: diff, cumsum, seq, arange, argsort, apply_permutation, take, convolve (GPU on CPU) • Slicing:mean_per_slice, std_per_slice, particles_within_cuts, macroparticles_per_slice, serchsortedleft, searchsortedright • Creation: zeros, ones • Marker: device • Monitor:init_bunch_buffer, init_slice_buffer Example: defmy_kick(bunch): print ‘Running on ‘, pm.device() bunch.x -= pm.sin(bunch.z) bunch.xp *= bunch.mean_xp() a = pm.zeros(100, dtype=np.float64) bunch.z = pm.take(bunch.dp, indices) … bunch.z -= 3 * a

  11. Developer: How to add new functionality defmy_kick(bunch): print ‘Running on ‘, pm.device() p = np.fft(bunch.x) p *= factor bunch.x = np.ifft(p) What do I do? What to do: • Create an entry in the pmath function dictionaries for both CPU and GPU with the same interface • Implement both versions and call them from the dictionary (not necessary if one-liner such as np.cos, store in dict directly) • Add tests which compare the two versions in test_dispatch.py (see examples there) • Check PyCUDA and scikit-cuda before implementing own kernels!

  12. User: Available trackers & kicks Available and tested in branch: feature/PyPIC_integration  Adrian • Tracker: transverse map with/without detuning and dispersion, rf systems, linear longitudinal, drift, … • Kick: wake kick (all types, convolution performed on CPU atm!), damper, rfq • Slicing: uniform bin slicing • Monitor: bunchmonitor, slicemonitor

  13. Results: Profiling Transverse map embarrassingly parallel speedup up to x27 saturated at > 106mp Why not faster?  PyCUDA/GPUArray... Wake field 500 slices Speedup of up to x6 Convolution on CPU < 10% of runtime No speedup for <105mp

  14. Discussion: GPUArray overhead A statement like: x = a*x + b*y invokes 3 kernel calls via PyCUDA: tmp b*y tmp2  a*x x  tmp + tmp2 This leads to: • Low arithmetic intensity • Lots of kernel call overhead, especially for small problem sizes • Memory allocation  mitigated by using memory pool

  15. Results: Benchmark Study Typical application: LHC@injection instability, wake field & damper CPU Time: ~ 1day GPU Time: 5x less Two lines of code added to script Benchmark for CPU Results agree

  16. Results: Benchmark Study Typical application: LHC@injection instability, wake field & damper CPU Time: ~ 1day GPU Time: 5x less Two lines of code added to script Benchmark for CPU Results agree

  17. Results: Quantitative IITypical application runtime 5x

  18. Conclusion PyHEADTAIL was successfully ported to GPUs and benchmarked against the previous implementation + easy to use + extensible + maintainable (Python) - not fully exploiting GPU - dependent on PyCUDA https://commons.wikimedia.org

  19. Backup: GPU Metrics

More Related