“ GPUs in NA62 trigger system”

“GPUs in NA62 trigger system” Gianluca Lamanna (CERN) NA62 Collaboration meeting in Brussels 9.9.2010

Outline Short reminder Use of GPUs in NA62 trigger system Study of algorithms performances: resolution and timing Towards a working prototype Conclusions

GPU Idea: reminder • Nowadays the GPU (standard PC video card processor) are very powerful processors, with computing power exceeding 1 Teraflops. • The GPUs are designed for digital imaging, video games, computer graphics. • In this context the main problems are to apply the same operations to a large quantity of data (for instance move an object, transform a part of an image, etc…) • The architecture of the processor is highly parallel. • In latter years several efforts to use the GPU in high performance computing in several fields (GPGPU) • Is it possible to use the GPUs for “hard realtime” computing? Is it possible to build a trigger system based on GPUs for high performance online selection? Is it possible to select events with high efficiency using cheap off-the-shelf components?

GPU characteristics: reminder SIMD architecture Big number of cores The single cores are grouped in Multiprocessors sharing a small quantity of in-chip memory (very fast) Huge quantity of external memory accessible from each core Particular care in programming the chip to exploit the architecture Big performance guarantee for parallelizable problems Two logic level of parallelization: parallel algorithm and parallel execution • TESLA C1060: • 240cores • 30Multiprocessor • 4 GB Ram • 102 GB/s memorybandwidth • PCI-ex gen2 connection The GPU (“device”) has to be always connected to a CPU (“host”) !!!

The “GPU trigger” Standard trigger system FE digitization L1 pipeline PCs L0 Trigger primitives Trigger processor “Triggerless” PCs FE digitization PCs Custom HW Commercial PCs Commercial GPU system “quasi-triggerless” withGPUs L1 PCs+GPU FE Digitization + buffer + (trigger primitives) PCs+GPU L0

Benefits The main advantage in using GPUs is to have a huge computing power in a compact system →cheap, off-the-shelf large consumer sector in continuous development, easy high level programming (C/C++), fully reconfigurable system, minimum custom hardware, very innovative (nobody in the world is using GPUs for trigger at the moment!). The software trigger levels are the natural place where the GPU should be placed →reduction of farm dimension. In the L0 the GPUs could be used to design more flexible and efficient trigger algorithms based on high quality fast reconstructed primitives→more physics channels collected, lower bandwidth employed, higher purity for the main channels hyperCPsgoldstino [See other physics example in my talks in Anacapri (2.9.2009) and at the NA62 PhysicsHandbook Workshop (12.12.2009)]

L1 “GPU trigger” The use of GPUs in software trigger levels is straightforward The GPUs act as “co-processors” to increase the power of the PCs allowing faster decision and smaller number of CPU cores involved • The RICH L1 dimension is dominated by the computing power: • 4 Gigabits links from the 4 RICH TELL1 should be, in principle, managed by a single PC. • Assuming 10 s of total time for L1 and 1 MHz of input rate the time budget for single event is 1 us. GPU GPU TEL62 L1 PC L2 PC 1 MHz 100 kHz L0TP L1TP Assuming 200 us for ring reconstruction (and other small stuffs), we need 200 cores (25 PCs) to produce the L1 trigger primitives Using GPUs a factor of 200 in reduction it isn’t impossible (see later) GPU The LKr is read at L1 at 100 kHz. About 173 Gb/s are produced. At least 20 PCs have to be dedicated to the event building (assuming 10 GbE links after a switch). Each PC see 5 kHz, resulting in 200 us of max processing time. Also in this case the GPUs could be employed to guarantee this time budget avoiding big increasing in farm cost (in the TD we assumed 4 ms per event).

L0 “GPU trigger” In the L0 GPU one event has to be processed in 100 ns and the total latency should be less than 100 us!!! L0TP 10 MHz L0 GPU 10 MHz TEL62 … protocol stack possibly managed in the receiver card Data arrive Transfer in the RAM 1 MHz … the non deterministic behavior of the CPU should be avoided (real time OS) Max 1 ms latency Transfer of a Packet of data in GRAM in video card … the PCI-ex gen2 is fast enough. Concurrent transfer during processing. Max 100 us Processing … as fast as possible!!! Send back to PC the results … done!

Example: RICH 2 TEL62 for 1 spots (1000 PMs) The GPU needs the position of the PMs →10 bits 3 hits can be put in a single word 20 hits x 5 MHz / 3 ~1.2 Gb/s 1 link for R/O, 1 link for standard primitives (histograms), 2 links for hits position for the GPU TEL62 TEL62 TDCB TDCB TDCB TDCB R/O ~50 MB/s TDCB TDCB Std primitives ~22 MB/s GBE GBE TDCB TDCB GPU primit. ~1.2 Gb/s L0 GPU Should be very useful to have rings (center and position) at L0 Measurement of particle velocity First ingredient for PID (we need also the spectrometer) Possibility to have missing mass information (assuming the particle mass and Pt kick) Very useful for rare decays (K→pgg, K→pg, Ke2g, …)

Ring findingwith GPU: generality • The parallelization offered by the GPU architecture can be exploited in two different ways: • In the Algorithm: parts of the same algorithms could be executed on different cores in order to speed up the single event processing • Processing many events at the same time: each core process a single event. • The two “ways” are, usually, mixed. The “threads” running in a multiprocessor (8 cores each) communicates through the very fast Shared memory (1 TB/s bandwidth) The data are stored in the huge Global Memory (a “packet” of N events is periodically copied in the global memory) • In general we have two approaches: • each thread makes very simple operation → heavy use of shared memory • each thread makes harder computations → light use of shared memory

On the CPU Minimization of With dr0=4.95, s=2.9755, A=-2asexp(0.5)(dr0+s) Inside the PM acceptance and constant outside. (From the RICH test beam analisys (Antonino)) The minimization is done using the MINUIT package with Root.

POMH Each PM (1000) isconsideredas the center of a circle. Foreach center anhistogramisconstructedwith the distancesbtw center and hits (<32). The wholeprocessorisusedfor a single event (hugenumberof center) The single threadcomputesfewdistances Severalhistogramscomputed in differentsharedmemoryspaces Notnaturalforprocessor Isn’t possibletoprocess more thanoneeventat the sametime (the parallelismisfullyexploitedtospeed up the computation) • Very important: particular care has to be adopted writing in shared memory to avoid “conflicts” → In this algorithms isn’t easy to avoid conflicts! (in case of conflicts the writing on the memory is serialized → lost of time) distance

DOMH G L O B A L M E MO R Y Exactly the samealgorithmwrt the POMH, butwithdifferentresourcesassignment The system isexploited in a more natural way: each block isdedicatedto a single event, using the sharedmemoryforonehistogram Severaleventsprocessed in parallel in the sametime Easiertoavoidconflicts in shared and global memory Shared Shared Shared 1 event -> M threads (eachthreadfor N PMs) 1 event -> M threads (eachthreadfor N PMs) 1 event -> M threads (eachthreadfor N PMs)

HOUGH Eachhit is the centerof a test circlewith a givenradius. The ring center is the best common pointof the test circles PMs position -> constantmemory hits -> global memory Testingcircleprototypes -> constantmemory 3D spaceforhistograms in sharedmemory (2D grid VS test circleradius) Limitations due to the total sharedmemoryamount (16K) Onethreadforeach center (hits) -> 32 threads (in onethread block) forevent Y X Radius

TRIPLS 2 In each thread the center of the ring is computed using three points (“triplets”) For the same event, several triplets are examined at the same time. Not all the possible triplets are considered: fixed number depending on the number of events. Each thread fills a 2D histogram in order to decide the final center The radius is obtained from the center averaging the distances with the hits. 1 3 The vector with the indexes of the triplets combination is loaded once in the costant memory at the beginning. The “noise” should induce “fake” center, but the procedure has been demonstrate to converge for a sufficient number of combination (at least for no to small number of hits)

MATH Is a pure matematical approach: Conformal Mapping It based on the inversion of the ring equation after the transformation: After second order approximation, the center and radius are obtained solving a linear system The shared memory is not used at all. The solution can be obtained exploiting many core for the same event or just one core for event: the latter has been proved to be more efficient for at least a factor of two.

Resolutionplots Toy MC: 100000 rings generated with random center and random radius but with fixed (17) number of hits (for this plots) 100 packets with 1000 events. POMH e DOMH give, as expected, very similar results HOUGH shows secondary peeks due to convergence for wrong radius Very strange HOUGH radius result: probably a bug in the code R-R_gen

Resolution plots TRIPL and MATH shows better resolution with respect to the “memories” algorithms The CPU shows strange tails mainly due to ring not fully included in the acceptance (the likelihood doesn’t converge)

Resolution vsNhits The resolution depends slightly on the number of hits The difference in X and Y is due to the different packing of the PMs in X and Y In the last plot the HOUGH result is out of scale The MATH resolution is better than the CPU resolution! cm Nhits cm Nhits cm Nhits

Resolutionplots (withnoise) In order to study the stability of the results with the possible noise, in the MC random hits are added to the generated ring A variable percentage of noise is considered (0%, 5%, 10%, 13%, 18%) POMH, DOMH and HOUGH are marginally influenced from noise For noise>10% not gaussian tails are predominant in the resolution in TRIPL and MATH. Shift of central value are observed at big value of noise.

Noise plots resolution as a function of %noise The noise in the RICH is expected to be quite low, according to test on the prototype cm Noise cm Noise cm Noise

Timeplots The execution time is evaluated using an internal timestamp counter in the GPU The resolution of the counter is 1 us The time per event is obtained as average in packet of 1000 events The plots are obtained for events with 20 hits us us us us us us

Time vs Nhitsplots The execution time depends on the number of hits This dependence is quite small in the GPU (at least for the 4 faster algorithms) and is higher in the CPU The best result, at the moment is 50 ns for ring with MATH, but… us Nhits us Nhits us Nhits

MATH optimization The time result in MATH depends on the total occupancy of concurrent threads in the GPU In particular it depends on the dimension of a logic block (a multiprocessor schedule internally the execution of a Block of N threads, but the real parallelization is given for 16 threads) Isn’t a priori obvious which is the correct number of threads to optimize the execution: it depends on memory occupancy, single thread time, divergency between threads in the block,… The result before was for 256 threads in one block. The optimization is for 16 threads in one block (the plot above is for 17 hits events, the result before was for 20 hits events) Ring computed in 4.4 ns !!!

Data transfer time • The other time that we need to know is the transfer time of the data inside the Video Card and the results outside • The transfer times are studied as a function of the dimension of the packets • There is, more or less,linearitywith the dimension of the packet • Obviously they are, in first approximation, independents on the algorithms. • For packets of 1000 events : • RAM→GRAM = 120 us • GRAM → RAM = 41 us Some algorithms requires extra-time to transfer data in the constant memory (for example TRIPLS needs the indices vector to choose between the various combination of hits)

Comparisonbetween MATH e CPUMATH Is quite unfair to compare the MATH result with the CPU result (in this case we have an improvement of a factor of 180000!!!) The CPUMATH (the MATH algorithm on the CPU) gives good result in terms of resolution and time. In any case the timing result isn’t very stable, depending on the activity of the CPU In this case the improvement is “only” a factor 25 For instance the L1 RICH farm needs a factor 25 less processors to do the same job (essentially we need only one PC)

Time budget in MATH 400 us

Time budget in MATH • For packet of 1000 events we have (real time for operations) • 120 us to transfer data in the video card • 4.4 us to process all the events • 41 us to transfer results in the PC • The measured elapsed time to have a final results for packets of 1000 events is: 400 us! • Actually, in the TESLA card, it’s possible to transfer data (a different packet) during the kernel execution in the processor. • In the new FERMI card it’s possible to have two separated streams of transfer: data and results • The “wrong” time budget for the kernel execution in the plots (1/3 of the total time) is due to the double call of the kernel for the “warm up” → probably should be avoided for each packet. • Other “tricks” should be used to decrease the total time → probably a factor of two is still possible: 200 us!

Triggeringwith MATH A single PC with a TESLA video card could process, at 10 MHz, events with 200 us of latency. The L0 trigger level requires an overall latency of 1 ms → the GPU can be used to generate high quality primitives! • To complete the system we need three more elements: • X CARD: a smart gigabit receiver card in which the ethernet protocol and the decoding is done in the FPGA (the link brings header, trailers, words with 3 hits (10 bits)… the video card wants only hits coordinates), to avoid extra jobs in the CPU→AlteraStratix IV development board bought in Pisa • Real time OS: any non deterministic behavior should be avoid, with 200 us of execution it’s very easy to have fluctuation of several factors due to CPU extra activity→ A group in Ferrara is studying the transfer time with standard and real time OS. • Transmitter card: in principle should be a standard ethernet card, but the X CARD should be adapted to rebuild the ethernet packet adding the timestamp after a fixed latency (in order to avoid extra job in the CPU)

Conclusions & ToDo • Several algorithms have been tested on GPU, using a toy MC to study the resolution with and without noise • The processing time per event has been measured for each algorithms: the best result is 4.4 ns per ring ! • Including the transfer time the latency, for packet of 1000 events, is around 200 us (due to the linearity of transfer time a packet of 500 events is processed in 100 us) • This latency allows to imagine a system to build high quality primitives for the RICH at L0. • ToDo: • Test (and modify) the algorithms for 2 rings case. • Test the GPU with a real transfer from an other PC (everything is ready in Pisa). • Test the new FERMI card (already baught in Pisa). • Final refining of the procedure. • Start thinking about the X Card. • …

“ GPUs in NA62 trigger system”