The DAQ/HLT system of the ATLAS experiment

André Anjos University of Wisconsin/Madison 05 November 2008 On behalf of the ATLAS TDAQ collaboration The DAQ/HLT system of the ATLAS experiment

André Anjos, University of Wisconsin/Madison Read-Out System ~ 3 kHz ~ 200 Hz Design characteristics • Triggering is done in 3 levels • High-level triggers (L2 and EF) implemented in software. L2 is RoI-based • HLT/Offline software share components • Event size is about 1.6 Mbytes • Level-1 operates at 40 Mhz • Level-2 operates at 100 kHz • Event Filter operates at ~3 kHz • Storage records at 200 Hz

André Anjos, University of Wisconsin/Madison 40 MHz ROD ROD ROD 100 kHz ROB ROB ROB ROIB L2SV DCN ~3 kHz L2P DFM SFI EFN EFP ~ 200 Hz SFO Detailed Overview Calo MuTrCh Trigger Other detectors DAQ 40 MHz Level 1 L1 accept (100 kHz) 2.5 s Det. R/O Commodity computing and networking RoI 160 GB/s Dataflow High Level Trigger O(10) ms L2 Level 2 RoI requests ROS RoI data (~2%) Read-Out Systems ~3+5 GB/s EB Heavily multi-threaded DAQ applications L2 accept (~3 kHz) Event Builder Event Processing is parallelized through MP O(1) sec EF Event Filter EF accept (~0.2 kHz) ~ 300 MB/s

André Anjos, University of Wisconsin/Madison Testing the Dataflow • Make sure the current system can handle: • High-rates • Oscillations • Unforeseen problems (crashes, timeouts) • Testing conditions • HLT loaded with a 10^31 menu • Mixed sample of simulated data (background + signal) • 4 L2 supervisors • 2880 L2PUs (70% of final L2 farm size) • 94 SFIs • 310 EFDs + 2480 PTs (~20% of final sys.)

André Anjos, University of Wisconsin/Madison Level-2 • Able to sustain 60 kHz through the system • Able to handle unforseen events • Timings for event processing at specification Almost 60 kHz into L2 – 80% of design rate 1 second timeout, so very few lost messages

André Anjos, University of Wisconsin/Madison Event Building Event Builder Rate (Hz) • LVL2-driven EB: 4.2 kHz (3.5 kHz) • Small event size 800k (1.6M) • Throughput ~3.5GB/s (5) Limited by Event Filter capacity – Only using 20% of final farm Aggregated Event Builder BW (MB/s)

André Anjos, University of Wisconsin/Madison Event Building Performance Installed EF BW Available EF BW Building-only performance extrapolated Predicted EB+EF performance degradation Cosmics' 08 10^31 menu testing Design

André Anjos, University of Wisconsin/Madison Event Filter PT • Multi-process approach • 1 EFD per machine, multiple PTs • EFD/PT: Data communication through shared heaps • “Quasi-offline” reconstruction • Seeded by L2 • SFIs work as “Event servers”: get data at 3 kHz • Data is pushed to SFOs ~ 200 Hz SFI EFD PT 3 kHz 200 Hz PT SFO 10^31 menu

André Anjos, University of Wisconsin/Madison Event Storage • Final stage of TDAQ: data is written into files and asynchronously transferred to the mass storage • Streaming capabilities (express, physics, calibration, ...) • Farm of 5 nodes with a total storage area of 50TB, Raid-5 • Provides sustained I/O rate of 550MB/s. Peak rate > 700MB/s (Target is 300MB/s) • Absorb fluctuations and spikes. Hot-spare capabilities • Fast recovery in case of mass storage failure • It is a component regularly used at the design specifications Transfering about 1 TB/hour during cosmics data taking ATLAS Throughput since August ~0.8PB

André Anjos, University of Wisconsin/Madison Cosmics data taking • 216 million events, with an average size of 2.1 MB = 453 TB • 400,000 files • HLT+DAQ problems tagged in about 2.5% of the total events (timeouts or crashes) • 21 inclusive streams

André Anjos, University of Wisconsin/Madison ATLAS HLT & Multi-core • End of “Frequency-scaling Era”: • Demand more parallelism to achieve expected throughput • Event parallelism inherent to typical high energy physics selection and reconstruction programs • ATLAS has large code basis mostly written and designed in “pre-multi-core era” • Base line design for HLT: • Event Filter: Multi-processing since beginning: • 1-2 secs processing time, needs 6000 cores to achieve 3 kHz • Level-2: Multi-threading (but, may fallback to multi-processing) • @40 ms processing time need 4000 cores to achieve 100 kHz • We have explored: • Multi-threading • Multiple processes

André Anjos, University of Wisconsin/Madison Level-2 & Multi-processing • When we started, understood that: • Longer context switches • Special data transfer mechanisms • More applications to control • More clients to configure • More resources to monitor • Dataflow more chaotic • Most tools already in place Apparently, lots of problems in many places!

André Anjos, University of Wisconsin/Madison Level-2 & Multi-threading • Multi-threading allows sharing the application space • Evident way to solve most of the problems mentioned before • Only problem: make HLT code thread-safe • Offline components shared • But: • MT is not only about safeness, it is also about efficiency!

André Anjos, University of Wisconsin/Madison Performance bottlenecks L2PU with 3 worker threads Worker threads blocked Initialization Event processing • Level-2 is thread safe, but not efficient • At first, “nobody’s” direct fault! • What is the real issue? Since we are importing code, it is difficult to keep track of what is “correctly” coded, what is not. From release to release, something else broke or became less efficient.

And still… • ATLAS has large code basis mostly written and designed in “pre-multi-core era” • Which other packages hide goodies? • Synchronization problems are not fun to debug • How to model software development so our hundreds of developers can understand it and code efficiently? • Current trends in OS development show improved context-switching times and more tools for inter-process synchronization • What if 1 thread crashes? • MP Performance almost identical to MT. • EF baseline is MP! One Machine with 8 cores in total

André Anjos, University of Wisconsin/Madison Summary & Outlook on MP for HLT • Multi-threading, despite more powerful, lacking support tools and specialized developers • Base Offline infrastructure created in the “pre-multi-core era”: MT efficiency difficult in our case... • TDAQ infrastructure proven to work if use MP for HLT • Event processing scales well using MP • Techniques being investigated for sharing immutable (constant) data: • Common shared memory block • OS fork() + copy-on-write • Importance understood, R&Ds being setup at CERN: http://indico.cern.ch/conferenceDisplay.py?confId=28823

André Anjos, University of Wisconsin/Madison Conclusions • It works: • RoI mechanism • 3-level trigger • Highly distributed system • HLT is currently MP based • With simulated data: • L2 (70%) can sustain 60 kHz over many hours • EB can sustain design rate confortably • Processing times for HLT within design range • With cosmics data: • Took nearly 1 Petabyte of detector data • Stable operations for months We are ready to take new physics in 2009!

André Anjos, University of Wisconsin/Madison First event from beam First day of LHC operations Detection of beam dump at collimator near ATLAS.

André Anjos, University of Wisconsin/Madison Backup

André Anjos, University of Wisconsin/Madison Level-2 • L2-ROS communication current in UDP • 2% * 1.6MB * 100 kHz = 3.2 MB/s (small!) • ROS design to stand a maximum 30 kHz hit rate • Each ROS is connected to a fixed detector location: Hot ROS effect Level-1 Trigger RoI Builder 100 kHz L2 SV L2 SV L2 SV L2 SV 10 kHz TIL MUON LAR L2 PU L2 PU L2 PU L2 PU PIX SCT TRT 100 Hz DC Net fixed detector mapping ROS ROS ROS ROS Readout System

André Anjos, University of Wisconsin/Madison Controls & Configuration • Coordinates all the applications during data-taking: • First beam run included ~7000 applications on 1500 nodes • DAQ configuration DB accounts ~100000 objects • As soon as the HLT farm is complete we expect O(20k) applications distributed over 3000 nodes • Control software operates over the control network infrastructure • Based on CORBA communication library • Decoupled from the dataflow • Some facilities provided: • Distributed application handling and configuration • Resource granting • Expert system for automatic recovery mechanisms • The same HW/SW infrastructure is also exploited by the monitoring services

André Anjos, University of Wisconsin/Madison Multithreading: Compiler Comparison (vector, list, string) Gcc 2.95 not valid, string not thread safe • Need technology tracking • Compilers • Debuggers • Performance assessment tools

André Anjos, University of Wisconsin/Madison Infrastructure • Complex (and scalable) infrastructure needed to handle the system • File servers, boot servers, monitoring server • Security • User management (e.g. roles) First beam • 50 infrastructure nodes already installed • Will be > 100 in the final system • ~1300 user allowed into the ATLAS on-line computing system

André Anjos, University of Wisconsin/Madison Read-Out System ~ 3 kHz ~ 200 Hz Data Acquisition Strategy • Based on three trigger levels • LVL1 hardware trigger • LVL2: 500 1U PC farm • Reconstruction within Region of Interest (RoI) defined by LVL1 • EF: 1900 1U PC farm • Complete event reconstruction High Level Trigger (HLT)

André Anjos, University of Wisconsin/Madison Data-Collection Back-end HLT Hardware • 850 PCs installed • 8 cores • 2 x Intel Harpertown 2.5GHz • 16GB RAM • Single motherboard • Cold-swappable power supply • Network booted • 2GbE on-board • 1 for control and IPMI • 1 for data • Double connection network to data-collection and back-end via VLAN (XPU) • Can act both as L2 and EF processors

André Anjos, University of Wisconsin/Madison Event Filter HLTSSW Processing HLT Core Software Task 1..* Steering Monitoring Service HLT Algorithms L2PU Application Data Manager HLT Algorithms ROB Data Collector 1..* Event Data Model MetaData Service <<import>> <<import>> <<import>> <<import>> Package Athena/ Gaudi Event Data Model Reconstr. Algorithms StoreGate Interface Offline Reconstruction Offline Core Software Dependency HLT and Offline Software HLT Data Flow Software • HLT Selection Software • Framework ATHENA/GAUDI • Reuse offline components • Common to Level-2 and EF Level2 Offline algorithms used in EF

André Anjos, University of Wisconsin/Madison Multi-threading Performance L2PU with 3 worker threads Worker threads blocked Initialization Event processing • Standard Template Library (STL) and multi-threading • L2PU: independent event processing in each worker thread • Default STL memory allocation scheme (common memory pool) for containers is inefficient for L2PU processing model → frequent locking • L2PU processing model favors independent memory pools for each thread • Use pthread allocator/DF_ALLOCATOR in containers • Solution for strings = avoid them • Needs changes in offline software + their external sw • need to insert DF_ALLOCATOR in containers • utility libraries need to be compiled with DF_ALLOCATOR • design large containers to allocate memory once and reset data during event processing. • Evaluation of problem with gcc 3.4 and icc 8 • Results with simple test programs ( were also used to understand original findings) indicate considerable improvement (also for strings) in libraries shipped with new compilers • Insertion of special allocator in offline code may be avoided when new compilers are used

The DAQ/HLT system of the ATLAS experiment