220 likes | 235 Vues
This paper discusses the online performance monitoring of the Third ALICE Data Challenge, including the testbed infrastructure, monitoring system, and performance results.
E N D
Online Performance Monitoring of the Third ALICE Data Challenge W. Carena1, R. Divia1, P. Saiz2, K. Schossmaier1, A. Vascotto1, P. Vande Vyvre1 CERN EP-AID1, EP-AIP2 NEC2001 Varna, Bulgaria 12-18 September 2001
Contents • ALICE Data Challenges • Testbed infrastructure • Monitoring system • Performance results • Conclusions NEC2001, 12-18 September 2001
ALICE Data Acquisition ALICE detectors Final system! up to 20 GB/s Local Data Concentrators (LDC) Readout ~300 nodes up to 2.5 GB/s Global Data Collectors (GDC) Event Building ~100 nodes up to 1.25 GB/s CASTOR System Mass Storage System NEC2001, 12-18 September 2001
ALICE Data Challenges • What?Put together components to demonstrate the feasibility, reliability and performance of our present prototypes. • Where?The ALICE common testbed uses the hardware of the common CERN LHC testbed. • When?This exercise is repeated every year by progressively enlarging the testbed. • Who?Joined effort between the ALICE online and offline group, and two groups of the CERN IT division. ADC I: March 1999 ADC II: March-April 2000 ADC III: January-March 2001 ADC IV: 2nd half 2002 ? NEC2001, 12-18 September 2001
Goals of the ADC III • Performance, scalability, and stability of the system (10% of the final system) • 300 MB/s event building bandwidth • 100 MB/s over the full chain during a week • 80 TB into the mass storage system • Online monitoring tools NEC2001, 12-18 September 2001
ADC III Testbed Hardware 80 standard PCs dual PIII@800Mhz Fast and Gigabit Ethernet Linux kernel 2.2.17 Farm Network Disks Tapes 6 switches from 3 manufactures Copper and fiber media Fast and Gigabit Ethernet 8 disk servers dual PIII@700Mhz 20 IDE data disks 750 GB mirrored 3 HP NetServers 12 tape drives 1000 cartridges 60 GB capacity 10 MB/s bandwidth NEC2001, 12-18 September 2001
ADC III Monitoring • Minimum requirements • LDC/GDC throughput (individual and aggregate) • Data volume (individual and aggregate) • CPU load (user and system) • Identification: time stamp, run number • Plots accessible on the Web • Online monitoring tools • PEM (Performance and Exception Monitoring) from CERN IT-PDP was not ready for ADC III • Fabric monitoring: developed by CERN IT-PDP • ROOT I/O: measures mass storage throughput • CASTOR: measures disk/tape/pool statistics • DATESTAT: prototype development by EP-AID, EP-AIP NEC2001, 12-18 September 2001
Fabric Monitoring • Collect CPU, network I/O, and swap statistics • Send UDP packets to a server • Display current status and history using Tcl/Tk scripts NEC2001, 12-18 September 2001
ROOT I/O Monitoring • Measures aggregate throughput to mass storage system • Collect measurements in a MySQL data base • Display history and histogram using ROOT on Web pages NEC2001, 12-18 September 2001
DATESTAT Architecture DATE v3.7 LDC LDC LDC LDC LDC LDC LDC dateStat.c top, DAQCONTROL GDC GDC GDC GDC GDC GDC DATE Info Logger Log files (~200 KB/hour/node) Perl script gnuplot script Statistics files C program gnuplot/CGI script MySQL data base http://alicedb.cern.ch/statistics NEC2001, 12-18 September 2001
Selected DATESTAT Results • Result 1: DATE standalone run, equal subevent size • Result 2:Dependence on subevent size • Result 3:Dependence on the number of LDC/GDC • Result 4:Full chain, ALICE-like subevents NEC2001, 12-18 September 2001
Aggregate rate: 304 MB/s Volume: 19.8 TB (4E6 events) Result 1/1 DATE standalone 11LDCx11GDC nodes, 420...440 KB subevents, 18 hours NEC2001, 12-18 September 2001
Result 1/2 DATE standalone 11LDCx11GDC nodes, 420...440 KB subevents, 18 hours LDC rate: 27.1 MB/s LDC load: 12% user, 27% sys NEC2001, 12-18 September 2001
Result 1/3 DATE standalone 11LDCx11GDC nodes, 420...440 KB subevents, 18 hours GDC rate: 27.7 MB/s GDC load: 1% user, 37% sys NEC2001, 12-18 September 2001
Aggregate rate: 556 MB/s Dependence on subevent size Result 2 DATE standalone 13LDCx13GDC nodes, 50…60 KB subevents, 1.1 hours NEC2001, 12-18 September 2001
Result 3 Dependence on #LDC/#GDC DATE standalone Gigabit Ethernet: max. 30 MB/s per LDC max. 60 MB/s per GDC NEC2001, 12-18 September 2001
Aggregate rate: 87.6 MB/s Volume: 18.4 TB (3.7E6 events) Result 4/1 Full chain 20LDCx13GDC nodes, ALICE-like subevents, 59 hours NEC2001, 12-18 September 2001
Result 4/2 Full chain 20LDCx13GDC nodes, ALICE-like subevents, 59 hours GDC rate: 6.8 MB/s GDC load: 6% user, 23% sys NEC2001, 12-18 September 2001
Result 4/3 Full chain 20LDCx13GDC nodes, ALICE-like subevents, 59 hours LDC rate: 1.1 MB/s (60 KB, Fast) LDC load: 0.8% user, 2.7% sys NEC2001, 12-18 September 2001
Grand Total • Maximum throughput in DATE: 556 MB/s for symmetric traffic, 350 MB/s for ALICE-like traffic • Maximum throughput in full chain: 120 MB/s without migration, 86 MB/ with migration • Maximum volume per run: 54 TB with DATE standalone, 23.6 TB with full chain • Total volume through DATE:at least 500 TB • Total volume through full chain:110 TB • Maximum duration per run:86 hours • Maximum events per run:21E6 • Maximum subevent size:9 MB • Maximum number of nodes:20x15 • Number of runs:2200 NEC2001, 12-18 September 2001
Summary • Most of the ADC III goals were achieved • PC/Linux platforms are stable and reliable • Ethernet technology is reliable and scalable • DATE standalone is running well • Full chain needs to be further analyzed • Next ALICE Data Challenge in the 2nd half 2002 • Online Performance Monitoring • DATESTAT prototype performed well • Helped to spot bottlenecks in the DAQ system • The team of Zagreb is re-designing and re-engineering the DATESTAT prototype NEC2001, 12-18 September 2001
Future Work • Polling agent • obtain performance data from all components • keep the agent simple, uniform, and extendable • support several platforms (UNIX, application software) • Transport&Storage • use communication with low overhead • maintain common format in central database • Processing • apply efficient algorithms to filter and correlate logged data • store permanently performance results in a database • Visualization • use common GUI (Web-based, ROOT objects) • provide different views (levels, time scale, color codes) • generate automatically plots, histograms, reports, e-mail, ... NEC2001, 12-18 September 2001