Software and Grid Computing Challenges for LHC: Insights from the ATLAS Collaboration

Software & GRIDComputing for LHC(Computing Work in a Big Group) Frederick Luehring luehring@indiana.edu February 10, 2011 P410/P609 Class Presentation

Outline • This talk will have three main sections: • A short introduction about the LHC and ATLAS • LHC/ATLAS serves as an example of large scale scientific computing • Writing software in large collaboration • Processing a large datastream P410/P609 Presentation

Hal’s Motivation for this Talk • Hal wanted me to tell you about how a large scientific collaboration deals with data processing to give you that perspective on scientific computing. • Of course even a large experiment has individuals writing individual algorithms just as a small experiment does. • However in the end for the large collaboration, some process has to glue all of those individual algorithms together and use the resulting code to analyze data. P410/P609 Presentation

A Little About LHC & ATLAS P410/P609 Presentation

What are LHC & ATLAS? • LHC is the “Large Hadronic Collider” which is the largest scientific experiment ever built. • This is the accelerator that produces the energetic proton beams that we use to look for the new physics discoveries • ATLAS (A Toroidal LHC ApparatuS) is one of four experiments at LHC and one of the two large general purpose experiments at CERN. • The ATLAS detector will produce Petabytes of data each year it runs. • Billions of Swiss Francs, Euros, Dollars, Yen, etc. have been spent to build LHC and ATLAS. P410/P609 Presentation

LHC Project • Atlas is one of two general purpose detectors at LHC (CMS is the other detector). P410/P609 Presentation

Atlas Detector Numbers are electronic channel count 1.2x106 190,000 3,600 1.45x108 10,000 P410/P609 Presentation

Atlas Detector Construction Notice the size of the man. P410/P609 Presentation

RAW 1.6MB/event ESD Event Summary Data .5MB/event AOD Analysis object data .1 MB/event TAG Event=level metadata tags 1 KB/event DPD Derived physics data N-tuple type, user based 40 MHz (80 TB/sec) level 1 (hardware trigger) O(75) KHz (150 GB/sec) level 2 (RoI analysis) O(2) KHz (4 GB/sec) level 3 (Event Filter) 200 Hz O(400) MB/sec Data Producer P410/P609 Presentation

Writing Software in a Collaboration P410/P609 Presentation

Collaborative SW Building in ATLAS • There are just over 800 people registered and about 400 active ATLAS software developers (3/2009). • These developers are located in ~40 countries world-wide and using their code, ATLAS must build timely, working software releases. • Relatively few computing professionals – most people are physicists. • Democratic coordination of effort. • No way to fire anyone… • Central SVN code repository at CERN (US mirror) • There are ~7 Million Lines of Code in ~3000 packages have been written mainly in C++ and Python (also Fortran, Java, PERL, SQL). • The main platform is Linux/gcc (SLC5 which is basically RHEL5) • Had to close parts of our computing infrastructure to the world (much info no longer findable using Google). P410/P609 Presentation

Collaborative SW Building in ATLAS • The ATLAS Software Infrastructure Team (SIT) provides the infrastructure used by these developers to collaboratively produce many versions of the ATLAS software. • There have been 16(!) major releases of the ATLAS SW since May of 2000 and hundreds(!) of smaller releases. • Note: ATLAS shares lots of code between the online (Trigger/DAQ) software and the offline software. • This talk is mainly about the offline portion of the software. • Online code is the software used to select which 200-400 beam crossings per second will be written our for further analysis (out of 40 million bunch crossings per second). • This sharing of offline and online code is unusual and results from design of the ATLAS Trigger. Much of the trigger is run on commodity PCs instead of specially built electronics. P410/P609 Presentation

Membership of the SIT • The ATLAS Software Infrastructure Team is made up of ~30 people. • There are only a few people working full time on the SIT activities and most people are volunteers working part time. • Many of the SIT members are computing professionals with a few physicists & technical physicists included. • Those ~30 people amount to ~12 Full Time Equivalents (FTEs) • Just like the developers, the SIT members are spread out worldwide with a core of about 4-5 people at CERN. • Most SIT members receive only service work credit. • Quarterly meetings at CERN, bi-weekly phone calls and huge numbers of emails are used to coordinate the work. P410/P609 Presentation

Software Versions/Diversity • Beyond the large number of developers and the large amount of code, the SIT is supporting many versions of the software. • Each community (e. g. Simulation, Reconstruction, Cosmic Running, Monte Carlo production) can & do request builds. • This puts a large stress on the SIT because often (usually!) priorities conflict between the various groups. • Various meetings are held to define the priorities. • To aid in managing this diversity, the software is partitioned into ~10 offline SW projects (light blue on diagram). • This allows the SIT to separate work flows and to control dependencies. P410/P609 Presentation

A Few Words about the Online Code • The online code must be written to a high standard. • If the online code is wrong, there is a risk of not recording data about the physics events of interest. Events not selected by the trigger are lost forever and cannot be recovered. • The trigger has multiple levels and the lowest level is executed millions of times per second. While the first use of offline-developed code actually begins at the second trigger level which “only” runs 75 kHz, even a tiny memory leak per execution is an enormous problem. • When memory is exhausted, the trigger algorithm must be restarted and this takes a significant amount of time. • There is a very strict testing and certification process for the online code and this has a direct effect on any offline code that is used in the online. P410/P609 Presentation

Offline Code Release Strategy • We have had to significantly modify our code release building strategy. • Previously we had nightly builds every night, development builds every month, production releases every six months, and bug fix production releases as needed. • Typically there were at most two releases operating at one time: one for fixing bugs in the current production release and one that was developing new functionality for the next production release. • Now there are ~35 releases (many built opt & debug). • As mentioned before these releases are for simulation, central data reprocessing, migration to new external software, etc. • The main development, production, and bug fix releases are copies of nightly builds. • There are also several releases producing patches. These patch releases replace buggy code on the fly as the software starts up. • 44 servers (220 cores) are used to build the releases and do basic tests every night! P410/P609 Presentation

Multiple Releases and Coordinators • The needs of the many user communities and the need to validate new versions of external software (e.g. ROOT, GEANT) have lead to a intricate release coordination system to manage the multiple releases. • Each release has its own release coordinator who: • Controls what tags (new versions of SW packages) are allowed in the release • Checks the relevant validation tests each day • Moves new code from a “validation” (trial) state to a “development” (candidate) state to being in the final release. • Tracks problems along the way with bug reports. • There are also two release builders on shift to build the releases once the OK to build a release is given. • For important releases, many people may have to signoff while for minor releases, the release coordinator just asks the shifters to build it. • We hold weekly global release coordination meetings to track issues needing resolution before the pending releases can be built. • These meetings also look at the relative priority of the pending releases. P410/P609 Presentation

Multiple Releases and Coordinators • Even with aggressive release coordination and extensive validation, it takes time and multiple iterations of the building cycle to get a final production version of the software. • Two mechanisms exist for adding adjustments and corrections to stable releases: • Patch releases that allow the software to patch itself at runtime (does not require a full rebuild of the release). • Bug fix releases that are full releases build rolling up the patches. • It is a constant struggle to meet the release schedule. • This is our toughest issue. • In the end you can’t cut a software release until all of the bugs are fixed and nightly test suite passes. P410/P609 Presentation

Coding • ATLAS does have a set of coding standards. • Originally there was an ATLAS developed system to automatically check compliance the coding rules. • This system was discarded as being too hard to maintain. • Constant tweaking was required to get it to work correctly. • ATLAS still does have a naming convention for packages and publically viewable objects. • Even though this is essentially enforced by an honor system, most developers comply with it. • The release coordinators manually find compiler warnings and get the developers to fix them. • A large fraction of compiler warnings indicate real problems. • Floating point problems and unchecked status codes generally lead to fatal terminating the program. P410/P609 Presentation

User Coding • The final layer of code which is generally written by the end-user (i.e. physicists) and is often written to a lower a standard of coding. • This is a real problem because this code is used to make the results tables and plots that go into journal articles. • Often this coding is done under extreme time pressure because high energy physics is very competitive. • To help protect against problems, a tool set (the Physics Analysis Tools) have been developed. Not everyone uses the official tools and some user code is still required even if you are using the official tools. • There is a guidebook (“The Analysis Workbook”) contains numerous examples of how to write good analysis code. • A lengthy period of review and interrogation of any result is required before it can be published. P410/P609 Presentation

Software Release Tools • The SIT uses a number of tools to build, test, and distribute the SW: • SVN – the code repository that holds the code submitted by the developers • Tag Collector (TC) – manages which software versions are used in the release • CMT – manages software configuration, build, and use & generates the makefile • NICOS – drives nightly builds of the ATLAS software making use of a variety of tools • Selected nightlies become official releases • Four automatic systems validate the software builds • CMT, CMT-based packaging tools, Pacman - create & install the software distribution kit. • In addition to the above tools, we also use Doxygen to document our code automatically and 3 workbooks to teach developers/users how to use the software. “Bedrock” base allowing the management of many developers and releases The control framework of the software building engine Automatic Validation SVN Easy to use distribution tools P410/P609 Presentation

SW Validation • ATLAS uses several tools to validate the building and the installation of its software. • AtNight (ATN) testing is part of the nightly builds and runs over 300 “smoke” tests each night. • Full Chain Testing (FCT) checks the entire software chain. • Run Time Tester (RTT) runs a long list of user defined tests against all of the software projects for many builds daily. • Tier 0 Chain Testing (TCT) is used to check that the ATLAS software is installed and running correctly during data taking. • Kit Validation (KV) is a standalone package that is used to validate that the ATLAS software is installed and functioning correctly on all ATLAS production sites (Tier 1/2/3). • The release coordinators actively monitor each day’s test results to see if new SW can be promoted to candidate production releases and check that the candidate production release is working. P410/P609 Presentation

SW Validation: Use of RTT • The RTT is the primary ATLAS SW validation system. • The RTT runs a significant number of validation tests against 10 software builds each day. • The RTT runs between 10 & 500 test jobs against a release. • The test jobs can run for several hours and test a package intensively. • Currently the RTT runs ~1300 test jobs each day. • The RTT uses a pool of 12 cores to schedule the test jobs and 368 cores to run the test jobs. • Additional scheduling cores are currently being brought online. • The RTT also needs a small disk pool of 1.4 TB. • The RTT system is highly scalable: if more tests are needed or more releases need testing, the RTT team can (and has!) added more nodes to the pool of servers used by the RTT system. P410/P609 Presentation

ATLAS Code Distribution • ATLAS has used the Pacman open source product to install the offline software for the last several years. • The installation uses pacman caches and mirrors. • Recently we begun using “pacballs” to install the software. A pacball is: • A self-installing executable. • Stamped with an md5sum in its file name. • The pacballs come in two sizes: • Large (~3 GB) for full releases (dev, production, & bug fix). • Small (~100 MB) for patches that are applied to full releases. • Pacballs are used to install the software everywhere from individual laptops to large production clusters. • ATLAS will now probably go an rpm based install. P410/P609 Presentation

Documentation • ATLAS documents its code using Doxygen. • All developers are expected to put normal comments and special Doxygen comments into their source code. • Once a day the Doxygen processor formats the comments and header information about the classes and members into web pages. • Twiki pages are then used to organize the Doxygen-generated web pages into coherent content. • There are other ways to see the Doxygen pages including an alphabetical class listing and links from within the tag collector. • ATLAS now has three “workbooks”: • The original introductory (getting started) workbook. • A software developers workbook. • A physics analysis workbook. • ATLAS has had three coding “stand-down” weeks to write documentation. • Significant time/effort has been spent to review the documentation. • In spite of this, the documentation still needs more work. P410/P609 Presentation

Summary of Collaborative SW Building • The SIT has adapted our tools and methods to the growing demand to support more developers, releases, validation, and documentation. • Even though the SIT is small, it provides effective central support for a very large collaborative software project. • Producing working code on a schedule has been difficult but we have learned what is required to do the job: • Strong tools to ease the handling of millions of lines of code. • Lots of testing and code validation. • Lots of careful (human) coordination. • We are continue to proactively look for way to further improve the ATLAS software infrastructure. P410/P609 Presentation

Processing Peta Bytes of Data • Hint: • 1 PB = 1,000 TB = 1,000,000 GB • Or in some cases: 1 PB =1,024 TB = 1,048,576 GB P410/P609 Presentation

Large Scale Data Production • Many scientific endeavors (physics, astronomy, biology, etc.) run a large data processing systems to analyze the contents of enormous datasets. • In these cases it is generally no longer possible for a single site (even a large laboratory) to have enough computing power and enough disk storage to hold and process the datasets. • This leads to the use of multiple data centers (distributed data processing) and having to split the data over multiple data centers. • Currently many experiments pre-position the data split over various data centers and once the data is positioned “send the job to the data” instead of “sending the data to the job”. • ATLAS disconinued most pre-positioning when it was discovered that much of the pre-positioned data was never used. P410/P609 Presentation

The Scale of LHC Data Production • The LHC beams cross every 25 ns. Each bunch crossing will produce ~23 particle interactions when the design luminosity of 1034 cm-2s-1 is achieved. • Initial LHC operation is at much lower luminosity (~1030 - 1032 cm-2s-1). • The Atlas trigger selects 200 or even 400 bunch crossings per second for full read-out. • The readout data size ranges from 1.7 MB/event initially to about 3.4 MB/event at full luminosity. • When LHC is fully operational, the LHC beams will be available to the experiments for ~107s/year. • Simple arithmetic gives 3.4 to 6.8 PB of readout data a year (depending on the exact luminosity). P410/P609 Presentation

Data Production Scale (continued) • However, to produce publishable results additional data is needed. • A Monte Carlo simulation dataset of comparable size to the readout (RAW + summary) data is required. • The Atlas detector has a huge number of channels in each subsystem. Most of the detectors require substantial amounts of calibration data to produce precision results. • All of this puts the amount of data well above 5 PB even in the first full years of running. P410/P609 Presentation

What is the Computing Model? • The Atlas computing model is the design of the Atlas grid-based computing resources. The computing model contains details such as: • Where the computing resources will be located. • What tasks will be done at each computing resource. • How much CPU power is needed. • How much storage is needed. • How much network bandwidth is needed. • All of these items must be specified as a function of time as the machine ramps up. • Much of the computing model has had to be modified on the fly as problems have arisen during the initial running. P410/P609 Presentation

Multi-Tiered Computing Model • The computing model used by the LHC experiments is a multi-tiered model that takes advantage of the large number of geographically distributed institutes involved and the availability of high speed networks. • The model maximizes the usage of resources from all over the world. Many of the resources are “free” to use. • The model makes it easier for the various funding agencies to contribute computing resources to the experiments. • Atlas has ~240 institutes on 5 continents. • The model is made possible by using the grid. • The grid allows the efficient use of distributed resources. • The grid also reflects the political reality that funding agencies like to build computer centers in their own country. P410/P609 Presentation

~80-160 TB/sec Event Builder 4-8 GB/sec Event Filter~1.5MSI2k • Some data for calibration and monitoring to institutes • Calibrations flow back 400-800 MB/sec Tier 0 T0 ~5MSI2k HPSS ~ 300MB/s/T1 /expt Tier 1 UK Regional Center (RAL) US Regional Center - BNL Italian Regional Center French Regional Center HPSS HPSS HPSS HPSS Tier 2 Tier2 Center ~200kSI2k Tier2 Center ~200kSI2k Tier2 Center ~200kSI2k Tier2 Center ~200kSI2k Lancaster ~0.25TIPS Liverpool Manchester Sheffield Physics data cache 100 - 1000 MB/s Desktop(Tier 3) Workstations This is an old diagram, so the number are out of date. Image supplied by Jim Shank. The Atlas Data Flow • ~9 PB/year • No simulation • ~2 MSI2k/T1 • ~2 PB/year/T1 • ~200 TB/year/T2 Massive, distributed computing is required to extract rare signals of new physics from the ATLAS detector P410/P609 Presentation F. Luehring Pg. 33

A Physicist’s View of Grid Computing • A physicist working at LHC starts from the point of view that we must process a lot of data. • Raw experimental data (data from the particle interactions at LHC). ~3-5 PB/year • Simulated (Monte Carlo) data needed to understand the experiment data. More PB/year • Physicists and computing resources are all over the world. • We are not Walmart and our funding is small on the scale of the data. This means: • We must use all of our own computing resources efficiently. • We must take advantage of anyone else’s resources that we can use: Opportunistic Use. (e.g. IU’s own Quarry) • Funding agencies like grids because they use resources efficiently. P410/P609 Presentation

Grid Middleware • Grid middleware is the “glue” that binds together the computing resources which are all over the world. • Middleware is the software between the OS and the application software and computing resources. • Careful authentication and authorization are the key to running a grid. • Authentication is knowing for sure who is asking to use a resource and is done with an X.509 certificate. • Authorization is knowing that an authenticated user is allowed to to use your resources. • A number of projects write grid middleware. • Perhaps the best known is the Globus project. • Grid middleware is often packaged by projects like VDT. • VDT stands for Virtual Data Tool Kit. P410/P609 Presentation

Physical Parts of a Grid • Compute Elements (CEs) • Usually quoted in numbers of “cores” now that a CPU chip has more then one processor it. • Storage Elements (SEs) • Mostly disk but lots of tape for preserving archival copies of event and Monte Carlo data. • Lots of high speed networks • Data is distributed exclusively using networks. Shipping physical containers such as data tapes to a site never happens (at least for ATLAS). P410/P609 Presentation

Grid use at ATLAS • ATLAS uses three grids: • EGEE/LCG based in Europe • Open Science Grid (OSG) based in the US • Nordugrid based in Scandinavia • Resources scale available to ATLAS currently: • ~30,000 CPUs (cores in our jargon) • Several PB of disk and tape • High speed (up to 10 GBps links) networking. • Everything runs on Linux • Things will need to grow to keep up with data production • The resources are arranged hieratically in tiers: • CERN is Tier 0, National Resources Tier 1, Regional Resources Tier 2, and Individual Institutes Tier 3 P410/P609 Presentation

The Result of the Process • In the end we filter the huge data set down to data from a few interesting particle collisions and make plots and print scientific papers. • There are multiple levels data “distillation” and filtering to get down to a data set that a physicist can process on his laptop or local clusters. • At all stages of filtering the worry is always that you will accidentally remove the interesting data. • This is data mining on a grand scale. P410/P609 Presentation

ATLAS Production System • ATLAS uses a own system called Panda to run jobs (both production and user analysis) on the grid. • Panda uses a “pilot” system where test jobs are periodically sent to each site to look for unused cores. When an available core is found, the input data is loaded on a local disk of the selected server and the pilot starts the actual job. • The input data must be present on the site for the job to run. This has been found to greatly reduce the number of failures compared to transferring the data from another site after the job starts. • Panda has a central database, operates using central servers, and has a monitoring system so that users can follow the progress of their jobs. • Human input plus software algorithms are used to set the relative priorities of jobs. • A large team of people creates & submits the jobs, monitors the system for hardware & software failures, and assists users. P410/P609 Presentation

Scale of ATLAS Data Production • ATLAS processes data through out the world. • All sites also contribute personnel to the effort. • In addition there are central teams in most countries and CERN that contribute personnel. • There are perhaps 20,000 cores worldwide assigned to run ATLAS jobs. • On an average day, ATLAS runs ~100,000 - ~150,000 jobs of which a few percent fail. • Keep this machinery running and up to date with current, validated software is a real challenge. • Occasionally we run out of work while waiting for new software versions. • Typical tier 0 & 1 run event data analysis, tier 2 run Monte Carlo and all levels 0-3 run user analysis jobs. P410/P609 Presentation

User Analysis Scenario • It is possible to work efficiently at a Tier 3 or a laptop: • Subscribe an interesting dataset to the Tier 2 • Use data retrieval to get the sample data onto the Tier 3 • Develop/test your software package on the Tier 3 • Run your Monte Carlo locally on the Tier 3 (if needed!) • Then use pathena (Panda Athena) to send large numbers of data analysis jobs to a Tier 2 or Tier (wherever the data is) • Process DP3Ds and ntuples on the Tier 3 or laptop. • High energy physics uses data analysis tool ROOT • Used interactively and in batch to process final ntuples into final plots. • Used throughout the hep community. • Based object-oriented (OO) design and interface. • Notice we are back to working as a single individual. P410/P609 Presentation

Summing Up Data Processing • Datasets are too large to process at one site. • The Grid is used to process the data over large number of geographically distributed data sites. • There is a hierarchy of tiered sites using to process the data. • The processing of the data requires the participation of lots people spread throughout the world. • After a lot of work, the distributed data processing system is working reasonably reliably. • Users are now able to effectively do data analysis in this environment. • No doubt further refinements will be required. P410/P609 Presentation

The Future of Data Processing • The current hot buzz words are: • “Cloud” which is a sort of commercial computing grid run the big internet companies: Amazon, Yahoo, Google, etc. The cloud is a sort of black box for computing. • This has the potential to be much cheaper than the grid but issue remain about whether the cloud can deal with the enormous datasets required by some physics calculations. • The advantage of the grid is that many (but not all) of the servers are owned physics/academic projects and can be tuned as needed. • “Virtual Machine” which is a software/middleware layer that allows a server running one OS version to run jobs written in another OS or version of the OS. • This virtualization layer used extract a huge overhead of wasted CPU cycles but modern virtual machine software uses a few percent of the CPU time on the target machine. • An obvious result of the use of virtualization is that it is now possible to run physics codes written in c++/gcc/Linux on Windows servers. P410/P609 Presentation

The Future (continued) • “GPUs” or graphical processing units (essentially graphics cards) can be used to process certain types of physics data. This amounts to using a graphics card a computer. • People have also played with using gaming consoles because they are sold below their fabrication cost (the money for the game makers is in the game licensing fees and game sales). • “SSDs” or Solid State Drives are under consideration to replace hard (disk) drives because they provide significant improvements in the speed of access to stored data and are physically small. • Various technical and cost issues have to be worked out before SSDs become practical. • Personal experience with putting heavily used databases on SSDs shows great potential. P410/P609 Presentation

Software and Grid Computing Challenges for LHC: Insights from the ATLAS Collaboration