Databases & Metadata Progress

Databases & Metadata Progress Elizabeth Gallas - Oxford Software and Computing Workshop Friday Plenary Session October 25, 2013

CHEP 2013: Database applications, operations & Metadata related contributions Alastair: Squid monitoring tools - a common solution for LHC experiments Graeme: ATLAS Job Transforms : A Data Driven Workflow Engine Elizabeth: Utility of collecting metadata to manage a large scale conditions database in ATLAS Jerome: Looking back on 10 years of the ATLAS Metadata Interface. Jack: The future of event-level information repositories, indexing, selection in ATLAS Lasha: A Tool for Conditions Tag Management in ATLAS Andrea: A J2EE based server for Muon Spectrometer Alignment monitoring in the ATLAS detector Dario: The ATLAS EventIndex: an event catalogue for experiments collecting large amounts of data Laura: Experiment Dashboard Task Monitor for managing ATLAS user analysis on the Grid Peter: Next-Generation Navigational Infrastructure and the ATLAS Event Store Alex: ATLAS Nightly Build System Upgrade Carlos: Integrated framework for data quality assessment and DB management for the ATLAS Tile … Charilaos: DCS Data Viewer, a Application that Access ATLAS DCS historical Data Plus many ADC applications based on databases: AGIS, Panda, Task Management, Rucio, ProdSys including Gancho: Next generation database relational solutions for ATLAS distributed computing E.Gallas / DB Session Intro

DatabaseSessionAgenda Intro: announcements and short reports on Frontier and TAG Operations/Services EventIndex Workshop: Tuesday afternoon this week http://indico.cern.ch/event/272494 E.Gallas / DB Session Intro

(CERN IT, ATLAS DBA) Database Operations ReportsMarcin Blaszczyk, Luca Canali, Gancho Dimitrov • Stable services on all RAC clusters • Good communication: DBA’s and application experts • New servers, storage and Software upgrade: Q1 of 2014 • Cycle hardware (New for production, Current to Standby etc) • Software version strategy: depends on validation in progress • Application side: • PVSS re-organization: move to index organized tables • DSS (Detector safety system): move to online DB • JEDI tables in place aside Panda in production • So far: managed 108 tasks. Will increase as ProdSys2 moves in. • Rucio: considerable refinements in design for optimization of tables and tablespaces separating attribute, fact, transient, and historical data • Full scalability test on new ADCR hardware in November 2013 • Can use it to test software releases, upgrades too E.Gallas: COMA

COOL and CORAL updates from Andrea Valassi(last report was in 2012) • Details: slides and TWiki PersistencyReleaseNotes • CORAL: improved handling of network glitches • CoralServerProxy: • address HLT crash issues, other connection handling • Platforms and Infrastructure: • CVSSVN; Support on gcc4.8 w/c++11, clang3.3, icc13 • To do: CMTcmake, SavannahJira, QuattorPuppet • Ongoing progress using various utilities • Static code and memory analysis, time profiling • Investigating: • Oracle client 12.1.0.10 • Kerberos authentication • COOL performance validation on Oracle 12c server • “Disappointing” • COOL vector payload (replacing CoraCool) • ATLAS SCT online actively testing using nightlies • Port to ROOT6 (pyCOOL) E.Gallas: COMA

LS1 Conditions Database Progress (1): M.Borodin (A.Artamonov), A.Formica, E.Gallas, D.South (V. Radescu) • Regular reports in meetings: • LS1 Conditions DB Weekly and Data Prep Coordination • Issues and Progress: • Considerable experience, tool development over Run 1 in the COOL Tag Coordination area • Maintain hopes of single BK Global Tags for Run 1 (2010-2013) • Progress subsystem by subsystem • MC: Run dependent conditions now possible • In recent software releases, thanks to John Chapman • A.Formica’s preliminary IOV completeness developments useful  plans to aggregate into COMA for use by CTB (Cool Tag Browser) • COMA Metadata content: • Now stores “Current” and “Next” Cool Tag State designations  new AMI based entry interface, similar to Data periods entry • Expand metrics to incorporate • IOV completeness info mentioned above • Other useful bookkeeping on COOL updates and other metrics • Goal: Help CTB (Cool Tag Browser): better functionality, usability via • Providing optimized data sources to suit desired functionality • New COMA content, PL/SQL and RESTful services E.Gallas: COMA

LS1 Conditions Database Progress (2): M.Borodin (A.Artamonov), A.Formica, E.Gallas, D.South (V. Radescu) • Run 1 instance unwieldy (table volume issues and obsolete definitions confound management) • Creation of new instance for Run 2 • Migration of data underway in INT8R • Adopt Athena and subsystem code accordingly • Progress moving some external references (POOL) to inline storage • Specifically online, but also useful offline • A. Artamonov: investigating best course folder by folder • In coordination with subsystem experts* (*when available) • Improvements underway for AtlCool* tools • “Smart clone” functionality will reduce data volume • Using Current,Next designations from COMA • New UPD type Folder Tags needed • Enable LAr only processing of “noise bursts” before ES1 • Optimizations to the CondDsMgr (POOL files in DDM) • Additional issues: SLC6 and Rucio migration • Address size and distribution of DB Releases (exclude from SW kit) • Adjust the model for CVMFS deployment • But taking into account needs of HPC and some cloud resources E.Gallas: COMA

Frontier Operations • Frontier Server Status: Continuous smooth operations ! • CHEP: “Squid monitoring tools - a common solution for LHC experiments” Joint contribution from ATLAS, CMS, CERN IT experts https://indico.cern.ch/contributionDisplay.py?contribId=63&sessionId=9&confId=214784 • ADC Operations Session: • Significant improvements in Frontier site configuration: “Auto-Setup validation” talk: Alastair Dewhurst, Alessandro deSalvo, Alessandro diGirolamo, I Ueda • https://indico.cern.ch/materialDisplay.py?contribId=25&sessionId=6&materialId=slides&confId=210658 • Lots of improvements to Squid monitoring: From Laura Sargsyan – details in Alessandra Forti’s “Monitoring Status and Plans” talk • https://indico.cern.ch/materialDisplay.py?contribId=24&sessionId=6&materialId=slides&confId=210658 • Work beginning to move Frontier boxes at CERN • to SL6 and Puppet (configuration management) E.Gallas / DB Session Intro

Frontier Squid Log parsing Useful: To understand sources of load when anomalous usage is observed Caveat: Must analyse the logs before they are “purged”  Within a few days Script extracts panda ID from squid logs to identify which tasks are causing load Alex Hamilton (summer student at RAL) https://indico.cern.ch/getFile.py/access?contribId=2&resId=0&materialId=slides&confId=277040 E.Gallas / DB Session Intro

AMI (Solveig Albrand for AMI Team) • New member: Jerome Odier hit the ground running • provided a JSON parser to check the grammar of job reports. See Graeme’s presentation Tuesday. • LS1: adapting AMI when changes to upstream sources are determined or other requirements known • JEDI, DEFT, ProdSys2 … • Some progress: JEDI finished 1 task in Sept, 7 in Oct • Metadata Review: many implications, for example: • Change to AMI tag implementation, Train model deployment, data Type field may have a subtype, carriage configuration • Good news: merging step will become transparent • New version of pyAMI: • parser improvements, new MC criteria available … • Can install pyAMI stand alone on Linux, Mac,Windows • Monitoring: continues to prove useful in growing ways E.Gallas: COMA

COMA: Event Counts by Trigger • Originates in a use case of the TAG DB • Trigger-wise event counts is a requested use case from many in the physics community • These counts have been aggregated into COMA • Per Run per Stream per Trigger (reports now in production) • But some counts are missing because datasets failed integrity checks available within the TAG infrastructure • Investigations required a broader investigation into the integrity of upstream processing chains: Work with AMI team, access AMI DB • Determine provenance • Get event counts of upstream data products for ALL collections • Investigations are ongoing … • Some findings on the following slides based on analysis of • 966 Runs (2009-2013), 17954 datasets analysed (Run/Stream/AMI Tag) • OVER 6 Billion events processed to get #/trigger (for datasets meeting integrity) • Ultimately, the goal is • Understand which collections are important to fix and then fix them • Add aggregated counts over user specified temporal ranges into Reports E.Gallas: COMA

Issues in Event Counts by Trigger or Dataset • Issues found (in total < 1% of datasets analysed, some important to fix): • TAG Catalog inconsistent with TAG upload - Being fixed by Qizhi • TAG uploads missing • Failure to produce or merge TAG files or replicate to CERN • data12 Bphysics stream Periods L, J, C • AMI counts for upstream products not available (INVALID) • Upstream products have file losses • TAG upload inconsistent with TAG files • TAG files replaced after upload (data12 delayed streams) • In progress: clean-up and re-upload • TAG files inconsistent with AOD • TAG file or merging problem from upstream products • TAG files inconsistent with RAW (AOD,ESD not available or valid ?) • data13_hip – many Runs and Streams • AOD count inconsistency • Additional cross checks added to AMI and some specific datasets fixed • AOD event losses • identify when this occurs for users • RAW inconsistent with SFO • identify when this occurs for users • Thanks: Qizhi Zhang, Gancho Dimitrov, Solveig Albrand, Tier-0 • For underlying methods used to collect the data and help with repairs E.Gallas: COMA

TAG Operations Personnel losses: • Roman Sorokoletov (DBA) departed last month for http://nxcgroup.com/ in Lausanne • Tom Doherty (Glasgow) : contract ends ~this month Implemented TAG skimming service (HI reprocessing), integrating TAG infrastructure with transforms, Ganga, Prodsys Status and Plans: • Try to maintain existing services and continue upload operations despite decreasing manpower • until Event Index successfully reaches a production state: • Robust cataloguing and storage of Run 1 and incoming data • Production level interfaces accessing that data E.Gallas / DB Session Intro

EventIndex Contents • Discussions last 3 months with use case "owners" • Event picking use case is clear • query with run number, event number, trigger stream, data format, AMI tag (or future equivalent to identify the processing cycle) • Old event selection use case is now superseded by the new ASG derivation framework • dropped for the time being • Production integrity use case still valid • event counts at different production stages must match! • EventServer use case becoming prominent • query with GUID • Poster and paper for CHEP on the above basis

EventIndex Producer (3) Pilot Transformation HDFS on HADOOP cluster Data file EI info file Algorithm writes o/p file Algorithm validates o/p file and extracts EI info EI load server ActiveMQ server Broker Broker Split and send EI info (ActiveMQ) Broker Broker Broker Broker EI load process WN Broker Broker CERN EI load process

Hadoop Infrastructure (2) • Ideas to be tried in the near future: • Store all data in Hadoop in files with a "simple" format (CSV or similar) • Create a "catalogue" of data in Hadoop, containing information on where to find the requested information and which tool to use for the given query. • The catalogue itself could be in simple files or in HBase. • Create a clever directory structure in HDFS so that most common queries can be resolved at the directory level and even a full scan would act on a small number of files • Create maps (can be done with Hadoop tools) for the common search items to make searches faster • Create "inverted indices" for some of the more complex search items (e.g. trigger chains) to allow fast count and search operations using those items

Hadoop Infrastructure (5) • Work plan: • Julius Hrivnac will implement "by hand" a tool to create catalogue and maps • Jack Cranshaw will have look at how to create inverted indices • Andrea Favareto will try his set of queries on any new format and also collaborate with Julius and Jack on the infrastructure • We should also • Reformat the existing data (from TagDB) removing the payload info that we think we won't need in EI (much smaller records) • Copy over from TagDB all data11_7TeV (all processing stages) and create in Hadoop the full structure with references to RAW, ESD, AOD (at least) to optimise the directory structure • Task still not assigned. Volunteers needed. • We'll discuss progress at our regular weekly meetings and we'll get together mid-December to plan next steps

Summary and Outlook • Progress made on EI data provider design and on core architecture for EI storage and retrieval • Missing manpower for the development of the "ELSSI equivalent" functionality • Additional services will come along once the core system is at least minimally operational • No need of additional hardware or servers for this year but we'll use the next few months to estimate the need for a full production system • Next check point mid-December

Final Alert ! ADCR • As announced many times since June S&C Workshop: • Stronger authentication (incl. case sensitive passwords) and limiting external direct DB access (from outside CERN network) is being finalized • Final stage: • Non-CERN network access to the ADCR database will not be allowed as of 29th October 2013. • Details in Gancho’s talk. ATLR ATLARC CERN offline network E.Gallas / DB Session Intro

Databases & Metadata Progress