Evolution of Replication Technology at CERN's IT-DB Service

CERN DB Services: Status, Activities, Announcements Marcin Blaszczyk - IT-DB Replication Technology Evolution for ATLAS Data Workshop, 3rd of June 2014

Recap • Last workshop: 16th Nov 2010 – at that time • We were using 10.2.0.4 • We were installing new hardware to replace RAC3 & RAC4 • RAC8 in “Safehost” for standbys • RAC9 for integration DBs • 11.2 evaluation process • 10.2.0.5 upgrade under planning • Infrastructure for Physics DB Services • Quadcoremachines with 16GB of RAM • FC infrastructure for storage (~2500 disks)

Things have changed… • Service evolution • RAC8 in Safehost for standby installed • Performed in Q3 2010 • To assure geographical separation for DR • New standby installations - for each production DB • 10.2.0.5 upgrade • Performed in Q1 2011

Oracle 11gR2 • SW upgrade + HW migration • Target version 11.2.0.3 • Performed in Q1 2012 • HW migration • New HW installations (RAC10 & RAC11) • 8 cores (16 threads) CPU, 48GB of memory • Move from ASM to NAS • Netapp NAS storage • Replication technology • Usage of streams replication - gradually reduced • Usage of Active Data Guard has grown

Offloading with ADG • Offloading Backups to ADG • Significantly reducesload on primary • Removes sequential I/O of full backup • Offloading Queries to ADG • Transactional workload runs on primary • Read-only workload can be moved to ADG • Examples of workload on our ADGs: • Ad-hoc queries, analytics and long-running reports, parallel queries, unpredictable workload and test queries • ORA-1555 (snapshot too old) • Sporadic occurrences • Oracle bug – to be confirmed if present in 11.2.0.4

New Architecture with ADG Maximum performance Maximum performance Active Data Guard for users’ access Redo Transport Redo Transport Redo Transport Primary Database Active Data Guard for users’ access and for disaster recovery Primary Database Active Data Guard for disaster recovery 2. Busy & critical ADG 1. Low load ADG • Disaster recovery • Offloading read-only workload

IT-DB Service on 11gR2 • IT-DB service much more stable • Workload has been stabilized • High loads and node reboots eliminated • More powerful HW • Offloading to ADG helps a lot • 11g clusterware more stable • Storage model benefited from using NAS • single/multiple disk failure can’t affect DB service anymore • Faster and less vulnerable Streams replication

Preparation for Run2 • Oracle SW • No good solution to fit entire RUN 2 • New Software versions: • 11.2.0.4 vs12.1.0.1 • New HW • 32 threads CPU, 128/256GB memory • New Storage NetApp model • More SSD cache • Consolidated storage

Hardware upgrades in Q1 2014 • New servers and storage • Servers: more RAM, more CPU • 128GB of RAM memory (48GB current prod machines) • Storage: more SSD cache • Newer NetApp model • Consolidated storage • Refresh cycle of OS and OS related • Puppet & RHEL 6 • Refresh cycle of our HW • New HW for production • Current production HW will be moved to standby

Software upgrades in Q1 2014 • Available Oracle releases • 11.2.0.4 • 12.1.0.1 • Evolution – how to balance • Stable services • Latest releases for bug fixes • Newest releases for new features • Fit with LHC schedule

DBAs & workload validation • DBAs - can do: • Test upgrades of integration and production databases • Share experience across users communities • Database CAPTURE and REPLAY with RAT testing • Capture workload from production and replay it in upgraded DB • Useful to catch bugs and regressions • Unfortunately it cannot cover the edge cases

Validation by the users • Validation by the application owners is very valuable to reduce risk • Functional tests • Tests with ‘real world’ data sizes • Tests with concurrent workload • The criticality depends • On the complexity of the application • On how well they can test their SQL

Recent Changes: Q1-Q2 2014 • DB services for Experiments/WLCG • Target version 11.2.0.4 • Exceptions - target 12c • ATLARC • LHCBR • Few more IT-DB services • Interventions took 2-5 hours of DB downtime • Depending on system complexity: standby infrastructure, number of nodes etc…

Upgrade technique - overview 4 2 5 1 6 3 DATABASE downtime RW Access RW Acess DATA GUARD RAC database Primary database RAC RDBMS upgrade Upgrade complete! Clusterware 12c + RDBMS 11.2.0.3 Clusterware 11g + RDBMS 11.2.0.3 Clusterware 12c + RDBMS 11.2.0.4 Redo Transport Redo Transport

Phased approach to 12c • Some DBs already on 12.1 version • ATLARC, LHCBR • Smooth upgrade • No major issues discovered so far • Following Oracle SW evolution, depending on • Next 12c releases feedback (12.2) • Testing status • Possibility to schedule upgrades • Next possible slot for upgrades to 12c 1stpatchset • Technical stop Q4 2014/Q1 2015? • Candidates: offline DBs (ATLR, CMSR, LCGR…)

Monitoring & Security • Monitoring • RacMon • EM12c • Strmmon • Support level during LS1 • Best effort • Security • AuditMon • Firewall rules for external access • For ADCR in 2013 • For ATLR in 2014

IT-DB Operations Report ATLAS databases • Production DBs: 12nodes*, ~69TB of data • ATONR: 2 nodes, ~8TB • ADCR: 4 nodes, ~19,5 TB • ATLR: 3nodes, ~20.5TB • ATLARC: 2 nodes, ~17 TB • *ATLAS DASHBOARD (1 node of WLCG database), ~4TB • Standby DBs: 14 nodes, ~75TB of data • ATONR_ADG: 2 nodes; ATONR_DG: 2 nodes • ADCR_ADG: 4 nodes; ADCR_DG: 3 nodes • ATLR_DG: 3 nodes • Integration DBs: 4 nodes, ~18 TB of data • INTR: 2 nodes, ~7,5 TB, • INT8R: 2 nodes, ~9TB • **ATLASINT: 2 nodes, ~2 TB (will be consolidated with INT8R) • Nearly 165TBof space, 30database servers • 12* databases (11 RAC clusters + 1 dedicated RAC node*)

Replication for ATLAS - current status

Replication for ATLAS - plans • Replication changesoverview • PVSS • Read onlyreplica: Active Data Guard • COOL • Online -> Offline: GoldenGate • Offline ->Tier1s: GoldenGate • MUON • Streamsstopped when ATLAS new solution for custom data movement will be in place

Conclusions • Focus on stability for DB services • Software evolution • Critical services has just moved to 11.2.0.4 • Long perspective: keep testing towards 12c • HW evolution • Technology evolution for replication • ADG & GG will fully replace Oracle Streams

Acknowledgements • Work presented here on behalf of: • CERN Database Group

Thankyou! Marcin.Blaszczyk@cern.ch Replication Technology Evolution for ATLAS Data Workshop, 3rd of June 2014

Evolution of Replication Technology at CERN's IT-DB Service