1 / 9

Database Operations

Database Operations. Elizabeth Gallas - Oxford ADC Weekly September 13, 2011. Overview. Brief notes Oracle 11g validation ATLR Replication User incidents (since S&C Week) Frontier ADCR. Brief Notes. LFC migration See Graeme’s talks … ATLARC / TAG Services

giselaj
Télécharger la présentation

Database Operations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Database Operations Elizabeth Gallas - Oxford ADC Weekly September 13, 2011

  2. Overview • Brief notes • Oracle 11g validation • ATLR • Replication • User incidents (since S&C Week) • Frontier • ADCR Elizabeth Gallas - Databases

  3. Brief Notes • LFC migration • See Graeme’s talks … • ATLARC / TAG Services • Popular: Event Picking & other TAG Services/Reports • Increasing requests for queries/cross checks using TAG DB • AMI Database Master Server: issues at Lyon late in July  full recovery, no data loss (early August) • DBA issue help: DQ2, Panda, DDM, AKTR, AGIS … • Indexing • Query optimization • Development improvements • AGIS Schema • Running in production mode on integration (INTR) server  Needs to move to production ASAP • Oracle 11g testing Elizabeth Gallas - Databases 3

  4. Oracle 11g Validation • All production DBs will upgrade to Oracle 11g • Scheduled: very early January 2012 • Testing reduces risks ! • Participation of developers – essential • DBAs & resources ready to help (platforms available since late May) • DBA’s initiated validation campaign in August • As announced in Roman’s talk (S&C Week – July) • ATLARC may upgrade to 11g in October 2010 • Take early advantage: Features, Performance improvements • Latest was summarized yesterday in Gancho’s talk at the ADC Development meeting: https://twiki.cern.ch/twiki/bin/viewauth/Atlas/DBOpsValidation11g Elizabeth Gallas - Databases 4

  5. ATLR Status … August: no holiday … DB usage is “evolving” (growing) … • Developers finding increased utility for Conditions data • We have powerful tools to access this data • People using it in new ways, a great thing ! • Release 17: increased DB access • Studying logs to quantify differences • Tier-0: increased capacity … other bottlenecks loosened (file staging) … Database access now limiting Tier-0 job throughput  Recent Technical Stop used for testing Frontier usage by Tier-0 (coordinated with Frontier experts) • No problems using CERN Frontier; Improved DB access time • BUT: some jobs had more DB retrievals for MUONALIGN • (See Hans’ talk in ADC Development meeting yesterday) • Trigger Reprocessing: • Early August: Bug (improper disconnects) problems: fixed • Currently: Trigger experts speeding up validation cycle • Use OFFSITE resources (Tier-1s): Timescale: ASAP • Development effort to later (also) use Frontier: test “in the next month” Elizabeth Gallas - Databases 5

  6. Oracle Streams • Recent request to run Trigger Reprocessing at BNL • Need to export ATLAS_CONF_TRIGGER_REPR to BNL • Decided to add to Oracle Streams • By default, it will go to all Tier-1s • Added benefit … available if/when these jobs use Frontier • Steps: adding this Schema to Oracle Streams • Must insure stability of all schemas under replication https://twiki.cern.ch/twiki/bin/view/Atlas/DatabaseSchemasUnderReplication • This Schema: 200 MB (not a volume issue) • Owner account locking • Trigger expert (Joerg) working with DBAs: • Small schema changes required to meet requirements • If all goes according to plan, intervention this week to add this Schema to the replication to all Tier-1s • Wednesday 10:00 – 12:30 • Requires replication to be stopped during intervention Elizabeth Gallas - Databases 6

  7. Incidents: User Access to Conditions 2 Frontier crashes at CERN Frontier site in 1 week • Follow up: Users – working independently on different projects • Developer: looking into SCT noise • Developer: adding info to Lumi Data Summary Metadata Reports • Why did Frontier crash ? Under investigation (memory issue?) Frontier “load” last week: “intense queries” from L1 Calo studies • Query time usually <2 sec, these were 20-30 seconds • Follow up with developer • Query is a reasonable request • Executed in reasonable time given nature of request • Look for ways to improve queries  Raise number of Frontier DB connections from 10 to 20 Additional Notes:  Incidents: reasoning behind dedicated Frontier launchpad for Tier-0 • Incidents NOT a problem on Oracle side, just for Frontier • Tracking down these issues reflects a lot of improvements in Frontier monitoring and understanding of Frontier logging • An ongoing effort Elizabeth Gallas - Databases 7

  8. Tier-1s / Frontier Status • Oracle+Frontier servers: • RAL, Lyon, KIT, BNL, TRIUMF and CERN • Frontier Meetings: Aug 11, Aug 25, Sep 9 https://www.racf.bnl.gov/docs/services/frontier/meetings/minutes • Skipping weeks with Tier-1 Service Coordination meetings • Current failover strategy: • Some Frontier launchpads still not open (as recommended) • Frontier fail-over only to sites with open access configuration and resilient server deployment • Need updated Frontier https://savannah.cern.ch/bugs/index.php?86408 • Needed for failover to work • WAS thought to NOT to be urgent …changed our minds … when specific sites had issues / hurricanes … raise urgency • To be included in LCG 60(d) • Improving Frontier Monitoring and follow up on frequent/intense queries • Still a work and investigations to be done – takes time Elizabeth Gallas - Databases 8

  9. ADCR Status • ADCR Database • Early August: • Alerts of storage and Oracle ASM problems. • Made controlled switch to standby hardware. • Added to standby for robustness, capacity: • 2 storage arrays • 3rd node • Current status: • SR open to Oracle on primary hardware - in progress.  From Gancho: ADCR on standby hardware … performing better … Doubling of buffer pool cache (now 13 GB ) thus less IOPS … Adding 2 storage arrays: ADCR has 72 disks (instead of 4 arrays = 48 disks) Elizabeth Gallas - Databases 9

More Related