Download
monitoring best practices tools for running highly available databases n.
Skip this Video
Loading SlideShow in 5 Seconds..
Monitoring best practices & tools for running highly available databases PowerPoint Presentation
Download Presentation
Monitoring best practices & tools for running highly available databases

Monitoring best practices & tools for running highly available databases

186 Vues Download Presentation
Télécharger la présentation

Monitoring best practices & tools for running highly available databases

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Monitoring best practices &toolsfor runninghighly available databases Miguel Anjo & Dawid Wojcik DM meeting – 20.May.2008

  2. Oracle Real Application Clusters

  3. Architecture RAC1 RAC2 RAC5 RAC6 RAC3 RAC4

  4. Highly Available databases – Oracle ‘services’ • Resources distributed among Oracle services • Applications assigned to dedicated service • On node failure, resources re-distributed

  5. Highly Available databases – Apps and DB Release cycle Development service Validation service Production service Production service version 10.2.0.n Production service version 10.2.0.(n+1) • Applications’ release cycle • Database software release cycle Validation service version 10.2.0.(n+1)

  6. Why monitor? • Monitor (n.) • Computer Science. A program that observes, supervises, or controls the activities of other programs. • Need to keep all components in healthy state • We are prepared for single failures, some double failures • Commitment to give 24/7 best effort service • SW misbehavior affecting performance • Trends might indicate need to grow system • Security breaches

  7. Monitoring participants Presentation title - 7

  8. Monitoring participants Presentation title - 8

  9. What we monitor 25 database clusters 124 servers, 450 cores, 150 disk-arrays, 2000 disks at Tier0 10 Tier1 sites for Streams replication 150+ Oracle ‘services’ / applications 2000+ user schemas 1M+ connections/day

  10. PDB-Backup • 2 node cluster • Using Oracle Clusterware • Running: • RACMon (monitoring agents) • StreamMon (monitoring agents) • Backups • Scripts repository • Monitored by Lemon. Set as Critical in Operator procedures

  11. Monitored components Servers Accessibility CDB state Tools: Lemon + RACMon + OEM Disk arrays Accessibility State given by controller Firmware, disk state, disk size, disk speed Tools: Lemon + RACMon Database SW Clusterware state Service accessibility Space available Oracle Streams Tools: RACMon + OEM + StreamMon Database usage OS CPU, I/O User Sessions, CPU, I/O User quotas, tablespace usage Bad usage (short connections, bind variables) Table fragmentation Tools: RACMon, Reports

  12. Best practises (I) No overhead to DB (monitored object) Monitor as much as possible Presentation layer simple & compact Possibility to drill down

  13. Best practises (II) Hierarchy of alarms and notifications Simplicity  reliability Centralized version vs. deployed everywhere Independent blocks (monitoring, dashboard, reporting) for HA

  14. Monitoring tools • Monitoring tools • Lemon, SLS • Basic Monitoring (in house development) • SQL scripts (reactive monitoring) • RACMon (in house development, openlab) • StreamMon (in house development , openlab) • OEM – Oracle Enterprise Manager (Grid Control) - openlab • Service oriented monitoring tools • Experiment reports • DB Availability & Performance Pages

  15. Basic monitoring • Checking every 5 minutes • Each failure  e-mail with error • 3 consecutive failures  SMS • Almost perfect for single instance databases • Limitations • On RAC, system survives to single HW failures • Users connect to ‘service’, not database instance • No other components (storage, clusterware) monitoring • Missing dashboard view

  16. DBA monitoring • SQL scripts – reactive monitoring (ad-hoc monitoring) • Pros: • Easy to use • Fast real time information • Cons: • No global overview • Diagnosing single problem • Requires expert knowledge

  17. RACMon requirements • Reliable (24/7) • Easy to use and configure • Provides up to date information (frequent runs) • Centralized – no configuration or deployment on RAC side • Web interface (RAC monitoring dashboard) – one common place for RACs’ status • Monitoring of Oracle services (DB and user level) and Oracle clusterware • Monitoring of ASM instances (diskgroups and failgroups) • Monitoring other parts of the infrastructure – backups, storage, … (easy extensibility) • Notification send via emails & SMSs to DBAs • Availability numbers (over extended periods of time) • Disabling monitoring for specific machines or clusters (scheduled and unscheduled intervention logbook)

  18. RACMon Architecture

  19. RACMon - examples

  20. RACMon - examples

  21. RACMon • Pros/Features: • Customized for our environment • Gives an overview of all our HW and RACs • Configurable alerts (via email and SMS) and alert levels (production or non-production systems) • Drill down details available via multiple links to other types of monitoring software (OEM, Lemon, StreamMon) • Cons: • Requires manpower for development

  22. Oracle Streams “Oracle Streams enables the propagation and management of data, transactions and events in a data stream either within a database, or from one database to another.”

  23. StreamMon

  24. StreamMon

  25. StreamMon • Streams availability and usage monitoring • Build in alerting in case of any error in streams stack • Pros: • Monitoring of all T1 sites in one place (streams monitoring not available in any other tool, including OEM) • Convenient and easy to use web interface • Advanced plotting utilities • Cons: • Required manpower for development (currently in maintenance only) • Uses not-standard libraries, requires customized server

  26. Oracle Enterprise Manager • Architecture: • Agent running on each server uploads information to central repository, if repository is not available, it caches data • Management Service provides insight into any monitored target details • Management Service based on set-up metrics and policies sends e-mails (SMSes) • Proactive monitoring possible (actions based on problem diagnostics)

  27. Oracle Enterprise Manager • Oracle Enterprise Manager Grid Control features

  28. Oracle Enterprise Manager • Pros: • Highly configurable alerts, metrics and notification policies • Advanced and easy to use web interface • Easy drill down • External product – fully supported • Cons: • Universal – requires more navigation • No global overview (per target oriented) • Customization for many target requires much work • Bugs may by intrusive (e.g. affecting streams, excessive memory/CPU consumption, storage, DB instances) • Manpower required for maintenance and configuration • Not reliable enough for 24/7 monitoring

  29. Weekly reports • Targeted to experiment DBAs and Coordinators • Information about • Bookkeeping – Application names, contacts • Resource usage – Sessions, CPU, Logical and Physical I/O • Security: Connection errors, expiring passwords, not used schemas • Space: consumed, fragmentation, recycle bin • Bad usage: short connections, queries missing bind variables

  30. Weekly reports PHP scripts Generate report over last 7 days Specific to one RAC cluster

  31. Weekly reports

  32. Weekly reports • Current functionality • Simple way to visualize whole DB usage • Concentrates on main users (dynamic) • Easy to spot problems (color coded) • Very good feedback from our users • Now working on user configurable reports

  33. DB availability and performance page • PHP, aggregation of other tools • Requested by experiments • Dashboard of “current” DB activity • Almost real time monitoring (up to last hour) • Application resource usage • No extra load • uses SLS, RACMon, StreamMon, weekly reports • Possibility to drill down

  34. DB availability and performance page

  35. Summary • Many monitoring components developed for our environment • Out of the box tools not sufficient • Open frameworks – new features easily added • Feedback given to Oracle Enterprise Manager development (openlab) • Very good feedback from T1s and experiments • Components included in experiment dashboards, WLCG ServiceMaps, SLS