250 likes | 364 Vues
This session explores high availability (HA) and disaster recovery (DR) from multiple perspectives. Led by Alok Pareek, the presentation covers the evolution of HA technologies, the shift from Mean Time to Failure (MTTF) to Mean Time to Recovery (MTTR), and real-world examples. Key areas include understanding transaction failure models, evaluating HA technologies, handling unplanned and planned outages, and applying effective solutions in enterprise applications to ensure continuous operation. Gain insights into the best practices and trade-offs involved in ensuring data integrity and availability.
E N D
NoCOUG Fall Conference 2005 High Availability & Disaster Recovery: A 360o View Alok Pareek
Agenda • Introduction • High Availability - 2005 • Industry Shift from MTTF to MTTR • Understanding transaction failure models • Evaluating HA technologies • Real-World HA Examples • About GoldenGate • Questions & Answers
Speaker Introduction/Background • GoldenGate Software Architect • Transactional Data Management and High Availability Solutions • 10 years in Oracle kernel development • Data Storage Technologies • Recovery and High Availability Group • High Speed Data Movement • Key Features • Cross platform/Heterogeneous transportable tablespaces (10g) • Multi threaded redo generation (9.2) • Implementation of multiple block size caches (9.0) • (TSPITR) Tablespace point in time recovery (8.0) • Owner of LGWR process (redo generation) algorithms (8i 10.2)
High Availability, Disaster Recovery • Definition • Ratio of system uptime to sum of uptime and downtime Availability = MTTF/(MTTF+MTTR) • Challenges • Addressing Performance vs. Reliability in computer systems • Hardware Faults, Software Bugs, Human errors are realities in any complex system deployment • Enterprise applications need to function 24x7 • Disasters are no longer a distant threat • Inadequate planning to handle outages • Shift in industry (and academic research) focus • From fault tolerance to reducing MTTR
HA/DR – Systematic View Database 1 Active 2 3 Unplanned outage Planned outage • Node death • Power failure • Upgrades • Migrations System Failure System Changes • Physical Media • Logical corruption Data Failure Data Changes • Maintenance
1) High Availability Concerns – No Outage Database 1 Active • Performance • DSS vs. OLTP • conflicting requirments • Mixed workload • Data validation • Data Transformation
2) Availability Concerns – Unplanned Database Unplanned outage Common Approaches • Database Restore/Recovery • RAID • Shared Disk Clusters • Standby database System failure Data failure
Understanding Database Failures • Failure points • Statement • Process • Instance • Database • Site • Failure types • Physical (Media, corruption, inconsistency amongst redundant copies) • Logical (Incorrect DML, out-of-synch, accidental table drop) • Failure Handling • Automatic • Manual
3) Availability Concerns – Planned Outage Database Planned Outage • Selected windows of downtime • Phased approach to maintenance System Changes Data Changes
Conventional Backup/Recovery RAID multiple hard disks behaving as a single large fast drive (mirrors/stripes/duplexing/parity) Snapshots Block Level Database Replication Change Level Database Replication Remote Mirroring Solutions Transactional Data Management Differentiating HA Technologies High Availability and Disaster Recovery Roll Forward / File Protection
HA Technologies & Tradeoffs • Block based database replication • Standby kept in constant recovery (inactive) mode • Useful for strict disaster recovery only, not HA • Cannot be used for reporting in recovery mode • No write access for distributed load balancing • Application response times suffer after failover • Cannot address availability across heterogeneous systems • Change based database replication • Trigger or log based • Not optimized for real time performance • Intrusive, Complex • Cannot address availability across heterogeneous systems
HA Technologies & Tradeoffs • Remote mirroring solutions • Volume managers maintain mirrors of local writes on a set of remote volumes • Useful for file protection • Physical distance to remote volumes is a critical limitation • No protection from logical corruption, or storage stack corruption • Message basedlogical writes sent by primary host over IP to remote hosts (synchronously/asynchronously) • Write ordering must be maintained by primary host • Remote volumes are standby-only, applications cannot access them • No protection from logical corruption • Hardware based • Storage arrays propagate IOs to storage arrays at a secondary site • Secondary arrays are inaccessible during replication • No protection from logical corruption • Only useful for block availability during DR
HA Technologies & Tradeoffs • Transactional Data Management • Addresses low-latency in HA hybrid computing environments (built on 1 Safe) • Management of transactional streams -- captures, transforms, routes, delivers and verifies data transactions in real time • Real time, heterogeneous, data integrity, low impact • Use cases for HA, DR, data integration, distributed computing • Not for file-level replication Log-based data capture (can be selective) Route Transform (if needed) Apply Target Source Sub-second latency
HA/DR – Solution Examples Database 1 Active 2 3 Unplanned outage Planned outage • Node death • Power failure • Upgrades • Migrations System Failure System Changes • Physical Media • Logical corruption Data Failure Data Changes • Maintenance
Real-World HA Configuration (Active) Primary “Master” • Uni/Bi-directional configuration – if one system fails, the other is immediately available and ready to take over. • For… • Highest Availability • Maximized ROI on hardware (transaction balancing) • Data Centers with Active/Passive • Example areas: • 24x7 (ATMs, Online Banking) • Online Retail • Real-time Analytics, Warehousing Active Secondary “Master” Live/Active
Real-World HA Configuration (Transaction Off-Loading) Processing Commit Transactions • Improve availability and performance of transaction processing by offloading query load to lower-cost DBs/platforms • For… • Horizontal scalability • Improved performance • Example areas: • Online Reservations • Online Lookups Active Live/Active Lookup Query Handling
Real-World DR Configuration (Failover) Database Active • An HA implementation that captures and applies data to a failover system in real time. • For… • Fast failover (No restore) • Do root-cause analysis later! • Surgical Repair (Dynamic, Selective undo) • Example areas: • 24x7/mission-critical applications • Strict SLA requirements Unplanned outage System failure Data failure Failover Database
Real-World Configuration (Switchover) System Changes Data Changes Current Database • Zero-Downtime Migrations • Rolling Upgrades • Zero-Downtime Maintenance • For… • 24x7 availability • Reduced windows for system maintenance • Example areas: • Can’t afford downtime to do in-place upgrade Planned Outage “Switchover” Database
About GoldenGate Software GoldenGate Software is a privately held software company that offers Transactional Data Management solutions. 250 customers... 1500+ solutions implemented… in 35 countries Established, Loyal Customer Base Leading Industry Solutions 18,000 Node ATM Network with 24/7 Availability Achieving paperless enterprise for this visionary healthcare provider Saving $ millions with real-time DW and zero downtime migrations. Database tiering handles average of 300,000 updates/hour, peaks at 800,000/hour
Provides guaranteed capture, routing, transformation, delivery, and verification of data transactions across heterogeneous environments in real time. GoldenGate & Transactional Data Management TDM must be: • Real time Moves with sub-second latency • Heterogeneous Moves transactions across different databases and platforms • Transactional Maintains transaction integrity GoldenGate differentiates on: • Performance Handles thousands of transactions per second with very low impact on IT systems • Extensibility & Flexibility Open architecture to meet demanding customer needs and data environments • Reliability Supports continuous operations and availability
How GoldenGate Works Capture: Committed changes are captured as they occur by reading the transaction logs. Trail files: Stages and queues data for routing. Route: Data is compressed, encrypted for routing to targets. Delivery: Applies transactional data with guaranteed integrity. GoldenGate Veridata™: Reports on data discrepancies between in-use DBs
Questions & Answers Contact Information: Alok Pareek www.goldengate.com (415) 777-0200 apareek@goldengate.com 301 Howard Street, Suite 2100, San Francisco CA 94105
Lag points address Logical failures One-to-Many One-to-One Source Source Target Target (current) Target (1 hour lag) Target (1 day lag)