Is it Time to Reassess Your Availability Approach ?

Is it Time to Reassess Your Availability Approach? David Edborg Chief Architect EMCC Assured Availability Services

Abstract Breakout C: Is It Time to Reassess Your Availability Approach?Information availability is in the middle of a period of intense change. What were acceptable business continuity and disaster recovery strategies even a year ago are now being questioned, and organizations must constantly re-examine information availability capabilities. For IT infrastructure, data center and disaster recovery managers, it has become a challenge to keep up. But this presentation can help. We will address the following questions: • What's causing the current shift in information availability? • What are the emerging technologies that IT managers should be discussing with their teams? • Is near-zero recovery time a realistic expectation? • How can companies deliver always-on IT while reducing cost and risk? David EdborgChief Architect, EMC Availability ServicesEMC

Which Philosophy Do You Subscribe To? Moore’s Law Gordon Moore “… the number of transistors on a microchip would double every two years.” Or Gretzky’s Rule Gretzky, “I skate to where the puck is going to be, not where it has been.”

Availability Challenges Today Users Expect Zero Downtime Planned Outage Approval Difficult Unplanned Outages Unacceptable • Pressure to Reduce Spend and Expand Services HA and DR Increased Complexity and Cost

1. Data Center Move (< 1%) • Relocations • Natural Disaster • Business Change, Merger or Acquisition 2. Unscheduled Events (15%) • Technical Failure • Operational Failure 3. Scheduled Events (~85%) • Maintenance, migrations, backups/restores, batch jobs, installations or upgrades • Data warehouse extracts, builds, and loads Availability Impact of Event Types

Typical IT Availability Charter • External Maximize Uptime • Internal • Provide Scheduled Outages • Handle Unscheduled Outages • Accommodate Data Center Moves

Traditional DR vs. Continuous Availability Utility company has a power failure at primary Data Center – backup DC with Tier-1 standby equipment 40 miles away & has power Hospital has a power failure at primary Data Center – Continuous Availability (CA) Architecture DecisionWait until power is restored ReasonToo long to fail-over and fail-back, critical apps have DR, non-critical do not DecisionNo event, no decision ReasonCA environment with load balanced enabled production in both centers; if one site goes down, the processing load auto migrates to the other • Downtime: 16 hours • Downtime: 0 hours

Expensive to Implement Expensive to Maintain Expensive to Test Why Rethink DR Traditions? …Unreliable And it isn’t useful for most Availability Events During your last major disruption, did you even consider using your DR solution?

So, … how did we get here? High Availability In-data-center Application Restart Traditional Disaster Recovery Tape Backup and Offsite Rotation Advanced Recovery Replication to Second Site Two Different Disciplines & Technologies to Deal with SPOFs

Where is the Technology Going? Continuous Availability Application continues without disruption (0-Downtime) High Availability In-data-center Application Restart Convergence Traditional Disaster Recovery Tape Backup and Offsite Rotation Advanced Recovery Replication to Second Site

The Journey to Continuous Availability Continuous Availability Application continues without disruption (0-Downtime) Convergence One Common Discipline & Set of Technologies to Deal with SPOFs

Why Make the Journey? Single availability solution: • Eliminate downtime for multiple scenarios • Eliminate idle assets • Reduce the costs of DR/HA testing Make Verification/Auditing easy Potential28-50% Reduction in Compute Cost …

Continuous Availability Characteristics Site A Site B Distribution Layer • Two-site Parallel Transaction Processing Architecture • Off-the-Shelf Technology • Non-invasive application adaptation • Continuous Availability (CA) Service Level • Always-On • App or Service always available in at least one site • Able to sustain all single failures including site loss • Transactions automatically re-routed • CA Apps maintained with little to no minimal disruption Transactions Redirect on Site Failure Transactions Flow to Either Site Presentation Layer Application Layer DB / File Layer Distributed Virtual Volume Storage Layer DCI Layer-2 Adjacency

Reducing the Cost Curve by Putting Idle Assets to Use Availability Architecture Transformation Traditional Tier-1 DB + + = △ HA+1 100% DR 100% Svr 300% Svr 100% 200% DB Cluster Traditional Tier-1 Web Svr 100% + + = △ HA+.2 20% Svr 220% DR 100% 120% • EMC CA • (Fractional Provisioning) + = + 0 20% △ 60% 60% 120% Site A Site B

Fractional Compute Model Concepts • Provision each site with average compute • Presumption; most servers are modeled to run at 55% - 70% • Headroom is used for peaking • Aggregate in pool provides: need, peaking, and loss protection (traditional HA and DR) • Fractional Computing Math:

Fractional Compute Provision v. Traditional DR Production/HA & DR Assuming Internal Recovery CA-2SiteHA Traditional 300% All 2SHA CA & 2SHA Full CA 200% 100% CPU RTO RPO hours 0 minutes 0 days 24-48h 0 0

Site Loss – DR v CA Alternative Disaster Recovery: • Evaluation • Outage/ Disaster extent / estimate of duration of outage • Make decision to fail-over; i.e. declare a disaster • Invoke BC process • Initiate fail-over process; push the “big red buttons” • Handle outage calls • Figure out how to come home Continuous Availability: • Immediately have average compute available – do nothing • Evaluation • Outage / Disaster extent / estimate of duration of outage • Triage • Determine any workloads to defer • Take down low use/low priority apps / reallocate virtual CPU • Open load balancer when site back online Which scenario would you rather deal with?

Why Take The Journeyto CA? • Save CAPEX • Fewer Servers • Less Storage (Fewer Copies) • Save OPEX • Tech Refresh Seldom Requires Outage • Impact From Test • Headcount/Labor, Licenses, Space / Power / Cooling, Maintenance and Patches… Reduce Cost

A Model for Continuous Availability Site A Site B • Stretch Farms and Clusters between sites • Stretch an A/A DB with a locking mechanism between sites • Add SAN storage • Add Networking • Upgrade Local Load Balancing to Global • Data Center Interconnect (DCI) • Spanned VLAN (VPLS/OTV) • WAN Connections • Add Data Coherency Mechanism (e.g. EMC VPLEX) The Application now is abstracted and spanned between sites Distribution Layer Presentation Layer Application Layer DB / File Layer VPLEX Distributed Virtual Volume Storage Layer DCI Layer-2 Adjacency

Different Layers Can Be Independent CA at the Distribution Layer CA at the Presentation and App Logic Layer (Static) CA at the DB and File System Layers CA at the Storage Layer VPLEX WITNESS

What if Existing Sites Exceed Metro* Distance? Options: • Create a small presence at a CoLo within metro distance for one leg • Establish the two sides in an existing data center by creating independent pods • Establish two sides in buildings on a campus • Added value: Vmware FT – no inflight transaction loss The Metro Distance Requirement is 5ms RTT, or roughly 60 miles or 100km.

What About Metro Distance Limits? Earthquake and Hurricane Considerations • Historically, the major impact span of US continental earthquakes has been under 43 miles • Nearly all US Coastal areas are susceptible to Hurricanes; but the impact dissipates as it comes inland USGS Earthquake Map 1900-2002 Red Span 0 – 69KM Green Span 70-299KM http://www.nhc.noaa.gov/breakpoints.shtml

What About Metro Distance Limits? http://blogs.ei.columbia.edu/wp-content/uploads/2012/11/storm_surge_map_final.jpg

What About Metro Distance Limits? Maximum disaster radius for last 100 yrs. has been ~25 miles (40km) VPLEX WITNESS • 2N+1 solution deployed out-of-region • SRM deployed to automate fail-over RPA Replication Network Site C (out-of-region) RPA Development Test/QA Anchor

Or Break Processing Down Between Pods And Use Geographically Dispersed Pods to Backup Each Other

Leveraging CA Constructs to Reduce Outages • Take a site offline for maintenance • Take an app offline in a site for rolling maintenance • Encapsulate site configuration • Encapsulate human errors • Data Warehouse Loads/Extracts Un-Scheduled Outages Scheduled Outages

Availability Testing In Lieu of DR Testing • Take App Down in Site-A App Continues to run in Site-B • Trace a transaction thru Site-B • Bring App Up in Site-A Trace a transaction thru Site-A Fed Requirement: Regular DR Testing or Use Capability Regularly

Case Study – Power & Gas Utility How Continuous Availability Reduces Costs & Required Resources Current State & Requirements Traditional Prod/HA/DR vs. CA • Studied 4 applications with various recovery requirements • Current recovery uses repurposed QA systems • Current state has 73 servers (Need-60, HA-13, DR-Dedicated-0) • DB Replication, file systems not protected • Requirements • Improve recovery & availability posture • DR solution needs to be scalable Potential to Reduce Server Count from 132 to 88 Solution Options Benefits • Traditional DR Solution (replication & standby equipment): • Need-56, HA-22, DR-56 = Total 132 Servers • Converged Prod/HA/DR Solution: • Site-A 44, Site-B 44 = Total 88 Servers • Reduce Server Count by 33% • Improved Availability Posture • $3.1M Cost Avoidance

Case Study – Global Life Sciences Firm How CA Improved Availability, Eliminated DR, & Reduced Cost Issues Traditional Prod/HA/DR vs. CA • SAP ERP & Critical IT has 352 servers • Need-173, HA-57, DR-122 = 203% of need • RTO < 24 hrs, RPO < 5 min, SLA 3x9’s • Concerns: • DR plan in place, but no coming home plan • Failover of top DR tiers strands other apps • Idle and out-of-sync assets Reduced Server Count from 352 to 250 Continuous Availability Solution • Converged HA & DR Architectures • VPLEX / RAC / OTV / vSphere MSC • SAP stays up regardless of failure scenario • Most App Transactions under CA • Low use Apps deployed as 2-Site HA • Improved confidence in availability Benefits • Reduced Server Count by 29% • Reduced Cost $18M over 3 years • Eliminate RPO & RTO • Eliminate Idle Assets, DR & Fail-Over Time

Forrester Consulting Study Results* Active-Active Data Centers Provide Operational And Financial Benefits • Unite HA and DR into a single approach • 89% of respondents agreed or strongly agreed • Leverage off-the shelf technology • 69% of organizations agreed or strongly agreed • Reduce DR capital expenditures • 67% of organizations agreed or strongly agreed that they were able to reduce capital expenditures by combining HA and DR. • Reduce the downtime for all IT services & applications • 86% of organizations agreed or strongly agreed that AA DC reduced downtime for all IT services and apps. *Question Details In Appendix

Assured Availability Services Continuous Availability Back-up and Recovery Disaster Recovery Managed Availability Disaster Recovery Back-up and Recovery Advisory Service Implementation Service Readiness Service Management Service

Summary Continuous Availability Technologies can: • Increase Availability • Reduce: • Cost • Complexity • Can be built with off-the-shelf technology with little to no invasive application changes

Summary • Quiz: Can you Find the Arrow in the FedEx Logo? • Moral: Sometimes solutions are in front of us and we just can’t see them

What is unique about EMC? We leverage technology to provide certainty in availability.

APPENDIX

Forrester Consulting Study Results Active-Active Data Centers Provide Operational And Financial Benefits Source: A commissioned study conducted by Forrester Consulting on behalf of EMC Corporation, January 2013 Full Report @ http://tinyurl.com/owsjxyg

Is it Time to Reassess Your Availability Approach ?