Designing Resilient Large-Scale Cloud Applications

Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team Resilent Cloud Applications

Session Objectives Designing resilient large-scale services requires careful design and architecturechoices This session will explore key patterns & practices for highly available cloud services, illustrated with customer examples Interactivity rocks -> please ask questions throughout!

Setting the Stage

Setting the stage Scalability Availability Insight

Setting the stage Maximize service availability for consumers Ensure customers (and client devices) can access and use the service Minimize impact of failure on consumers Degrade gracefully, isolate faults, fallback to alternate delivery paths Maximize performance and capacity Services that are “live”, but cannot handle desired/required demand are not available

Musings on application design • Traditional web service design (N-tier) • Make “everything stateless”

Musings on application design • Traditional web service design (N-tier) • Make “everything stateless” • Separate logic from data (state) • Leverage specialized external state services • Cache, load balancer, relational database, document database, key/value store, etc

Musings on application design • No service is an island • Dependencies on other internal and external services • Trading time-to-market and agility for control

What’s in a workload? #1: without the relational database the application cannot fulfill any workloads #2: the relational database is an external service, subject to partial availability

Designing for Failure

Decompose by Workload Applications are compromised of one or more workloads Products like SharePoint and Windows Server are designed with this principle in mind Each with different profiles, requirements and boundaries Management, Availability, Operational, Cost, Health, Security, Capacity, etc. Decomposition allows for workload specific optimization Technology selections, scalability and availability approaches, etc.

What are the “9”s • Study Windows Azure Platform SLAs: • Compute External Connectivity: 99.95% (2 or more instances) • Compute Instance Availability: 99.9% (2 or more instances) • Storage Availability: 99.9% • SQL Azure Availability: 99.9%

The Truth About 9s Contoso API Composite Fabrikam API Duwamish API Composite TailSpin API Northwind API 99.95% SLA 99.99% SLA 99.99% SLA 99.99% SLA 99.99% SLA 99.99% SLA 99.99% SLA SLA = *

Define Your SLAs Sports API 99.99% All the time 100% During Games 0% When No Game 99% All the Time Live Scores + Commentary Team, Player, League Stats

Design for Failure Given enough scale, time and pressure all components or services will fail • Your application will experience 1..N failures How will your application behave? • Gracefully handle failure modes, continue to deliver value • Not so gracefully … • Fault types: • Transient. Temporary service interruptions, self-healing • Enduring. Require intervention.

Failure Scope Regions may become unavailable Connectivity Issues, acts of nature Region Service Entire Services May Fail Service dependencies (internal and external), configuration and code issues Node Individual Nodes May Fail Connectivity Issues (transient failures), hardware failures,

Handling Transient and Enduring Failures • Use fault-handling frameworks that recognize transient errors • Make it part of the background ”noise” • Appropriate retry and backoff policies

Handling Transient and Enduring Failures

Handling Transient and Enduring Failures • At some point, your request is blocking the line • Fail gracefully, and get out of the queue! • Anti-patterns: • Too much trust in downstream services and client proxies • Not bounding non-deterministic calls • Blocking synchronous operations

Sample Retry Policies

Circuit Breaker at Netflix Error RateThreshold Criteria A request to a remote service times out On Thread pool and bounded task queue used to interact with a service dependency are at 100% Off Client library used to interact with a service dependency throws an exception

Circuit Breaker at Netflix - Fallbacks Custom fallback Client library can provide an invokable callback method. Can also use locally available data on API server (cookie or cache) to generate a fallback response Fail SilentReturn a null value. Useful if the data is optional Fail FastWhen data is required and there’s no good fallback. Negative UX impact, but keeps API healthy

Deployment Redundancy Within a Datacenter Across Data Centers Across On Premise and Cloud Across Cloud Providers Traffic Management

Failure Points definition: design elements that can cause an outage. Focus on identifying design elements that are subject to external change. For example: • Database connection • Website connection • Configuration file • Registry key Categories of common Failure Points: • ACLs, Database access, External web site/service access, Transactions, Configuration, Capacity, Network

Failure Modes definition: a predictable root cause of the outage that occurs at a Failure Point. Examples of failure modes: • Configuration file is not in correct location • Too much traffic overusing resources • Database reaches maximum capacity The following would not be considered a failure mode: • Product bugs • Symptoms of problems • Informational occurrences

Failure Mode Example • Potential Failure Points: • Database Server • Database • Table • Configuration File public intGetBusinessData(string[] parameters) { try { varconfig = Config.Open(_configPath); var conn = ConnectToDB(config.ConnectString); var data = conn.GetData(_sproc, parameters); return data; } catch (Exception e) { WriteEventLogEvent(100, E_ExceptionInDal); throw; } } • Potential Failure Modes: • DB Server not responding • DB offline • DB access denied • Sproc execute denied • DB doesn’t exist • DB timeout on connect • Index corrupt • Database corrupt • Table doesn’t exist • Table corrupt • Config file missing or invalid

Design for operations

Running a Live Site Service

Running without Insight / Telemetry

Capturing Insight • Log all internal/external “transactions” (database, web services, etc) • Application context (module/component) • Host context (server/role/instance/process) • Timing information (start/stop/duration) • Activity identifier • Consolidate logs to central system / dashboard for health monitoring and troubleshooting

Capturing Insight Capture timing and context information through helper delegates (background noise) Capture contextual errors (inner exceptions, etc) on error Logging library is asynchronous (fire-and-forget) to avoid blocking

Many Options Windows Azure Diagnostics

Designing for Insight Instrument for production logging If you didn’t capture it, it didn’t happen Implement inter-service monitoring and alerting Capture and quantify inter-service behavior and activity Run-time configurable logging Enable activation (capture or delivery) of additional channels at run-time

Define ALM

Updating Configuration • For a production service configuration == code • Need rigorous ALM process for rolling out (and rolling back) updates to both.

Updating Services “We want global, simultaneous production rollouts of our new code” Are you sure about that?  Production rollouts: • Running N, N+1 concurrently • Rolling load over to N+1, ability to fallback

What is a health model? Managed Entity Aspect Operational Condition Logical piece of an application A component that makes sense to an operator Each entity has a health state Entities can be external or internal Multiple instances of an entity may exist Break down health state by functional team Must be mutually exclusive Group by organizational responsibility e.g. security, performance, backup May be specific or non-technology e.g. orders shipped. Defines level of operation currently available Normal state is fully functional Well designed applications may support partial operation e.g. read only

Troubleshooting Workflow Detection Is there a problem? Classification What’s not working, how bad is it? Diagnosis Why is there a problem? Recovery What needs to be done to fix it? Verification Is the problem really gone?

Resources • Failsafe: Guidance for Resilient Cloud Architectures (http://msdn.microsoft.com/en-us/library/jj853352.aspx) • Best Practices for the Design of Large-Scale Services on Windows Azure Cloud Services • (http://msdn.microsoft.com/en-us/library/windowsazure/jj717232.aspx) • Designing and Deploying Internet Scale Services • https://www.usenix.org/events/lisa07/tech/full_papers/hamilton/hamilton.pdf

Design for Scale

Scale Unit of Scale Workloads Messaging Collaboration Productivity Resources* 4 x Web Servers ( 8 CPU) 100 GB Database 10 GB Blob Storage Demands 10K Active Users 1K Concurrent Users <2 second response time (*) Other details such as operational demand, resources and workloads omitted for simplicity

Scale by Units Demand & Resources 400K 100K Time

Example Bottom Ramp Peek Workload 1 Workload 2 J F M A M J J A S O N D

Data Partitioning Decomposition and Partitioning Understanding the 3 Vs Horizontal Partitioning Vertical Partitioning Hybrid Partitioning

Understanding the 3Vs Volume How large is the data today? Velocity How fast is it growing? Variety What type(s) of data are involved?

Understanding Queryability What? What types of queries are done and what data set(s) and transformations are required to deliver them? When? How often must the data be queried? In real time or once a day, month, quarter, or year?

Horizontal Partitioning

Vertical Partitioning

Hybrid Partitioning

Designing Resilient Large-Scale Cloud Applications

Designing Resilient Large-Scale Cloud Applications

Presentation Transcript

Migrating Applications to Azure Cloud

Building Scalable Cloud Applications

Cloud Based Analytics for Cloud Based Applications

Building Applications for the Cloud Applications

Cloud, Data Center Applications

Cloud Technologies and Their Applications

Building Cloud Applications

HIPAA and Cloud Applications

Cloud Native Applications and AppDynamics

Cloud Applications in Cognitive Radio

Elastic Applications in the Cloud

WEB APPLICATIONS IN Cloud

Standard Applications VS. Cloud Applications

Designing scalable applications for cloud

Cloud Applications

Implement Oracle Cloud Applications

Advantages of Cloud Applications

Best Cloud Computing Applications & Cloud Apps | Solar VPS

Cloud Computing Applications

Cloud Based Analytics for Cloud Based Applications