1 / 60

Resilent Cloud Applications

Mark Simms (@ mabsimms ) Principal Program Manager Windows Azure Customer Advisory Team. Resilent Cloud Applications. Session Objectives. Designing resilient large-scale services requires careful design and architecture choices

kailey
Télécharger la présentation

Resilent Cloud Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team Resilent Cloud Applications

  2. Session Objectives Designing resilient large-scale services requires careful design and architecturechoices This session will explore key patterns & practices for highly available cloud services, illustrated with customer examples Interactivity rocks -> please ask questions throughout!

  3. Setting the Stage

  4. Setting the stage Scalability Availability Insight

  5. Setting the stage Maximize service availability for consumers Ensure customers (and client devices) can access and use the service Minimize impact of failure on consumers Degrade gracefully, isolate faults, fallback to alternate delivery paths Maximize performance and capacity Services that are “live”, but cannot handle desired/required demand are not available

  6. Musings on application design • Traditional web service design (N-tier) • Make “everything stateless”

  7. Musings on application design • Traditional web service design (N-tier) • Make “everything stateless” • Separate logic from data (state) • Leverage specialized external state services • Cache, load balancer, relational database, document database, key/value store, etc

  8. Musings on application design • No service is an island • Dependencies on other internal and external services • Trading time-to-market and agility for control

  9. What’s in a workload? #1: without the relational database the application cannot fulfill any workloads #2: the relational database is an external service, subject to partial availability

  10. Designing for Failure

  11. Decompose by Workload Applications are compromised of one or more workloads Products like SharePoint and Windows Server are designed with this principle in mind Each with different profiles, requirements and boundaries Management, Availability, Operational, Cost, Health, Security, Capacity, etc. Decomposition allows for workload specific optimization Technology selections, scalability and availability approaches, etc.

  12. What are the “9”s • Study Windows Azure Platform SLAs: • Compute External Connectivity: 99.95% (2 or more instances) • Compute Instance Availability: 99.9% (2 or more instances) • Storage Availability: 99.9% • SQL Azure Availability: 99.9%

  13. The Truth About 9s Contoso API Composite Fabrikam API Duwamish API Composite TailSpin API Northwind API 99.95% SLA 99.99% SLA 99.99% SLA 99.99% SLA 99.99% SLA 99.99% SLA 99.99% SLA SLA = *

  14. Define Your SLAs Sports API 99.99% All the time 100% During Games 0% When No Game 99% All the Time Live Scores + Commentary Team, Player, League Stats

  15. Design for Failure Given enough scale, time and pressure all components or services will fail • Your application will experience 1..N failures How will your application behave? • Gracefully handle failure modes, continue to deliver value • Not so gracefully … • Fault types: • Transient. Temporary service interruptions, self-healing • Enduring. Require intervention.

  16. Failure Scope Regions may become unavailable Connectivity Issues, acts of nature Region Service Entire Services May Fail Service dependencies (internal and external), configuration and code issues Node Individual Nodes May Fail Connectivity Issues (transient failures), hardware failures,

  17. Handling Transient and Enduring Failures • Use fault-handling frameworks that recognize transient errors • Make it part of the background ”noise” • Appropriate retry and backoff policies

  18. Handling Transient and Enduring Failures

  19. Handling Transient and Enduring Failures • At some point, your request is blocking the line • Fail gracefully, and get out of the queue! • Anti-patterns: • Too much trust in downstream services and client proxies • Not bounding non-deterministic calls • Blocking synchronous operations

  20. Sample Retry Policies

  21. Circuit Breaker at Netflix Error RateThreshold Criteria A request to a remote service times out On Thread pool and bounded task queue used to interact with a service dependency are at 100% Off Client library used to interact with a service dependency throws an exception

  22. Circuit Breaker at Netflix - Fallbacks Custom fallback Client library can provide an invokable callback method. Can also use locally available data on API server (cookie or cache) to generate a fallback response Fail SilentReturn a null value. Useful if the data is optional Fail FastWhen data is required and there’s no good fallback. Negative UX impact, but keeps API healthy

  23. Deployment Redundancy Within a Datacenter Across Data Centers Across On Premise and Cloud Across Cloud Providers Traffic Management

  24. Failure Points definition: design elements that can cause an outage. Focus on identifying design elements that are subject to external change. For example: • Database connection • Website connection • Configuration file • Registry key Categories of common Failure Points: • ACLs, Database access, External web site/service access, Transactions, Configuration, Capacity, Network

  25. Failure Modes definition: a predictable root cause of the outage that occurs at a Failure Point. Examples of failure modes: • Configuration file is not in correct location • Too much traffic overusing resources • Database reaches maximum capacity The following would not be considered a failure mode: • Product bugs • Symptoms of problems • Informational occurrences

  26. Failure Mode Example • Potential Failure Points: • Database Server • Database • Table • Configuration File public intGetBusinessData(string[] parameters) { try { varconfig = Config.Open(_configPath); var conn = ConnectToDB(config.ConnectString); var data = conn.GetData(_sproc, parameters); return data; } catch (Exception e) { WriteEventLogEvent(100, E_ExceptionInDal); throw; } } • Potential Failure Modes: • DB Server not responding • DB offline • DB access denied • Sproc execute denied • DB doesn’t exist • DB timeout on connect • Index corrupt • Database corrupt • Table doesn’t exist • Table corrupt • Config file missing or invalid

  27. Design for operations

  28. Running a Live Site Service

  29. Running without Insight / Telemetry

  30. Capturing Insight • Log all internal/external “transactions” (database, web services, etc) • Application context (module/component) • Host context (server/role/instance/process) • Timing information (start/stop/duration) • Activity identifier • Consolidate logs to central system / dashboard for health monitoring and troubleshooting

  31. Capturing Insight Capture timing and context information through helper delegates (background noise) Capture contextual errors (inner exceptions, etc) on error Logging library is asynchronous (fire-and-forget) to avoid blocking

  32. Many Options Windows Azure Diagnostics

  33. Designing for Insight Instrument for production logging If you didn’t capture it, it didn’t happen Implement inter-service monitoring and alerting Capture and quantify inter-service behavior and activity Run-time configurable logging Enable activation (capture or delivery) of additional channels at run-time

  34. Define ALM

  35. Updating Configuration • For a production service configuration == code • Need rigorous ALM process for rolling out (and rolling back) updates to both.

  36. Updating Services “We want global, simultaneous production rollouts of our new code” Are you sure about that?  Production rollouts: • Running N, N+1 concurrently • Rolling load over to N+1, ability to fallback

  37. What is a health model? Managed Entity Aspect Operational Condition Logical piece of an application A component that makes sense to an operator Each entity has a health state Entities can be external or internal Multiple instances of an entity may exist Break down health state by functional team Must be mutually exclusive Group by organizational responsibility e.g. security, performance, backup May be specific or non-technology e.g. orders shipped. Defines level of operation currently available Normal state is fully functional Well designed applications may support partial operation e.g. read only

  38. Troubleshooting Workflow Detection Is there a problem? Classification What’s not working, how bad is it? Diagnosis Why is there a problem? Recovery What needs to be done to fix it? Verification Is the problem really gone?

  39. Resources • Failsafe: Guidance for Resilient Cloud Architectures (http://msdn.microsoft.com/en-us/library/jj853352.aspx) • Best Practices for the Design of Large-Scale Services on Windows Azure Cloud Services • (http://msdn.microsoft.com/en-us/library/windowsazure/jj717232.aspx) • Designing and Deploying Internet Scale Services • https://www.usenix.org/events/lisa07/tech/full_papers/hamilton/hamilton.pdf

  40. Design for Scale

  41. Scale Unit of Scale Workloads Messaging Collaboration Productivity Resources* 4 x Web Servers ( 8 CPU) 100 GB Database 10 GB Blob Storage Demands 10K Active Users 1K Concurrent Users <2 second response time (*) Other details such as operational demand, resources and workloads omitted for simplicity

  42. Scale by Units Demand & Resources 400K 100K Time

  43. Example Bottom Ramp Peek Workload 1 Workload 2 J F M A M J J A S O N D

  44. Data Partitioning Decomposition and Partitioning Understanding the 3 Vs Horizontal Partitioning Vertical Partitioning Hybrid Partitioning

  45. Understanding the 3Vs Volume How large is the data today? Velocity How fast is it growing? Variety What type(s) of data are involved?

  46. Understanding Queryability What? What types of queries are done and what data set(s) and transformations are required to deliver them? When? How often must the data be queried? In real time or once a day, month, quarter, or year?

  47. Horizontal Partitioning

  48. Vertical Partitioning

  49. Hybrid Partitioning

More Related