Building Big: Lessons Learned from Windows Azure Customers

Building Big: Lessons Learned from Windows Azure Customers Mark Simms (@mabsimms), Christian Martinez Windows Azure Customer Advisory Team Building Big: Lessons Learned from Windows Azure Customers 4-554

Setting the stage • This is meant to be an interactive discussion – if you don’t ask questions, we will! • This session will be customer stories, patterns & code. • We will get deeply nerdy with .NET and Azure services. • Designing resilient large-scale services requires careful design and architecture choices • In this session we will explore key scenarios extracted from customer engagements, and what happens @ big scale. • Windows Azure Customer Advisory Team (CAT) • Works with internal and external customers to build out some of the largest applications on Azure • Get our hands dirty on all aspects of delivery; design, implementation and all too often firefighting

Story time with Christian

A large web site, processing asynchronous work

Connected device(s) service, asynchronous processing • 100k+ connected devices publishing activity reports • Target end to end latency (including cellular link) – 8 seconds • Target throughput 5000 messages / second

Connected device(s) service, asynchronous processing • Batch receiving messages for throughput • Flag completion for individual messages

Serialized processing – increasing latency • Batching receive for chunky communication – needed to meet throughput goals • Processing messages in sequence drives up latency

Switch to parallel processing

Something isn’t right • Initial performance very smooth • App quickly spikes to 100% CPU on all cores • Execution time spikes to minutes!

What does windbg say? • Most threads blocked in FindEntry of Dictionary • Using a Dictionary to look up the message handlers

Something still isn’t right • Large variations in avg/max latency • After time, processing rate drops to ~5 msg / second • CPU at ~ 0%

What does perf view have to say? System.Core!System.Dynamics.Utils. TypeExtensions.GetParametersCached http://channel9.msdn.com/Series/PerfView-Tutorial/Tutorial-12-Wall-Clock-Time-Investigation-Basics

Asynchronous & queue based processing • Looks simple enough… • Required messaging exchange patterns for queuing (pub/sub, competing consumer) • Partitioning and load balancing (affinity) for queue resources • Latency vs. throughput – batching • Resources vs. latency – bounding concurrency of task execution • Message dispatch – dynamic vs. fixed function tables • Poison messages, retries • Idempotent processing

Large website, scale-out relational data storage • (Very) Large scale website, backed by 500 Azure SQL databases • Physically collapsed web/app tiers to reduce latency • What can happen during periods of extreme success?

Large website, scale-out relational data storage • Each cloud service has a single public IP (VIP) • Each Azure SQL Database cluster also has a single public IP • 120 web role instances, 500 databases • Connection pool default size = 100 • What’s the limit?

Large website, leveraging external services • (Very) Large scale website, leveraging an external service for content moderation • Protected the external service dependency with a retry policy • On average called in 0.5% of service calls

Unintended consequences • Too much trust in downstream services and client proxies • Not bounding non-deterministic calls • Blocking synchronous operations • No load shedding

Large website, asynchronous document processing • Rich clients (mobile and desktop) publishing documents for processing • Using Shared Access Signature (SAS) tokens for direct writes to storage • Looks like a good design…

Large website, asynchronous document processing • Storage account URI is “hard coded” into the client application • Need to update all 100k+ client applications to change storage account

Design Choices & Challenges

Exploration – Data Design • Optimize for the most stringent case • Simplicity is king • No one, true solution • Devices and Services workload – connected embedded devices and applications streaming data to the cloud • 100k+ devices, growing 50k / month • Regional affinity (North America only)

Option 1: Relational – Considerations and Challenges • Cannot fulfill with a single database • Exceeds transactional throughput limit • Data growth will exceed practical size limits • Insert heavy workload • Pressure on transaction log • Partitioning keys? • Device ID, User account? • Partitioning approach • Bucket, range, lookup?

Option 1: Relational – Considerations and Challenges • Periodic query spike on bulk reporting • Impact to online operations (30M+ rows) • Rebalancing • Moving data between partitions / databases • Distribution of reference data (relational model) • Keeping in sync • Impact of noisy neighbors (Azure SQL DB) • Variable latency, pushback under heavy load • Cost of management (SQL IaaS) • Cost of automation for patching, maintenance

Tackling the Insert Challenge • Inserting large volumes of streaming data into a data store • Data store is governed on number of operations (transactions) • Trade consistency for throughput – enqueue, batch and publish • Get: increased throughput, shift work to ”cheap” resource (app memory) • Give up: full durability (potential data loss)

Tackling the Insight Challenge • Challenge: know that your site is having issues before Twitter does • This is not a randomly chosen anecdote. • Instrument, collect, analyze - react • Best: buy your way to victory (AppDynamics, New Relic, etc) • Also need to instrument application effectively for ”contextual” data (aka, logging)

Instrumenting Applications • Instrument for production logging • If you didn’t log & capture it, it didn’t happen • Implement inter-service monitoring and alerting • Nothing interesting happens on a single instance • Run-time configurable logging • Enable activation (capture or delivery) of additional channels at run-time • Getting logging right • All logging must be asynchronous • Buffer and filter before pushing to remote service or store

Bringing down a production system with logging…

Demo: Instrumenting Applications with Event Source

Option 2: Compositional Azure Storage Querying by device By time - direct { PkRk } lookup By day - direct { Pk} max of 2880 records per partition Batch transfer by time frame Parallel download of all blobs matching timeframe pattern Adding scale capacity 20k operations per storage account, This isn’t a relational workload Per-device insert and lookup Periodic batch transfer Per-device lookup Natural fit for table storage Device ID = Pk Data type = Rk Periodic batch transfer Natural fit for blob storage Instance + Timestamp = blob id Buffer and write into blocks Roll over on time interval (10 min)

Azure Storage Account - Blob

Azure Storage Account - Table

Azure Storage Account - Queues

User centric web application • Services site for mobile device applications • 1M+ users at launch, 1M+ users added per month • Front ended by Android, iOS, Windows Phone • Personalized information feeds and data sets • Examples: browsing history, shopping cart • Assuming up to 30% of user base can be online at any point in time • Maximum response latency 250 ms @ 99th percentile

Tearing apart the architecture • Where are the scalability bottlenecks? • Where are the availability and failure points? • Where are the key insight and instrumentation points?

Demo: Implementing an information publishing site

Recap • Know the numbers – platform scalability targets • Compute, storage, networking and platform services • Scalability == capacity * efficiency • Watch out for shared resources and contention points • At high load and concurrency “interesting” things happen • Default to asynchronous, bound all calls • Insight is power – measuring and observation of behavior • Without rich telemetry and instrumentation – down to the call level – apps are running blind • Buy your way to victory, leverage asynchronous and structured logging

Resources • Failsafe: Building scalable, resilient cloud services • http://channel9.msdn.com/Series/FailSafe • Cloud Service Fundamentals - Reference code for Azure • http://code.msdn.microsoft.com/Cloud-Service-Fundamentals-4ca72649

Building Big: Lessons Learned from Windows Azure Customers

Building Big: Lessons Learned from Windows Azure Customers

Presentation Transcript

Windows Azure Building web sites and services in the cloud

Using Microsoft Visual Studio 2010 to Build Applications That Run on Windows Azure

Building Windows 8 and Windows Azure apps

Windows Azure Compute

Using Windows Azure Storage

Building and Running HPC Apps in Windows Azure

Windows Azure Compute

Deployment Options for Kentico CMS in Windows Azure

Azure Management Studio The all-in-one productivity tool for Windows Azure developers and IT Pros

Windows Azure Platform

Windows Azure Data Storage

Building Big: Lessons learned from Windows Azure customers – Part One

Introduction to Building Applications with Windows Azure

Windows Azure Web Sites

Windows Azure