Scaling and stabilizing large server side infrastructure Yoshinori Matsunobu Principal Infrastructure Architect DeNA

Scaling and stabilizinglarge server side infrastructureYoshinori MatsunobuPrincipal Infrastructure Architect DeNA

Who am I 2006-2010: Lead MySQL Consultant at MySQL AB (now Oracle) Sep 2010- : Database and Infrastructure engineer at DeNA Eliminating single-point-of-failure Establishing no-downtime operations Performance optimizations, reducing the total number of servers Avoiding sudden performance stalls Many more Speaking at many conferences such as mysqlconf, OSCON Oracle ACE Director from 2011

DeNA: Company Background One of the largest social game providers in Japan Social game platform “Mobage” and many social game titles Subsidiary ngmoco:) at San Francisco Japan localized phone, Smart Phone, and PC games Operating thousands of servers in multiple datacenters 2-3 billion page views per day 35+ million users Got O’Reilly MySQL Award in 2011 Note: This is not a sponsored session so the talks and slides are neutral

Scaling strategies Cloud vs Real Servers Performance practices Administration practices Agenda

Server side apps – What’s difficult? At least these parameter values are different from users, changing very frequently, and must not be lost. Name Job Equipment ATK points DEF points Social points Current LP Max LP Estimated time of recovery Current BP Max BP EXP/LV/etc

Server side apps – What’s difficult? Dynamic data must not be lost Should be stored into (stable) databases Should be highly available Some kinds of “data redundancy” is needed Very frequently selected and updated Top page is the most frequently accessed. All of the latest values need to be fetched Name does not change, but LP/EXP change very frequently Caching (read scaling) and write tuning strategy should be planned Data size can be huge 100K per user x 100M users = 10TB in total Some kinds of “data distribution” strategy might be needed Focus on “text” (images/movies are read only, and can be located on static contents servers so it’s easy from server side point of view) Not so much different between online games and general web services

It is very difficult to predict growth Social games grow (or shrink) rapidly i.e. Estimating 10 Million page views per day before going live.. Good case: 100 Million page views/day Bad case: Only 10K page views/day It is necessary to prepare servers for handling enough traffics It takes time to purchase, ship and setup 100 physical servers Too many unused servers might kill your company For smaller companies Using cloud services (AWS, Rackspace, etc) is much less risky For larger companies Servers in stock can probably be used Characteristics of online/social games

Single Server Multiple-Tier servers Scaling Reads / Database Redundancy Horizontal Partitioning / Sharding Distributing across regions Scaling strategy

Logically separated Issues Single Point of Failure Service capacity is very limited (soon reaching CPU or disk I/O bound) Single Server Web/App Database

At DeNA, we run thousands of machines in production ~N,000 web servers ~1,000 database (mainly MySQL) servers ~100 caching (memcached, etc) servers Most of server failures are caused by H/W problems Disk I/O errors, memory corruption, etc Mature middleware is stable enough MySQL has not crashed by MySQL own bugs for years Do not afraid too much (but prepare) for upgrading middleware Older software (CentOS4, MySQL5.0, etc) has lots of bugs that will never be fixed H/W failure often happens

Physically separated Web and Database Tier Multiple Web servers Single database server Web servers scale Issues Database server is still a single point of failure Service capacity is limited by database server’s performance Single database server can’t handle TBs of active game data Multiple-Tier Servers Web/App Database

This is probably the most common deployment in the world Data is replicated from master to slaves Read traffics can be scaled with cache servers and slave servers Issues Master database is still single point of failure Write is not scalable Scaling Reads / Database Redundancy Cache Web/App Database(W: master) Replication Database(R: slave)

Asynchronous replication is majority In case of master crash you might lose some of the latest data Most of (open source) database software does not support synchronous replication Sync replication does not perform well between remote datacenters Be careful about replication delay Replication is “Single Threaded” in most of databases Game users are very strict even for a few seconds of delay How do you feel if you can’t find virtual items that you bought just now? Especially during limited time gaming events We use SSD on slave servers so that replication slaves can keep up Replication for read scaling/Redundancy

Resolving server address (our case) db1001 db1011 db1012 db1013 - getting mapping info - caching locally MyDNS (Global catalog database) connect(ff_m) connect(ff_s) ff_m (db1001) ff_s (db1011:100, db1012:100, db1013:0) gundam_m (db2103) gundam_s (db1713, db1714, db1715, db1716) … • Other approaches • Using distribution-aware databases such as MongoDB • - Good if database software itself is stable • Using Load Balancer for distributing access to slaves • - Increasing # of servers and response time

Database write operations INSERT, UPDATE, DELETE Most of write operations do random disk reads as well as writes INSERT … reading target index blocks (2,000-4,000/s) UPDATE … reading matched records (200-2,000/s) DELETE … reading matched records and indexes (100-2,000/s) Increasing RAM helps to reduce the number of random reads Performance highly depends on storage devices Using SSD (great at random reads) is a good practice Single database server is not enough to handle massive write requests Scaling Writes

Data is divided into shards Both reads and writes are scalable How should application programs choose proper shard? Hashing, Mapping table Each master database still might be single point of failure (depending on database products) Datacenter crash results in service failure Re-sharding (moving data) is painful Horizontal Partitioning / Sharding Cache Web/App Database(W1,2,..) Shard1 Shard2 Database(R1,2,..) 1<=uid<1M 1M<=uid<2M

Developing application framework Creating a catalog (mapping) table (i.e. user_id=1..1M => shard1) Looking up a catalog table (and cache it), then accessing to a proper shard So far many people have taken this approach Once you create a framework, you can use it for many other games Re-sharding (moving data between shards) is difficult Using sharding-aware database products Database client library automatically selects a proper shard Many recent distributed NoSQL databases support it MongoDB, MemBase, HBase (and many more) are popular Newer databases tend to have lots of bugs and stability problems Approaches for Sharding

Don’t move data into different shard, but move entire data to higher-spec environment (larger RAM, SSDs, etc) It is easy by using replication slaves Practically we’ve been able to avoid re-sharding Approaches to avoid Re-Sharding Instance 1 Instance 1 Instance 1 Instance 2 Instance 2 Instance 2 …. Running N shards within single server Running a dedicated shard per server …. Moving to higher-spec server

Handling rapidly growing users On one of our most popular online games.. At first we started with 2 database shards + 3 slaves per shard The number of registered users in the first two days after launch was much higher than expected We added two more shards dynamically, then mapped all new users to new shards We have heavily used “range-partitioning” for removing older data Can drop older data very quickly (less than milliseconds) This has helped to reduce total data size (less than 250GB database size per shard) a lot

Mitigating Replication delay For older games using HDD on slaves.. Even 1,000+ updates per sec causes replication delay Replacing HDD with SATA SSD on slaves is a good practice Many updates (4,000+ updates per second) Some slaves with SSD got behind master Increasing RAM Increasing RAM from 32GB to 64GB per shard Reducing the total write volumes innodb-doublewrite=0 helps in InnoDB(MySQL) Avoiding sudden short stalls Using xfs filesystem instead of ext3 We migrated to higher spec servers without downtime. We haven’t needed re-shards so far

Cloud vs Physical servers DeNA uses physical servers (N+ thousands) Some of our subsidiaries use AWS / AppEngine

Advantages of cloud servers Initial costs are very small You don’t need to buy 10 servers (may cost more than $50,000) for a new game that might be closed within a few months No lead time to increase servers It is not uncommon to take 1 month to get new physical H/W components No penalty to decrease servers If unused physical servers can’t be used anywhere else, it just wastes money

Disadvantages of Cloud servers Taking longer time to analyze problems Network configurations are black box Storage devices are black box Problems caused by disks and network often happen, but it takes longer time to find root causes This really matters for games that generate 10M$/month Limited choices for performance optimizations What I want are: Using direct attached PCIe SSD for handling massive reads/writes Customizing Linux to reduce TCP connection timeouts (3 seconds to 0.5-1 second) Updating device drivers Per-server performance tends to be (much) lower than physical server More expensive than physical servers when your system becomes large

Distributing across DC/regions Availability Single datacenter crash should not result in service failure Latency / Response time Round trip time (RTT) between Tokyo to San Francisco exceeds 100ms 10+ round trips within 1 HTTP request will take more than 1 second If you plan to release games in Japan/APAC, I don’t recommend sending all contents from US Use CDN (to send static contents from APAC region) or run servers in APAC

Network latency should be considered (100+ms RTT) Web servers should access to local databases as long as possible Conflict detection & resolution is very tough issue What if user_id=1 was updated in both regions at the same time? “My region ID” should be considered When updating users in Tokyo, always access to masters in Tokyo Bulk operations should be considered To reduce round-trips between Web and DB Stored Procedure, Proxy server, etc.. Distributing across multiple regions Web/ Cache Web/ Cache DB(W1,2,..) DB(W1,2,..) DB(R1,2,..) DB(R1,2,..) Tokyo DC US East DC

Monitoring Fighting against stalls Improving per server performance Consolidating servers Monitoring and Performance Practices

Server activity reachable via ping, http, mysql, etc H/W and OS errors Memory usage (to avoid out of memory) Disk failure, Disk block failure RAID controller failure Network errors (sometimes caused by bad switches) Clock time Monitoring Server

Resource utilization per second Load average (Not perfect, but better than nothing) Concurrency (the number of running threads: Useful to detect stalls) Disk busy rate (svctm, %iowait) CPU utilization (especially on web servers) Network traffics Free memory (Swap is bad / Be careful about memory leak) Web Servers HTTP response time The number of processes/threads that can accept new HTTP connections Database Servers Queries per second Bad queries Long running transactions (blocking other clients) Replication delay Monitoring Server Performance

Best practices for minimizing replication delay Application side Continuously monitor bad queries Especially when deploying new modules Do not run massive updates in single DML statement LOAD DATA…; ALTER TABLE… Reduce the number of DML statements INSERT+UPDATE+DELETE -> Single UPDATE Infrastructure side Using SSD on slave servers Using larger RAM Using xfs filesystem

Identifying reasons why the server is so slow or resource consuming Web server Profiling functions that take long elapsed time (i.e. using NYTprof for Perl based applications) Database server Bad queries (full table scan/etc) / indexing Only a few query patterns consume more than 90% time Performance analysis

Most of middleware can’t handle 100K+ concurrent requests per server Work hard for stable request processing Sudden cache miss (CDN bug, memcached bug, etc) will result in sending burst requests toward backend servers Case: memcached bug (recently fixed): memcached crashed when thousands of persistent connections are established All requests to memcached went to database servers (cache miss) backend database servers couldn’t handle 10x more queries -> service down Thundering Herd / C10K

Stalls cause bad response time and unstable resource usage In bad cases web servers or database servers will be down Serious if happened on database servers All web servers are affected Identifying stalls is important, but difficult Very often root causes are inside source codes of the middleware I often take stack traces and dig into MySQL source codes Stalls

Some unstable database servers suddenly drop performance in some situations Low performance is a problem because we can’t meet customers’ demands during that time Through product A is better on average, product B is much more stable Don’t trust benchmarks. Vendors’ benchmarks show the best score but don’t show bad numbers Avoiding sudden performance drops Response time Product A Product B Time

All clients are blocked for a short periods of time (less than one second – a few seconds) The number of internal running threads grew significantly (5-20 on average, but suddenly grew to 1000+ and TOO MANY CONNECTIONS errors are thrown) Increased response time Monitoring Stalls

Avoid holding locks for a long time Other clients that need the locks are blocked Establishing new TCP connection … sometimes takes 3 seconds for SYN retry Prohibiting cheats from malicious users i.e Massive requests from same user id Be careful when choosing database products Most of newly announced database products don’t care about stalls Avoiding Stalls

To handle 1 million queries per second.. 1000 queries/sec per server : 1000 servers in total 10000 queries/sec per server : 100 servers in total Additional 900 servers will cost 10M$ initially, 1M$ every year If you can increase per server throughput, you can reduce the total number of servers and TCO Improving per-server performance

64bit Linux + large RAM 60GB – 128GB RAM is quite common SSD Database performance can be N times faster Using SSD on MySQL slaves is a good practice to eliminate replication delay Network, CPU Don’t use 100Mbps CPU speed matters: There are still many single threaded codes inside middleware Recent H/W trends

How do you handle unpopular games? Running a small game on high-end servers is not cost effective Recent H/W is fast Running N DB instances on a single server is not uncommon DeNA consolidates 2-10 games in a single database server Consolidating Web/DB servers

Performance is not everything High Availability, Disaster Recovery (backups), Security, etc… Be careful about malicious users Bot / repeatable access via tools Duplicating items / Real money trades Illegal logging in etc

Administration practices Automating setups Operations without downtime Automating failover

Automation is important to reduce operational costs The number of dev/ops engineers can not grow as quickly as the number of servers At DeNA: Hundreds of servers are managed per devops engineer Initial server setups (installing OS, setting up Web server) can be done within 30 minutes Can add to / remove from services in seconds to minutes Do not automate (in production) what you do not understand How do you restart failover when automated failover stops with errors in the middle? Without understanding in depth it is very hard to recover Automation

Automate software installation / filesystem partitioning / etc Kickstart + Local yum repository Cloning OS Copying entire OS image (including software packages and conf files) from read-only base server We use this approach Automate initial configuration Not all servers are identical Hostname, IP addr, disk size some parameters (server-id in MySQL) Use configuration manager like Chef/Puppet (we use similar internal tool) Automate continuous configuration checking Sometimes people change configurations tentatively and forget to set them back Automating installation/setup

For expanding games Adding more web servers Adding more replication slaves or cache servers (for read scaling) Adding more shards (for write scaling) DeNA has Perl based sharding framework on application side so that we can add new shards without stopping services Scaling up master’s H/W, or upgrading MySQL version More RAM, HDD->SSD/PCI-E SSD, Faster CPU, etc Moving Servers- Games are expanding rapidly

For shrinking games Decreasing web servers Decreasing replication slaves Migrating servers to lower-spec machines Consolidating a few servers into single machine Moving Servers- Games are shrinking gradually

For many databases (MySQL etc), there is only one master server (replication master) per shard Moving master is not trivial Many people allocate scheduled maintenance downtime We want to move master servers more frequently Scaling-up or Scaling-down (Online games have many more opportunities than non-game web services) Upgrading MySQL version / updating non-dynamic parameters Working around power outage: Moving games to remote datacenter Moving database servers

In many cases people do not want to allocate maintenance window Announcing to users, coordinating with customer support, etc Longer downtime reduces revenue and hurts brands Operating staffs will be exhausted by too many midnight work Reducing maintenance time is important to manage hundreds or thousands of servers Desire for easier operations

Previous Allocating 30+ minute maintenance window after 2:00am Announcing at top page of the game Coordinating with customer support team Couldn’t do many times because it’s painful Current Migrating to a new server within 0.5-3 seconds gracefully with an online MySQL master switching tool (“MHA for MySQL”: OSS) Not allocating maintenance window: can be done in daytime Note: Be very careful about error handling when using databases that support “automated master switch” Killing database sessions might result in data inconsistency If it waits for a long time to disconnect, downtime will be longer Our case

Master database server is usually a single point of failure, and difficult for failover Bleeding edge databases that support automated failover often don’t work as expected Split-brain, false positive, etc In our case: Using MHA for automated MySQL failover Takes 9-12 seconds to detect failure, 0.5-N seconds to complete failover Mostly caused by H/W failure Automating Failover

In extreme crash scenario, automated failover is dangerous i.e. Datacenter failure Identifying root cause should be the first priority But failover should be done by simple enough command Just one command Manual, Simple failover

Understanding scaling solutions is important to provide large/growing social games Stable performance is important to avoid sudden service outage. Do not trust sales talks. Many middleware still have many stall problems Online games tend to grow or shrink rapidly. Solutions for setting up/migrating servers (including master database) are important Summary

Scaling and stabilizing large server side infrastructure Yoshinori Matsunobu Principal Infrastructure Architect DeNA