Scalability By Alex Huang
Current Status • 10k resources managed per management server node • Scales out horizontally (must disable stats collector) • Real production deployment of tens of thousands of resources • Internal testing with software simulators up to 30k physical resources with 30k VMs managed by 4 management server nodes • We believe we can at least double that scale per management server node
Balancing Incoming Requests • Each management server has two worker thread pools for incoming requests: effectively two servers in one. • Executor threads provided by tomcat • Job threads waiting on job queue • All incoming requests that requires mostly DB operations are short in duration and are executed by executor threads because incoming requests are already load balanced by the load balancer • All incoming requests needing resources, which often have long running durations, are checked against ACL by the executor threads and then queued and picked up by job threads. • # of job threads are scaled to the # of DB connections available to the management server • Requests may take a long time depending on the constraint of the resources but they don’t fail.
The Much Harder Problem • CloudStack performs a number of tasks on behalf of the users and those tasks increases with the number of virtual and physical resources available • VM Sync • SG Sync • Hardware capacity monitoring • Virtual resource usage statistics collection • More to come • When done in number of hundreds, no big deal. • As numbers increase, this problem magnifies. • How to scale this horizontally across management servers?
Comparison of two Approaches • Stats Collector – collects capacity statistics • Fires every five minutes to collect stats about host CPU and memory capacity • Smart server and dumb client model: Resource only collects info and management server processes • Runs the same way on every management server • VM Sync • Fires every minute • Peer to peer model: Resource does a full sync on connection and delta syncs thereafter. Management server trusts on resource for correct information. • Only runs against resources connected to the management server node
Numbers • Assume 10k hosts and 500k VMs (50 VMs per host) • Stats Collector • Fires off 10k requests every 5 minutes or 33 requests a second. • Bad but not too bad: Occupies 33 threads every second. • But just wait: • 2 management servers: 66 requests • 3 management servers: 99 requests • It gets worse as # of management servers increase because it did not auto-balance across management servers • Oh but it gets worse still: Because the 10k hosts is now spread across 3 management servers. While it’s 99 requests generated, the number of threads involved is three-fold because requests need to be routed to the right management server. • It keeps the management server at 20% busy even at no load from incoming requests • VM Sync • Fires off 1 request at resource connection to sync about 50 VMs • Then, push from resource as resource knows what it has pushed before and only pushes changes that are out-of-band. • So essentially no threads occupied for a much larger data set.
What’s the Down Side? • Resources must reconcile between VM states caused by management server commands and VM states it collects from the physical hardware so it requires more CPU • Resources must use more memory to keep track of what amounts to a journal of changes since the last sync point. • But data centers are full of these two resources.
Resource Load Balancing • As management server is added into the cluster, resources are rebalanced seamlessly. • MS2 signals to MS1 to hand over a resource • MS1 wait for the commands on the resources to finish • MS1 holds further commands in a queue • MS1 signals to MS2 to take over • MS2 connects • MS2 signals to MS1 to complete transfer • MS1 discards its resource and flows the commands being held to MS2 • Listeners are provided to business logic to listen on connection status and adjusts work based on who’s connected. • By only working on resources that are connected to the management server the process is on, work is auto-balanced between management servers. • Also reduces the message routing between the management servers.
Designing for Scalability • Take advantage of the most abundant resources in a data center (CPU, RAM) • Auto-scale to the least abundant resource (DB) • Do not hold DB connections/Transactions across resource calls. • Use lock table implementation (Merovingian2 or GenericDao.acquireLockInTable() call) over database row locks in this situation. • Database row locks are still fine quick short lock outs. • Balance the resource intensive tasks as # of management server nodes increases and decreases • Use job queues to balance long running processes across management servers • Make use of resource rebalancing in CloudStack to auto-balance your world load.
Reliability By Alex Huang
The Five W’s of Unreliability • What is unreliable? Everything • Who is unreliable? Developers& administrators • When does unreliability happen? 3:04 a.m. no matter which time zone… Any time. • Where does unreliability happen? In carefully planned, everything has been considered data centers. • How does unreliability happen? Rather nonchalantly
Dealing with Unreliability • Don’t assume! • Don’t bang your head against the wall! • Know when you don’t know any better. • Ask for help!
Designs against Unreliability • Management Servers keeps an heartbeat with the DB. One ping a minute. • Management Servers self-fences if it cannot write the heartbeat • Other management servers wait to make sure the down management server is no longer writing to the heartbeat and then signal interested software to recover • Check points at every call to a resource and code to deal with recovering from those check points • Database records are not actually deleted to help with manual recovery when needed • Write code that is idempotent • Respect modularity when writing your code