Architecture and optimization on 1000 nodes cluster in China Mobile

Architecture and optimization on 1000 nodes cluster in China Mobile Li Hao, Junwei Liu, Yuntong Jin,Yingxin Cheng lihao@cmss.chinamobile.com ecloud.10086.cn

Agenda Practice of Openstack in CMCC Architecture & Deployment Optimization 1000 nodes Performance

Practice of Openstack in CMCC 1 Public Cloud with 2 pools, 1 Private Cloud with 2 pools IT Public cloud(GuangZhou) total: 1000 nodes 600 nova-compute Public cloud(Beijing) total: 1000 nodes 530 nova-compute Private cloud total: 6000 nodes， 3000 nodes every pool Public cloud(GuangZhou & Beijing) is online (https://ecloud.10086.cn) Private cloud will be online in June 2017

Public Cloud Topological Graph

Details of The Topology • Horizontally extended services • Horizontal extension of the core services • haproxy + keepalived + lvs do load balancing • There is no single point of failure • Use 5 + nodes to deploy api gateway • Each service resides on a different rack • multi AZ，multi HA • Divided into nfv AZ, cpu bound AZ, bigdata AZ, exclusive-physical-machine AZ, novaAZ • Windows HA，linux HA

Details of The Topology • Multi-pool • Guangzhou, Beijing resource pool is independent • OP(China Mobile Resource Management Platform) manages multiple resource pools

Details of The Deployment • Use ansible to deploy • Use the module：shell，copy，script • Semi-automated deployment

Optimization • Infrastructure optimization • Every physical machine has three bonds in active-backup mode • The business network and the storage network are separated to ensure that the business traffic and storage traffic are independent between each other • The core service uses ssd • Make full use of network resources • Optimization of the middleware components • Optimization of Haproxy, keepalived • Optimization of Mariadb Galera • Optimization of Rabbitmq • Optimization of openstack components

Optimization of the middleware components • rabbitmq • Increase the rabbitmq connection pool • Added heartbeat check • haproxy • Configure reasonable number of haproxy processes • Strengthen the detection of back-end services • keepalived • Enhanced exception handling for keepalived • Added exception handling for nics • mysql • Increased connection pool • Solve the dead lock problem

Optimization-nova • Profile optimization • Increase the max_pool_size = 1000, increase nova's processing capacity • Increase the number of workers, increase nova-api concurrent processing capacity • Scheduler_host_subset_size = 10, increase the concurrency ability • Use a reasonable ratio of cpu,disk,memory, increase thenumber of running vms in a independent machine • Use more scheduler filter, to complete precise scheduling • Use configdrive • Nova new feature • Change vm secret, change the hostname, execute cmd in vm through nova api • Optimize the volume-attach & volume-detach code • Practice optimization • All images are cached on each compute node, greatly speeding up the vmprovision

Optimization-neutron • Profile optimization • Increase worker • increase database connections • Accelerate restart ovs-agent • Rally test results

Optimization-neutron2 • Restarting ovs-agent is very slow • Using ovs-ofctl, ovs-vsctl to distribute the flow table and configure ovsdb port is time consuming, An average of 0.4 seconds to configure one flow table, when in large scale will become very slow. • Optimization: • Stopdistributing the duplicate flow table • Concurrent processing • Don't delete the port and the flow table when restart(L version has been optimized) • The local controller is used and the flow table is delivered through the local controller RYU (L version has been optimized)

Optimization-monitoring • Use ceilometer + gnocchi + influxdb • In a large-scale deployment, using the ceilometer + mongodb query sample is almost impossible • After using gnocchi, the response of querying sample is in 5 seconds with 10000 virtual machine • Support senlin do elastic stretching • Cmcc Monitoring architecture：

Optimization-cinder • Profile optimization • Support sheepdog cluster, ipsan cluster at the same time • Added heartbeat_timeout_threshold parameter for bi-directional heartbeat check • Modifying parameter provisioned_capacity puts a reasonable ratio of volume capacity size • Expand the osapi_max_limit value to increase the number of entries displayed • Configure a reasonable qos to limit the io of each volume • High availability optimization • Multiple cinder-volume are configured with the same host • Use the pacemaker to manage multiple cinder-volume clusters • The cinder api uses haproxy & keepalived for load balancing

Optimization-glance • Glance new features • Imagesare shared across different pools • Images can be download by the download tools, f.g. flashget, wget • Practice optimization • Nova-compute use glance-api by storage network glance-api address, instead of glance-api endpoint address • Glance-api use image-cache

Optimization-keystone • Profile optimization • Use more workers • Running Keystone in HTTPD • Use Fernet Token Formats • Rally test, Pkiz vs Fernet (5000 Concurrent)

Nova 1000-node performance • Technique • 2 phases analysis • White-box profiling • State machine parsing • Component-level costs • 5 ~ 2000 concurrent requests • Scheduler bottleneck • Scheduler saturation • Compute cost • Statistics • Throughput: 1.78 requests/second • Failure rate: up to 41.1% • Retry rate: up to 26.3%

OpenStack troubleshooting • Failure tracing

OpenStack optimizations Detailed whitepaper: https://01.org/sites/default/files/performance_analysis_and_tuning_in_china_mobiles_openstack_production_cloud.pdf • Issues • Database deadlocks • Neutron port creation failure • Keystone authentication failure • Improvements (800 requests) • Failure rate: 25.7 -> 0%; Retry rate: 29.0 -> 0.83%; • Wall clock: 470 -> 150 sec; Throughput: 1.31 -> 4.35req/sec

THANKS FOR WATCHING

Architecture and optimization on 1000 nodes cluster in China Mobile