Benchmarking Swift

Benchmarking Swift Eamonn O’Toole Mark Seger

Agenda • Benchmarking with HP’s getput • Procedure, tools and operation • Case study • Selecting servers for HP’s public cloud

The Benchmarking Bible • Scripts work best for repeatability • Both for load generation and measurement • Test from the bottom of the stack up • Longer runs tend to reduce cache effects • The middle of the test is as important as the duration • Avoid changing more than 1 thing at a time • It will take as long as it will take • There’s no such thing as a coincidence!

Size Matters • Large Objects • IOPS are small, so pay attention to MB/sec • These use a lot of bandwidth so make sure network wide enough • Use a lot of CPU so could need ~1core/stream/client • Small Objects • MB/sec is low, so pay attention to IOPS • Network bandwidth is less of a concern but latency is • CPU requirements are relatively low as well

Collectl • Developed about a dozen years ago • Open Source on sourceforge • Collects fine-grained metrics • CPU, Disk, Network, Memory and more • Process level, including I/O • Can generate stats in real-time or record for later playback • In playback mode can summarize metrics for each process • Colplot generates plots for visualizing overall performance

Getput Tools • Designed exclusively for Swift Benchmarking • Lots of options for simulating lots of behaviors • Puts, Gets, Deletes • Object sizes • Number of clients • Number of processes • Level of container sharing • Options for running tests • Ranges for numbers of objects, processes and clients • Define pre/post test initialization/analysis scripts • The complete list beyond the scope of this talk

Getting Started • Need swift credential exported to your environment • If swift stat works, getput will work and if it doesn’t it won’t! Simple put, get, del $ ./getput.py -cc -oo -n1 -s1k -tp,g,d Rank Test Clts Proc OSize Start End MB/Sec Ops Ops/Sec Errs Latency Median LatRange 0 put 1 1 1k 20:28:04 20:28:04 0.01 1 7.30 0 0.137 0.137 0.14-00.14 0 get 1 1 1k 20:28:04 20:28:04 0.13 1 132.40 0 0.008 0.008 0.01-00.01 0 del 1 1 1k 20:28:04 20:28:04 0.06 1 66.32 0 0.015 0.015 0.02-00.02 Multiple sizes, multiple number of processes $ ./getput.py -cc -oo -n1 -s1k,2k -tp --procs 1,2 Rank Test Clts Proc OSize Start End MB/Sec Ops Ops/Sec Errs Latency Median LatRange 0 put 1 1 1k 20:32:55 20:32:55 0.02 1 19.92 0 0.050 0.050 0.05-00.05 0 put 1 1 2k 20:32:55 20:32:55 0.07 1 37.50 0 0.027 0.027 0.03-00.03 Rank Test Clts Proc OSize Start End MB/Sec Ops Ops/Sec Errs Latency Median LatRange 0 put 1 2 1k 20:32:55 20:32:55 0.03 2 28.82 0 0.071 0.084 0.06-00.08 0 put 1 2 2k 20:32:56 20:32:56 0.21 2 109.18 0 0.019 0.022 0.02-00.02 Note that 1KB PUTs are a lot slower than 2KB PUTs

Watching with collectl Large object upload $ ./getput.py -cc -oo -n1 -s1g -tp Rank Test Clts Proc OSize Start End MB/Sec Ops Ops/Sec Errs Latency Median LatRange 0 put 1 1 1g 20:43:51 20:44:05 77.57 1 0.08 0 13.201 13.201 13.20-13.20 Network rate is NOT smooth Collectl p# <----CPU[HYPER]-----><----------Disks-----------><----------Network----------> #Time cpu sys inter ctxswKBRead Reads KBWrit Writes KBInPktInKBOutPktOut 20:52:28 4 0 440 217 0 0 12 1 13 26 3 24 20:52:29 13 3 454 100 0 0 44 11 0 3 0 2 20:52:30 10 1 14913 30949 0 0 0 0 841 14253 57030 39605 20:52:31 12 1 20892 44930 0 0 0 0 1221 20841 82666 57154 20:52:32 12 1 21092 44454 0 0 0 0 1248 21296 82315 56894 20:52:33 11 1 19808 40054 0 0 0 0 1162 19839 76518 52928 20:52:34 6 0 16505 33347 0 0 0 0 927 15824 69085 47908 20:52:35 7 0 17832 34715 0 0 0 0 1028 17541 67448 46858 20:52:36 6 0 20819 42114 0 0 0 0 1219 20785 80389 55628 20:52:37 9 0 10210 20885 0 0 0 0 591 10080 40941 28290 20:52:38 6 0 20067 39984 0 0 12 1 1160 19802 75784 52552 20:52:39 8 0 21208 44885 0 0 56 14 1263 21552 82416 56985 20:52:40 12 1 18289 36995 0 0 0 0 1073 18311 71868 49758 20:52:41 8 0 20044 37608 0 0 0 0 1223 20872 94048 64743 20:52:42 8 0 17100 28888 0 0 0 0 850 14503 91449 62516 20:52:43 12 0 19396 35053 0 0 0 0 1143 19512 92891 63792 20:52:44 6 0 5005 6023 0 0 0 0 178 3025 25813 17467 20:52:45 6 0 364 142 0 0 0 0 0 2 0 2 20:52:46 4 0 188 72 0 0 0 0 0 1 0 1

Running a Benchmark gpsuite –suite 1kobjs $ gpsuite --suite 1kobjs Test Clts Proc OSize Start End MB/Sec Ops Ops/Sec Errs Latency LatRange put 1 1 1k 11:36:24 11:38:24 0.02 2081 17.34 0 0.058 0.052 0.01-00.78 get 1 1 1k 11:38:54 11:39:11 0.12 2081 119.78 0 0.008 0.007 0.01-00.27 del 1 1 1k 11:39:41 11:40:15 0.06 2081 60.50 0 0.017 0.011 0.01-00.75 put 1 2 1k 11:40:45 11:42:45 0.03 4030 33.58 0 0.060 0.052 0.01-01.03 get 1 2 1k 11:43:15 11:43:31 0.25 4030 258.12 0 0.008 0.007 0.01-00.25 del 1 2 1k 11:44:01 11:44:33 0.12 4030 126.27 0 0.016 0.011 0.01-00.76 put 1 4 1k 11:45:03 11:47:03 0.06 7864 65.50 0 0.061 0.052 0.01-00.97 get 1 4 1k 11:47:33 11:47:48 0.50 7864 514.76 0 0.008 0.007 0.01-00.22 del 1 4 1k 11:48:18 11:49:01 0.21 7864 210.04 0 0.019 0.011 0.01-00.84 put 1 8 1k 11:49:31 11:51:31 0.12 14711 122.56 0 0.065 0.052 0.01-00.99 get 1 8 1k 11:52:01 11:52:16 0.95 14711 975.96 0 0.008 0.007 0.01-00.25 del 1 8 1k 11:52:46 11:53:37 0.29 14711 298.07 0 0.027 0.011 0.01-01.23 put 1 16 1k 11:54:07 11:56:07 0.24 29435 245.23 0 0.065 0.052 0.01-01.33 get 1 16 1k 11:56:37 11:56:52 1.88 29435 1927.82 0 0.008 0.007 0.01-00.26 del 1 16 1k 11:57:23 11:58:31 0.45 29435 459.14 0 0.035 0.012 0.01-00.96 put 1 32 1k 11:59:01 12:01:01 0.38 46277 385.58 0 0.083 0.053 0.01-01.04 get 1 32 1k 12:01:31 12:01:44 3.58 46277 3662.58 0 0.009 0.007 0.01-00.62 del 1 32 1k 12:02:14 12:03:40 0.54 46277 549.55 0 0.058 0.012 0.01-03.43 put 1 48 1k 12:04:11 12:06:11 0.51 62605 521.51 0 0.092 0.054 0.01-01.49 get 1 48 1k 12:06:41 12:06:56 4.41 62605 4520.88 0 0.011 0.007 0.01-00.53 del 1 48 1k 12:07:26 12:09:07 0.63 62605 640.82 0 0.075 0.021 0.01-02.23 • PUTs scale linearly through 16 process, rate increases are slower at 32 and 48 • GETs look read good through 32 processes and slow down a bit at 48 • DELs had some irregular latencies in upper range

Example of getput maxing out Note – this cluster only had 1 object server gpsuite –suite 1kobjs Test Clts Proc OSize Start End MB/Sec Ops Ops/Sec Errs Latency LatRange put 1 1 1k 16:43:50 16:48:50 0.02 6116 20.39 0 0.049 0.01-01.08 put 8 32 1k 16:50:17 16:55:18 0.24 62506 207.68 0 0.154 0.01-03.84 put 8 128 1k 16:56:51 17:01:56 0.08 32430 107.51 0 1.191 0.01-07.12 put 8 256 1k 17:03:40 17:08:45 0.08 35056 115.50 0 2.216 0.01-09.40 put 8 512 1k 17:09:36 17:14:44 0.16 44663 147.10 0 3.481 0.01-308.52 put 8 1024 1k 17:15:41 17:20:49 0.16 44179 145.98 0 7.122 0.01-308.87 Wow! • Look at the latencies growing in both average and range • Also notice we’ve hit the wall at a little <150 IOPS • BUT swift did keep on chugging along

Selecting servers for HP’s public cloud • Get better understanding of Swift performance and optimise hardware/Swift combination • Two different hardware configurations • 12-disk data servers • Dedicated proxy servers • Data servers host account/container/object services • 5:1 data-servers:proxy-servers • 60-disk data servers • Dedicated proxy servers • Data servers host object services only • Container/account services on separate servers to object services • 1:1 data-servers:proxy-servers • Concentrate on transaction rates, especially PUTs of small objects (1KB to 10KB) • Most objects in production are small (50% <= 20KB) • High transaction rate exercises CPU, container & proxy services

Configuration 1 • Proxy servers • 12 physical cores, 2666MHz • 96GB RAM • 10 GigE • 2*2TB 7200 RPM drives (mirror) • ½ U width . . . Disk1 Disk2 • Data servers • 12 physical cores, 2666MHz • 24GB RAM • 1 GigE • 12*2TB 7200RPM drives • 1U high • Run object, container & account services Server 1 . . . . . . . . . Disk 1 Disk 1 Disk 12 Disk 12 . . . Server 5

First set of measurements: “idle” system • Idle: no external PUTs, GETs, DELETEs etc • This system has 123K containers & 17M objects per data-server • Measurements with different services turned on and off in graph • Significant “idle” CPU load • Biggest contributor to “idle” CPU burn is container replicator

CPU measurements: 1KB object PUTs

I/O measurements: 1KB object PUTs

Observations on Configuration 1 measurements • Idle CPU burn is 34%, increases to 38% at the maximum-achieved PUT rate (approx 338 PUTs/s) • Container services are the major CPU hogs • Small amount of memory hurts performance - most of the reads go to disk as opposed to cache • Major source of reads: object auditor • Object server reads grow approx. linearly with PUT rate (read 6x as much as write for 1KB PUTs) • Running the container service in conjunction with the object service hurts I/O - the container data flushes object data from cache

Conclusions from Configuration 1 measurements • PUT throughput (1KB) is limited by READ IOPs • Keep container and object services separate • The object services consumes relatively little CPU • Large amounts of RAM for buffer cache will help increase performance

Configuration 2 • Proxy/Container & Account Servers • Same server type for proxy services & container/account services • 12 physical cores, 2666MHz • 96-192GB RAM • 10 GigE • 4*1TB 7200 RPM disks, in a variety of RAID configurations • ½ U width . . . . . . . . . • Object servers • 12 physical cores, 2666MHz • 96GB RAM • 10 GigE • 60*2TB 7200 RPM disks • 4.3U high . . . Disk 1 Disk 1 Disk 60 Disk 4 • Note • Used many combinations of server & Swift services • Used many variations of server details – e.g. RAID config • Report results for a specific server/Swift service config

Performance measurements: 4KB object PUTs

Observations on Configuration 2 measurements • We achieved a maximum throughput of approx 1600 PUTs/s using 1KB objects, and 2000 PUTs/s using 4KB objects • Dramatic jump in CPU usage particularly on the Object-server for the 2000 PUTs/s run • Benefiting from hyperthreading • On the Proxy/Account&Container-server, the dominant processes are the proxy-server and the container-server. • All reads are satisfied from cache on Object-server and Proxy/Account&Container-server

Conclusions from Configuration 2 measurements • Massive increase in operation throughput • 5x System 1 (per rack U) • Proxy services and account/container services can coexist • Object auditing time probably an issue with 60 disks • Estimate over 200 days for auditor to walk that many disks on “full” system • Possible solution: parallel object auditor • Patch under review https://review.openstack.org/#/c/59778/ • Next steps • Detailed object auditor measurements • Large container measurements using SSDs, striped disks

Links collectl : http://collectl.sourceforge.net/ getput: https://github.com/markseger/getput

Benchmarking Swift

Benchmarking Swift

Presentation Transcript

BENCHMARKING:

BENCHMARKING

Swift

Benchmarking

SWIFT

SWIFT

Benchmarking

Benchmarking

SWIFT

Benchmarking

Swift for Swift Research Manager

Benchmarking

SWIFT

Swift Programming | Hire Swift Developers | swift development

Swift Programming | Hiring Swift Developers | Swift App Development

Swift Programming | Hire Swift Developers | Swift Development

Hire Swift Programming | Hire Swift Developers | Swift App Development

Swift Ios Development | Ios Development with Swift | Ios Swift

Swift Programming | Hire Swift Developers | Swift App Development Company