General Performance News

Virginia POWER User Group May 19, 2015What’s New Performance Features forIBM PowerVM & POWER8Steve Nasypany nasypany@us.ibm.com

General Performance News

Optimization Redbook New! Draft available now! POWER7 & POWER8 PowerVM Hypervisor AIX, i & Linux Java, WAS, DB2… Compilers & optimization Performance tools & tuning http://www.redbooks.ibm.com/redpieces/abstracts/sg248171.html

Quick View of POWER8 • POWER8 Migration & Best Practices http://www14.software.ibm.com/webapp/set2/sas/f/best/home.html • SAP, Oracle, Siebel results linked here http://www-03.ibm.com/systems/power/hardware/benchmarks/erp.html • IBM Power Systems Performance Report • POWER8 Single-Thread, SMT2, SMT4 & SMT8 numbers! • Per report • Uplift from SMT2 to SMT4 is 30% • Uplift from SMT4 to SMT8 is 7% • Uplift from Single-Thread to SMT8 is 100% • Per SMT thread vs throughput should be very linear as threads are more equally biased in POWER8 (covered later)

The Dynamic System Optimizer function in AIX is not supported on POWER8 AIX function, formerly called Active System Optimizer (aso) daemon function available for free Additional charged features for Autonomic Large Page (16 MB) Migration Autonomic Processor Pre-Fetch Control AIX asoo commands will not execute anything on POWER8 If you migrate from POWER7 with it enabled, it will remain enabled, but aso daemon will not do anything No performance concern, but can disable if you find the aso logs annoying Future support is based on two issues Benefit of DSO was not judged a high-priority for Scale Out systems Functional support is not as much a technical issue as a testing resources issue Lab is interested in feedback on customers who want Scale Up support for POWER8. Complain to CTS or me and I will forward to development This has no impact on Dynamic Platform Optimizer DSO optimizes threads within an virtual machine (OS) instance DPO optimizes virtual machine placement within a frame Dynamic System Optimizer

Java 7.1 SR1 is the preferred level for POWER7 and POWER8 Java 6 SR7 is the minimally recommended level for POWER7, as it contains optimizations for POWER7 and the default use of 64KB (versus 4KB) pages for Java Virtual Machines (JVM) in AIX Java 7.1 is optimized to use specific hardware optimizations for POWER8 JIT compiler will automatically detect platform architecture and generate code optimized for that platform. WAS 8.5.2.2 RHEL 6, SLES 11 Linux support use of 64KB pages for JVMs As with all legacy levels, Java applications with little memory footprint typically perform better in 32-bit. Applications with larger memory requirements should use 64-bit. A variety of other Java optimizations for AIX & Linux are covered in Section 8.3 of the Performance Optimization & Tuning Techniques for IBM Processors, including IBM POWER8 Redbook Java

Utilization, Simultaneous Multithreading & Virtual Processors

Review: POWER6 vs POWER7/8 SMT Utilization Simulated single threaded process on 1 core, 1 Virtual Processor, utilization values change. In each case, physical consumption can be reported as 1.0. Real world production workloads will involve dozens to thousands of threads, so users may not notice any difference in the “macro” scale See Simultaneous Multi-Threading on POWER7 Processors by Mark Funk http://www.ibm.com/systems/resources/pwrsysperf_SMT4OnP7.pdf POWER5/6 utilization does not account for SMT, POWER7/8 is calibrated in hardware POWER6 SMT2 POWER7 POWER8 SMT2 SMT4 SMT4 SMT8 Htc0 Htc0 Htc0 Htc0 busy busy Htc0 busy busy busy Htc1 Htc1 Htc1 Htc1 idle idle Htc1 idle idle idle idle idle idle Htc2 Htc2 Htc2 100% busy ~70% busy idle idle idle Htc3 Htc3 Htc3 idle Htc4 ~60% busy ~63% busy idle Htc5 idle Htc6 idle Htc7 “busy” = user% + system% ~56% busy

POWER6 vs POWER7/POWER8 Dispatch There is a difference between how workloads are distributed across cores in POWER7 & POWER8 In POWER5 & POWER6, the primary and secondary SMT threads are loaded to ~80% utilization before another Virtual Processor is unfolded In POWER7, all of the primary threads (defined by how many VPs are available) are loaded to at least ~50% utilization before the secondary threads are used. Once the secondary threads are loaded, only then will the tertiary threads be dispatched. This is referred to as Raw Throughput mode. Why? Raw Throughput provides the highest per-thread throughput and best response times at the expense of activating more physical cores POWER7/8 SMT4 POWER6 SMT2 Htc0 busy Virtual Processor When Htc0 busy Htc1 idle or Activate Htc1 busy idle Htc2 idle Htc3 ~80% busy ~50% busy

Review: POWER6 vs POWER7/8 Dispatch proc0 proc1 proc2 proc3 Primary POWER6 Secondary proc0 proc1 proc2 proc3 Primary POWER7 POWER8 (Raw Mode) Secondary Tertiaries lcpu 0-3 lcpu 4-7 lcpu 8-11 lcpu 12-15 100% 88% 63% 77% 100% 88% 63% 77% 100% 77% 88% 63% 100% 88% 77% 63% Once a Virtual Processor is dispatched, the Physical Consumption metric will typically increase to the next whole number Put another way,the more Virtual Processors you assign, the higher your Physical Consumption is likely to be in POWER7/POWER8

POWER7/POWER8 will activate more cores at lower utilization levels than earlier architectures when excess VP’s are present Customers may complain that the physical consumption metric (reported as physc or pc) is equal to or possibly even higher after migrations from earlier architectures Every POWER7/POWER8 customer with this complaint to also have significantly higher idle% percentages over earlier architectures Consolidation of workloads and may result in many more VP’s assigned to a new POWER7 or POWER8 partition Just because we let you set very high ranges of Virtual Processor to Entitlement (20:1 now on some POWER7+ and POWER8) does not mean that is always optimal. Your choices have consequences. There is no magic ratio for all environments. If you want more education on VP vs Entitlement, ask for that education. More VPs can result in lower affinity Broader spread across shared pool and memory domains Lower affinity leads to more cycles, more cycles leads to lower perf POWER7/POWER8 Consumption

A recurring question in AIX is “how many Virtual Processors am I using?” The physical consumption metric (physc or pc) could be used to approximate activity if the VP Folding algorithm was understood and the workload was stable (typically, 1 to 2 VPs higher than physc) Tools like sar, mpstat and nmon could be used to display logical CPUs and divine how many Virtual Processors were active by looking at SMT sets (mapping to a VP) and their logical CPU statistics (utilization and context switches) A new mpstat option provides information on Virtual Processor activity mpstat –v Displays the delta Virtual Timebase (VTB), which is time charged to a dispatched VP If the Virtual Timebase is 0, the processor statistics associated with that VP will not be shown, simplifying the output AIX 7.1 TL3 SP2 Virtual Processor Dispatch

Virtual Processors Dispatched - mpstat -v vcpu lcpu us sy wa id pbusy pc VTB(ms) --- ---- ---- ---- ----- ----- ----- ----- ------- 0 55.88 0.53 0.00 43.59 0.34[ 56.4%] 0.60[119.7%] 649 0 55.88 0.52 0.00 0.47 0.34[ 56.4%] 0.34[ 56.9%] - 1 0.00 0.00 0.00 13.95 0.00[ 0.0%] 0.08[ 13.9%] - 2 0.00 0.00 0.00 15.04 0.00[ 0.0%] 0.09[ 15.0%] - 3 0.00 0.01 0.00 14.13 0.00[ 0.0%] 0.08[ 14.1%] - 4 56.26 0.92 0.00 42.82 0.07[ 57.2%] 0.13[ 25.5%] 209 4 56.26 0.87 0.00 1.28 0.07[ 57.1%] 0.07[ 58.4%] - 5 0.00 0.04 0.00 14.11 0.00[ 0.0%] 0.02[ 14.1%] - 6 0.00 0.01 0.00 13.69 0.00[ 0.0%] 0.02[ 14.8%] - 7 0.00 0.01 0.00 13.75 0.00[ 0.0%] 0.02[ 13.9%] - 8 60.92 0.50 0.00 38.58 0.15[ 61.4%] 0.25[ 49.0%] 404 8 60.92 0.49 0.00 0.64 0.15[ 61.4%] 0.15[ 62.0%] - 9 0.00 0.00 0.00 12.61 0.00[ 0.0%] 0.03[ 12.9%] - 10 0.00 0.00 0.00 12.66 0.00[ 0.0%] 0.03[ 13.0%] - 11 0.00 0.00 0.00 12.67 0.00[ 0.0%] 0.03[ 13.0%] - ALL 173.05 1.95 0.00 124.99 0.56[175.0%] 0.97[194.2%] 1262 VCPU values appears to be tied to lowest logical CPU number. In this Example there are only 3 active VPs and VCPU does not represent some internal AIX numbering scheme

Migration Guidance

If you are migrating between POWER7 and POWER8 Not a problem AIX SMT4 default makes these migrations “apples-to-apples” Default dispatcher behaves the same If you are migrating between POWER5/POWER6 to POWER8? Maybe a problem POWER7 & POWER8 behave the same way Now that you understand the dispatch behavior, you know why customers may complain What are my options? Get the VP counts right the first time. Do not do 1:1 VP sizings for larger partitions between POWER5/6 and POWER7/8. This will get you into trouble! If a customer ignores updated VP sizings, consider using Scaled Throughput tunings Use Scaled Throughput tunings AIX uses more SMT threads before dispatching a VP. See backup material for detail and guidance. Migrations: Dispatching, SMT… Will I have a problem?

AIX 6.1 will only support SMT4. Most customers are still running AIX 6.1 After early experiences with POWER7, AIX chose the conservative path for POWER8 at the expense of some capacity Most workloads will be fine with SMT4 or SMT8 All those problems you thought were SMT issues in POWER7 weren’t. They were firmware, affinity, aggressive dispatcher, too many VPs. We avoid application scalability issues made visible by more SMT threads, but often blamed incorrectly on SMT Lab view is most customers do not run at utilization levels (> 80%) to benefit from SMT8. The reality is, many, if not most of our customers do not run at utilization levels to fully exercise SMT4. SMT4 is the best of all worlds for now, but there are now more options to exploit SMT. This can be done via the Scaled Throughput tunings which are covered in the backup material POWER8 SMT Default: Why SMT4?

Any PoC or benchmark where you’re going to drive to 80% utilization Absolutely try SMT8, don’t leave capacity on the table You can’t get to the highest rPerf without SMT8 OLTP DB, large WAS appservers, etc have seen 5 to 15% increases We should be open to letting experienced customers trying SMT8 These customers typically know what they’re doing and understand if higher SMT is appropriate for their environment It is easy and free to test SMT4 and SMT8 modes, no reboot For new customers/applications, need to review software stack If application space is will known on AIX, should not be a problem If application new to AIX or Linux, should be tested for scaling issues (product may have never been tested to 24 cores / 192 logical cpus) POWER8 SMT: Should I use SMT8?

POWER7 & POWER8 are different in SMT bias In POWER7, there is a correlation between the Hardware Thread number (logical CPU 0, 1, 2 & 3) and physical resources within the processor. Lower threads may also have a higher priority. POWER8 Hardware Threads are equally biased and provide the same performance regardless of which thread is active. This is true for AIX & Linux. For AIX, you do not need to worry about using bindprocessor or RSET function with various threads, or always “pinning” to a Virtual Processors Primary Hardware Thread for the best performance. This topic, called Flexible SMT, is covered in more detail in Section 4.2 of the tuning Redbook AIX will dynamically adjust between SMT and ST mode based on the workload utilization. A 1:1 equivalent in Linux does not really exist, but I expect similar function will migrate to Linux and/or PowerKVM https://www.ibm.com/developerworks/community/blogs/aixpert/entry/local_near_far_memory_part_4_aggressive_intelligent_threads46?lang=en POWER8 SMT: Flexible SMT

The Linux space is a bit more complicated As of right now, there does not appear to be seamless handling of SMT between all Linux distros and PowerKVM comparable to PowerVM hosting AIX and i. Most Linux workloads are more scale out than scale up Smaller partitions More HPC-like, manual SMT tunings, manual bindings to processors IBM and the industry is working on this SMT can be dynamically changed Distros have added more SMT awareness, NUMA tooling (numastat) Visibility of SMT through host & client layers may differ in distros “Split-Core” function offered where a single core with SMT8 will be split into four SMT2 “cores” from the guest perspective Rely on guidance provided by the Linux OS and application space. LTC is very responsive at DeveloperWorks Community questions: https://www.ibm.com/developerworks/community/forums/html/forum?id=a95a744c-e8fd-4228-a57a-1ae837efe457&ps=25 POWER8 SMT Opinion: What about Linux?

If your environment has been memory constrained, consider profiling existing workloads for Advanced Memory Expansion We are getting many field questions about this feature in 2015 AIX amepat tool can profile running workloads Generates output report with guidance on recommended expansion factors and CPU use required to implement Can select target architecture of POWER7 or POWER8 Supported on AIX 6.1 with POWER6 and above For storage I/O, use existing tools, knowledge base for planning Ask for a Disk Magic study Use documents/tools at IBM Techdocs Search for documents on POWER8 or written by Dan Braden, Sue Baker, John Hock http://www-03.ibm.com/support/techdocs/atsmastr.nsf/Web/TechDocs For example, the updated Fibre Channel Planning tool estimates adapters required based on IOPS, MB/sec, paths and LUN counts (should work fine for System i & AIX) http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/PRS5166 Migrating Memory & Storage I/O

For all network efforts, see Steve Knudson’s and/or Alexander Paul’s presentations (10 Gb SEA tuning, SR-IOV, etc) High packet counts (>100K/sec) or low-latency tiny-packets require tuning Learn about mtu_bypass Beware using Large Receive & Send on VIOS with Linux clients Linux does not support this feature (LTC is trying!) Mixing AIX/i clients with Linux virtual ethernet/SEA will result in performance issues Separate Linux clients See Lab’s Performance Tuning Best Practices links Single sheets for POWER7 & POWER8 Transition and Service Strategy guidance All at: https://www-304.ibm.com/support/customercare/sas/f/best/home.html Migrating Network & General Tuning

Scaled Throughput

Scaled Throughput is an alternative to the default “Raw” AIX scheduling mechanism An alternative for some customers at the cost of some performance Not an alternative to addressing AIX and pHyp defects, partition placement issues, realistic entitlement settings and excessive Virtual Processor assignments Will dispatch more SMT threads to a VP/core before unfolding more VPs It can be considered to be more like the POWER6 folding mechanism, but this is a generalization, not a technical statement Supported on POWER7/POWER7+, AIX 6.1 TL08 & AIX 7.1 TL02 Does not apply to dedicated partitions unless you enable VP folding Raw vs Scaled Performance Raw provides the highest per-thread throughput and best response times at the expense of activating more physical cores Scaled provides the highest core throughput at the expense of per-thread response times and throughput. It also provides the highest system-wide throughput per VP because hardware thread capacity is “not left on the table.” What is Scaled Throughput?

Raw vs Scaled proc0 proc1 proc2 proc3 Primary Raw default Secondary Tertiaries lcpu 0-3 lcpu 4-7 lcpu 8-11 lcpu 12-15 77% 88% 100% 63% 100% 77% 88% 63% 100% 63% 77% 88% 100% 88% 77% 63% Once a Virtual Processor is dispatched, physical consumption will typically increase to the next whole number proc0 proc1 proc2 proc3 Scaled Mode 2 proc0 proc1 proc2 proc3 Scaled Mode 4 POWER8 Mode + AIX 7.1 Supports Scaled Mode 8

Tunings are not restricted, but you can be sure that anyone experimenting with this without understanding the mechanism may suffer significant performance impacts Dynamic schedo tunable Actual thresholds used by these modes are not documented and may change at any time schedo –p –o vpm_throughput_mode= 0 Legacy Raw mode (default) 1 Scaled or “Enhanced Raw” mode with a higher threshold than legacy 2 Scaled mode, use primary and secondary SMT threads Scaled mode, use all four SMT threads Scaled mode, use eight SMT threads (POWER8, AIX 7.1 required) Tunable schedo vpm_throughput_core_threshold sets a core count at which to switch from Raw to Scaled Mode Allows fine-tuning for workloads depending on utilization level VP’s will “ramp up” quicker to a desired number of cores, and then be more conservative under chosen Scaled mode Scaled Throughput: Tuning

Workloads Workloads with many light-weight threads with short dispatch cycles and low IO (the same types of workloads that benefit well from SMT) Customers who are easily meeting network and I/O SLA’s may find the tradeoff between higher latencies and lower core consumption attractive Customers who will not reduce over-allocated VPs and prefer to see POWER6 behavior Use mpstat (-v) in AIX 7.1 TL3 to view Virtual Processor dispatches Performance It depends, we can’t guarantee what all workloads will do Mode 1 may see little or no impact but higher per-core utilization with lower physical consumed (typically 10-15%) Workloads that do not benefit from SMT and use Mode 2 or Mode 4 will see double-digit per-thread performance degradation (higher latency, slower completion times) POWER6 workloads migrating to POWER7 or POWER8 and using Mode 2 will likely perform as well, or better and minimize complaints about higher than expected physical consumption. Many POWER7 workloads could migrate to POWER8 mode 2 and reduce core usage without performance impact. These are non-restricted dynamic tunings, easily tested like SMT mode changes Scaled Throughput: Guidance

Raw Throughput: Default and Mode 1 • AIX will typically allocate 2 extra Virtual Processors as the workload scales up and is more instantaneous in nature • VP’s are activated and deactivated one second at a time • Mode 1 is more of a modification to the Raw (Mode 0) throughput mode, using a higher utilization threshold and moving average to prevent less VP oscillation • It is less aggressive about VP activations. Many workloads may see little or no performance impact

Scaled Throughput: Modes 2 & 4 • Mode 2 utilizes both the primary and secondary SMT threads • Somewhat like POWER6 SMT2, eight threads are collapsed onto four cores • “Physical Busy” or utilization percentage reaches ~80% of Physical Consumption • Mode 4 utilizes both the primary, secondary and tertiary SMT threads • Eight threads are collapsed onto two cores • “Physical Busy” or utilization percentage reaches 90-100% of Physical Consumption

Never adjust the legacy vpm_fold_threshold without L3 Support guidance Remember that Virtual Processors activate and deactivate on 1 second boundaries. The legacy schedo tunable vpm_xvcpus allows enablement of more VPs than required by the workload. This is rarely needed, and is over-ridden when Scaled Mode is active. If you use RSET or bindprocessor function and bind a workload To a secondary thread, that VP will always stay in at least SMT2 mode If you bind to a tertiary thread, that VP cannot leave SMT4 mode POWER8 threads are more balanced whereas lower POWER7 threads typically have a higher priority. These functions should only be used to bind to primary threads unless you know what you are doing or are an application developer familiar with the RSET API Use bindprocessor –s to list primary, secondary and tertiary threads Tuning (other)

General Performance News