QLogic TrueScale InfiniBand Advantage Performance and Scaling

QLogicTrueScale InfiniBand AdvantagePerformance and Scaling April 2011

Agenda TrueScale Performance TrueScale® Architecture Performance Advantage Enabling Performance Factors Real Applications Customer Examples

End-to-End Connectivity Solutions for High Performance Computing Clusters ASICs Host and Switch Adapters Standard and Custom Form Factor Systems Directors and Edges Software Host, Fabric, Element Cables Copper, Fiber Optic Adapter ASICs Switch ASICs

QLogic TrueScale HPC InterconnectAugmenting basic InfiniBand for HPC clusters MANAGEABILITY PERFORMANCE Optimized Host Architecture InfiniBand Fabric Suite Adaptive Routing GPGPU Support Dispersive Routing FABRIC EFFICIENCY CAPABILITY Virtual Fabrics Advanced Topologies

What are some of factors that determine performance? • HPC Cluster • Size of Cluster = Speed of Simulation • Available Memory = Complexity or size of the model • Processors • Speed & Generation of the Processor = Speed of simulation • GPGPU = Performance “specific applications” • Interconnect – Designed for HPC • Messaging Rate = Performance, Scalability and Latency • Collectives = Performance and Scalability • Latency = Speed of Simulation • Bandwidth = Speed and Performance, Simplification

TrueScale Architecture

Early 2000’s Mid-2000’s Bit of HPC History • InfiniBand Focus • Designed for the enterprise data center market and an IO paradigm • Backbone network as a replacement for Ethernet and Fibre Channel • Incorporate best data center features of all interconnects and protocols • Servers • Single Core / Dual Socket • Limited processor speed • Slower PCI, PCI-X buses • InfiniBand Finds It’s Niche • High Performance Computing Clusters market • Low-Latency / High Bandwidth advantages • Primary message paradigm – MPI • Servers • Multiple Cores per CPU • Multi-socket servers are the norm • Processors faster with more internal bandwidth • PCI-express Conventional IB TrueScale IB

Comparison of Architectural ApproachesTraditional vs. TrueScale QDR Traditional TrueScale QDR Open MPI MVAPICH MVAPICH2 HP MPI Intel MPI QLogic MPI Platform MPI Applications MPI Libraries (Verb-based) Applications MPI Libraries (Verb-based) MPICH2 SHMEM Design: Host-based protocol processing engine and state API: PSM and RDMA Verbs Generic OFEDULPs OFEDULPs Design: On board protocol processing engine and state API: RDMA Verbs Verbs Provider / Driver Verbs Provider / Driver PSM Traditional HCA Adapter Specific TrueScale HCA InfiniBand Wire Transports

Difficult to Beat a Modern x86 Processor • STREAM Copy Bandwidth • Processor Execution Rate 24x Dual Socket = 48x Dual Socket Westmere = 72x

QLogic Traditional – Verbs TrueScale – PSM Performance Scaled Messaging Software Architecture Difference • Based on RDMA and QP programming model • Connection oriented approach with heavy-weight state • RDMA model requires significant memory pinning for send and receive • Retrofitted for HPC • No receiver tag matching semantics • PSM is light weight with targeted semantics • 1/13th the user space code of Verbs • Connectionless with minimal on-adapter state – • No Chance of Cache Misses as the Fabric Scales • High message rate - Short message efficiency • Amenable to receiving out-of-order packets • Long message transfers similar to RDMA model (typically >64KB) PSM (and its underlying hardware) is designed for HPC

Scalable PerformanceMessage Rate Performance

QLogic QDR Message Rate – Nehalem QLE7340 vs. ConnectX vs. ConnectX-2 TrueScale Difference 5x Better • TrueScale Performance Difference • TrueScale 5 times more messages • TrueScale - ~2.8M Messages/Sec/Core • Mellanox - ~530K Messages/Sec/Core • Mellanox max message performance at 3-core pairs • Superior scalability with TrueScale • Messaging performance = cluster scalability and performance Results Easily Replicated OSU’s osu_mbw_mr test; running on 2 nodes at 1 thru 8 processes per node; 2x Xeon X5570, 2.93 GHz Nehalem CPUs per node; MVAPICH 1.2 on all adapters; RHEL 5.3

QLogic QDR Message Rate – Westmere 7.8x TrueScale Difference Non-coalesced Messages Better • TrueScale Performance Difference • TrueScale 8 times more messages • TrueScale - ~2M Messages/Sec/Core • Mellanox - ~250K Messages/Sec/Core • Mellanox max message performance at 3-core pairs • Messaging performance = cluster scalability and application performance Results Easily Replicated OSU’s osu_mbw_mr test; running on 2 nodes at 1 thru 12 processes per node; 2x Xeon X5670, 2.93 GHz Nehalem CPUs per node; MVAPICH 1.2 on all adapters; RHEL 5.4

Scalable Performance Collectives Performance

QLogic TrueScale Performance DifferenceCollective - Barrier IMB Pallas Collectives Benchmark Barrier Test Source: Voltaire Whitepaper TrueScale results from LLNL • TrueScale Collective Acceleration • 168 times performance advantage over standard IB adapter implementations • 11% to 24% better performance than Voltaire’s FCA All MPI’s & standard collective algorithms are supported

QLogic TrueScale Performance DifferenceCollective - AllReduce IMB Pallas Collectives Benchmark AllReduce Benchmark Source: Voltaire Whitepaper TrueScale results from LLNL • TrueScale Collective Acceleration • 154 times performance advantage over standard IB adapter implementations • 6% to 17% better performance than Voltaire’s FCA All MPI’s & standard collective algorithms are supported

MPI Collectives Performance at Scale • The performance of MPI collectives is also critical for application performance scalability on large clusters • PSM's native MPI collectives performance offers near-perfect scaling on very large clusters without requiring any hardware acceleration • This performance is achievable across all major MPI libraries and across all forms of collectives

PSM – Performance Summary Note: Results achieved with top-bin Westmere processors and well-tuned systems. 1 - osu_bw, osu_latency & osu_mbw_mr are executable names of tests from the OSU MPI Benchmark suite 2 – LLNL actual performance results 3 - osu_mbw_mris the OSU Multiple Bandwidth, Message Rate Benchmark 4 - Source Voltaire FCA Whitepaper – Results without Collective Acceleration hardware/code

QLogic PSM – Performance Summary Note: Results achieved with top-bin Westmere processors and well-tuned systems. 1 - osu_bw, osu_latency & osu_mbw_mr are executable names of tests from the OSU MPI Benchmark suite 2 – LLNL actual performance results 3 - osu_mbw_mris the OSU Multiple Bandwidth, Message Rate Benchmark 4 - Source Voltaire FCA Whitepaper – Results without Collective Acceleration hardware/code

PSM - Performance at ScaleCustomer Example

Gov. Agency Benchmark ResultsConnectX vs. TrueScale

LLNL’s industry partnerships have delivered world-class Linux clusters to our science and national security programs • Coastal dedicated to LLNL National Ignition Campaign • Latest Intel Westmere + QLogic InfiniBand clusters • 24 SU’s or 4,000 nodes in 2010 • Over 500 TF/s • Multiple LLNL programs • Demonstrated success of QLogic IB – competition in the InfiniBand market • Highly scalable • New clusters will move into production in early November 22

QLogic Matt Leininger from LLNL

QLogic Sierra is the most scalable system LLNL has ever deployed BETTER

LLNL Summary http://www.qlogic.com/Products/Pages/HPCLearnMore.aspx

Scalable Performance HPC and ISVMPI Application Performance

Performance TestsANSYS Application ANSYS FLUENT Benchmark Tests http://ansys.com/Support/Platform+Support/Benchmarks+Overview • ANSYS FLUENT 12.0 • Reacting Flow Case w/Eddy Dissipation Eddy_417K • Turbomachinery Flow Turbo_500K • External Flow Over an Aircraft Wing Aircraft_2M • External Flow Over a Passenger Sedan Sedan_4M • External Flow Over a Truck Body Truck_14M • Ext Flow Over a Truck Body w/Polyhedral Mesh Truck_poly_14M • External Flow Over a Truck Body Truck_111M

QLogic TrueScale Best in Class ANSYS FLUENT with Westmere TrueScale Difference • TrueScale Single Rail IB Results vs. Dual Rail IB for Mellanox • TrueScale Performance Advantage – 2.5% to 11% • Average Performance Difference – 9.4% Cluster - iDataplex - Xeon 5670, TrueScale IB SGI ALTIX_ICE_8400EX - INTEL_X5670, Mellanox IB - Dual IB Rail Implementation

QLogic ANSYS FLUENTApples to Apples Comparisons Cluster - Intel Server - Xeon 5670, TrueScale IB SGI ALTIX_XE1300C - INTEL_X5670, Mellanox IB - Single IB Rail Implementation • TrueScale Performance Difference • Published ANSYS FLUENT 12.1 Results • Average Performance Difference – 84% • Truck 111M Model is the largest model and therefore less sensitive to the performance of the interconnect, even so TrueScale offers 20% better overall performance.

ANSYS – FLUENTNehalem Results TrueScale Difference - Nehalem • Eddy 417K profile is small messages but heavy node-to-node communications • TrueScale Advantage increases from 3.6% at 2 nodes to 8% at 16 nodes • Both configurations’ CPU Utilization is 100% Better Nehalem • Intel Servers – Xeon 5570 • - TrueScale QDR QLE7340 • - Mellanox QDR ConnectX-2

QLogic ANSYS – FLUENTWestmere Results 68% Westmere Better TrueScale Advantage 188% 23% • Cluster • - Intel Server - Xeon 5670, TrueScale IB • SGI ALTIX_XE1300C - INTEL_X5670, Mellanox IB • TrueScale Difference - Westmere • Results are from the ANSYS FLUENT Benchmark Site • Westmere Results – TrueScale has 188% Advantage • TrueScale Scaling Efficiency – 68% vs. 23% for Mellanox • Mellanox reaches point of diminishing return at 8 nodes • Both configurations’ CPU Utilization is 100%

QLogic CPU UtilizationProfile: Interconnect Sensitive Model CPU Utl Profile Better TrueScale Difference - CPU and MPI Statistics Show • Eddy 417K Cell model is highly interconnect performance sensitive, because there are a limited number of cells to be processed/core so most of the application’s time is spent in MPI communications • MPI application tend to use 100% of the available CPU cycles • More efficient interconnect / More cycles for the application • TrueScale provides 5x’s more cycles for application (user) processing

QLogic CPU UtilizationProfile: Large Model Better CPU Utl Profile TrueScale Difference - CPU and MPI Statistics Show • Truck 111M Model is the largest model, and less sensitive to the performance of the interconnect. Each Westmere Core has to process 266 time more cells per step than the Eddy 417K, so more time is spent in processing then communications • More efficient interconnect / More cycles for the application • TrueScale still provides more cycles for application (user) processing

CFD ApplicationOpenFOAM Simulation (independent vendor)

QLogic Reservoir SimulationSchlumberger Eclipse (independent vendor) Seconds TrueScale Difference 8%

QLogic Weather Modeling – WRF (major university) • Test is an average of three runs • TrueScale achieves a 19% performance advantage at scale Better

General Purpose Graphical Processing Unit GPGPU Overview

QLogic HPC Performance Demands • High performance computing market needs/demands more performance • GPU parallel processing capabilities provide an answer for attaining higher levels of HPC performance • Parallel processing capabilities of the GPU allows for complex computing tasks computing to be divided up across hundreds of processors within the GPU. • GPUs performance/architecture place higher demands on cluster communications • InfiniBand is the only interconnect to handle the higher performance and communication requirements that GPUs place on the cluster

QLogic QLogic TrueScale – GPUDirect Implementation • GPUDirect • Optimizes memory access and transfers for communication between GPU nodes across InfiniBand • Provides for faster communications across a GPU based cluster through direct GPU CPU System Memory GPU Memory QLogic Solution • Only requires an update to the nVIDIA drive, other implementation require Linux kernel patches and special IB drives. • No memory region conflicts • No impact on GPGPU performance • Maintains latency and message rate performance • QLogic 2 1 • 44% Performance improvement with TrueScale GPUDirect Update

Performance with and without GPUDirect Update • Test Configuration • 8 GPU Cluster • Tesla M2050 GPGPU (2/Server) • Intel Servers – • -- Dual X5570 2.93 GHz • -- 24GB Memory/Server • 44% Performance improvement with TrueScale GPUDirect Update Based Results

QLogic TrueScale Performance Difference • 9.6% performance advantage with TrueScale • Performance difference increases with cluster size • Test Configuration • 8 GPU Cluster • Tesla M2050 GPGPU (2/Server) • Intel Servers – • -- Dual X5570 2.93 GHz • Amber B/M Configuration • 8 GPU Cluster • Tesla M2050 GPGPU (2/Server) • Intel Servers – • -- Dual X5670 2.93 GHz • (http://ambermd.org/gpus/benchmarks.htm#Benchmarks)

QLogic TrueScale Performance Difference • 3.9% average performance advantage with TrueScale InfiniBand as the interconnect • 5 to 6% TrueScale performance advantage with models that are more dependent on the interconnect – DHFR and FactorIX DHFR 23k Atoms FactorIX 90.9k Atoms Cellulose 408.6k Atoms • Test Configuration • 8 GPU Cluster • Tesla M2050 GPU (2/Server) • Intel Servers – • -- Dual X5570 2.93 GHz • Amber B/M Configuration • 8 GPU Cluster • Tesla M2050 GPU (2/Server) • Intel Servers – • -- Dual X5670 2.93 GHz • (http://ambermd.org/gpus/benchmarks.htm#Benchmarks)

QLogic Performance Per Watt • GPU Performance with TrueScale • Up to 44% increase in performance versus implementations without GPUDirect • Provides up to 10% additional performance over other InfiniBand interconnects • Scaling efficiency that is on the average 15%+ better than other clusters based on GPUs, Intel based server and the other InfiniBand. • Power Consumption • TrueScale Host HCA require 20% to 50% less power than other HCA’s • #3 Green500 List • Combination of the above allow nVIDIA GPU and QLogic TrueScale based cluster to achieve 933 Mflops/Watt!

QLogic TrueScale GPUDirect UpdateAdvantages • Easier to install and support • Does not require Linux kernel patch or special IB drivers • Optimizes performance • Up to 44% increase in performance versus implementations without GPUDirect • TrueScale performance advantage • Provides up to 10% additional performance over other InfiniBands interconnects • More efficient scaling • Top 500 GPU clusters with TrueScale InfiniBand offers 20% better scaling efficiency than most other GPU clusters • Power per watt advantage • TrueScale’s power consumption is between 20% to 50% less than other InfiniBand interconnects • Combination of TrueScale’s performance, scaling efficiency and power utilization produced the #3 on the Green500 list

Maximize HPC Resources through More Effective Management

Introducing: InfiniBand Fabric Suite 6.0 “Scaling HPC Performance Efficiently” • Fabric Wide Adaptive Routing • Torus / Mesh Advanced Topologies • Performance Scaled Messaging (PSM) for MPI and SHMEM Applications • Source Directed Dispersive Routing • Virtual Fabrics and Quality of Service • Boot over IB • FastFabric Tools & Fabric Viewer Management Performance Usability

IFS Management • Adaptive Routing • Fabric Manager identifies equivalent paths per destination ID • As message patterns change and thresholds are exceeded, routes can be adjusted to alleviate congestion • Traffic flows are moved transparently to circumvent congested areas in the fabric without user intervention • Allows for more consistent and predictable performance in congested environments • MORE • Fabric Wide Adaptive Routing • Torus / Mesh Advanced Topologies Management

IFS Management • Advanced Topologies • Torus/mesh topologies are attractive to many customers as a cost savings measure due to fewer switching components • However, improved traffic management and failure isolation capabilities are required • Multiple routing algorithms help eliminate congestion and resulting deadlocks • Inevitable large scale fabric disruptions are handled gracefully by routing around problem areas • MORE • Fabric Wide Adaptive Routing • Torus / Mesh Advanced Topologies Management

IFS Performance • Performance Scaled Messaging (PSM) • Provides a fast path to the QLogic TrueScaleASIC with stateless, cut-through packet delivery with minimal overhead • Designed from the ground up to support MPI-like semantics • Accelerated performance is enabled for the most widely available MPI libraries • Fully functional vFabric support is enabled natively for all MPI libraries supported by PSM • Support for multiple traffic classes to enable QoS objectives • MORE • Performance Scaled Messaging (PSM) for MPI Applications • Source Directed Dispersive Routing Performance

IFS Performance • Dispersive Routing • Overcomes limitations of traditional static IB routing between source and destination • Dispersive Routing enables multiple routes between pairs of sources and destinations • Using load balancing techniques, traffic is routed across these paths for improved overall bandwidth efficiency • Support for unordered packets enable optimal performance for different traffic classes • MORE • Performance Scaled Messaging (PSM) for MPI and SHMEM Applications • Source Directed Dispersive Routing Performance

QLogic TrueScale InfiniBand Advantage Performance and Scaling

QLogic TrueScale InfiniBand Advantage Performance and Scaling

Presentation Transcript

IB ACM InfiniBand Communication Management Assistant (for Scaling)

Competitive Advantage and Firm Performance

Infiniband Architecture

InfiniBand

NPAR Dell - QLogic

QLogic Corporation

InfiniBand: Today and Tomorrow

Infiniband architecture

SQL performance with advantage

InfiniBand FPGA

Parallel Application Scaling, Performance, and Efficiency

Competitive Advantage and Firm Performance

Infiniband

High Performance Communication for Oracle using InfiniBand

VolpexMPI : Performance Evaluation of VolpexMPI over Infiniband

Competitive Advantage and Firm Performance

Scaling and Performance

Competitive Advantage and Firm Performance

InfiniBand at Sun

Parallel Application Scaling, Performance, and Efficiency

Competitive Advantage and Firm Performance

InfiniBand Routers