1 / 43

High Performance and Productivity Computing with Windows HPC

High Performance and Productivity Computing with Windows HPC. Phil Pennington Windows HPC Microsoft Corporation. Supercomputing Reached the Petaflop. IBM RoadRunner at Los Alamos National Lab. HPC at Microsoft. 2004 Windows HPC team established 2005 Windows Server 2003 SP1 x64

feo
Télécharger la présentation

High Performance and Productivity Computing with Windows HPC

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High Performance and Productivity Computing with Windows HPC Phil Pennington Windows HPC Microsoft Corporation

  2. Supercomputing Reached the Petaflop IBM RoadRunner atLos Alamos National Lab

  3. HPC at Microsoft • 2004 Windows HPC team established • 2005 Windows Server 2003 SP1 x64 • 2005 Microsoft launches HPC entry at SC‘05 in Seattle with Bill Gates keynote • 2006 Windows Compute Cluster Server 2003 ships • 2007 Microsoft named one of the Top 5 companies to watch in HPC at SC’07 • 2008 Windows HPC Server 2008

  4. Spring 2008, NCSA, #23 9472 cores, 68.5 TF, 77.7% Spring 2008, Umea, #40 5376 cores, 46 TF, 85.5% Spring 2008, Aachen, #100 2096 cores, 18.8 TF, 76.5% Spring 2006, NCSA, #130 896 cores, 4.1 TF Winter 2005, Microsoft 4 procs, 9.46 GFlops Spring 2007, Microsoft, #1062048 cores, 9 TF, 58.8% Fall 2007, Microsoft, #1162048 cores, 11.8 TF, 77.1% 30% efficiencyimprovement Windows HPC Server 2008 Windows Compute Cluster 2003

  5. HPC Clusters in Every Lab X64 Server

  6. Explosion of Data Experiments Simulations Archives Literature Petabytes Doubling every 2 years

  7. The Data Pipeline Courtesy Catherine van Ingen, MSR

  8. New Breed of HPC: Computational Finance • Modern finance differentiates by the quality, breadth and rapidity of building internal models of global markets and executing on them profitably • Very large datasets (10’s of TB), changing daily→realtime • Tick by tick data, yield curves, past trades and closing prices, fundamental data, news, video • Overnight and realtime computation • Finding patterns, building trading strategies, backtesting, portfolio optimization, derivatives pricing, risk simulation for thousands of scenarios • HPC Grids growing to tens of thousands of nodes • Data is moving from databases to scale-out caches • Enterprise management, security, policy and accounting requirements • Extreme developer productivity requirements • Develop, test and deploy models in production in DAYS • Scale to tens of thousands of cores • Usable by thousands of domain experts, not || wizards

  9. Sun’s Surface 10,000 1,000 100 10 1 Rocket Nozzle Nuclear Reactor Power Density (W/cm2) 8086 Hot Plate 4004 8085 Pentium® processors 8008 386 286 486 8080 ‘70 ‘80 ‘90 ‘00 ‘10 Parallelism Everywhere Today’s Architecture: Heat becoming an unmanageable problem! To Grow, To Keep Up, We Must Embrace Parallel Computing 32,768 2,048 128 16 Many-core Peak Parallel GOPs Parallelism Opportunity 80X GOPS Single Threaded Perf 10% per year 2004 2006 2008 2010 2012 2015 Intel Developer Forum, Spring 2004 - Pat Gelsinger “… we see a very significant shift in what architectures will look like in the future ... fundamentally the way we've begun to look at doing that is to move from instruction level concurrency to … multiple cores per die. But we're going to continue to go beyond there. And that just won't be in our server lines in the future; this willpermeate every architecture that we build. All will have massivelymulticore implementations.” Intel Developer Forum, Spring 2004 Pat Gelsinger Chief Technology Officer, Senior Vice President Intel Corporation February, 19, 2004

  10. What will you look for?Overall scalability

  11. Challenge: High Productivity Computing “Make high-end computing easier and more productive to use. Emphasis should be placed on time to solution, the major metric of value to high-end computing users… A common software environment for scientific computation encompassing desktop to high-end systems will enhance productivity gains by promoting ease of use and manageability of systems.” 2004 High-End Computing Revitalization Task Force Office of Science and Technology Policy,Executive Office of the President

  12. Microsoft’s Productivity Vision Windows HPC allows you to accomplish more, in less time, with reduced effort by leveraging users existing skills and integrating with the tools they are already using. Administrator Application Developer End - User • Integrated Turnkey Solution • Simplified Setup and Deployment • Built-In Diagnostics • Efficient Cluster Utilization • Integrates with IT Infrastructure and Policies • Highly Productive Parallel Programming Frameworks • Service-Oriented HPC Applications • Support for Key HPC Development Standards • Unix Application Migration • Seamless Integration with Workstation Applications • Integrated Collaboration and Workflow Solutions • Secure Job Execution and Data Access • World-class Performance

  13. Windows HPC Server 2008 • Complete, integrated platform for computational clustering • Built on top the proven Windows Server 2008 platform • Integrated development environment • Available at http://www.microsoft.com/hpc

  14. Windows HPC Server 2008 • Integrated security via Active Directory • Support for batch, interactive and service-oriented applications • High availability scheduling • Interoperability via OGF’s HPC Basic Profile • Rapid large scale deployment and built-in diagnostics suite • Integrated monitoring, management and reporting • Familiar UI and rich scripting interface Job & Resource Scheduling Systems Management HPC Application Models Storage • MS-MPI stack based on MPICH2 reference implementation • Performance improvements for RDMA networking and multi-core shared memory • MS-MPI integrated with Windows Event Tracing • Access to SQL, Windows and Unix file servers • Key parallel file server vendor support (GPFS, Lustre, Panasas) • In-memory caching options

  15. Typical HPC Cluster Topology Corporate IT Infrastructure SystemsManagement Windows Update Monitoring AD DNS DHCP PublicNetwork Head Node Compute Node Compute Node Admin / User Cons Node Manager Node Manager WDS MPI Job Scheduler MPI MPI Management Management Management NAT PrivateNetwork MPINetwork Compute Cluster

  16. Integrated Job Scheduler

  17. Job Scheduler Architecture Compute Nodes Job Validation Resource Allocation Resource Controller Admins Scheduler Store Users

  18. Submitting a job on 9472 cores • Start time < 2 seconds Id : 584 JobTemplate : Default Priority : Normal JobType : Batch NodeGroups : OrderBy : State : Finished Name : UserName : CCE\jeffb Project : RequestedNodes : ResourceRequest : 9472-9472 cores MinMemory : MaxMemory : AllocatedNodesubmitTime : 4/1/2008 10:51:53 PM StartTime : 4/1/2008 10:51:54 PM EndTime : 4/1/2008 10:58:58 PM PendingReason : ChangeTime : 4/1/2008 10:58:58 PM Wait time : 00:00:00:00 Elapsed time : 00:00:07:04 ErrorMessage : RequeueCount : 0 TaskCount : 1 ConfiguringTaskCount : 0 QueuedTaskCount : 0 RunningTaskCount : 0 FinishedTaskCount : 1 FailedTaskCount : 0 CanceledTaskCount : 0

  19. Placement via Job ContextNode Grouping, Job Templates, Filters MATLAB A C0 C1 C2 C3 A A A A MATLAB Application Aware An ISV application (requires Nodes where the application is installed) M M MATLAB MATLAB MATLAB MATLAB MATLAB MATLAB MATLAB MATLAB MATLAB MATLAB MATLAB MATLAB Multi-threaded application (requires machine with many Cores) Capacity Aware A big model (requires Large memory machines) P0 P1 M M |||||||| |||||||| M M Numa Aware M M |||||||| |||||||| 4-way Structural Analysis MPI Job M M P2 P3 C0 C1 C2 C3 IO IO Quad-core 32-core M

  20. Node/Socket/Core Allocation • Windows HPC Server can help your application make the best use of multi-core systems Node 2 S2 S0 S1 S1 S3 S2 S0 P1 P1 P1 P1 P1 P1 P1 P0 P0 P0 P0 P0 P0 P0 Node 1 P2 P2 P2 P2 P2 P2 P2 P3 P3 P3 P3 P3 P3 P3 J1 J1 J2 S3 P1 P0 J3 J3 J1 P2 P3 J3 J3 J1: /numsockets:3 /exclusive: false J3: /numcores:4 /exclusive: false J2: /numnodes:1

  21. Group compute nodes based on hardware, software and custom attributes; Act on groupings. Pivoting enables correlating nodes and jobs together Track long running operations and access operation history Receive alerts for failures List or Heat Map view cluster at a glance Single Management Console

  22. Integrated Monitoring

  23. Comprehensive Diagnostics Suite

  24. Built-in Reporting

  25. Easy to Deploy

  26. Easy to Configure

  27. Evolving HPC Application Support V2 (focusing on Interactive applications) V1 (focusing on batch applications) Job Scheduler Resource allocation Process Launching Resource usage tracking Integrated MPI execution Integrated Security WCF Service Broker WS Virtual Endpoint Reference Request load balancing Integrated Service activation Service life time management Integrated WCF Tracing + App.exe App.exe App.exe App.exe Service (DLL) Service (DLL) Service (DLL) Service (DLL)

  28. HPC + WCF Services Compute Scenario 2. Session Manager starts WCF Broker job and WCF Service job for client. Head Node Compute Nodes 1. User submits job. 3. Requests 4. Requests Workstation WCF Broker Nodes 5. Responses 6. Responses

  29. Head Node Job Mgmt Cluster Mgmt Scheduling Resource Mgmt Jobs Scheduler Results Compute Node Job Execution User App MPI Service Oriented HPC + WCF Integrated Solutions UDF UDF UDF UDF UDF UDF UDF UDF

  30. HPC + WCF Programming Model Sequential Parallel for (i = 0; i < 100,000,000; i++) { r[i] = worker.DoWork(dataSet[i]); } reduce ( r ); Session session = new session(startInfo); PricingClient client = new P ricingClient(binding, session.EndpointAddress); for (i = 0; I < 100,000,000, i++) { client.BeginDoWork(dataset[i], new AsyncCallback(callback), i) } void callback(IAsyncResult handle) { r = client.EndDoWork(handle); // aggregate results reduce ( r ); }

  31. Sub-millisecond round-trips

  32. High Throughput

  33. HPC MPI Programming Model • Traditional HPC • mpiexec communicates with each node’s MPI Service to start worker processes mpiexec –n 6 app.exe process process process process process process Job scheduler node P P node node P P P P ... MPI Service MPI Service MPI Service MPI Service Headnode Compute nodes

  34. MPI.NET • Supports all .NET languages (C#, C++, F#, ..., even Visual Basic!) • Natural expression of MPI in C# • Negligible overhead (relative to C) over TCP if (world.Rank == 0) world.Send(“Hello, World!”, 1, 0); else stringmsg = world.Receive<string>(0, 0); string[] hostnames = comm.Gather(MPI.Environment.ProcessorName, 0); double pi = 4.0*comm.Reduce(dartsInCircle,(x, y) => return x + y, 0) / totalDartsThrown;

  35. User Mode Kernel Mode NetworkDirectA new RDMA networking interface built for speed and stability • Verbs-based design for close fit with native, high-perf networking interfaces • Equal to Hardware-Optimized stacks for MPI micro-benchmarks • 2 usec latency, 2 GB/sec bandwidth on ConnectX • OpenFabrics driver for Windows includes support for Network Direct, Winsock Direct and IPoIB protocols Socket-Based App MPI App MS-MPI Windows Sockets (Winsock + WSD) RDMA Networking TCP/Ethernet Networking Networking Hardware Networking Hardware Networking Hardware Networking Hardware Networking Hardware Networking Hardware WinSock Direct Provider NetworkDirect Provider Mini-port Driver TCP IP NDIS Kernel By-Pass Networking Hardware Networking Hardware Networking Hardware Networking Hardware Networking Hardware Networking Hardware Hardware Driver User Mode Access Layer Networking Hardware (ISV) App CCP Component OS Component IHV Component

  36. Devs can't tune what they can't seeMS-MPI integrated with Event Tracing for Windows • Single, time-correlated log of: OS, driver, MPI, and app events • CCS-specific additions • High-precision CPU clock correction • Log consolidation from multiple compute nodes into a single record of parallel app execution • Dual purpose: • Performance Analysis • Application Trouble-Shooting • Trace Data Display • Visual Studio & Windows ETW tools • Intel Collector/Analyzer • Vampir • Jumpshot

  37. Enables Analysis of MPI Traffic

  38. Enables Optimization Strategies Count of machines and distinct communicating pairs Statistical summary of counts Statistical summary of sizes Sender / receiver pairs. Senders on vertical axis. Bubble chart has bubble area proportional to size of chart. Histogram of counts Histogram of sizes Scatter plot of sizes ( vertical axis ) vs counts ( Large scale problem before optimization ( linpack 2048 cores ) Large scale problem after optimization Usage and notes: Overall idea is that we are able to do live logging of the communication traffic that occurs as part of an executing run. We are then able to optimize the traffic based on either latency or bandwidth metrics. Real-world usage is: • Run your scenario with traffic analysis on • Optimize for latency or bandwidth dependent on the characteristics of the app • Save a machine file representing the changes • Rerun your task passing in –machinefile to mpiexec and see things improve hopefully Walkthrough of zipped up stuff: • Unzip to a folder • Start the health client. This takes an ip address and port, but you can use random ones as we are not doing live traffic work • Healthclient 10.1.1.1 6000 • Choose the view / view traffic menu option • Load one of the provided traffic files • Traffic_64.txt is a 64 node linpack run • Traffic_2048.txt is a 2048 node linpack run • Open the RHM menu over the traffic and you have a number of options: • Show counts and show size let you flip the ui between showing counts , sizes or both on the bubble chart • Histograms lets you flip the vertical axis on the histograms to logarithmic which is useful when the data distributions are very uneven • Optimize For…lets you choose to optimize for latency , bandwidth or a combination of the two. The implementation here is obvious: just weighting the proportion of size and counts when calculating the final layout • SHM / Network ratio lets you set the relative speeds of your network compared to SHM. For gige 100:1 or 1000:1 is good, for NWD it is more like 2 or 5:1 • Optimize performs the optimization ( a greedy clustering algorithm currently ) • View optimized / original lets you flip between optimized and non optimized views • Once you have optimized choose file / save machine file to save an optimized layout suitable for being passed to mpiexec.

  39. And Optimization Results…

  40. HPC Open Grid Forum Interoperability Cloud Services Other OS’s Thin Clients HPC client API Application ISVs Scheduling ISVs HPC Basic Profile Web Service Windows HPC Server 2008 Headnode

  41. Resources • Windowshpc.net • www.microsoft.com/hpc • Channel9.msdn.com/shows/the+hpc+show • Edge.technet.com/tags/HPC • www.microsoft.com/science • research.microsoft.com/fsharp • www.osl.iu.edu/research/mpi.net • www.microsoft.com/msdn • www.microsoft.com/technet

  42. Thank You!

  43. © 2007 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

More Related