1 / 31

SSS Test Results

Scalability, Durability, Anomalies Todd Kordenbrock Technology Consultant Scalable Computing Division thkorde@sandia.gov.

Télécharger la présentation

SSS Test Results

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalability, Durability, Anomalies Todd Kordenbrock Technology Consultant Scalable Computing Division thkorde@sandia.gov Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. SSS Test Results

  2. Overview • Effective System Performance Benchmark • Scalability • Service Node • Cluster Size • Durability • Anomalies

  3. The Setup • The physical machine • dual processor 3GHz Xeon • 2 GB RAM • FC3 and VMWare 5 • The 4 node VMWare cluster • 1 service node • 4 compute nodes • OSCAR 1.0 on Redhat 9 • The 64 virtual node cluster • 16 WarehouseNodeMonitors running on each compute node

  4. Dual Processor Xeon VMWare service1 - SystemMonitor compute1 compute2 compute3 compute4 NodeMon 2 NodeMon 6 NodeMon 10 NodeMon 14 ... NodeMon n-2 NodeMon 3 NodeMon 7 NodeMon 11 NodeMon 14 ... NodeMon n-1 NodeMon 4 NodeMon 8 NodeMon 12 NodeMon 16 ... NodeMon n NodeMon 1 NodeMon 5 NodeMon 9 NodeMon13 ... NodeMon n-3

  5. Effective System Performance Benchmark • Developed by the National Energy Research Scientific Computing Center • System utilization test, NOT a throughput test • Focused on O/S attributes • launch time, accounting, job scheduling • Constructed to be processor-speed independent • Low resource usage (besides network) • Two variants: Throughput and Multimode • The final result is the ESP Efficiency Ratio

  6. ESP Efficiency Ratio • Calculating the ESP Efficiency Ratio • CPUsecs = sum(jobsize * runtime * job count) • AMT = CPUsecs/syssize • ESP Efficiency Ratio = AMT/observed runtime

  7. ESP2 Efficiency (64 nodes) • CPUsecs = 680251.75 • AMT = 680251.75/64 = 10628.93 • Observed Runtime = 11586.7169 • ESP Efficiency Ratio = 0.9173

  8. Scalability • Service Node Scalability (Load Testing) • Bamboo (Queue Manager) • Gold (Accounting) • Cluster Size • Warehouse scalability (Status Monitor) • Maui scalability (Scheduler)

  9. Access control Meta Meta Meta Security Scheduler Monitor Manager manager Interacts with all components Node System Monitor Accounting Scheduler Configuration & Build Manager Resource Allocation management Job Queue Manager & Monitor Manager User DB Data Migration High Usage User Checkpoint/ File Performance Reports utilities Restart System Communication & I/O Application Environment

  10. Access control Meta Meta Meta Security Scheduler Monitor Manager manager Interacts with all components Node System Monitor Accounting Scheduler Configuration & Build Manager Resource Allocation management Job Queue Manager & Monitor Manager User DB Data Migration High Usage User Checkpoint/ File Performance Reports utilities Restart System Communication & I/O Application Environment

  11. Bamboo Job Submission

  12. Access control Meta Meta Meta Security Scheduler Monitor Manager manager Interacts with all components Node System Monitor Accounting Scheduler Configuration & Build Manager Resource Allocation management Job Queue Manager & Monitor Manager User DB Data Migration High Usage User Checkpoint/ File Performance Reports utilities Restart System Communication & I/O Application Environment

  13. Gold Operations

  14. Access control Meta Meta Meta Security Scheduler Monitor Manager manager Interacts with all components Node System Monitor Accounting Scheduler Configuration & Build Manager Resource Allocation management Job Queue Manager & Monitor Manager User DB Data Migration High Usage User Checkpoint/ File Performance Reports utilities Restart System Communication & I/O Application Environment

  15. Warehouse Scalability • Initial concerns • per process file descriptor (socket) limits • time required to gather status from 1000s of nodes • Discussed with Craig Steffen • had the same concerns • experienced file descriptor limits • resolved with a hierarchical configuration • no tests on large clusters, just simulations

  16. Access control Meta Meta Meta Security Scheduler Monitor Manager manager Interacts with all components Node System Monitor Accounting Scheduler Configuration & Build Manager Resource Allocation management Job Queue Manager & Monitor Manager User DB Data Migration High Usage User Checkpoint/ File Performance Reports utilities Restart System Communication & I/O Application Environment

  17. Maui Scalability

  18. Scalability Conclusions • Bamboo • Gold • Warehouse • Maui

  19. Durability • What is durability? • A few terms regarding starting and stopping • Easy Tests • Hard Tests

  20. Durability and Other Terms • Durability Testing - examines a software system's ability to react to and recover from failures and conditions external to the system itself. • Warm Start/Stop - an orderly startup/shutdown of the SSS services on a particular node • Cold Start/Stop – a warm start/stop paired with a system boot/shutdown on a particular node

  21. Easy Tests • Compute Node Warm Stop • 30 sec delay between stop and Maui notification • race condition • Compute Node Warm Start • 10 sec delay between start and Maui notification • jobs in the queue do not get scheduled, new jobs do • Compute Node Cold Stop • 30 sec delay between stop and Maui notification • race condition

  22. More Easy Tests • Single Node Job Failure • mpd to queue manager communication • Resource Hog - stress • disk • memory • network

  23. More Easy Tests • Resource Exhaustion • compute node • disk – no failures • service node • disk – gold fails in logging package

  24. Hard Tests • Compute Node Failure/Restore • Current release of warehouse fails to reconnect • Service Node Failure/Restore • Requires a restart of mpd on all compute nodes • Compute Node Network Failure/Restore • 30 sec delay between failure and Maui notification • race condition • 20 sec delay between restore and Maui notification

  25. More Hard Tests • Service Node Network Failure/Restore • 30 sec delay between failure and Maui notification • race condition • 20 sec delay between restore and Maui notification • If outage >10 sec, mpd can't reconnect to computes

  26. Durability Conclusions • Bamboo • Gold • Warehouse • Maui

  27. Anomalies Discovered • Maui • Jobs in the queue do not get scheduled after service node warm restart • If max runtime expires on the last job in the queue, repeated attempts are made to delete it; the account is charged actual runtime + max runtime • Otherwise, the last job in the queue doesn't get charge until another job is submitted • Maui loses connections to other services

  28. More Anomalies • Warehouse • warehouse_SysMon exits after ~8 hrs (current release) • warehouse_SysMon doesn't reconnect to power cycled compute nodes (current release) • Gold • “Quotation Create” pair fails with missing column error • gquote succeeds, glsquote fails with similar error • Spikes CPU usage when gold.db file gets large (>64MB). sqlite problem?

  29. More Anomalies • happynsm • /etc/init.d/nsmup needs a delay to allow the server time to initialize • Is NSM in use at this time? • emng.py throws errors • After a few hundred jobs, errors begin showing up in /var/log/messages • Jobs continue to execute, but slowly without events

  30. Conclusions • Overall scalability is good. Warehouse needs to be tested on a large cluster. • Overall durability is good. Some problems with warehouse have been resolved in the latest development release.

  31. ToDo List • Develop and execute tests for the BLCR module • Retest on a larger cluster • Get the latest release of all the software and retest • Formalize this information into a report

More Related