1 / 21

What’s Working in HPC

What’s Working in HPC. Nicole Wolter, Mike McCracken Allan Snavely, Lorin Hochstein, Taiga Nakamura, Vic Basili DARPA HPCS October 2006. HPC at a glance. Allocations Often multi-site Moving data (and porting code) between sites is often necessary Wide variety of systems

erno
Télécharger la présentation

What’s Working in HPC

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What’s Working in HPC Nicole Wolter, Mike McCracken Allan Snavely, Lorin Hochstein, Taiga Nakamura, Vic Basili DARPA HPCS October 2006

  2. HPC at a glance • Allocations • Often multi-site • Moving data (and porting code) between sites is often necessary • Wide variety of systems • 8 sites, 20 systems: 32-3,060 CPUs & 0.16 – 16 TFlops • ~10 Architectures, 7 non-linux OSs 7 linux variants • Other systems exist (this is just teragrid.org) • Shared resources • 1000’s of users sharing systems • Systems shoot for >90% utilization • Systems are batch-scheduled*, have time limits, have priority policies • Example: 1024 p job, ~24hr avg wait time on DS

  3. Highly multidisciplinary • What kinds of programs are running? • Simulation • Visualization • Validation • Huge data requirements • Terabyte files are common • Permanent fast storage is scarce

  4. FORTRAN. Really. • Also C, C++, other • Code lasts for decades • Programming models • MPI • Also OpenMP, PGAS languages • Tool support • Dedicated support personnel • Regular system maintenance

  5. Workflow • Copy data in • Wait ? Hrs for data transfer • Submit • wait 8-24 hrs queue, <=18hrs run • Copy data out / archive data (wait) • wait ? hrs for data transfer • Check results • Visualize • wait ? • Analyze

  6. GoalsUnderstand HPC development strategies • Discern common performance evaluation and improvement tactics. • Assess performance enhancing tools. • Assess developers adeptness at predicting performance. • Evaluate Developers knowledge on the Domain Science versus the Computer Science. • Evaluate hybrid versus “purebred” codes. • Developers view on improving HW versus improving SW. • Understand System Usage

  7. Process • Evaluate HPC system logs, and help tickets • Developer Interviews (Consultants, User/Developers) • Consultants: Focused design effort not original developer or scientist • User: Scientist, limited modifications • Developer: Design original code • SDSC Summer Institute Survey

  8. Conjectures • HPC users all have similar concerns and difficulties with productivity. • Users with the largest allocations and the most expertise tend to be the most productive. • A computer science background is crucial to success in performance optimization. • Visualization is not on the critical path to productivity in HPC in most cases. • HPC programmers would require dramatic performance improvements to consider making major structural changes to their code. • Lack of publicity and education is the main roadblock to adoption of performance and parallel debugging tools. • Computational performance is usually the limiting factor for productivity on HPC systems.

  9. Conjecture 1: HPC users all have similar concerns and difficulties with productivity • Assumption • HPC users are a homogenous community • Background • HPC users all come to HPC centers to capitalize on extended resources • Evaluation • Classes of users • Resource Demands • System Usage Trends (flowchart) • Top perceived bottlenecks • Conclusion • Not True

  10. Conjecture 2: Users with the largest allocations and most experience are the most productive • Assumption • The more you know…… • Background • Large allocations get preferential treatment on large systems • Queue priority • Knowledge • Evaluation • Queue Wait Time • Reliability • Porting • Conclusion • Not always True

  11. Conjecture 3: A computer science background is crucial to success in performance optimization • Assumption • Code Developers are computer Scientists • Background • Many HPC users are physical scientists • Evaluation • Project Funding • SAC (Strategic Application Collaboration) • Conclusion • Not True

  12. Conjecture 4: Time to solution is the limiting factor for productivity on HPC systems • Assumptions • Users are motivated to improve performance • Background • Code Maintenance is done by the physical scientist funded to produce scientific results • Evaluation • Satisficing (satisfied with “good enough” performance) • Users request help with running longer jobs, not performance evaluation. • Job logs (job size versus runtime) • Conclusion • Not True

  13. Conjecture 5: Visualization is not on the critical path to productivity in HPC • Assumptions • Visualization is utilized at the end of production cycle • Background • Visualization usage • Validation • Publication • Evaluation • Utilization frequency (flowchart) • Conclusion • Not True

  14. Conjecture 6: HPC programmers would demand dramatic performance improvements to consider major structural changes to their code • Assumptions • People shy away from change • Background • Community Codes last for 10’s of years • Predominant languages used on HPC systems Fortran and C • Evaluation • Compensation in return for Code rewrite (responses vary) • Conclusion • Not True

  15. Conjecture 7: Lack of publicity is the main roadblock to adoption of performance and parallel debugging tools • Assumptions • People use tools to help debug and profile codes to improve performance • Background data • Number of performance tools available • Number of debuggers available • Evaluation • Debuggers • Hard to use • Possibly impossible to scale • Performance Evaluation • Not Critical Path • Top tool used: print statements and timers • Conclusion • Not True

  16. Conclusion • Productivity != Development Time + Runtime Performance • HPC users are heterogeneous • Performance is considered a constraint not a goal • Paper can be found at: http://www.sdsc.edu/PMaC/HPCS/hpcs_productivity.html

  17. EXTRAS • User Classifications • Chart DS Wait times • Chart DS Run times

  18. User Classifications • Classifying Users: • Marquee User: • Large allocations are greater then equal to1,000,000 (SU’s)Service Units, • Maximum job size can be the full system. • VIP because of large allocation. (At SDSC there are currently only 2 users who run on the full system, this will most likely scale up with the push for petascale.) • Normal User: • Allocation approximately 100,000 SU’s. • Job sizes normally range from 64 -256 processors. • Small User: • Small Allocation, not many users accounts. Accounts through AAP (Academic Associates Program). • These accounts are usually less then or equal to 10,000 SUs (usually 300-3000) university course accounts, short in duration. • Benchmarkers and Computer Scientists: • Dynamic system usage. They can run from 1 cpu to full system. Usually not long in duration, either as individual run or attention span to one project. • Usually less then a dozen runs per application.

  19. Wait time Distribution, Grouped by Job Size(DS job logs from January 2003 to April 2006)

  20. Run time Distribution, Grouped by Job Size(DS job logs from January 2003 to April 2006)

More Related