1 / 18

Summer Internship

Summer Internship. Douglas Drobny Idaho National Laboratory High Performance Computing. Who I worked for. Idaho National Laboratory Idaho Falls High Performance Computing group Manages ~4 different clusters Supports and maintains software for big research progress. User Support group.

seoras
Télécharger la présentation

Summer Internship

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Summer Internship Douglas Drobny Idaho National Laboratory High Performance Computing

  2. Who I worked for • Idaho National Laboratory • Idaho Falls • High Performance Computing group • Manages ~4 different clusters • Supports and maintains software for big research progress. • User Support group

  3. Clusters • Fission • 12,512 processors • 25 TBytes of memory • Icestorm • 2048 processors • 4 TBytes of memory • Quark • Eos

  4. Compute Manager • Current job submissions are command line • Goals • Web interface for PBS Scheduler • Easy to use • Behaves the same as current job submissions • Improved error message handling

  5. Setup • Application Services • On the server head nodes • Receive web requests • Submits Jobs • Compute Manager • On the web server • Creates web forms • Sends results to App. Services • Displays Results

  6. What I did • Installed compute manager and AIF on Eos • Created test cases for PBS features • Created test cases for User Inputs • Submit feedback / bug reports with PBS • Documented process for future implementations / troubleshooting

  7. Results • Good • Easy to create different application forms • Instant job monitoring • Restrict input values • Default input values • Secure file transferring

  8. Results • Bad • Easy to put results in insecure location • Always copies the input files • Missing a form entry can result in lost output files • Spams the sudo log • “Fixed in next version (Week after I leave)”

  9. Updating HPC Wiki • Moinmoin wiki (python) • 1.8.8 to 1.9.4 • Used temporary virtual machine to test update and fix issues • Added support for viewing reports • Deployed on hpcweb • Note: Learn what type of service monitoring is being used before taking down a system.

  10. Wiki Reports • Automatically generate a visual report of an XML document each month • Created the XSL • Putting data into charts • Automation ('Right' way vs. Working way) • Editing to reduce transcription errors • <script/>

  11. XSL/XML • Goal: Display XSL/XML pages inside of a wikipage • Problems • Moinmoin uses outdated XSL library • XSL can contain javascript (XSS) • Solution • Created a wiki macro to convert XML with a specific XSL stylesheet on the server

  12. Intel Compiler Issue (ICC) • Issue • Compile times on Quark are much longer than Fission (head nodes) • Quark should be faster (hardware wise) • 17 minutes on Quark • 8 minutes on Fission

  13. Intel Compiler Steps • Create test cases • Determine effected systems • Enable debugging • Strace • Wireshark • Hardware Test Environment

  14. ICC Solution • License files were resolved in the order • License manager • User's home directory • /opt/intel • /apps/intel/..../license • 'Errors' in the license file cause the system to check all of the sources

  15. ICC Solution • The /opt/intel license files pointed to the license manager • This caused additional requests to the license manager (takes time) • Quark's /opt/intel license files pointed to the license servers the most • *Removed /opt/intel/license folder to fix the problem.

  16. Things Learned • Python • XSL • Creating and Signing SSL Keys • Unix permissions • Strace • Testing • Refactoring • Monitoring • Vim!

More Related