1 / 24

University of Calgary EcoGrid

University of Calgary EcoGrid. Presenter: Dave Schulz dschulz@ucalgary.ca Research Computing Services University of Calgary IT. Example Job Types. General purpose – arbitrary linux apps Rendering video and still images Charmm Matlab Maple Parameter sweeps etc.

gordon
Télécharger la présentation

University of Calgary EcoGrid

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. University of Calgary EcoGrid Presenter: Dave Schulz dschulz@ucalgary.ca Research Computing Services University of Calgary IT CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

  2. Example Job Types • General purpose – arbitrary linux apps • Rendering video and still images • Charmm • Matlab • Maple • Parameter sweeps • etc. CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

  3. Rendered Pictures Examples • http://hpc.ucalgary.ca/EcoGrid/pics • Note: Hi bandwidth – even for on campus CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

  4. What is EcoGrid? • Cycle scavenging system -- using otherwise idle CPU cycles to perform useful work • Most lab computers are powered on but idle for most of the night. CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

  5. Consider: • Assumptions: • Idle from 6pm to 6am – 12h / day • Idle all weekend – 48h/week • 2000 EcoGrid Nodes • Calculation: • Idle Time = 12h*5 + 24h * 2 = 108hours / week • 108h/week = 6480 CPU Minutes per week • 2000 nodes * 6480 minutes/week = 12 960 000 CPU Minutes / Week! • Or 4 600 000 000 CPU Minutes / Year • To Contrast, The Westgrid Matrix cluster (128 nodes) running at 100% for one year would only have 135 000 000 CPU minutes. CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

  6. Goals (by July ‘09) • 1000+ nodes • Enough demand to consume 100% of the cluster • Full web based reporting and statistics • Other clusters connected • Origin • Terminus • Matrix • Lattice CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

  7. Benefits • Huge untapped computing resource • Compute cycles available to the campus without the need to purchase more equipment • Cluster will always have some fairly current hardware • Efficiently using power already wasted by idle computers • Little – if any – impact on lab users CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

  8. Drawbacks • More network utilization • Possible heat capacity of lab environmental systems • Somewhat increased electrical power draw • Lab power system should be able to supply this power but may not under normal lab conditions CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

  9. How is it done? • Using Condor and Innotek VirtualBox (Windows Platforms) • Next build will use QEMU • Checkpointing the machine • Jobs survive nightly reboots • Using Condor natively (MAC/Linux Platforms) CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

  10. What is Condor? • Developed by the University of Wisconsin-Madison • Runs on many common operating systems – but the jobs must be designed for that operating system • Windows is supported but little demand for HPC applications • Condor project started in 1988 CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

  11. Why Condor? • Users retain full control over their computers • Provides job checkpointing, migration, and restart (with certain restrictions) • DAGMAN – Directed Acyclic Graph job MANager • Takes care of job dependencies • Even allows portions of jobs to be run on completely dis-similar clusters. • Very easy to express job dependencies • Very resilient to Network Problems • Jobs finish and wait until the network is restored to complete. CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

  12. What is QEMU? • Processor Emulator with extensions to quickly run code built for the host processor • Open Source • Runs Linux Guest on Windows™ • Virtually undetectable to the Windows User • Runs as a service – only visible in task manager as a running task. CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

  13. Image Size • Small node filesystem image ~20Mb which kickstarts a full system upon first bootup. • Reinstalls can be triggered from the headnode, so software updates and fixes can be pulled in RPM form at the next reboot. CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

  14. Networking • Central manager is unable to initiate direct TCP/IP connections to the nodes so something else is required. • Options • VPN • IP Tunneling • Connection Brokering CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

  15. Networking Cont’d • We have chosen to use GCB – Generic Connection Brokering – which is a part of the Condor distribution. • The compute nodes establish and maintain a connection to the GCB at startup. • When the Central Manager needs to open a connection to the node, it contacts it via the GCB machine. CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

  16. Vulture (a type of Condor) is the central manager. It coordinates all of the machines and the jobs they run. Ecogrid is the submit machine, the one where the users login to and submit their jobs. Networking Cont’d CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

  17. Scalability • GCB nodes can be created as network load requires. CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

  18. Where do we plan on using this? • Lab Computers (via VirtualBox/QEMU) • DTP Desktop Computers (via VirtualBox/QEMU) • Linux Labs (natively) • Other Clusters (via the GlobusInterface) • Will provide one common interface to many clusters CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

  19. Ideal Workload • Serial jobs -- Possibly 2 processor depending on the available hosts • Jobs that can be broken into smaller jobs • Parameter sweeps • Self Compiled (To take advantage of checkpointing and restart) • COMING SOON: Matlab jobs!! CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

  20. Timeline / Currently • Currently: • 80 IT Labs machines in the Elbow Room • Hoping to roll out a number of Linux labs which have Matlab installed before the end of summer CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

  21. Going Forward • Phase II – Expansion of project to non UCIT labs CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

  22. Web Portal Demo CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

  23. Team Members • Stephen Cartwright • Robert Fridman • Eric Merth • David Schulz • Carol Sin CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

  24. Information Resources • Condor Website: www.cs.wisc.edu/condor • QEMU Website: bellard.org/qemu • VirtualBox Website: www.virtualbox.org • Local Website: hpc.ucalgary.ca/EcoGrid CANHEIT | On the EDGE | June 15-18, 2008 | University of Calgary

More Related