1 / 15

Ian C. Smith

Experiences with running MATLAB jobs on a power-saving Condor Pool. Ian C. Smith. University of Liverpool Condor Pool. Contains around 300 machines running the University’s Managed Windows (XP) Service.

pisces
Télécharger la présentation

Ian C. Smith

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Experiences with running MATLAB jobs on a power-saving Condor Pool Ian C. Smith

  2. University of Liverpool Condor Pool • Contains around 300 machines running the University’s Managed Windows (XP) Service. • Most have 2.33 GHz Intel Core 2 processors with 2 GB RAM, 80 GB disk, configured with two job slots / machine. • Software updates via a weekly re-imaging process. • Single combined submit host / central manager running on Sun V440 SMP server. • Restricted access to submit host for registered Condor users. • Currently running Condor 7.0.2 (moving to 7.2.x soon). • Policy is to run jobs only if a least 10 minutes of inactivity and low load average during office hours and at anytime outside of office hours.

  3. MATLAB advantages • Originally developed for linear algebra algorithm development but now contains many built-functions geared to different disciplines divided into toolboxes. • Intuitive interactive environment allows rapid code development. • Simple but powerful file I/O: save <filename>, load <filename> (useful for checkpointing). • Allows users to create their own functions stored as M-files. • “Standalone” applications can be built from M-files: • can run on platforms without MATLAB installed • do not need a licence to be able to run • can include all toolbox functions • APIs available for FORTRAN and C codes (“MEX files”)

  4. MATLAB disadvantages • Even standalone applications can run slower than equivalent C or FORTRAN implementations. • Standalone applications aren’t quite what they may seem: • more than just an .exe – several files need to be packaged and deployed • need access to MATLAB run-time libraries usually via MATLAB Component Runtime (150 MB self-extracting .exe) • luckily we have MATLAB pre-installed on all PCs in Condor pool (originally used a network drive) • Run-time errors can be difficult to trace when MATLAB jobs are run under Condor: • need to run under Condor on local PC • configure with USE_VISIBLE_DESKTOP=True to see pop-up messages • Jobs submitted in a UNIX environment but code developed under Windows.

  5. Minor MATLAB irritations • Output files occasionally go missing: • specify all required files using transfer_output_files • identify problem jobs with condor_q –held • resubmit with condor_release –all • Jobs sometimes run “forever”: • use condor_vacate to move job to another machine • less of a problem during term time as jobs usually get evicted by logins • Difficult to reproduce these problems: • happen quite rarely ( < 1 in ~1000 jobs) • many jobs based on stochastic methods

  6. MATLAB Research Applications • Predicting the spread of avian influenza outbreaks in poultry flocks (Veterinary Clinical Science). • Modelling of E-Coli propagation in dairy cattle (Veterinary Clinical Science). • Testing of parallel genetic algorithms in a complex classification system (Electrical Engineering and Electronics). • Simulation of the infection of a bacterial cell by a virus (Mathematical Sciences). • Modelling the effects of radiotherapy on normal tissue using 3D voxel arrays (Medical Imaging and Radiotherapy).

  7. Power-saving at Liverpool • Have around 2 000 centrally managed PCs across campus which were powered up overnight, at weekends and during vacations. • Original power-saving policy was to power-off machines after 30 minutes of inactivity, now hibernate them after 10 minutes of inactivity • Policy has reduced wasteful inactivity time by ~ 200 000 – 250 000 hours per week (equivalent to 20-25 MWh) leading to an estimated saving of approx. £125 000 p.a. • Makes extensive use of PowerMAN system from Data Synergy comprising: • service which forces machines into a low-power state and reports machine activity to Management Reporting Platform • Management Reporting Platform - central server from where usage stats can be retrieved and viewed via a web browser

  8. Adapting Condor for use with power-saving PCs • Two main problems: • how to ensure Condor jobs are not evicted by hibernating/powered-off PCs • how to wake up dormant PCs to run Condor jobs on-demand • Originally used Microsoft system service to power-down PCs after 30 min inactivity: • runs .bat file which checks if a user is logged in and shuts machine down if not • doesn’t detect owner of Condor job as a logged-in user • need to check for presence of condor_exe.bat • PowerMAN service now prevents job eviction: • can provide PowerMAN with a list of “protected programs” • ensures that system remains active if a protected program is running • include condor_starter process as a protected program (only present while a Condor job is running).

  9. Adapting Condor for use with a power-saving PCs • Wake-on-LAN (“WoL”) used to bring hibernating machines back to full power: • NICs must be remain powered-up during hibernation/power-off • NICs must be capable of waking machines on receipt of a “magic packet” • network must be able to route “magic packets” • cron runs on the submit host which examines state of queue (condor_q) and pool (condor_status): • if more idle jobs in queue than Unclaimed machines then need to wake up hibernating machines • find number of powered up machines machines in each “teaching centre” (classroom) • estimate the number of hibernating machines in each teaching centre from total number of machines in each • sort centres from highest number of available machines to lowest • wake up centres in turn until sufficient machines woken to meet the demand (or all centres woken up) • MAC addresses of machines are stored in files sorted according to teaching centre (needed for Wake-on-LAN)

  10. Automatic wake up issues • Assumes that any job can run on any machine: • users cannot choose particular teaching centres or machines in their job Requirements • ideally, pool needs to be homogenous • errors in Requirements specification can cause severe problems (machines repeatedly wake up then hibernate) • cron now includes a “sanity check” for this • Large clusters of jobs can cause condor scheduler to become overloaded: • condor_q times out so cron cannot determine queue state • only a transient problem – load eventually drops off and condor_q responds again • Can only estimate number of hibernating machines in each centre • May wake up more machines than needed

  11. Automatic wake up in action – Condor pool machine statistics

  12. Automatic wake up in action – PowerMAN statistics

  13. Recent and Future Developments • Recently moved to a policy of hibernating machines after 10 minutes of inactivity • submit host / central manager needs to work harder to get jobs running before recently woken machines go back to hibernation • move execute hosts from Owner to Unclaimed state after just 5 minutes idle • update activity timer every 1 minute (default is 5 minutes) • increase number of scheduler and negotiator cycles using SCHEDD_INTERVAL=60, NEGOTIATOR_INTERVAL=60 • around 25 % machines still hibernate after first wakeup • see a ramp up in machines running Condor jobs over about an hour • little impact on Condor users • energy wastage offset by savings with user logouts

  14. Recent and Future Developments • Migrating to Condor 7.2 shortly • Has some interesting power-management features • Automatic power-down on execute hosts could provide a useful “safety net” but PowerMAN likely to remain primary power management tool • Can retain records of ClassAds of machines in low-power state • could be useful in matchmaking jobs to powered-down machines • matchmaking logic already in Condor • nice if Condor could use this to provide a list of machines to wake-up on demand • ... and wake them up with condor_wakeup ? • would like to ensure that powered-down machines are still out there (not broken, permanently turned off, not listening etc) • also useful to see powered-off machines represented in condor_status output • Couple of extra “wishes” • allow jobs to claim all slots on a machine (useful if they have large memory requirements) • provide a “logged-in user” machine ClassAd attribute

  15. Further Information http://www.liv.ac.uk/e-science/condor i.c.smith@liverpool.ac.uk

More Related