1 / 41

Running CCSM

Running CCSM. Tony Craig CCSM Software Engineering Group ccsm@ucar.edu. Outline. General review of CCSM Setting up and running a simple case Datasets Production Modifying source code Errors Tools Performance. Review of CCSM. Five components / Ten models

fischera
Télécharger la présentation

Running CCSM

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Running CCSM Tony Craig CCSM Software Engineering Group ccsm@ucar.edu

  2. Outline • General review of CCSM • Setting up and running a simple case • Datasets • Production • Modifying source code • Errors • Tools • Performance

  3. Review of CCSM • Five components / Ten models • Atmosphere(3) : atm, datm, latm • Ocean(2) : ocn, docn • Land(2) : lnd, dlnd • Ice(2+) : ice, ice (prescribed mode), ice (mixed layer ocean mode), dice • Coupler(1) : cpl • Communication via MPI between components and coupler only • Each component runs on multiple processors via MPI, OpenMP, MPI/OpenMP

  4. Component parallelization • atm : MPI, OpenMP, or MPI/OpenMP • lnd : MPI, OpenMP, or MPI/OpenMP • Ice : MPI only • ocn : MPI only • cpl : OpenMP only • The data models, datm, docn, dice, dlnd, and latm : serial only, 1 processor

  5. Configurations • A = datm, dlnd, docn, dice, cpl • B = atm, lnd, ocn, ice, cpl • C = datm, dlnd, ocn, dice, cpl • D = datm, dlnd, docn, ice, cpl • F = atm, lnd, docn, ice (prescribed mode), cpl • G = latm, dlnd, ocn, ice, cpl • H = atm, dlnd, docn, dice, cpl • I = datm, lnd, docn, dice, cpl • K = atm, lnd, docn, dice, cpl • M = latm, dlnd, docn, ice (ml ocn mode), cpl

  6. Resolutions • atm/lnd/datm/dlnd = T42, T31 • ocn/ice/docn/dice = gx1v3, gx3, gx3v4 • latm = T62 • Scientifically validated combinations • B, T42_gx1v3 = b20.007 control run (test.a1 case) • B, T31_gx3v4 = paleo control run (test.a2 case)

  7. * = supported (subject to change) = b20.007 control = paleo control * * “Available” configurations

  8. Platforms • IBM • SGI • Compaq*

  9. Review of scripts • Main script (test.a1.run) • Sets primary ccsm environment variables • Calls $model.setup.csh • Gets input datasets • Builds components • Runs model • Archives • Harvests

  10. Setting up a simple case • Use the GUI !! • The GUI modifies the scripts and creates a new case for you • Input $CASE, $CSMROOT, $CSMDATA, $EXEROOT • Input resolution • Input configuration (A-M) • Sets processor layout based on configuration (first guess) • Sets some batch environment variables • Works well in the NCAR environment, other sites require post script-generation tuning

  11. Setting up a simple case, without GUI • Create new case directory under scripts, copy over test.a1 files • Rename file test.a1.run to $CASE.run • Edit $CASE, $CSMROOT, $CSMDATA, $EXEROOT, $ARCROOT • Edit batch environment parameters • Edit $GRID • Edit $SETUPS • Edit $NTASKS, $NTHRDS

  12. $NTASKS, $NTHRDS, batch • $NTASKS are the total number of MPI tasks for each component • $NTHRDS are the number of OpenMP threads per MPI task • $NTASKS*$NTHRDS = total number of processors for each component • Tuning required to get optimal load balance • Batch parameters should match processors used, consistency important, task_geometry (loadleveler) is very powerful

  13. Component parallelization • atm : MPI, OpenMP, or MPI/OpenMP • lnd : MPI, OpenMP, or MPI/OpenMP • ice : MPI only, NTHRDS=1 • ocn : MPI only, NTHRDS=1 • cpl : OpenMP only, NTASKS=1 • The data models, datm, docn, dice, dlnd, and latm : serial only, 1 processor, NTASKS=1, NTHRDS=1

  14. Main script configuration summary • B case MODELS ( atm lnd ocn ice cpl) SETUPS ( atm lnd ocn ice cpl) NTASKS ( 8 2 40 8 1) NTHRDS ( 4 4 1 1 4) • datm/dlnd/ocn/ice case MODELS ( atm lnd ocn ice cpl) SETUPS ( datm dlnd ocn ice cpl) NTASKS ( 1 1 64 16 1) NTHRDS ( 1 1 1 1 4)

  15. $RUNTYPE • Startup - initial startup of model using arbitrary initialization • set $CASE, $BASEDATE • Continue - continuation of case, bit-for-bit guaranteed, uses model restart files • set $CASE • Branch - start new case as a bit-for-bit continuation of another case, uses model restart files, requires continuous date • set $CASE, $REFCASE, $REFDATE • Hybrid - start new case, not bit-for-bit continuation, uses model initial files in atm and land, can change starting date • set $CASE,$BASEDATE,$REFCASE,$REFDATE

  16. Coupler namelist • Stop_option: ndays, nmonths, newmonth, halfyear, newyear, newdecade • Stop_n : integer (ndays, nmonths) • Rest_freq : ndays, monthly, quarterly, halfyear, yearly • Rest_n : integer (ndays) • Diag_freq : daily, weekly, biweekly, monthly, quarterly, yearly, ndays • Diag_n : integer (ndays) • info_bcheck : integer

  17. Data Sets • Types • Grid files, binary • Namelist input, ascii • Initial datasets, binary/netcdf • Restart datasets, binary • History datasets, netcdf • Log files, ascii • inputdata directory • This is usually pointed to by $CSMDATA

  18. scripts/$CASE $CSMDATA = inputdata $EXEROOT Setup scripts $ARCROOT/restart Mass Store Data Flow, Input • Everything is copied to $EXEROOT • Tools and scripts attempt to automate most of the “get input files” • Main script variables include $CSMDATA, $LFSINP, $LMSINP, $MACINP, $RFSINP, $RMSINP

  19. Data Flow, Output • Output files are moved out of $EXEROOT • Harvesting is a separate process • Writing of restart files coordinated by the coupler • Writing of history files is not coordinated between components, monthly average is default • Main script variables include $LMSOUT, $MACOUT, $RFSOUT Scripts $EXEROOT Mass Store archiving $ARCROOT harvesting

  20. Log Files • Each component produces a log file, $model.log.$LID • $LID is a system date stamp • Date stamps are the same on all log files for a run • Log files are written into the $EXEROOT/$model directories during execution • Log files are copied to $SCRIPTS/logs at the end of a run • There are separate stdout and stderr that sometimes contain output information

  21. Archiving, ccsm_archive • Means moving model output to a separate area on a local disk, ccsm_archive • Local disk area is set by $ARCROOT in the main script • Benefits • Allows separation of running and harvesting • Mass storage availability does not prevent continued execution of the model • Allows users to run in volatile temporary space • Supports simple harvesting in a clustered machine environment (like nirvana)

  22. Harvesting, $CASE.har • Means copying model output to the local mass store • Separate script in scripts/$CASE, $CASE.har • Typically submitted in batch, can also be run interactively • Submitted by main script after model run, off by default • Sources ccsm_joe for important environment variables • Harvests all files in $ARCROOT/{atm,lnd,ocn,ice,cpl} • Verifies accurate copy on mass store before removing • Can scp files to remote machines

  23. Exact Restart • CCSM can stop and restart exactly • The coupler controls the frequency of restart file writes • Restart files guarantee bit-for-bit continuity at a checkpoint boundary • rpointer files are updated in the scripts/$CASE directory after each run

  24. Restart file management (1) • ccsm_archive • In scripts/$CASE • Called from main script after model run is complete, commented out by default • $ARCROOT/restart contains the latest full set of restart files • ccsm_archive copies full set of restart datasets into $ARCROOT/restart after each run • ccsm_archive then tars up that restart set into the $ARCROOT/restart.tars directory • These tar files can be large, regular clean up required

  25. Restart file management (2) • ccsm_getrestart • In scripts/tools • Called from main script before model run starts, commented out by default • Copies the latest set of restart files from $ARCROOT/restart to the appropriate directories • To “backup” model run to previous model date • Assumes both ccsm_archive and ccsm_getrestart have been active in the main script • Delete all files in $ARCROOT/restart • Untar an $ARCROOOT/restart.tars file into $ARCROOT/restart • Resubmit

  26. Auto-Resubmit • RESUBMIT file in scripts/$CASE directory • contains a single integer • If the integer is >0, main script resubmits itself and decrements the integer • Runaway jobs • FIRST! set value in RESUBMIT file to 0 • Attempt to kill running jobs

  27. Production • Modify coupler namelist in cpl.setup.csh, set run length and restart frequency, turn down diagnostic frequency, set info_bcheck to 0. • Run a startup, hybrid, or branch case $RUNTYPE • Transition to continue $RUNTYPE • Turn on archiving, harvesting, and ccsm_getrestart • Edit RESUBMIT file to initiate auto-resubmission

  28. Monitoring a run • Monitor the batch jobs using llq, bjobs, qstat • Verify that runs complete successfully, check for timing information at the end of a log file • Tail -f $EXEROOT/cpl/cpl.log* • If runs are not succeeding, • tail each log file • grep for ENDRUN in atm and lnd log files • Check stdout and stderr files for component messages or system messages • Look for core files in $EXEROOT/$model • Look for zero length files in $EXEROOT/$model • Check email

  29. Modifying source code • Modifying files in the ccsm models directory is not recommended • Create directories under scripts/$CASE • src.atm, src.lnd, src.ocn, src.ice, src.cpl • Copy subset of model source code to these directories and modify it • Has highest priority with respect to build • Benefits include • Release source code remains unmodified and available • Allows implementation of case dependent code modifications

  30. Multiple Machine Support • Should run on blackforest, babyblue, and ute “out of the box” • “Other” machines include seaborg, nirvana, eagle, falcon, cheetah • Supported platforms are indicated in $OS, $SITE, $MACH, $ARCH environment variables in the main script • See also scripts/tools/test.a1.mods.$MACH for suggested changes to test.a1.run for “other” machines.

  31. Running on a “New” Machine • Main script • Set batch queue commands • Add new $OS, $SITE, $MACH, $ARCH options • Set standard CCSM path names, $CSMROOT, … • Harvester submission issues • Set data movement variables, $LMSINP, … • Harvester script • May require modification • Tools • May need to modify ccsm_msread, ccsm_mswrite • Build • Modify models/bld/Macros.$OS file

  32. ccsm_joe • Created by main script • Updated every time the main script runs • Case dependent • Records important ccsm environment variables • Can be “sourced” by other scripts to inherit ccsm environment variables

  33. Interactive/Batch Issues • Can run main script interactively • Typically used to build and pre-stage initial data • Uncomment “exit” command in main script to stop the script before script starts ccsm execution • Batch environment highly site dependent • NQS • Loadleveler • LSF • PBS

  34. Common Errors (1) • Model won’t build • Try rebuilding clean • Remove all obj directories, these are $OBJROOT/model/obj which is normally equivalent to $EXEROOT/model/obj • When rebuilding, make sure $SETBLD is true in main script • Model won’t continue due to restart problem • Determine cause of problem; quota, hardware, script, zero length files, rpointer problems • Fix if possible • Back up to latest “good” restart dataset • Rerun

  35. Common Errors (2) • Ice model stops due to mp transport error • Double ndte in ice.setup.csh ice model namelist • Back up to latest “good” restart dataset • Run past previous stop date • Reset ndte value • Ocean model non-convergence • Add about 10% to the number of model timesteps/hour in ocn.setup.csh, DT_COUNT • Back up to latest “good” restart dataset • Run past previous stop date • Reset DT_COUNT • Non-convergence on first timestep is special case

  36. Tools • Under scripts/tools • ccsm_getfile : hierarchical search for file • ccsm_getinput : hierarchical search for input file • ccsm_msread : copies a file from local mass store • ccsm_mswrite : copies a file to local mass store • ccsm_checkenvs : echo ccsm environment variables, used to created ccsm_joe • ccsm-getrestart : copies restart files from $ARCROOT/restart to appropriate $EXEROOT and scripts/$CASE directories

  37. Performance • This is complicated! • Issues • Performance of components and system as a function of resolution and configuration • Scalability of individual components, scaling efficiency of individual components • Task/Thread counts • Components sharing nodes, overloading nodes with multiple components, overloading threads, overloading tasks • Load balance of coupled system

  38. Component Timings

  39. CCSM Load Balancing 40 ocean 32 atm 16 ice 12 land 04 cpl 104 total processors 53.2 8.6 40.4 6.2 15.0 9.4 3.0 10.0 10.0 5 3 2 55 Timings in seconds per day

  40. Component/Hardware layout • Machine, set of nodes • Nodes, group of processors that share memory • Processors, individual computing elements • General rules • Do not oversubscribe processors, place only 1 MPI task or 1 thread on each processor • Minimize the number of nodes used for a given component and processor requirement • Multiple components can share a node as long as there is no oversubscription of processors • Test several decompositions, layouts, task/thread combinations to try to optimize performance

  41. Summary • CCSM is a complicated multi-executable climate model, expect there to be “spin-up” time • CCSM is a scientific research code • There are many possible components, configurations, platforms, and resolutions; we are unable to test everything • Users are responsible for validating their science • NCAR can help with software/configuration problems, ccsm@ucar.edu • Please report bugs, fixes, improvements, and ports to new hardware, so we can incorporate those changes! ccsm@ucar.edu

More Related