530 likes | 905 Vues
Training session 2 : Advanced training course on modipsl and libIGCM November 14 th 2013, MdS. Outline. IPSL climate modelling centre (ICMC) presentation IPSLCM history and perspective Mini how to use modipsl/libIGCM Post-processing with libIGCM Monitoring a simulation Hands-on.
E N D
Training session 2 :Advanced training course on modipsl and libIGCMNovember 14th 2013, MdS
Outline • IPSL climate modelling centre (ICMC) presentation • IPSLCM history and perspective • Mini how to use modipsl/libIGCM • Post-processing with libIGCM • Monitoring a simulation • Hands-on
Modeling platform(IPSL-ESM)Arnaud Caubel (LSCE) - Marie-Alice Foujols (IPSL) Current and future climate changesJean-Louis Dufresne(LMD) - Olivier Boucher (LMD) Atmospheric and surface physics and dynamics (LMDZ)Frédéric Hourdin (LMD) - Laurent Fairhead (LMD) Paleoclimate and last millennium Pascale Braconnot - Masa Kageyama (LSCE) Ocean and sea ice physics and dynamics (NEMO, LIM)C Ethé (IPSL) - Claire Lévy - Gurvan Madec (LOCEAN) “Near-term” prediction (seasonal to decadal)Eric Guilyardi (LOCEAN) - Juliette Mignot (LOCEAN) Atmosphere and ocean interactions (IPSL-CM, different resolutions) Sébastien Masson (LOCEAN) - Olivier Marti (LSCE) Regional climatesRobert Vautard (LSCE), Laurent Li (LMD) Atmospheric chemistry and aerosols (INCA, INCA_aer, Reprobus)Anne Cozic (LSCE) - M. Marchand (LATMOS) Biogeochemical cycles (PISCES)Laurent Bopp (LSCE) - Patricia Cadule (IPSL) Evaluation of the models, present-day and future climate change analysis Sandrine Bony (LMD) - Patricia Cadule (IPSL) - Marion Marchand (LATMOS) - Juliette Mignot (LOCEAN) – Jérôme Servonnat (LSCE) Data Archive and Access RequirementsSébastien Denvil (IPSL) - Karim Ramage (IPSL) ICMC organisation PI: J-L Dufresne; Office: L. Bopp, MA Foujols, J. Mignot Steering committee Continental processes (ORCHIDEE)Philippe Peylin (LSCE) - Josefine Ghattas (IPSL)
Outline • IPSL climate modelling centre (ICMC) presentation • IPSLCM history and perspective • Mini how to use modipsl/libIGCM • Post-processing with libIGCM • Monitoring a simulation • Hands-on
IPSLCM history and scientific articles IPCC reports FAR AR5 SAR TAR AR4 1990 1995 2001 2007 2013 CMIP projects CMIP3 CMIP 1 & 2 CMIP5 few articles IPSL-CM1 some articles IPSL-CM2 10+ articles IPSL-CM4 30+ articles IPSL-CM5 IPSL-CM6
LMDZ : atmospheric componenthttp://lmdz.lmd.jussieu.fr/?set_language=en Next LMDZ training session : 9-11 December 2013inscription before 15th November http://studs.unistra.fr/studs.php?sondage=1wgk8t9v44nsml27
Short history of IPSL modelhttp://icmc.ipsl.fr/index.php/icmc-models
1979 : 1st Linpack performance list 80 Mflops
Supercomputers timeline : top500.org *10/4 years
Complexity and resolution of models IPCC, AR4, WG1, Chap. 1, fig 1.2 and 1.4
top500.org : number of CPUS/cores 100 000 1 000 10 1993 2003 2013
Technical challenges : HPC • More parallelism in component : • MPI : messages programming • hybrid ie MPI/OpenMP : directives and shared memory • More parallelism in coupled model • 3 executables at least • each with MPI or MPI/OpenMP • more executables with XIOS : IO servers • Huge amount of data produced, to be analysed
on the road for IPSL-CM6 • New physical package : LMDZ, NEMO, ORCHIDEE • Increased H and V resolutions • Ensembles of simulations • Longer simulations : paleo • More complexity : INCA chemistry added • More processors used in parallel • New dynamical core : DYNAMICO • Optimisation in IO • Improvement and Reliability of libIGCM
Outline • IPSL climate modelling centre (ICMC) presentation • IPSLCM history and perspective • Mini how to use modipsl/libIGCM • Post-processing with libIGCM • Monitoring a simulation • Hands-on
Récupérer, compiler et lancer une configuration de type _v5 • Accès à MODIPSL svn co http://forge.ipsl.jussieu.fr/igcmg/svn/modipsl/trunk modipsl • Accès à IPSLCM5_v5cd modipsl/util ; ./model IPSLCM5_v5 • Installation des Makefilescd modipsl/util ; ./ins_make • Compilation cd modipsl/config/IPSLCM5_v5 ; gmake + resolution choisie • Installation de l’expérience type (et post-traitements) cp EXPERIMENT/IPSLCM5/piControl/config.card . vi config.card ### JobName=MYEXP ../../util/ins_job ### recopie repertoire piControl dans MYEXP avec COMP, DRIVER, PARAM • Soumission du Job de lancementcd modipsl/config/IPSLCM5_v5/MYEXP; ccc_msub Job_MYEXPllsumbmit Job_MYEXP
IPSL sources of components cvs/svn servers Connection Specific configuration dowloading Modipsl Compilation Simulation set up LibIGCM Physical package choice and set up Job set up and submission LibIGCM Front End Computing
Generical job: AA_Job PeriodLength
libIGCM library : schematic description EXP00/DRIVER EXP00 driver EXP00/COMP card
Job_EXP00 Job_EXP00 Job_EXP00 Job_EXP00 Computing job PackFrequency pack_debug PackFrequency pack_restart RebuildFrequency rebuild pack_output Post-processing jobs SeasonalFrequency create_se atlas atlas create_ts TimeSeriesFrequency create_ts monitoring
TGCC computers and file system in a nutshell Computers airainfront-end curie hybrid nodes-q hybrid airainnodes curiefront-end curiethin nodes -q standard curielarge nodes -q xlarge login compute File system Small precious filesSaved space $HOME $CCCWORKDIR sources small results IGCM_OUT : MONITORING/ATLAS cp dods/work dods_cp temporary REBUILD IGCM_OUT : files to be packed outputs of post-proc jobs $SCRATCHDIR cp quotas $CCCSTOREDIR IGCM_OUT : Packed resultsOutput, Analyse SE and TS dods/store ccc_hsm get dods_cp HPSS : Robotic tapes Temporary space Non saved space Saved space Space on tapes Visible from www October 2013
curie Job_EXP00 Job_EXP00 Job_EXP00 Compute TGCC PeriodLength PeriodLength $SCRATCHDIR/IGCM_OUT/.../REBUILD RebuildFrequency rebuild Post curie $SCRATCHDIR/IGCM_OUT/XXX/Output $SCRATCHDIR/IGCM_OUT/XXX/Restart Debug PackFrequency PackFrequency pack_restart pack_debug ncrcat tar pack_output Post curie $CCCSTOREDIR/IGCM_OUT/.../RESTART DEBUG $CCCSTOREDIR/IGCM_OUT/XXX/Output TimeSeriesFrequency SeasonalFrequency create_ts create_se Post monitoring atlas curie TS et SE : $CCCSTOREDIR/IGCM_OUT/… dods/storeMONITORING et ATLAS : $CCCWORKDIR dods/work DodsCopy=TRUE/FALSE
IDRIS computers and file system in a nutshell turingfront-end turingcalcul adappfront-end adappcompute adacompute login compute Small precious filesSaved space $HOME File system $HOME sources small results temporary REBUILD IGCM_OUT : files to be packed outputs of post-proc jobs $WORKDIR $WORKDIR $TMPDIR $TMPDIR $TMPDIR mfput/mfget mfput/mfget gaya dods $HOME dmput/dmget IGCM_OUT :Output, Analyse MONITORING/ATLAS dods_cp Robotic tapes Temporary space Non saved space Saved space Space on tapes Visible from www October 2013
ada Job_EXP00 Job_EXP00 Job_EXP00 Compute IDRIS PeriodLength PeriodLength $WORKDIR/IGCM_OUT/.../REBUILD RebuildFrequency rebuild Post adapp $WORKDIR/IGCM_OUT/XXX/Output $WORKDIR/IGCM_OUT/XXX/Restart Debug PackFrequency PackFrequency pack_restart pack_debug ncrcat tar pack_output Post adapp gaya:IGCM_OUT/.../RESTART DEBUG gaya:IGCM_OUT/XXX/Output TimeSeriesFrequency SeasonalFrequency create_ts create_se Post monitoring atlas adapp DodsCopy=TRUE/FALSE gaya:IGCM_OUT/… dods.idris.fr
Outline • IPSL climate modelling centre (ICMC) presentation • IPSLCM history and perspective • Mini how to use modipsl/libIGCM • Post-processing with libIGCM • Monitoring a simulation • Hands-on
Time Series : create_ts.job • A Time Series is a file which contains a single variable over the whole simulation period (ChunckJob2D = NONE) or for a shorter period for 2D (ChunckJob2D = 100Y) or 3D (ChunckJob3D = 50Y) variables. • The write frequency is defined in theconfig.cardfile: TimeSeriesFrequency=10Yindicates that the time series will be written every 10 years and for 10-year periods. • The Time Series are set in the COMP/*.card files by the TimeSeriesVars2D and TimeSeriesVars3D options. • The Time Series coming from monthly (or daily) output files are stored on the file server in the IGCM_OUT/TagName/[SpaceName]/[ExperimentName]/JobName/Composante/Analyse/TS_MO and TS_DA directories. • Bonus : TS_MO_YE (for annual mean time series) are produced for all TS_MO variables • You can add or remove variables to the TimeSeries lists according to your needs. [Post] ... #D- If you want to produce time series, this flag determines #D- frequency of post-processing submission (NONE if you don't want) TimeSeriesFrequency=10Y config.card • [OutputFiles] • List= (histmth.nc, ${R_OUT_ATM_O_M}/${PREFIX}_1M_histmth.nc, Post_1M_histmth),\ • ... • [Post_1M_histmth] • Patches= () • GatherWithInternal = (lon, lat, presnivs, time_counter, time_counter_bnds, aire) • TimeSeriesVars2D = (bils, cldh, ... • ... • ChunckJob2D = NONE • TimeSeriesVars3D = (upwd, lwcon, ... • ... • ChunckJob3D = OFF COMP/lmdz.card
Intermonitoring : http://webservices.ipsl.jussieu.fr/monitoring/
How to add a new variable in MONITORING • You can add or change the variables to be monitored by editing the configuration files of the monitoring. Those files are defined by default for each component. • The monitoring is defined here: ~compte_commun/atlas For example for LMDZ on curie : ~p86ipsl/monitoring01_lmdz_LMD9695.cfgFor example for LMDZ on adapp : ~rpsl035/monitoring01_lmdz_LMD9695.cfg • You can change the monitoring by creating a POST directory which is part of your configuration. Copy a .cfg file and change it the way you want. • use ferret language • You can monitor variables produced in time series and stored in TS_MO POST/monitoring01_lmdz_LMD9695.cfg • #-------------------------------------------------------------------------------------------------------- • # field | files patterns | files additionnal | operations | title | units | calcul of area • #-------------------------------------------------------------------------------------------------------- • nettop_global | "tops topl" | LMDZ4.0_9695_grid.nc | "(tops[d=1]-topl[d=2])" | "TOA. total heat flux (GLOBAL)" | "W/m^2" | "aire[d=3]"
Seasonal mean : create_se.job • A seasonal means files (SE) contain averages for each month of the year (jan, feb,...) for a frequency defined in the config.card files • SeasonalFrequency=10Y The seasonal means will be computed every 10 years. • SeasonalFrequencyOffset=0 The number of years to be skipped for calculating seasonal means. • All files with a requested Post (Seasonal=ON in COMP/*card) are then averaged within the ncra script before being stored in the directory: • IGCM_OUT/IPSLCM5A/DEVT/pdControl/MyExp/ATM/Analyse/SE. There is one file per SeasonalFrequency=10Y • ATLAS are launched by create_se. ATLAS sources are : ~rpsl035 ~p86ipsl/atlas #======================================================================== #D-- Post - [Post] ... #D- If you want to produce seasonal average, this flag determines #D- the period of this average (NONE if you don't want) SeasonalFrequency=10Y #D- Offset for seasonal average first start dates ; same unit as SeasonalFrequency #D- Usefull if you do not want to consider the first X simulation's years SeasonalFrequencyOffset=0 config.card • [OutputFiles] • List=(histmth.nc, ${R_OUT_ATM_O_M}/${PREFIX}_1M_histmth.nc, Post_1M_histmth),\ • ... • [Post_1M_histmth] • ... • Seasonal=ON COMP/lmdz.card
Outline • IPSL climate modelling centre (ICMC) presentation • IPSLCM history and perspective • Mini how to use modipsl/libIGCM • Post-processing with libIGCM • Monitoring a simulation • Hands-on
Monitoring the simulation Verification and Correction
Monitoring a simulation • We strongly encourage you to check your simulation frequently during run time. First of all, check job status : ccc_mstat llq • Real time limit exceeded : jobs are killed without any message on ada • RunChecker.job : This tool, provided with libIGCM, allows you to find out your simulations' status. • One historical simulation, 156 years : 1850-2005 is composed by 50 computing jobs and 1000 post-processing jobs Documentation http://forge.ipsl.jussieu.fr/igcmg/wiki/platform/en/documentation
Monitoring a simulation : mail • You receive a message at the end of the simulation • The simulation could be completed or failed De : rpsl003@idris.fr Objet : COURSNIV2 completed Date : 22 octobre 2013 18:29:24 UTC+02:00 À : rpsl003@idris.fr Dear rpsl003, Simulation COURSNIV2 completed on supercomputer ada027 Simulation started : 20000101 Simulation ended : 20000102 Output files are available in /u/rech/psl/rpsl003/IGCM_OUT/IPSLCM5A/DEVT/pdControl/COURSNIV2 Files to be rebuild are temporarily available in /workgpfs/rech/psl/rpsl003/IGCM_OUT/IPSLCM5A/DEVT/pdControl/COURSNIV2/REBUILD Pre-packed files are temporarily available in /workgpfs/rech/psl/rpsl003/IGCM_OUT/IPSLCM5A/DEVT/pdControl/COURSNIV2 Script files, Script Outputs and Debug files (if necessary) are available in /gpfs5r/workgpfs/rech/psl/rpsl003/ADA/COURS/NIV2/IPSLCM5_v5/modipsl/config/IPSLCM5_v5/COURSNIV2 Greetings! Check this out for more information : https://forge.ipsl.jussieu.fr/igcmg/wiki/platform/documentation Mail Début du message réexpédié : De : rpsl003@idris.fr Objet : MyJobTest failed Date : 22 octobre 2013 17:17:41 UTC+02:00 À : rpsl003@idris.fr Dear rpsl003,
Monitoring a simulation : run.card • When the simulation has started, the file run.card is created by libIGCM using the template run.card.init. • run.cardcontains information of the current run period and the previous periods already finished. • This file is updated at each run period by libIGCM. • You can find here information of the time consumption of each period. • The status of the job is set to OnQueue, Running, Completed or Fatal. [Configuration] #last PREFIX OldPrefix= COURSNIV2_20000103 #Compute date of loop PeriodDateBegin= 2000-01-04 PeriodDateEnd= 2000-01-04 CumulPeriod= 4 # State of Job "Start", "Running", "OnQueue", "Completed" PeriodState= Completed SubmitPath= /gpfs5r/workgpfs/rech/psl/rpsl003/ADA/COURS/NIV2/IPSLCM5_v5/modipsl/config/IPSLCM5_v5/COURSNIV2 #======================================================================== [PostProcessing] TimeSeriesRunning=n TimeSeriesCompleted= #======================================================================== [Log] # Executables Size LastExeSize= ( 88011086, 0, 0, 19956686, 0, 0, 1523952 ) #----------------------------------------------------------------------------------------------------------------------------------- # CumulPeriod | PeriodDateBegin | PeriodDateEnd | RunDateBegin | RunDateEnd | RealCpuTime | UserCpuTime | #----------------------------------------------------------------------------------------------------------------------------------- # 1 | 20000101 | 20000101 | 2013-10-22T17:53:48 | 2013-10-22T17:55:10 | 82.01000 | 4.21000 | # 2 | 20000102 | 20000102 | 2013-10-22T18:28:03 | 2013-10-22T18:29:17 | 74.19000 | 4.09000 | # 3 | 20000103 | 20000103 | 2013-10-23T17:28:50 | 2013-10-23T17:30:26 | 95.21000 | 4.30000 | run.card
Verification and correction 1/6 • Where did the problem occur ? • 1 "failed" email : Main computation job => gaya stopped at IDRIS, hardware problem ? Check Script_output_xxxx. => When gaya restarted, or if there isn't any clear error message, try relaunching (after a clean_month): path/to/libIGCM/clean_month.job ccc_msub (llsubmit) Job_...
Verification and correction 2/6 • Where did the problem occur ? • 1 "failed" email : Main computation job : analyse Script_output_xxxx ####################################### # ANOTHER GREAT SIMULATION # ####################################### 1ère partie (copying the input files) ####################################### # DIR BEFORE RUN EXECUTION # ####################################### 2ème partie (running the model) ####################################### # DIR AFTER RUN EXECUTION # ####################################### 3ème partie (post-processing) ####################################### http://forge.ipsl.jussieu.fr/igcmg/wiki/platform/en/documentation/G_suivi#AnalyzingtheJoboutput:Script_Output
Verification and correction 3/6 --> analyse Script_output_xxxx : In general, if your simulation stops you can look for the keyword "IGCM_debug_CallStack" in this file. This keyword will come after a line explaining the error you are experiencing. ===================================================================== EXECUTION of : mpirun -f ./run_file > out_run_file 2>&1 Return code of executable : 1 IGCM_debug_Exit : EXECUTABLE !!!!!!!!!!!!!!!!!!!!!!!!!! !! IGCM_debug_CallStack !! !------------------------! !------------------------! IGCM_sys_Cp : out_run_file xxxxxxxxxxxx_out_run_file_error =====================================================================
Verification and correction 4/6 --> Check closely the sub directory Debug (if it exists) Check file xxxxx_error in Debug/ : • contains LMDZ standard output. LMDZ often fails in hgardfou. Stopping in hgardfou • contains abends (abnormal termination / exception) of each and every component. Check standard outputs for NEMO, ORCHIDEE, INCA, OASIS • Debug/xxxx_ocean.output • Debug/xxxx_output_orchidee • Debug/xxxx_inca.out • Debug/xxxx_cplout
RunChecker.job • RunChecker.job helps you to monitor all the jobs produced by libIGCM for a simulation
RunChecker.job : usage and options This script can be launched from anywhere. Usage: path/to/libIGCM/RunChecker.job [-u user] [-q] [-j n] [-s] job_name path/to/libIGCM/RunChecker.job [-u user] [-q] [-j n] -p config.card_path path/to/libIGCM/RunChecker.job [-u user] [-q] [-j n] -r Options : -h : print this help and exit -u user : owner of the job -q : quiet -j n : print n post-processing jobs (default is 20) -s : search for a new job in $WORKDIR and fill in the catalog before printing information -p path : give the absolute path to the directory containing the config.card instead of the job name (needed only once) -r : check all running simulations. 1) path/to/libIGCM/RunCkecker.job –p $CCCWORKDIR/CURIE/CMIP5/R1414/IPSLCM5A_20120731/modipsl/config/IPSLCM5A/v5.rcp45CMR2 2) path/to/libIGCM/RunCkecker.job v5.rcp45CMR2
Verification and correction 5/6 • You have received 2 "failed" emails or RunChecker status is abnormal ie : red • Analyse the situation: • Simple case: • Re-submit rebuild, pack_debug or pack_restart jobs • Re-submit pack_output • Less simple case: • Use clean_year to go back to a healthy situation • Holes in the data path/to/libIGCM/clean_year.job [SSAA] • all data from current year to SSAA (included) will be deleted. • Restart the simulation https://forge.ipsl.jussieu.fr/igcmg/wiki/platform/en/documentation/G_suivi#Startorrestartpostprocessingjobs1
TimeSeries_Checker.job • Install a dedicated directory • Copy required files and directories : config.card, run.card, COMP, POST • Copy from libIGCM the script : TimeSeries_Checker.job • Modify the job : libIGCM, name of the simulation, ... • Look at the documentation :https://forge.ipsl.jussieu.fr/igcmg/wiki/platform/en/documentation/G_suivi#TimeSeries_checker.job-Recommendedmethod > mkdir POST_REDO > cd POST_REDO > cp –pr COMP POST config.card run.card . > cp ../../../../libIGCM/TimeSeries_Checker.job . > vi TimeSeries_Checker.job # Check/Modify : libIGCM= SpaceName= ExperimentName= JobName= CARD_DIR= BRIDGE_MSUB_PROJECT=gen2211 > ./TimeSeries_Checker.job Answer y to submit create_ts.job ksh > ./TimeSeries_Checker.job 2>&1|tee TSC_OUT_TO_KEEP
Verification and correction 6/6 • Everything went ok : • End of simulation email • No anomaly detected by RunChecker • TimeSeriesChecker (and SE_checker):Checks existing time series et submit create_ts jobs to build the missing ones • Keep in mind: • Rebuild jobs automatically submit pack jobs, as well as corresponding TS and SE. • Pack, TS and SE jobs may be re-submitted independently from a rebuild job
The END! (so soon?) champagne-users@ipsl.jussieu.fr platform-users@ipsl.jussieu.fr Mailing list to ask for help and to share information with other users