340 likes | 500 Vues
Upgrade D0 farm. Reasons for upgrade. RedHat 7 needed for D0 software New versions of ups/upd v4_6 fbsng v1_3f+p2_1 sam Use of farm for MC and analysis Integration in farm network. MC production on farm. Input: requests Request translated in mc_runjob macro Stages:
E N D
Reasons for upgrade • RedHat 7 needed for D0 software • New versions of • ups/upd v4_6 • fbsng v1_3f+p2_1 • sam • Use of farm for MC and analysis • Integration in farm network
MC production on farm • Input: requests • Request translated in mc_runjob macro • Stages: • mc_runjob on batch server (hoeve) • MC job on node • SAM store on file server (schuur)
1.2 TB mcc request fbs(rcp,sam) farm server SAM DB file server fbs job: 1 mcc 2 rcp 3 sam fbs(mcc) datastore mcc input FNAL SARA mcc output node 100 cpu’s control 40 GB data metadata
cron: sam 1.2 TB mcc request fbs(rcp[,sam]) farm server SAM DB file server fbs job: 1 mcc 2 rcp fbs(mcc) datastore mcc input FNAL SARA mcc output node 100 cpu’s control 40 GB data metadata
hoeve node schuur fbsuser: mc_runjob cron fbs submit fbsuser:cp fbsuser:mcc fbs submit fbsuser: rcp willem:sam data control
SECTION mcc EXEC=/d0gstar/curr/minbias-02073214824/batch NUMPROC=1 QUEUE=FastQ STDOUT=/d0gstar/curr/minbias-02073214824/stdout STDERR=/d0gstar/curr/minbias-02073214824/stdout SECTION rcp EXEC=/d0gstar/curr/minbias-02073214824/batch_rcp NUMPROC=1 QUEUE=IOQ DEPEND=done(mcc) STDOUT=/d0gstar/curr/minbias-02073214824/stdout_rcp STDERR=/d0gstar/curr/minbias-02073214824/stdout_rcp
#!/bin/sh . /usr/products/etc/setups.sh cd /d0gstar/mcc/mcc-dist . mcc_dist_setup.sh mkdir -p /data/curr/minbias-02073214824 cd /data/curr/minbias-02073214824 cp -r /d0gstar/curr/minbias-02073214824/* . touch /d0gstar/curr/minbias-02073214824/.`uname -n` sh minbias-02073214824.sh `pwd` > log touch /d0gstar/curr/minbias-02073214824/`uname -n` /d0gstar/bin/check minbias-02073214824 batch_rcp runs on schuur #!/bin/sh i=minbias-02073214824 if [ -f /d0gstar/curr/$i/OK ];then mkdir -p /data/disk2/sam_cache/$i cd /data/disk2/sam_cache/$i node=`ls /d0gstar/curr/$i/node*` node=`basename $node` job=`echo $i | awk '{print substr($0,length-8,9)}'` rcp -pr $node:/data/dest/d0reco/reco*${job}* . rcp -pr $node:/data/dest/reco_analyze/rAtpl*${job}* . rcp -pr $node:/data/curr/$i/Metadata/*.params . rcp -pr $node:/data/curr/$i/Metadata/*.py . rsh -n $node rm -rf /data/curr/$i rsh -n $node rm -rf /data/dest/*/*${job}* touch /d0gstar/curr/$i/RCP fi batch runs on node
runs on schuur called by fbs or cron #!/bin/sh locate(){ file=`grep "import =" import_${1}_${job}.py | awk -F \" '{print $2}'` sam locate $file | fgrep -q [ return $? } . /usr/products/etc/setups.sh setup sam SAM_STATION=hoeve export SAM_STATION tosam=$1 LIST=`cat $tosam` for job in $LIST do cd /data/disk2/sam_cache/${job} list='gen d0g sim' for i in $list do until locate $i || (sam declare import_${i}_${job}.py && locate ${i}) do sleep 60; done done list='reco recoanalyze' for i in $list do sam store --descrip=import_${i}_${job}.py --source=`pwd` return=$? echo Return code sam store $return done done echo Job finished ... declare gen, d0g, sim store reco, recoanalyze
Filestream • Fetch input from sam • Read input file from schuur • Process data on node • Copy output to schuur
hoeve node schuur attach filestream mc_runjob cron fbs submit rcp d0exe rcp fbs submit sam data control
Analysis on farm • Stages: • Read files from sam • Copy files to node(s) • Perform analysis on node • Copy files to file server • Store files in sam
1.2 TB fbs(1), fbs(3) farm server SAM DB file server • sam + rcp • analyze • rcp + sam fbs(2) datastore FNAL SARA node 100 cpu’s control (fbs) 40 GB data metadata
triviaal node-2 willem:sam input fbsuser:rcp fbsuser: analysis program output fbsuser:rcp willem:sam
batch.jdf SECTION sam EXEC=/home/willem/batch_sam NUMPROC=1 QUEUE=IOQ STDOUT=/home/willem/stdout STDERR=/home/willem/stdout batch_sam #!/bin/sh . /usr/products/etc/setups.sh setup sam SAM_STATION=triviaal export SAM_STATION sam run project get_file.py --interactive > log /usr/bin/rsh -n -l fbsuser triviaal rcp -r /stage/triviaal/sam_cache/boo node-2:/data/test >> log
1.2 TB fbs(1), fbs(3) farm server SAM DB file server fbs(2) • sam • rcp + analyze + rcp • rcp + sam datastore FNAL SARA node 100 cpu’s control (fbs) 40 GB data metadata
triviaal node-2 willem:sam fbsuser:fbs submit fbsuser: rcp analysis program rcp input output willem:sam
rsh -l fbsuser triviaal fbs submit ~willem/batch_node.jdf SECTION sam EXEC=/d0gstar/batch_node NUMPROC=1 QUEUE=FastQ STDOUT=/d0gstar/stdout STDERR=/d0gstar/stdout #!/bin/sh uname -a date
SECTION ana EXEC=/d0gstar/batch_node NUMPROC=1 QUEUE=FastQ STDOUT=/d0gstar/stdout STDERR=/d0gstar/stdout SECTION sam EXEC=/home/willem/batch NUMPROC=1 QUEUE=IOQ STDOUT=/home/willem/stdout STDERR=/home/willem/stdout #!/bin/sh . /usr/products/etc/setups.sh setup fbsng setup sam SAM_STATION=triviaal export SAM_STATION sam run project get_file.py --interactive > log /usr/bin/rsh -n -l fbsuser triviaal fbs submit /home/willem/batch_node.jdf #!/bin/sh rcp -pr server:/stage/triviaal/sam_cache/boo /data/test . /d0/fnal/ups/etc/setups.sh setup root -q KCC_4_0:exception:opt:thread setup kailib root -b -q /d0gstar/test.C { gSystem->cd("/data/test/boo"); gSystem->Exec("pwd"); gSystem->Exec("ls -l"); }
# # This file sets up and runs a SAM project. # import os, sys, string, time, signal from re import * from globals import * import run_project from commands import * ######################################### # # Set the following variables to appropriate values # Consult database for valid choices sam_station = "triviaal" # Consult Database for valid choices project_definition = "op_moriond_p1014" # A particular snapshot version, last or new snapshot_version = 'new' # Consult database for valid choices appname = "test" version = "1" group = "test" get_file.py # The maximum number of files to get from sam max_file_amt = 5 # for additional debug info use "--verbose" #verbosity = "--verbose" verbosity = "" # Give up on all exceptions give_up = 1 def file_ready(filename): # Replace this python subroutine with whatever # you want to do # to process the file that was retrieved. # This function will only be called in the event of # a successful delivery. print "File ",filename," has been delivered!" # os.system('cp '+filename+' /stage/triviaal/sam') return
/ups /db /etc /prd Disk partitioning hoeve /d0 /mcc /fnal /fbsng /mcc-dist /mc_runjob /d0usr /d0dist /curr /fnal -> /d0/fnal /d0usr -> /fnal/d0usr /d0dist -> /fnal/d0dist /usr/products -> /fnal/ups
ana_runjob • Is analogous to mc_runjob • Creates and submits analysis jobs • Input • get_file.py with SAM project name • Project defines files to be processed • analysis script
Integration with grid (1) • At present separate clusters: • D0, LHCb, Alice, DAS cluster • hoeve and schuur in farm network
Present network layout ajax hefnet schuur hoeve router switch surfnet node node node NFS
New network layout hefnet ajax lambda farmrouter booder switch switch switch hoeve schuur LHCb D0 alice NFS
New network layout hefnet ajax lambda farmrouter das-2 booder switch switch switch hoeve schuur LHCb D0 alice NFS
Server tasks • hoeve • software server • farm server • schuur • fileserver • sam node • booder • home directory server • in backup scheme
Integration with grid (2) • Replace fbs with pbs or condor • pbs on Alice and LHCb nodes • condor on das cluster • Use EDG installation tool LCGF • Install d0 software with rpm • Problem with sam (uses ups/upd)
Integration with grid (3) • Package mcc in rpm • Separate programs from working space • Use cfg commands to steer mc_runjob • Find better place for card files • Input structure now created on node
Grid job PBS job submit #!/bin/sh macro=$1 pwd=`pwd` cd /opt/fnal/d0/mcc/mcc-dist . mcc_dist_setup.sh cd $pwd dir=/opt/fnal/d0/mcc/mc_runjob/py_script python $dir/Linker.py script=$macro [willem@tbn09 willem]$ cat test.pbs # PBS batch job script #PBS -o /home/willem/out #PBS -e /home/willem/err #PBS -l nodes=1 # Changing to directory as requested by user cd /home/willem # Executing job as requested by user ./submit minbias.macro
RunJob class for grid class RunJob_farm(RunJob_batch) : def __init__(self,name=None) : RunJob_batch.__init__(self,name) self.myType="runjob_farm" def Run(self) : self.jobname = self.linker.CurrentJob() self.jobnaam = string.splitfields(self.jobname,'/')[-1] comm = 'chmod +x ' + self.jobname commands.getoutput(comm) if self.tdconf['RunOption'] == 'RunInBackground' : RunJob_batch.Run(self) else : bq = self.tdconf['BatchQueue'] dirn = os.path.dirname(self.jobname) print dirn comm = 'cd ' + dirn + '; sh ' + self.jobnaam + ' `pwd` >& stdout' print comm runcommand(comm)
To be decided • Location of minimum bias files • Location of MC output
Job status • Job status is recorded in • fbs • /d0/mcc/curr/<job_name> • /data/mcc/curr/<job_name>
SAM servers • On master node: • station • fss • On master and worker nodes: • stager • bbftp