420 likes | 548 Vues
This resource provides a complete guide to using the NPACI Grid and NPACKage for advanced computing. It covers essential topics such as hardware and software resources at leading institutions, the benefits of utilizing the NPACI Grid, including job submission simplification and distributed data management tools. Users will find information about setting up their environment, account management, and authentication processes. Additional tutorials are available for running jobs, managing certificates, and accessing services, ensuring a smooth experience on the grid.
E N D
Getting Started With The NPACI Grid &NPACKageShannon Whitmoreswhitmor@sdsc.eduhttp://npacigrid.npaci.eduhttp://npackage.npaci.edu
Overview • Introduction • Getting Started on the NPACI Grid • Tutorial
Defined • NPACI Grid • Hardware, software, network, and data resources at • San Diego Supercomputer Center • Texas Advanced Computing Center • University of Michigan, Ann Arbor • California Institute of Technology – coming soon • NPACKage • An integrated set of grid middleware and advanced NPACI applications
Grid Resources • Blue Horizon (SDSC) • IBM power3-based clustered SMP system • 1,152 processors, 576 GB main memory • Longhorn (TACC) • IBM power4 system • 224 processor and 512 GB aggregate memory • Hypnos & Morpheus (UMichigan) • AMD-based Linux clusters • Hypnos: 128 nodes Morpheus 50 nodes • Each SMP node: two CPUs & one GB memory per processor
Why use the NPACI Grid? • Simplifies job submission • Globus: common scripting language for job submission • Condor-G: launch and monitor jobs from one site • Combines local resources with SC resources • Run small jobs locally, large jobs remotely • Enables portal development • Single point of access for tools • Simplifies complex interfaces for users
Why use the NPACI Grid? (cont’d) • Provides tools for distributed data management and analysis • SRB • Datacutter • Provides single sign-on capabilities
Caveats • Resources are intended for large jobs • Try not to run small jobs on the batch queues • Must plan in advance for large runs • Request machine allocations • Cannot run distributed jobs on batch resources concurrently
Why use NPACKage • Easier to port applications • Components tested before release • Consulting support available • Consistent packaging • Simplified installation/configuration process • Single web site for all documentation • Install NPACKage on your system today!
Accessing The Gridhttp://npacigrid.npaci.edu/user_getting_started.html
Accounts • Need an NPACI account? http://npacigrid.npaci.edu/expedited_account.html • Need an account extension? http://npacigrid.npaci.edu/account_extension_request.html • Username does not start with “ux”? consult@npaci.edu
Login Nodes • SDSC (Blue Horizon) • tf004i.sdsc.edu & tf005i.sdsc.edu (batch) • b80n01.sdsc.edu - b80n13.sdsc.edu (interactive) • TACC • longhorn.tacc.utexas.edu (batch & interactive) • Michigan • hypnos.engin.umich.edu (batch & interactive) • morpheus.engin.umich.edu (batch & interactive)
Setup your environment • Add the following to your shell initialization file on all NPACI Grid hosts • For csh based shells if ( ! $?NPACI_GRID_CURRENT ) then alias . source setenv NPACI_GRID_CURRENT /usr/npaci-grid-1.1 . $NPACI_GRID_CURRENT/setup.csh endif • For bourne based shells if [ "x$NPACI_GRID_CURRENT" = "x" ]; then export NPACI_GRID_CURRENT=/usr/npaci-grid-1.1 . $NPACI_GRID_CURRENT/setup.sh fi
Certificates Required to use the NPACI Grid • Used for authentication and encryption • Enables single sign-on capabilities On cert.npaci.edu • Run /usr/local/apps/pki_apps/cacl • Generates X.509 certificate • Creates your Distinguished Name – a globally unique ID identifying you as an individual
Certificates (cont’d) • Copy your .globus directory to all sites • Script: http://npacigrid.npaci.edu/Examples/copycert.sh • Wait for DN propagation in grid-mapfile • Maps local usernames on a given host to a DN
Verify Grid Access Connect to any login node • ssh longhorn.tacc.utexas.edu –l <username> • grid-proxy-init • generates a proxy certificate • provides single sign on capability • proxies are valid for one day • grid-proxy-destroy • Removes the proxy
Verify Grid Access (cont’d) • Authenticate your certificate at each site • globusrun -a -r hypnos.engin.umich.edu • globusrun -a -r morpheus.engin.umich.edu • globusrun -a -r longhorn.tacc.utexas.edu • globusrun -a -r tf004i.sdsc.edu • Problems? Contact us: • http://npacigrid.npaci.edu/contacts.html
TutorialClients and Serviceshttp://npacigrid.npaci.edu/tutorial.html
Overview • Running Jobs • Using Globus • Using Condor-G • Transferring Files • Resource and Monitoring Services • MDS/Ganglia • NWS
Gatekeeper • Handles globus job requests at a remote site • Manages authentication and security • Routes job requests to a jobmanager • Exists on all login nodes
Jobmanager • Manages job launching • Two jobmanagers on each gatekeeper host • Interactive • jobmanager-fork – default • Batch – interface to local schedulers • jobmanager-loadleveler - longhorn & horizon • jobmanager-pbs - hypnos & morpheus
Globus clients Three commands for remote job submission • globus-job-submit • globus-job-run • globus-run
globus-job-submit • Runs in background • Returns a contact string • Output from each job stored locally • $(HOME)/.globus/.gass_cache/….. • Example: • globus-job-submit morpheus.engin.umich.edu /bin/date
globus-job-submit (cont’d) Supporting commands • globus-job-status <contact-string> • globus-job-getoutput <contact-string> • globus-job-clean <contact-string>
globus-job-run • Runs in foreground • Provides executable staging • Output delivered directly • Example • globus-job-run hypnos.engin.umich.edu/jobmanager-pbs /bin/hostname
globusrun • Main command for submitting globus jobs • Uses the Resource Specification Language for specifying job options • Examples: • globusrun -f b80.rsl • globusrun -r hypnos.engin.umich.edu -f myjob.rsl
Sample RSL File + ( &(resourceManagerContact="longhorn.tacc.utexas.edu/jobmanager-loadleveler") (max_wall_time=45) (queue=normal) (max_memory=10) (directory=/paci/sdsc/ux444444/JobOutput) (executable=/bin/date) (stdout=longhorn-output) (stderr=longhorn-error) )
Required RSL Parameters Loadleveler at Texas (queue=normal) (max_wall_time=45) (max_memory=10) Loadleveler at SDSC b80’s: (queue=interactive) (max_wall_time=45) (environnent=(MP_EUIDEVICE en0)) tf004i/tf005i (queue=normal) (max_wall_time=45) PBS at Michigan hypnos (queue=route) (max_wall_time=45) (email_address=your@email) morpheus (queue=npaci) (max_wall_time=45) (email_address=your@email)
Condor-G • Provides job submission & monitoring at a single site: griddle.sdsc.edu • Handles file transfers & job I/O • Uses Globus to launch jobs • Provides a tool (DAGMan) for handling job dependencies
Condor Submit Description File # path to executable on remote host executable = /paci/sdsc/ux444444/hello.sh # do not stage executable from local to remote host transfer_executable = false # host and jobmanager where job is to be submitted globusscheduler = longhorn.tacc.utexas.edu/jobmanager-fork # condor-g always uses the globus universe universe=globus # local files where output, error, and logs will be placed output = hello.out error = hello.error log = hello.log # submit the job Queue
Condor-G Commands • condor_submit to launch job • condor_submit <description_file> • condor_q to monitor job • condor_rm to remove job • condor_rm <id> • condor_rm all
DAGMan • Metascheduler for Condor-G jobs • Directed acyclic graph (DAG) represents jobs & dependencies • Nodes (vertices) are jobs in the graph • Edges (arcs) identify dependencies • Commands • condor_submit_dag <DAGMan Input File> • condor_q -dag
DAGMan Input File • Required • Jobs names and their corresponding Condor submit description files for each node in the DAG • Dependency description • Optional • Preprocessing and postprocessing before or after job submission • Number of times to retry if a node within the DAG fails
Example DAGMan Input File Job A longhorn.condor Job B morpheus.condor Job C hypnos.condor Job D horizon.condor PARENT A CHILD B C PARENT B C CHILD D Retry C 3
Description Files • longhorn.condor universe = globus executable = /bin/hostname transfer_executable = false globusscheduler = longhorn.tacc.utexas.edu/jobmanager-fork output = job.$(cluster).out error = job.$(cluster).err log = job.$(cluster).log Queue • For hypnos, morpheus, and tf004i files replace globusscheduler value appropriately
File Transfer GridFTP • Defines a protocol • Provides GSI authentication, partial file and parallel transfers, etc • Programs • Server: gsiftp - extends FTP with GSI authentication • Client: globus-url-copy
globus-url-copy • globus-url-copy <fromURL> <toURL> • Accepted URL’s • For local files: file:<full path> • For remote files: gsiftp://<hostname><path> • Also accepts http://, ftp://, https:// • Example globus-url-copy file:/paci/sdsc/ux444444/myfile \ gsiftp://longhorn.tacc.utexas.edu/~/newfile
gsiscp • Not GridFTP-based • Uses GSI authentication • Specify GSISSH server port for single sign-on capabilities • Example • gsiscp –P 1022 setup* \ ux444444@morpheus.engin.umich.edu:.
Resource & Discovery Services • Publishes system and application data • Components • Globus MDS - Monitoring and Discovery Services • Ganglia – For clusters • NWS – Network monitoring • Useful for grid middleware • Resource discovery & selection • Useful for grid applications Configuration and real-time adaptation
Graphical MDS Views • On the web: • https://hotpage.npaci.edu/ • https://hotpage.npaci.edu/cgi-bin/grid_view.cgi • http://npackage.cs.ucsb.edu/ldapbrowser/login.php • Download and run your own LDAP browser • i.e. http://www.iit.edu/~gawojar/ldap/ • NPACI Grid MDS Info • LDAP Host: giis.npaci.edu • Port: 2132 • Base DN: Mds-Vo-name=npaci,o=Grid
Future Work • New NPACKage components • GridPort (ready for next release) • Netsolve (in progress) • NPACI Alpha Project Integration • MCELL, Telesciences, Geosciences, Protein Folding are all in progress • Scalable Viz, PDB Data Resource, Computational Electromicroscopy coming soon
Grid Consulting • Services • Assist with troubleshooting • Evaluate your application for use on the grid • Assist with porting • We are actively looking for applications! Contact us: http://npacigrid.npaci.edu/contacts.html