1 / 71

Lecture 4 Grid Data Management

Lecture 4 Grid Data Management. Jaime Frey UW-Madison Condor Group jfrey@cs.wisc.edu Slides prepared in part by Scott Koranda UW-Milwaukee & NCSA skoranda@uwm.edu Grid Summer Workshop June 21-25, 2004. Motivation?.

suki
Télécharger la présentation

Lecture 4 Grid Data Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 4Grid Data Management Jaime Frey UW-Madison Condor Group jfrey@cs.wisc.edu Slides prepared in part by Scott Koranda UW-Milwaukee & NCSA skoranda@uwm.edu Grid Summer Workshop June 21-25, 2004 Lecture4: Grid Data Management

  2. Motivation? Why is the Grid community concerned with data/file management? Why might you be concerned with data/file management? Lecture4: Grid Data Management

  3. Motivation: The Data Problem • Motivate our discussion with the large physics experiments (part of GriPhyN and Grid2003) • Laser Interferometer Gravitational Wave Observatory • Detect spacetime ripples from blackholes & other sources • Generates data at 10 MB per second, just under 1 TB per day • Sloan Digital Sky Survey • Catalog more stars and galaxies then ever before • More than 15 TB of data catalogs • Compact Muon Solenoid and ATLAS • Detect the Higgs Boson (a fundamental particle) • 100 MB per second, about 1 Petabyte per year (per detector) Lecture4: Grid Data Management

  4. Really Two Data Problems • The amount of data • High-performance tools needed to manage the huge raw volume of data • Store it • Move it • Measure in terabytes, petabytes, and ??? • The number of data files • High-performance tools needed to manage the huge number of filenames • 1012 filenames is expected soon • Collection of 1012 of anything is a lot to handle efficiently Lecture4: Grid Data Management

  5. Three Data Questions on the Grid Essentially three (3) questions for which you want Grid tools to address • What data/files exist? • What data/files are where? • How do I move data/files from A to B? Lecture4: Grid Data Management

  6. Three Data Questions on the Grid Examine these questions last to first …because even if you don’t have TBs of data you will want to move files so start with #3 • What data/files exist? • What data/files are where? • How do I move data/files from A to B? Lecture4: Grid Data Management

  7. How to move data/files? • Requirements • Fast – as fast as networks and protocols allow • I2 sites should expect at least 10 MB/s sustained • Secure • Server must only share files with strongly authenticated clients • No passwords in the clear or similar • Robust • Fault tolerant, time-tested protocol Lecture4: Grid Data Management

  8. GridFTP • Extension to well known File Transfer Protocol (FTP) • http://www.globus.org/datagrid/deliverables/C2WPdraft3.pdf • Extensions include • Strong authentication, encryption via Globus GSI • Multiple, parallel data channels • Third-party transfers • Tunable network & I/O parameters • Server side processing, command pipelining Lecture4: Grid Data Management

  9. Necessary Semantics… • GridFTP is the protocol • A server or client that implements the GridFTP protocol is GridFTP-enabled or Grid-enabled • Often hear “the GridFTP server…” or “the GridFTP client…” • Correct is “the GridFTP-enabled server from the Globus team” or the particular client being used • Let it slide…easier to use the slang…but • Distinction more important soon as groups outside of Globus release GridFTP-enabled clients & servers Lecture4: Grid Data Management

  10. GridFTP Server • Built on top of wuftpd, our old friend • A brand new server from scratch in beta now… • Most configuration details same as wuftpd • Runs as a inetd (xinetd) service • Connection is attempted on port 2811 • Xinetd looks up port in /etc/services and finds responsible service • Xinetd starts service according to configuration with data from communication send on stdin Lecture4: Grid Data Management

  11. GridFTP Server • From /etc/services [services]$ tail /etc/services gsiftp 2811/tcp #Grid-FTP Server globus-gatekeeper 2119/tcp #Globus Gatekeeper • From /etc/xinetd.d/ [xinetd.d]$ cat gsiftp service gsiftp { socket_type = stream protocol = tcp env = LD_LIBRARY_PATH=/opt/ldg-2.0/globus/lib wait = no user = root server = /opt/ldg-2.0/globus/sbin/in.ftpd server_args = -l -a -G /opt/ldg-2.0/globus log_on_success += DURATION USERID log_on_failure += USERID nice = 10 disable = no } Lecture4: Grid Data Management

  12. GridFTP Server • Environment variables • LD_LIBRARY_PATH • Point to $GLOBUS_LOCATION/lib • GRIDMAP • Path to grid-mapfile for authentication • Generic GSI environment variable • X509_CERT_DIR • Directory in which CA signing certificates held • Generic GSI environment variable Lecture4: Grid Data Management

  13. GridFTP Server • Logging to system log • On most Linux /var/log/messages Jun 10 10:46:59 basil gridftpd[21857]: GSSAPI user /DC=org/DC=doegrids/OU=People/CN=Scott Koranda 43845 is authorized as skoranda Jun 10 10:46:59 basil gridftpd[21857]: FTP LOGIN FROM oregano.phys.uwm.edu [129.89.57.55], skoranda • Uses host certificate for mutual authentication [root@basil root]# grid-cert-info -file /etc/grid-security/hostcert.pem -subject/DC=org/DC=doegrids/OU=Services/CN=basil.phys.uwm.edu Lecture4: Grid Data Management

  14. GridFTP client ygraine.aei.mpg.de GridFTP server GridFTP server ldas-cit.ligo.caltech.edu basil.phys.uwm.edu GridFTP Server Third-party transfers • Client directs transfers between two servers “move file1 to ldas-cit.ligo.caltech.edu” file1 Lecture4: Grid Data Management

  15. GridFTP clients Globus-url-copy • GridFTP-compliant client from the Globus team • Copy files from one URL to another URL • One URL is usually a gsiftp:// URL • Another URL is usually a file:/ URL • To move a file from remote GridFTP-enabled server to local machine globus-url-copy gsiftp://dataserver.phys.uwm.edu/data/file1 file:/home/skoranda/file1 Lecture4: Grid Data Management

  16. Globus-url-copy • Alternative forms for file:/ URLs globus-url-copy gsiftp://dataserver.phys.uwm.edu/data/file1 file://localhost/home/skoranda/file1 globus-url-copy gsiftp://dataserver.phys.uwm.edu/data/file1 file://basil.phys.uwm.edu/home/skoranda/file1 • If GridFTP server runs on a non-standard port? globus-url-copy gsiftp://dataserver.phys.uwm.edu:15000/data/file1 file:/home/skoranda/file1 Lecture4: Grid Data Management

  17. Globus-url-copy • To put file onto server reverse URLs globus-url-copy file:/home/skoranda/file1 gsiftp://dataserver.phys.uwm.edu/data/file1 • By default 1 data channel used • average performance • monitor performance using –vb flag $globus-url-copy -vb gsiftp://ldas-cit.ligo.caltech.edu:15000/usr1/grid/smallfile file:/tmp/smallfile 9437184 bytes 658.09 KB/sec avg 512.95 KB/sec inst Lecture4: Grid Data Management

  18. Going fast • Multiple channels dramatically boosts ‘xfer rate $ globus-url-copy -vb -p 4 gsiftp://ldas-cit.ligo.caltech.edu:15000/usr1/grid/largefile file:/tmp/largefile 523960320 bytes 5814.25 KB/sec avg 5568.27 KB/sec inst • Still faster by using large TCP windows $ globus-url-copy -vb -p 4 -tcp-bs 1048576 gsiftp://ldas-cit.ligo.caltech.edu:15000/usr1/grid/largefile file:/tmp/largefile 514392064 bytes 6609.67 KB/sec avg 8639.71 KB/sec inst • Still faster by using large memory buffers $ globus-url-copy -vb -p 4 -bs 1048576 -tcp-bs 1048576 gsiftp://ldas-cit.ligo.caltech.edu:15000/usr1/grid/largefile file:/tmp/largefile 523304960 bytes 7300.56 KB/sec avg 9311.99 KB/sec inst Lecture4: Grid Data Management

  19. Faster! • Depending on network & weather you can go very fast! $ globus-url-copy -vb -p 8 -bs 1048576 -tcp-bs 1048576 gsiftp://ldas-cit.ligo.caltech.edu:15000/usr1/grid/largefile file:/tmp/largefile 185270272 bytes 18092.57 KB/sec avg 25153.96 KB/sec inst Lecture4: Grid Data Management

  20. Third-party transfers • Transfers from server to server directed by client • Use gsiftp:// URLs for both • requires both servers be configured to allow 3rd party $ hostname basil.phys.uwm.edu $ globus-url-copy gsiftp://hydra.phys.uwm.edu/tmp/file1 gsiftp://contra.phys.uwm.edu/tmp/file1 Lecture4: Grid Data Management

  21. Debugging Use –dbg to see control channel communication $ globus-url-copy -dbg gsiftp://hydra.phys.uwm.edu/tmp/file1 file:/tmp/file1 debug: starting to get gsiftp://hydra.phys.uwm.edu/tmp/file1 debug: connecting to gsiftp://hydra.phys.uwm.edu/tmp/file1 debug: response from gsiftp://hydra.phys.uwm.edu/tmp/file1: 220 hydra.phys.uwm.edu GridFTP Server 1.12 GSSAPI type Globus/GSI wu-2.6.2 (gcc32dbg, 1069715860-42) ready. debug: authenticating with gsiftp://hydra.phys.uwm.edu/tmp/file1 debug: response from gsiftp://hydra.phys.uwm.edu/tmp/file1: 230 User skoranda logged in. debug: sending command: FEAT debug: response from gsiftp://hydra.phys.uwm.edu/tmp/file1: 211-Extensions supported: REST STREAM ESTO ERET MDTM SIZE PARALLEL DCAU 211 END <snip> Lecture4: Grid Data Management

  22. Globus-url-copy Acutally a general purpose URL copying tool • No GSI authentication used • Parallel channels and like won’t work • $ globus-url-copy http://www.yahoo.com file:/tmp/yahoo • $ globus-url-copy ftp://ftp.globus.org/banner.msg file:/tmp/banner.msg Lecture4: Grid Data Management

  23. GridFTP clients • UberFTP • developed and supported at National Center for Supercomputing Applications (NCSA) • interactive like our old (insecure) friend ‘ftp’ • use –a GSI for GSI authentication • supports multiple channels using –c flag $ uberftp -H hydra.phys.uwm.edu -a GSI 220 hydra.phys.uwm.edu GridFTP Server 1.12 GSSAPI type Globus/GSI wu-2.6.2 (gcc32dbg, 1069715860-42) ready. 230 User skoranda logged in. uberftp> Lecture4: Grid Data Management

  24. GridFTP clients • “Roll your own” • Add functionality directly to your applications • Your application find and download its own data? • Your application deliver output data files when finished computing? • Globus Toolkit offers APIs to code against • C • Java • Python Lecture4: Grid Data Management

  25. GridFTP and Firewalls • Nice document by Globus team at http://www.globus.org/security/firewalls/Globus Firewall Requirements-5.pdf • Tip: when debugging GridFTP and firewalls • remember which way connections established • 1 single data channel • data connection established from client to server • 2 or more data channels • data connection established in direction data will flow • control connection always from client to server Lecture4: Grid Data Management

  26. Hints for Experts To make GridFTP go really fast • use fast disks/filesystems • filesystem should read/write > 30 MB/second • configure TCP for performance • See TCP Tuning Guide at http://www-didc.lbl.gov/TCP-tuning/ • patch your Linux kernel with web100 patch • See http://www.web100.org • Important work-around for Linux TCP “feature” • understand your network path Lecture4: Grid Data Management

  27. Three Data Questions on the Grid • What data/files exist? • What data/files are where? • How do I move data/files from A to B? Lecture4: Grid Data Management

  28. What data/files are where? • Requirements • Catalog 108 files and their locations • What files are where (possibly at more then one place) • Across multiple sites within a Grid • Mappings from logical filenames (LFNs) to physical filenames (PFNs) or URLs • No single point of failure • No central catalog/server to be single point of failure Lecture4: Grid Data Management

  29. Globus Replica Location Service • Globus RLS • Each RLS server usually runs two catalogs • LRC • Local replica catalog • Catalog of what files you have (LFNs) and mappings to URL(s) or PFNs • RLI • Replica location index • Catalog of while files (LFNs) that other LRCs in your data grid know about Lecture4: Grid Data Management

  30. Globus RLS • Network of RLS servers inform each other • Each site has LRC with mappings of LFNs to PFNs • usually contains the “local” mappings • where files located at the site • Site at Milwaukee might have this mapping in its LRC H-R-792845521-16.gwf → gsiftp://dataserver.phys.uwm.edu/LIGO/H-R-792845521-16.gwf • LRC catalog at each site tells remote RLIs what LFNs it has mappings for • Milwaukee tells Caltech it has a mapping for H-R-792845521-16.gwf • So Caltech RLI has mapping H-R-792845521-16.gwf → LRC at Milwaukee Lecture4: Grid Data Management

  31. rls://serverA:39281 rls://serverB:39281 site A site B LRC LRC file1→ gsiftp://serverA/file1file2→ gsiftp://serverA/file2 file3→ gsiftp://serverB/file3file4→ gsiftp://serverB/file4 RLI RLI file1file2 file3file4 file3→ rls://serverB/file3file4→ rls://serverB/file4 file1→ rls://serverA/file1file2→ rls://serverA/file2 Globus RLS Lecture4: Grid Data Management

  32. Globus RLS Typical way to query RLS network and find files in your Grid • Ask your local LRC “do you know about the file H-R-793274271.gwf?” • If yes… • Ask your local LRC for the corresponding URL(s) • It answers “H-R-793274271.gwf is at URLgsiftp://basil.phys.uwm.edu/LIGO/H-R-793274271.gwf” • If no… • Ask your local RLI “who does know about this file?” • It answers “The RLS server at MIT knows about this file?” • Go ask the MIT RLS server • “I am told you know about the file H-R-793274271.gwf…please tell me the URL for it?” • It answers “H-R-793274271.gwf is at URLgsiftp://ldas.mit.edu/LIGO/H-R-793274271.gwf” Lecture4: Grid Data Management

  33. Globus RLS • Quick Review • LFN → logical filename (think of as simple filename) • PFN → physical filename (think of as a URL) • LRC → your local catalog of maps from LFNs to PFNs • H-R-792845521-16.gwf → gsiftp://dataserver.phys.uwm.edu/LIGO/H-R-792845521-16.gwf • RLI → your local catalog of maps from LFNs to LRCs • H-R-792845521-16.gwf → LRCs at MIT, PSU, Caltech, and UW-M • LRCs inform RLIs about mappings known • Find files on your Grid by querying RLI(s) to get LRC(s), then query LRC(s) to get URL(s) Lecture4: Grid Data Management

  34. Globus RLS: Server Perspective • Listens on port 39281 (default) for clients • Responds to client queries • what LFNs in local catalog, the LRC? • what other LRCs know about LFNs? • checks against access control list for each client • Accepts publishing of new LFNs into LRC • add files to local catalog • Sends updates of LRC to other servers • tell remote RLI catalogs what LFNs you have mappings for locally Lecture4: Grid Data Management

  35. Globus RLS: Server Perspective • Listens on port 39281 (default) for clients • Server address is URL • rls://dataserver.phys.uwm.edu • rls://dataserver.phys.uwm.edu:39281 • rls://dataserver • rls://localhost • Uses a host certificate to identify itself • must run as root if host cert is owned by root • often copy host cert/key to other non-root limited privilege account and configure to use that copy Lecture4: Grid Data Management

  36. Globus RLS: Server Perspective • Mappings LFNs → PFNs kept in database • Uses generic ODBC interface to talk to any (good) RDBM • MySQL, PostgreSQL, Oracle, DB2,... • All RDBM details hidden from administrator and user • well, not quite • RDBM may need to be “tuned” for performance • but one can start off knowing very little about RDBMs Lecture4: Grid Data Management

  37. Globus RLS: Server Perspective Mappings LFNs → LRCs stored in 1 of 2 ways • table in database • full, complete listing from LRCs that update your RLI • requires each LRC to send your RLI full, complete list • as number of LFNs in catalog grows, this becomes substantial • 108 filenames at 64 bytes per filename ~ 6 GB • in memory in a special hash called Bloom filter • 108 filenames stored in as little as 256 MB • easy for LRC to create Bloom filter and send over network to RLIs • can cause RLI to lie when asked if knows about a LFN • only false-positives • tunable error rate • acceptable in many contexts Lecture4: Grid Data Management

  38. Globus RLS: Configuring the Server • Single configuration file • usually $GLOBUS_LOCATION/etc/globus-rls-server.conf • Send server a HUP signal to refresh configuration • kill –SIGHUP <pid> • Access control • each “client” given one or more of • lrc_read : permission to query the LRC for mappings • lrc_update : permission to add new mappings in LRC • rli_read : permission to query RLI for mappings • rli_update : permission to inform RLI of remote LRC mappings • stats : permission to query server for statistics • admin : permission to change configuration on the fly Lecture4: Grid Data Management

  39. Globus RLS: Configuring the Server • Access control • access given to certificate subject acl /DC=org/DC=doegrids/OU=People/CN=Scott Koranda: lrc_read • access given to UID mapped in grid-mapfile • which grid-mapfile examined controlled by GRIDMAP environment variable acl skoranda: lrc_read • must give remote LRCs permission to update your RLI • remote RLS server uses host certificate to identify itself acl /DC=org/DC=doegrids/OU=Services/CN=ldas.mit.edu: rli_update Lecture4: Grid Data Management

  40. Globus RLS: Configuring the Server • globus-rls-admin tool for configuration • need GSI credential to talk to server • must have acl with admin privileges for your credential • manual page is available NAME globus-rls-admin - Replica Location Service Administration SYNOPSIS globus-rls-admin -A|-a|-C option value|-c option|-D|-d|-e|-p|-q|-r|-S|-s|-t timeout|-u|-v [ rli ] [ pattern ] [ server ] DESCRIPTION The program globus-rls-admin performs administrative oper- ations on a RLS server (see globus-rls-server(8)). • ping the server to see if alive $ globus-rls-admin -p rls://localhost ping rls://localhost: 0 seconds Lecture4: Grid Data Management

  41. Globus RLS: Configuring the Server • Query server for statistics $ globus-rls-admin -S rls://localhost Version: 2.1.5 Uptime: 02:46:19 LRC stats update method: lfnlist update method: bloomfilter updates bloomfilter: rls://mini.astro.cf.ac.uk:39281 last 06/15/04 11:39:12 updates bloomfilter: rls://ygraine.aei.mpg.de:39281 last 12/31/69 18:00:00 updates bloomfilter: rls://ldas-cit.ligo.caltech.edu:39281 last 12/31/69 18:00:00 lfnlist update interval: 86400 bloomfilter update interval: 900 numlfn: 4110878 numpfn: 12328767 nummap: 12328775 RLI stats updated by: rls://mini.astro.cf.ac.uk:39281 last 06/15/04 11:47:56 updated by: rls://ygraine.aei.mpg.de:39281 last 06/15/04 11:25:23 updated by: rls://ldas-cit.ligo.caltech.edu:39281 last 06/15/04 11:43:31 updated via bloomfilters Lecture4: Grid Data Management

  42. Globus RLS: Configuring the Server • Tell LRC what remote RLIs to update • local LRC should update the RLI at MIT using Bloom filter $globus-rls-admin –A rls://ldas.mit.edu rls://localhost • use –a if updating via lists rather than Bloom filter Lecture4: Grid Data Management

  43. Globus RLS: Client Perspective Two ways for clients to interact with RLS Server • globus-rls-cli simple command-line tool • query • create new mappings • “roll your own” client by coding against API • Java • C • Python Lecture4: Grid Data Management

  44. Globus-rls-cli Simple query to LRC to find a PFN for LFN • Note more then 1 PFN may be returned $ globus-rls-cli query lrc lfn H-R-714024224-16.gwf rls://dataserver:39281 H-R-714024224-16.gwf: file://localhost/netdata/s001/S1/R/H/714023808-714029599/H-R-714024224-16.gwf H-R-714024224-16.gwf: file://medusa-slave001.medusa.phys.uwm.edu/data/S1/R/H/714023808-714029599/H-R-714024224-16.gwf H-R-714024224-16.gwf: gsiftp://dataserver.phys.uwm.edu:15000/data/gsiftp_root/cluster_storage/data/s001/S1/R/H/714023808-714029599/H-R-714024224-16.gwf • Server and client sane if LFN not found $ globus-rls-cli query lrc lfn "foo" rls://dataserver LFN doesn't exist: foo $ echo $? 1 Lecture4: Grid Data Management

  45. Globus-rls-cli • Be sure to quote LFN if it has funny characters $ globus-rls-cli query lrc lfn file& rls://dataserver [1] 16346 bash: rls://dataserver: No such file or directory [datarobot@dataserver datarobot]$ connect(file): Bad URL: globus_url_parse(file): Error code -3 [1]+ Exit 1 globus-rls-cli query lrc lfn file [datarobot@dataserver datarobot]$ globus-rls-cli query lrc lfn "file&" rls://dataserver LFN doesn't exist: file& Lecture4: Grid Data Management

  46. Globus-rls-cli Wildcard searches of LRC supported • probably a good idea to quote LFN wildcard expression $globus-rls-cli query wildcard lrc lfn "H-R-7140242*-16.gwf" rls://dataserver:39281 H-R-714024208-16.gwf: gsiftp://dataserver.phys.uwm.edu:15000/data/gsiftp_root/cluster_storage/data/s001/S1/R/H/714023808-714029599/H-R-714024208-16.gwf H-R-714024224-16.gwf: gsiftp://dataserver.phys.uwm.edu:15000/data/gsiftp_root/cluster_storage/data/s001/S1/R/H/714023808-714029599/H-R-714024224-16.gwf Lecture4: Grid Data Management

  47. Globus-rls-cli Bulk queries also supported • obtain PFNs for more then one LFN at a time $ globus-rls-cli bulk query lrc lfn H-R-714024224-16.gwf H-R-714024320-16.gwf rls://dataserver H-R-714024320-16.gwf: gsiftp://dataserver.phys.uwm.edu:15000/data/gsiftp_root/cluster_storage/data/s001/S1/R/H/714023808-714029599/H-R-714024320-16.gwf H-R-714024224-16.gwf: gsiftp://dataserver.phys.uwm.edu:15000/data/gsiftp_root/cluster_storage/data/s001/S1/R/H/714023808-714029599/H-R-714024224-16.gwf Lecture4: Grid Data Management

  48. Globus-rls-cli Simple query to RLI to locate a LFN to LRC map • then query that LRC for the PFN $ globus-rls-cli query rli lfn H-R-714024224-16.gwf rls://dataserver H-R-714024224-16.gwf: rls://ldas-cit.ligo.caltech.edu:39281 $ globus-rls-cli query lrc lfn H-R-714024224-16.gwf rls://ldas-cit.ligo.caltech.edu:39281 H-R-714024224-16.gwf: gsiftp://ldas-cit.ligo.caltech.edu:15000/archive/S1/L0/LHO/H-R-7140/H-R-714024224-16.gwf Lecture4: Grid Data Management

  49. Globus-rls-cli • Bulk queries to RLI also supported $ globus-rls-cli bulk query rli lfn H-R-714024224-16.gwf H-R-714024320-16.gwf rls://dataserver H-R-714024320-16.gwf: rls://ldas-cit.ligo.caltech.edu:39281 H-R-714024224-16.gwf: rls://ldas-cit.ligo.caltech.edu:39281 • Wildcard queries to RLI may not be supported! • no wildcards when using Bloom filter updates $ globus-rls-cli query wildcard rli lfn "H-R-7140242*-16.gwf" rls://dataserver Operation is unsupported: Wildcard searches with Bloom filters Lecture4: Grid Data Management

  50. Globus-rls-cli RLS with Bloomfilter updates to RLI • fast and efficient • Bloom filter is hash of information in a LRC • remote LRC creates Bloom and sends it to RLI • RLI can test to see if a particular LFN in the LRC’s Bloom filter • can’t do a wildcard search • will sometimes lie! • only false positives • if can’t have any false positives use full list updates Lecture4: Grid Data Management

More Related