HEPCAL view on File Access Jeff Templon NIKHEF email@example.com
Outline • HEPCAL file access • What I don’t like about SRM • How SRM got into EDG SE (WP5) (personal view)
HEPCAL on file access • A dataset (DS) can be any sort of collection of information • Datasets are Write-Once-Read-Many • DMS (data management system) must be able to associate default remote access protocol to each dataset; DMS is expected to make sure dataset lands only on SEs that can support the protocol • Root daemon • AMS • Files belonging to a dataset should be made available for opening via a POSIX call or an application specific remote access protocol. The Grid should provide a mechanism whereby a user can present a LDN and receive in return a list of physical file names (and possibly the protocols by which they can be opened) that can be mapped to the original files that were uploaded to the Grid.
More HEPCAL on file access • Regardless of which method (optimization in making the file available to the user) the Grid chooses, the user accesses the DS by providing an LDN and passing the returned file identifier to an open call. • … to write a dataset directly onto an SE. We consider this a special case of the “DS upload” use case. In this case the Grid provides a dataset staging area where files can be created via standard POSIX calls. This will be either a suitable area on the local machine or on the SE, or even a different area as long as it optimises the subsequent upload of the dataset to the SE. • The user opens the files for reading with a POSIX open or using the syntax of the specified access protocol; (from uc#dsaccess use case)
Conclusions, HEPCAL file access • Present LDN, pass returned object to “open”, get the bytes • Multi-file datasets are possibly in conflict with this model: • Files belonging to a dataset should be made available for opening via a POSIX call or an application specific remote access protocol. The Grid should provide a mechanism whereby a user can present a LDN and receive in return a list of physical file names (and possibly the protocols by which they can be opened) that can be mapped to the original files that were uploaded to the Grid. • POSIX access discussed yesterday in EDG ATF • Obvious that in falls within either WP2 or WP5 • Neither feels they have time to do it
Multi-File Datasets Unresolved Issue events for a given run might be partially resident in several files. If a physicist requests the dataset Omega-20070312 and wants to read vertex events, he wants to be sure to open the file “Vertex” and not something else. This means it must be possible to “label” the various components for identification later. One could take a unix-like approach where the DS name is like a “directory”, and the component files like the “files” in that directory. We were’nt able to decide if this approach was good. the problem of files/directories is not exactly equivalent to DS/components.
What I don’t like about SRM • Files don’t stay put. • We get a SFN which really isn’t a file name. There is no guarantee that if I ssh onto the SE, that there will be a file with that name • The actual files on disk may have a different name each time they show up on disk • This maybe isn’t so bad, but one cannot just open the file! Opening the file becomes a two-call sequence • Maybe after seeing HEPCAL again I should not worry … just don’t ever look at an SE again
How SRM got into EDG • Not a party line view • SRM was presented at December 2002 ATF meeting • Was not generally realized that SE would be based on SRM • People “woke up” during Feb 2003 ATF meeting when WP5 expressed surprise that we thought “get” actually got the bytes • May all be irrelevant, unless SE converges within next N hours it may not be in LCG-1.