Scalla/xrootd

Scalla/xrootd Andrew Hanushevsky, SLAC SLAC National Accelerator Laboratory Stanford University 08-June-10 ANL Tier3 Meeting

What isScalla? • StructuredClusterArchitecturefor LowLatencyAccess • Low Latency Access to data viaxrootdservers • POSIX-style byte-level random access • By default, arbitrary data organized as files • Hierarchical directory-like name space • Protocol includes high performance features • Structured Clustering provided by cmsdservers • Exponentially scalable and self organizing 2

What People Like About xrootd • It’s really simple and easy to administer • Requires basic file system administration knowledge • And becoming familiar with xrootd nomenclature • No 3rd party software maintenance (i.e., self-contained) • Handles heavy loads • E,g., >3,000 connections and >10,000 open files • Resilient and forgiving • Failures handled in a natural way • Configuration changes can be done in real time • E.g., Adding or removing servers 3

NFS Server NFS Client Data Files Application xroot Server xroot Client Linux Linux Client Machine Server Machine The Basic Building Blocks Application Alternatively Data Files Linux Linux Client Machine Server Machine xrootd is nothing more than an application level file server & client using another protocol 4

Why Not Just Use NFS? • NFS V2 & V3 inadequate • Scaling problems with large batch farms • Unwieldy when more than one server needed • Doesn’t NFS V4 support multiple servers? • Yes, but… • Difficult to maintain a scalable uniform name space • Still has single point of failure problems • Performance on par with NFS V3 5

Application NFS Server NFS Server Data Files Data Files NFS Client Linux Linux Linux Client Machine Server Machine A Server Machine B NFS & Multiple File Servers cp /foo /tmp open(“/foo”); Which Server? NFS V4 uses manual referrals to redirect client. This still ties part of a name space to a particular server requiring that you also deploy pNFS to spread data across servers. 6

xroot Server xroot Server Linux Linux Server Machine A Server Machine R 1 Who has /foo? Try B 3 4 open(“/foo”); 2 I do! The Scalla Approach Data Files xrdcp root://R//foo /tmp Application xroot Client Redirector open(“/foo”); Linux Client Machine /foo Data Files xroot Server The xroot client does all of these steps automatically without application (user) intervention! Linux Server Machine B 7

File Discovery Considerations • The redirector does not have a catalog of files • It always asks each server, and • Caches the answers in memory for a “while” • So, it won’t ask again when asked about a past lookup • Allows real-time configuration changes • Clients never see the disruption • This is optimal for reading (80% use-case) • The lookup takes << 100 microseconds when files exist • Much longer when a requested file does not exist! • This is the write case with more details found here 8

Nomenclature Review • xrootd • Server that provides access to data • cmsd • Server that glues xrootd’s together • Redirector (special xrootd-cmsd pair) • The head node that clients always contact • Looks just like a data server but is • Responsible for directing clients to correct server A L W A Y S P A I R E D xrootd cmsd 13

Clustering Mechanics • Clustering provided by cmsd processes • Oversees the health and name space on each xrootdserver • Maps file names to the servers that have the file • Informs client via an xrootdserver about the file’s location • All done in real time without using any databases • Eachxrootdserver process talks to a localcmsdprocess • Communicate over a Unix named (i.e., file system) socket • Localcmsd’scommunicate to a managercmsdelsewhere • Communicate over a TCP socket • Each process has a specificrolein the cluster 14

Role Nomenclature • Manager Role • Assigned to head node xrootd-cmsdpair (i.e., redirector) • Keeps track of path to the file • Guides clients to the actual file • Decides what server is to be used for a request • Server Role • Assigned to data node xrootd-cmsdpairs • Keeps track of xrootd utilization and health • Reports statistics to the manager • Provides actual data service 15

xrootd xrootd xrootd cmsd cmsd cmsd all.role server all.role manager if x.slac.stanford.edu all.manager x.slac.stanford.edu 1213 Configuration File: (vdtconfig makes it) Example of a Simple Cluster Data Server Node a.slac.stanford.edu Manager Node x.slac.stanford.edu Data Server Node b.slac.stanford.edu Note: All processes can be started in any order! Which one do clients connect to? 16

Preparing To Build ACluster • Select unprivileged user to run xrootd and cmsd • Use will own mount points and administrative paths • Decide exported file-system mount-point(s) • Much easier if it’s the same for all servers • Decide where admin files will be written • Log and core files (for example, /var/adm/xrootd/logs) • Other special files (for example, /var/adm/xrootd/admin) • Decide where the software will be installed • Should be the same everywhere (for example, /opt/xrootd) 17

Digression on Mount Points • Easy if you have a single file system per node • More involved when you have more • Solution is to use the oss linked file system (more later) • You can simplify your life by… • Create a single directory in ‘/’, say /xrootd • Mount all xrootd handled file systems there • E.g., /xrootd/fs01, /xrootd/fs02, etc • You can make them all available with a single directive • oss.cache public /xrootd/* xa 18

Multiple File Systems • The oss allows you to aggregate partitions • Each partition is mounted as a separate file system • An exported path can refer to all the partitions • The oss automatically handles it by creating symlinks • File name in /atlas is a symlink to a real file in /xrootd/fs01 or /xrootd/fs02 The oss linked file system symlink /xrootd/fs01 /atlas oss.cache public /xrootd/* xa all.export /atlas /xrootd/fs02 File system used to hold exported file paths Mounted Partitions hold file data 19

Building A Simple Cluster • Identify One* manager node and up to 64 data servers • Verify that prep work has been properly done • Directories created, file systems mounted, etc. • All of these directories must be owned by the xrootd user • This setup normally requires root privileges, but…. • Root privileges really not needed after this point • Create configuration file describing the setup above • Always the same one for each node • vdt tools automate configuration build • Install the same software & config file on each node *Can have any number. 20

Running A Simple Cluster • The system runs as an unprivileged user • The exported file system & special directory owner • Note: xrootdandcmsdrefuse to run as user root • Start xrootd-cmsdpair on each node • Can use Startup/Shutdown scripts provided • StartXRD, StopXRD, StartCMS, StopCMS • The StartXRD.cf provides location of various programs • Good to add a cron job scout to restart pair • May want to replicate the head node • While simple it requires some thought 21

xrootdprotocol for random I/O Machine Pa Grid protocol for sequential bulk I/O Pg xrootd X X X X C C C C cmsd GRID FUSE ftpd Full Cluster Overview xrootdFS redirector xrootd cluster Globusftpd with or without xrootdFS Supports >200K data servers Machine Machine Machine Clients GridFTP Minimum for a cluster 22

Getting to xrootd hosted data • Via the root framework • Automatic when files named root://.... • Manually, use TXNetFile() object • Note: identical TFile() object will not work with xrootd! • xrdcp • The native copy command • POSIX preload library • Allows POSIX compliant applications to use xrootd • xprep (speeding access to files) • gridFTP • FUSE & xrootdFS • Linux only: xrootd cluster as a mounted file system Native Set Simple Add Intensive Full Grid Set 23

The Optional Additions • FUSE • Allows a file system view of a cluster • Good for managing the name space • Not so good for actual data access (overhead) • People usually mount it on an admin node • GridFTP • Bulk transfer program • Can use on top of FUSE or with preload library 27

Using FUSE • Redirector is mounted as a file system • Say, “/xrootd” on interactive node • Use this for typical operations • dq2-get, dq2-put, dq2-ls • mv, rm, etc. • Can also use as a base for GridFTP • Though POSIX preload library works as well 28

open(/a/b/c) open(/a/b/c) Using Grid FTP 1st Point of Contact (Standard GSIftp Server) FTP servers can be Firewalled and Replicated for scaling Client GSI FTP Preload Library Subsequent Data Access Preload Library Can be replaced by FUSE Firewall Standard xrootd Cluster xrootd 29

xrootd xrootd cmsd cmsd = Simple Grid Access GRID gridFTP Typical Setup Manager Node Data Nodes Basic xrootd Cluster + dq2get Node (gridFTP + FUSE) dq2get Node (gridFTP + POSIX Preload Lib) xrootdFS Posix Preload Library Even more effective if using a VMSS dq2get dq2getNode 30

Other Good Things To Consider • Security • Automatic server inventory • Summary Monitoring • In addition to the VDT Gratia Storage Probe • Opportunistic Clustering • Expansive Clustering 31

Security • Normally r/o access needs no security • Allowing writes, however does • VDT configuration defines a basic security model • Unix uid based (essentially like NFS) • Everyone has r/o access • Special users have r/w access • Actual model can be arbitrarily complicated • Best to stick with the simplest one that works • Privilege definition reside in a special file 32

Security Example • Security needs to be configured • xrootd.seclib /opt/xrootd/lib/libXrdSec.so • sec.protocolunix • ofs.authlib /opt/xrootd/authfile • ofs.authorize • Authorizations need to be declared • Placed in /opt/xrootd/authfile (as stated above) • u * /atlas r • u myid /atlas a • Much of this is done by VDT configuration 33

Simple Server Inventory (SSI) • A central file inventory of each data server • Good for basic sites needing a server inventories • Inventory normally maintained on each redirector • Automatically recreated when lost • Updated using rolling log files • Effectively no performance impact • “cns_ssi update” merges all of the inventories • Flat text file format • LFN, Mode, Physical partition, Size, Space token • “cns_ssi list” command provides formatted output 34

Summary Monitoring • Needed information in almost any setting • Xrootd can auto-report summary statistics • Specify xrd.report configuration directive • Data sent to one or two locations • Use provided mpxstats as the feeder program • Multiplexes streams and parses xml into key-value pairs • Pair it with any existing monitoring framework • Ganglia, GRIS, Nagios, MonALISA, and perhaps more 35

Summary Monitoring Setup monhost:1999 Monitoring Host mpxstats ganglia Data Servers xrd.report monhost:1999 all every 15s 36

Opportunistic Clustering • Xrootd extremely efficient of machine resources • Ultra low CPU usage with a memory footprint 20 ≈ 80MB • Ideal to cluster just about anything (e.g., proof clusters) Batch Nodes File Servers Redirector xrootd cmsd xrootd cmsd xrootd job cmsd job Clustered Storage System Leveraging Batch Node Disks 37 37

Opportunistic Clustering Caveats • Using batch worker node storage is problematic • Storage services must compete with actual batch jobs • At best, may lead to highly variable response time • At worst, may lead to erroneous redirector responses • Additional tuning will be required • Normally need to renice the cmsd and xrootd • As root: renice –n -10 –p cmsd_pid • As root: renice –n -5 –p xroot_pid • You must not overload the batch worker node • Especially true if exporting local work space 38

Expansive Clustering • Xrootd can create ad hoc cross domain clusters • Good for easily federating multiple sites • This is the ALICE model of data management • Provides a mechanism for “regional” data sharing • Get missing data from close by before using dq2get • Architecture allows this to be automated & demand driven • This implements a Virtual Mass Storage System 39

BNL root://atlas.bnl.gov/ includes SLAC, UOM, UTA xroot clusters xrootd xrootd xrootd xrootd SLAC UOM UTA all.role manager all.manager meta atlas.bnl.gov:1312 all.role manager all.manager meta atlas.bnl.gov:1312 all.manager meta atlas.bnl.gov:1312 all.role manager cmsd cmsd cmsd cmsd Virtual Mass Storage System all.role meta manager all.manager meta atlas.bnl.gov:1312 Meta Managers can be geographically replicated! 40

What’s Good About This? • Fetch missing files in a timely manner • Revert to dq2get when file not in regional cluster • Sites can participate in an ad hoc manner • The cluster manager sorts out what’s available • Can use R/T WAN access when appropriate • Can significantly increase WAN xfer rate • Using multiple-source torrent-style copying 41

Expansive Clustering Caveats • Federation & Globalization are easy if . . . . • Federated servers are not blocked by a firewall • No ALICE xroot servers are behind a firewall • There are alternatives . . . . • Implement firewall exceptions • Need to fix all server ports • Use proxy mechanisms • Easy for some services, more difficult for others • All of these have been tried in various forms • Site’s specific situation dictates appropriate approach 42

In Conclusion. . . • Xrootd is a lightweight data access system • Suitable for resource constrained environments • Human as well as hardware • Geared specifically for efficient data analysis • Supports various clustering models • E.g., PROOF, batch node clustering and WAN clustering • Has potential to greatly simplify Tier 3 deployments • Distributed as part of the OSG VDT • Also part of the CERN root distribution • Visit http://xrootd.slac.stanford.edu/ 43

Acknowledgements • Software Contributors • Alice: Derek Feichtinger • CERN: FabrizioFurano , Andreas Peters • Fermi: Tony Johnson (Java) • Root: Gerri Ganis, BeterandBellenet, FonsRademakers • STAR/BNL: PavelJackl • SLAC: JacekBecla, TofighAzemoon, WilkoKroeger • LBNL: Alex Sim, JunminGu, VijayaNatarajan(BeStMan team) • Operational Collaborators • BNL, FZK, IN2P3, RAL, UVIC, UTA • Partial Funding • US Department of Energy • Contract DE-AC02-76SF00515 with Stanford University 44

Scalla/xrootd

Scalla/xrootd

Presentation Transcript

The Next Generation Root File Server

Data access and Storage

XRootD Monitoring Report A.Beche D.Giordano

Deployment of federated xrootd infrastructure in ATLAS

Experience of xrootd monitoring for ALICE at RDIG sites

XROOTD news

xrootd Roadmap

XRootD Release 4 And Beyond

Wuppertal Post Mortem

XROOTD Storage

FTS web portal: WebFTS

FAX status report

GFAL , SRMv2 & XROOTD interactions

CASTOR at RAL

Xrootd – an outlook

xrootd

XROOTD news

FAX Deployment Status

Xrootd usage @ LHC

Xrootd Redux

Scalla/xrootd WAN globalization tools: where we are.

Data Access – Performance in Remote I/O

Scalla/xrootd

Scalla/xrootd

Presentation Transcript

The Next Generation Root File Server

Data access and Storage

XRootD Monitoring Report A.Beche D.Giordano

Deployment of federated xrootd infrastructure in ATLAS

Experience of xrootd monitoring for ALICE at RDIG sites

XROOTD news

xrootd Roadmap

XRootD Release 4 And Beyond

Wuppertal Post Mortem

XROOTD Storage

FTS web portal: WebFTS

FAX status report

GFAL , SRMv2 &amp; XROOTD interactions

CASTOR at RAL

Xrootd – an outlook

xrootd

XROOTD news

FAX Deployment Status

Xrootd usage @ LHC

Xrootd Redux

Scalla/xrootd WAN globalization tools: where we are.

Data Access – Performance in Remote I/O

GFAL , SRMv2 & XROOTD interactions