Facility Implications for WAN Data Access and Caching

Facility Implications for WAN Data Access and Caching Torre Wenaus, BNL US ATLAS Facilities Workshop Santa Cruz, CA November 13, 2012

WAN Data Access and CachingMotivations and Benefits • Simplification to data and workflow management; no more choreography in moving data to processing or vice versa • Data movement inherent in WAN access/caching, not driven by DDM system • More efficient use of storage; reduce replica counts; direct sharing of replicas across sites. Important as storage is an increasingly scarce commodity • More efficient use of network; fine granularity in moving only the data needed; data moved on demand, only when it is needed • Caching avoids redundant copying of same data to same destination, within the cache lifetime • No replication latency after brokerage decision • PD2P’s ‘first jobset to a Tier 1’ policy no longer needed • Low support and maintenance load; can be attractive for Tier 3s • Attractive for cloud utilization; minimal persistent in-cloud storage, minimizes inbound data transfer to the essential • Escape the protocol/middleware jungle! Data access via standard, efficient, direct protocols

WAN Data Access and Caching Viability • I/O optimization work in ROOT and experiments in recent years makes WAN data access efficient enough to be viable • ATLAS: “TTreeCache shown to be of substantial benefit to direct I/O performance, essential for WAN” • Caching can further improve efficiency • Addressable shared cache at destination reduces latency for jobs sharing/reusing data and reduces load on source • Supported by ROOT • Requires cache awareness in workflow brokerage to drive re-use • Requires cache support infrastructure at processing sites • Asynchronous pre-fetch at the client further improves efficiency • Reduce effect of irreducible WAN latencies on processing throughput • Supported by ROOT (debugged over the summer, ready for serious testing)

WAN Data Access & Caching Activity • First survey of relevant and needed work was at ATLAS Distributed Computing Technical Interchange Meeting, Dubna, June 2011 • Decided to establish as ATLAS R&D area in June 2012 S&C Week • Work has been going on for some time; establish an R&D activity to define objectives and priorities, track progress • Held kickoff meeting for R&D activity Sep 10 2012 at CERN • Surveyed the state of relevant activities • Reviewed status of initial objectives established in June • Discussed implications of WAN data access and caching • Activities benefiting from, affected by, or dependent on WAN data access • e.g. facility implications of WAN data access & caching • Wishlists/requirements for other systems • I/O framework, xrootd federation, DDM system, PanDA, monitoring (system and user), DA/PAT, ... • Objectives and priorities for the next ~6 months • Discussed at October S&C Week, together with event server ideas • Side discussion on technical event I/O implications, implementation approaches for event servers Torre Wenaus

ROOT TTreeCache • ROOT’s TTreeCache optimizes read performance of a TTree by minimizing the number of and ordering of disk reads: • In a ‘learning phase’ it discovers the TBranches that are needed • After that, a read on any TBranch will also read the rest of the TBranches in the cache with a single read • ROOT TTreeCache can have a huge impact on read performance • Reduce number of disk-reads by several orders of magnitude • Impact depends on system setup and use case • However, there have been restrictions in the usability of TTreeCache • Only one automatic TTreeCache per TFile • ATLAS and other experiments use several trees per file (Event data, references, auxiliary transient/persistent converter extensions, …) • Slow learning phase, no caching while learning • Impediment to activating TTreeCache by default; slow start-up

ROOT Caching and Pre-Fetching Developments • Support for more than one TTreeCache per file added • Needed by and implemented by ATLAS • Working on turning TTreeCache on by default • Don’t leave to chance users reaping the benefit • Investigation and optimization of TTreeCache start-up done by ANL summer student • Next: testing, evaluating readiness for on-by-default • Ongoing work to improve learning phase by pre-reading all branches (if possible) rather than relying on individual branch reads • Asynchronous pre-fetching now usable; provides efficient way to improve CPU/runtime efficiency over the WAN • Stress testing in summer exposed bugs now resolved in v5.34 • Caching of TreeCache blocks with reusable block addresses available • Enables large latency reductions on reuse (where we can achieve the reuse)

ROOT Synchronous Tree Processing FonsRademakers

ROOT Asynchronous Pre-Fetch Works for any access protocol Works in addition to local caching And in addition to site proxy server FonsRademakers

Caching of TreeCache Blocks FonsRademakers

FonsRademakers, Dirk Duellmann

ATLAS Event I/O • Key enabler for WAN data access: efficient event data I/O with minimal transactions between application and storage • ROOT TTreeCache provides the necessary foundation for achieving this • ATLAS Event I/O optimizations have allowed us to reap the benefits • Introduced in release 17 (current production release) newly optimized ROOT storage layout for RDO, ESD, AOD to better match the transient event store and support single event reads more efficiently • Events now grouped (up to ~10) contiguously in baskets, minimizing I/O transactions for event retrieval (previously events were split among baskets) • ATLAS conclusion in working with the new format and TTreeCache: “TTreeCache shown to be of substantial benefit to direct I/O performance, essential for WAN” • Wall time efficiency measures to date show 50-80% efficiency, increasing with increasing TTreeCache buffer size

AOD Read Performance with New Layout Peter van Gemmeren

WAN access vs. Copy: Fraction of Files Read • WAN access vs. copying locally pays for fractional file reads • What fraction of files are actually read? • Depends on many details – the analysis, the format – but studies now being made; here is one (from dCache logs) • 85% of all reads of group D3PD’s read less than 20% of file • 76% of all reads of user files read less than 20% of file • There is a lot of unread data, not transferred with WAN access Doug Benjamin

WAN data access and FAX • WAN data access may be (is probably?) the killer app for the federation • Xroot is a well optimized, hardened, scalable data access tool optimized for HEP • Naturally extended to distributed operation through the global redirector layer • Provides uniformity of data access (global catalog, uniformity of access protocol including direct access) across the distributed store • For WAN data access across a plethora of sites with different flavors of storage services, the federation encapsulates the heterogeneity and provides the transparency of data location that matches the transparency of access in WAN data access • Eases the implementation of support for WAN access at the higher levels (e.g. PanDA) • It provides a natural and capable development sandbox for WAN access, and a production deployment context as well • Benefits from the active development going on with FAX, its monitoring, its expansion across ATLAS

xroot federation or http federation? • WAN data access is a fit with… any federation? which federation(s)? • Xroot federation draws on long experience with a hardened, scalable, trusted system that supports many storage flavors and is naturally extendible to a distributed federation through the global redirector mechanism • But what about http, the foundation of the most hardened, scalable distributed computing platform in existence? • Belated attention now going to http as a data transport protocol for LHC computing • Outcome from the TEGs: all data services are directed to support http in the future, and we see this emerging (eg. DPM, FTS3) • With http, easy to incorporate hardened, scalable, open source web cache proxies into the architecture, augmenting/complementing other caching approaches • A year ago xroot proxy caches were on the table, being prototyped in the EOS work, but things went quiet (that effort ended; work underway in CMS) • At least what we do at the application layer we should be technology agnostic wherever possible

A Critical Need: Monitoring • WAN access in analysis depends crucially on I/O performance, and analysis codes, behaviors are highly variable • Possible if not probable that users with antique/poorly optimized I/O using WAN access will kill their performance and hold queue slots with very poor CPU/walltime efficiency • Also, WAN access will be a tricky optimization and tuning problem for some time – important for sites, operations, developers to have clear view into system performance • Consequently, detailed monitoring of individual and aggregate job performance is critical • e.g. expose to users in PanDA monitoring and to sites/operators in aggregate performance monitoring the I/O, throughput, efficiency performance relative to expectations • Drive users to improve poor performance motivated by self interest • Allow sites to spot problems, diagnose, optimize • Building blocks mostly in place but must be assembled into effective monitoring enhancements • Monitoring of underlying infrastructure and site-to-site performance (which feeds into e.g. brokerage decisions) is also vital and in pretty good shape, see Ilija’s talk

Initial Objectives [and status]Defined in June S&C Week • 1) Make use of xrootd federation to recover missing production files, at both direct access sites and copy-to-WN sites. • ‘Nearest redirector' for a site [still] to be added to schedconfig • Minor pilot changes required [see Paul’s talk] • 2) Incorporate site-site cost matrix derived from HC WAN I/O performance tests to extend analysis job brokerage to WAN access sites • Oracle-resident matrix with API for access by PanDA • Use DQ2 to determine sites holding input data, as at present • Use cost matrix to add to the brokerage pool sites not holding the data but with good WAN access to data-holding site(s) – weighted lower than sites with data, and up to a max limit (~10?) of sites • Proceed with standard brokerage that also factors in site load • If sites with data are busy, WAN access sites that aren’t may get the job • Later extension: add network bandwidth/performance to cost calculation

Objectives continued • 3) Try using xrootd federation rather than DDM subscription for dispatch block replication in production • File (file set) copy completion notification to PanDA via activeMQ (rather than http callback). Needs xrootd extensions [done] and activeMQ service for PanDA use [done]. • Can serve as a stress test for xrootd federation • Timescale for these 3 initial objectives: few months • Then: tune, refine, adapt to usage experience • Longer term: incorporate cache awareness in brokerage • Longer term: event servers

Cache-Aware Brokerage • Initial plan: adapt existing capabilities in PanDA to catalog ‘cached data hints’ • Memcached file memory supporting WN file cataloging for pcache based WN-level brokerage • Or the ARC cache index used by NorduGrid • New plan: use Rucio • Always the preferred plan but our request for ‘cache hint’ attribute in location catalog wasn’t included • It now has been in latest Rucio design • So we’ll be able to use Rucio to record ‘this data was cached at that site in the past’ so we can preferentially broker there • At a site with local caching, assume that files recently used there are cached there • Then age away the cache hint over time • Later step: knowledge of event-level caching…? • Utilize event table to be added anyway for JEDI Torre Wenaus

Event Servers (1) • With viable WAN data delivery at the event level, we can examine the utility and practicality of going to finer granularity for our processing • Currently jobs are assigned file(s) to process, whether local or remote • The real objective is to process tasks: large ensembles of events • Assigning the work at the file level brings some disadvantages • Optimal job partitioning driven by input file size and/or processing time doesn’t necessarily (often doesn’t) result in optimal file sizes on output • Responsibility for file processing can trap a processing thread at a looper event; e.g. one athenaMP process holds up completion and stalls 8 cores • Loss of a site or its storage kills a job to the potential loss of a lot of CPU

Event Servers (2) • Alternative: event servers dispatch events to consumers on WNs • Possible event delivery mechanisms: streaming actual events, or (less demanding of new infrastructure and leveraging what we have) sending event addresses/tokens with the WN client doing event retrieval itself • Asynchronously retrieving/buffering input events could remove WAN latency • On output side, fine grained output handling: send event clusters asynchronously in real time to aggregation site for merging • Presents a substantial event-level bookkeeping challenge! • Event consumers well suited to time-limited, temporary resources • Cheap cloud time on the spot market (factor up to ~10 below baseline cost on EC2) • Checkpointable opportunistic resources • Borrowed resources requiring exit on short notice • Lower latency in liberating prod resources to absorb analysis spikes • On the table for the next-generation ATLAS production system, JEDI

Summarizing Some Facility Issues for Discussion • ROOT TTreeCache caching at the site • (Much) greater utility if the cache is on shared storage, not WNs… support? Performance? • Other transparent (at app level) caching at the site • Caching proxies for xroot, http … • Presumably at the site perimeter, transparently used & shared by all WNs • Both uncataloged at app level, but PanDA usage info => Rucio cache hints can provide the record • Cache management, deletion for all mechanisms • Tied to aging away of cache hints • ‘Distributed SE/cache’ == the xrootd federation? Leveraged by PanDA brokerage to SE pools with good data availability based on cost matrix • So doubt much learning/optimization/tuning here Torre Wenaus

Facility Issues (2) • Monitoring, monitoring, monitoring • Including transmitting performance information to end users to drive self-motivated optimization • Data serving issues with WAN access (ie random read) replacing (some) file replication (ie serial read) activity • Random-optimized SSD front end cache to protect the storage from (local and remote) random access • Difficult optimization to create an affordable effective cache with significant reuse (SLAC, DESY) • WAN access is utterly dependent on network performance and latency – PanDA was originally designed specifically to avoid these dependencies • Workflows with this dependency are new territory both for facilities and for PanDA, will be a learning experience! • Wider scope of facilities we can potentially utilize, particularly in longer term if/when we have event serving • Opportunistic resources with a light footprint (minimal data flow, flexibly short usage duration)… farms where we have to shrink usage fast, with checkpointing, with variable availability, inherently ephemeral (eg. Boinc), … Torre Wenaus

Summary • WAN access can bring greater efficiencies to resource utilization • Improved storage efficiency through greater direct data sharing and fewer replicas • Improved processing efficiency by broadening the pool of eligible processing sites for a given workload • ROOT and experiment level I/O improvements and optimizations have demonstrably made WAN data access viable • All the more so when augmented by caching schemes if they can be made effective through sufficient reuse • FAX provides a natural, capable context for WAN access • FAX over ROOT I/O provides WAN-capable optimized access • Uniformity of access protocol across sites simplifies the implementation at the application and WMS (PanDA) levels • Time to examine the facility implications over and above those already being addressed in the context of FAX

Thank You • Thanks to FonsRademakers, Dirk Duellmann, Peter van Gemmeren, Doug Benjamin, Kaushik De, Tadashi Maeno, Paul Nilsson, Rob Gardner, IlijaVucotik, GuenterDuckeck, BorutKersevan, Andrej Filipcic, Ricardo Rocha and others for materials and contributions Torre Wenaus

More Information • 9/10/12 WAN data access & caching meeting, CERN • https://indico.cern.ch/conferenceDisplay.py?confId=201691 • 10/17/12 ATLAS S&C Week, WAN access & federated store session • https://indico.cern.ch/conferenceDisplay.py?confId=169697 Torre Wenaus

Facility Implications for WAN Data Access and Caching

Facility Implications for WAN Data Access and Caching

Presentation Transcript

Caching for Sustainability

Ramping up FAX and WAN direct access

Open data, implications for teaching and learning

Employee and Visitor Access Controls Facility Access

SC13 Data Movement WAN and ShowFloor

Caching and Data Consistency in P2P

Data Caching in WSN

Privacy and Confidentiality: Implications for Data Exchange

DATA CACHING IN WSN

Semantic Data Caching and Replacement

Caching for Performance

Data standardization and Data access

International Finance Facility for Immunisation (IFFIm) and Universal Access

Medication Access Economics: Michigan Data and Implications

WAN and Internet Access

Semantic Data Caching and Replacement

Optimization of Data Caching and Streaming Media

Extending Facility Access

WAN RAW/ESD Data Distribution for LHC

Caching and Data Consistency in P2P

Rereporting and Recurrence Data: Implications for Strategies

Extending Facility Access