Scientific Computing Resources

Scientific Computing Resources Ian Bird – Computer Center Hall A Analysis Workshop December 11, 2001 Ian.Bird@jlab.org

Overview • Current Resources • Recent evolution • Mass storage – HW & SW • Farm • Remote data access • Staffing levels • Future Plans • Expansion/upgrades of current resources • Other computing – LQCD • Grid Computing • What is it? – Should you care? Ian.Bird@jlab.org

Jefferson Lab Scientific Computing Environment November 2001 10 TB work areas SCSI disk – RAID 5 16 TB Cache disk SCSI + EIDE disk RAID 0 on Linux servers Unix, Linux, Windows desktops 2 TB Farm Cache SCSI – RAID 0 on Linux servers • Batch Farm Cluster • 350 Linux nodes (400 MHz – 1 GHz) • 10,000 SPECint95 • Managed by LSF + Java layer + • web interface • 2 STK silos • 10 9940 • 10 9840 • 8 Redwood • 10 Solaris/Linux data • movers w/ 300 GB stage • Interactive Analysis • 2 Sun 450 – 4 processor • 2 4-processor Intel/Linux Gigabit Ethernet Switching Fabric Grid gateway bbftp service CUE General Services JLAB Network Backbone Internet (ESNet : OC-3) JASMine managed Mass Storage Systems • Lattice QCD Cluster • 40 Alpha/Linux (667 MHz) • 256 Pentium 4 (Q2 FY02?) • Managed by PBS + Web portal Ian.Bird@jlab.org

JLAB Farm and Mass Storage Systems November 2001 Batch Farm – 350 processors 175 – dual nodes each connected at 100 Mb to 24-port switch with Gb uplink (8 switches) Fiber Channel direct from CLAS • 2 STK silos • 10 9940 • 10 9840 • 8 Redwood • 10 Solaris/Linux data • movers each w/ 300 GB stage & Gb uplink Foundry BigIron 8000 Switch; 256 Gb backplane, ~45/60 Gb ports in use Site Router – CUE and general services Work disks 4 MetaStor systems each with 100 Mb uplink Total 5 TB SCSI – RAID 5 Cache disk farm 20 Linux servers – each with Gb uplink Total 16 TB SCSI/IDE – RAID 0 CH-Router – Incoming data from Halls A & C Work disk farm 4 Linux servers – each with Gb uplink Total 4 TB SCSI – RAID 5 Ian.Bird@jlab.org

CPU Resources • Farm • Upgraded this summer with 60 dual 1 GHz P III (4 cpu / 1 u rackmount) • Retired original 10 dual 300 MHz • Now 350 cpu (400, 450, 500, 750, 1000 MHz) • ~11,000 SPECint95 • Deliver > 500,000 SI95-hrs / week • Equivalent to 75 1 GHz cpu • Interactive • Solaris: 2 E450 (4-proc) • Linux: 2 quad systems (4x450, 4x750MHz) • If required can use batch systems (via LSF) to add interactive CPU to these (Linux) front ends Ian.Bird@jlab.org

First purchases, 9 duals per 24” rack Last summer, 16 duals (2u) + 500 GB cache (8u) per 19” rack Recently, 5 TB IDE cache disk (5 x 8u) per 19” Intel Linux Farm Ian.Bird@jlab.org

Tape storage • Added 2nd silo this summer • Required move of room of equipment • Added 10 9940 drives (5 as part of new silo) • Current: • 8 Redwood, 10 9840, 10 9940 • Redwood: 50 GB @ 10MB/s (helical scan single reel) • 9840: 20 GB @ 10MB/s (linear mid-load cassette (fast)) • 9940: 60 GB @ 10MB/s (linear single reel) • 9840 & 9940 are very reliable • 9840 & 9940 have upgrade paths that use same media • 9940 2nd generation – 100 GB@20MB/s ?? • Add 10 more 9940 this FY (budget..?) • Replace Redwoods (reduce to 1-2) • Requires copying 4500 tapes – started – budget for tape? • Reliability, end of support(!) Ian.Bird@jlab.org

Disk storage • Added cache space • For frequently used silo files, to reduce tape accesses • Now have 22 cache servers • 4 dedicated to farm ~ 2 TB • ~16 TB of cache space allocated to expts • Some bought and owned by groups • Dual Linux systems, Gb network, ~ 1 TB disk, RAID 0 • 9 SCSI systems • 13 IDE systems • Performance approx equivalent • Good match cpu:network throughput:disk space • This is a model that will scale by a few factors, but probably not by 10 (but there is as yet no solution to that) • Looking at distributed file systems for the future – to avoid NFS complications – GFS, etc., but no production level system yet. • Nb. Accessing data with jcache does not need NFS, and is fault tolerant • Added work space • Added 4 systems to reduce load on fs3,4,5,6 (orig /work) • Dual Linux systems, Gb network, ~ 1 TB disk, SCSI RAID 5 • Performance on all systems is now good • Problems – • Some issues with IBM 75 GB ATA drives, 3-ware IDE RAID cards, Linux kernels • System is reasonably stable, but not yet perfect – but alternatives are not cost-effective Ian.Bird@jlab.org

JASMine • JASMine – Mass Storage system software • Rationale – why write another MSS? • Had been using OSM • Not scaleable, not supported, reached limit of sw, had to run 2 instances to get sufficient drive capacity • Hidden from users by “Tapeserver” • Java layer that • Hid complexities of OSM installations • Implemented tape disk buffers (stage) • Provided get, put, managed cache (read copies of archived data) capabilities • Migration from OSM • Production environment…. • Timescales driven by experiment schedules, need to add drive capacity • Retain user interface • Replace “osmcp” function – tape to disk, drive and library management • Choices investigated • Enstore, Castor, (HPSS) • Timescales, support, adaptability (missing functionality/philosophy – cache/stage) • Provide missing functions within Tapeserver environment, clean up and reworking • JASMine (JLAB Asynchronous Storage Manager) Ian.Bird@jlab.org

Architecture • JASMine • Written in Java • For data movement, as fast as C code. • JDBC makes using and changing databases easy. • Distributed Data Movers and Cache Managers • Scaleable to the foreseeable needs of the experiments • Provides scheduling – • Optimizing file access requests • User and group (and location dependent) priorities • Off-site cache or ftp servers for data exporting • JASMine Cache Software • Stand-alone component – can act as a local or remote client, allows remote access to JASMine • Can be deployed to a collaborator to manage small disk system and as basis for coordinated data management between sites • Cache manager runs on each cache server. • Hardware is not an issue. • Need a JVM, network, and a disk to store files. Ian.Bird@jlab.org

Software cont. • MySQL database used by all servers. • Fast and reliable. • SQL • Data Format • ANSI standard labels with extra information • Binary data • Support to read legacy OSM tapes • cpio, no file labels • Protocol for file transfers • Writes to cache are never NFS • Reads from cache may be NFS Ian.Bird@jlab.org

Dispatcher Dispatcher Dispatcher Cache Manager Cache Manager Cache Manager Volume Manager Volume Manager Volume Manager Drive Manager Drive Manager Drive Manager Disk Disk Disk Drive Drive Drive Request Manager Database Scheduler Request Manager Log Manager Client Library Manager Library Manager Service Connection Data Mover Database Connection Log Connection Ian.Bird@jlab.org

JASMine Services • Database • Stores metadata • also presented to user on an NFS filesystem as “stubfiles” • But could equally be presented as e.g. a web service, LDAP, … • Do not need to access stubfiles – just need to know filenames • Tracks status and locations of all requests, files, volumes, drives, etc. • Request Manager • Handles user requests and queries. • Scheduler • Prioritizes user requests for tape access. • priority = share / (0.01 + (num_a * ACTIVE_WEIGHT) + (num_c * COMPLETED_WEIGHT) ) • Host vs User shares, farm priorities • Log Manager • Writes out log and error files and databases. • Sends out notices for failures. • Library Manager • Mount and dismounts tapes as well as other library related tasks. Ian.Bird@jlab.org

JASMine Services -2 • Data Mover • Dispatcher • Keeps track of available local resources and starts requests the local system can work on. • Cache Manager • Manages a disk or disks for pre-staging data to and from tape. • Sends and receives data to and from clients. • Volume Manager • Manages tapes for availability. • Drive Manager • Manages tape drives for usage. Ian.Bird@jlab.org

User Access • Jput • Put one or more files on tape • Jget • Get one or more files from tape • Jcache • Copies one or more files from tape to cache • Jls • Get metadata for one or more files • Jtstat • Status of the request queue • Web interface • Query status and statistics for entire system Ian.Bird@jlab.org

Web interface Ian.Bird@jlab.org

Ian.Bird@jlab.org

Data Access to cache • NFS • Directory of links points the way. • Mounted read-only by the farm. • Users can mount read-only on their desktop. • Jcache • Java client. • Checks to see if files are on cache disks. • Will get/put files from/to cache disks. • More efficient than NFS, avoids NFS hangs if server dies, etc., but users like NFS Ian.Bird@jlab.org

Disk Cache Management • Disk Pools are divided into groups • Tape staging. • Experiments. • Pre-staging for the batch farm. • Management policy set per group • Cache – LRU files removed as needed. • Stage – Reference counting. • Explicit – manual addition and deletion. • Policies are pluggable – easy to add Ian.Bird@jlab.org

Protocol for file moving • Simple extensible protocol for file copies • Messages are java serialized objects passed over streams, • Bulk data transfer uses raw data transfer over tcp • Protocol is synchronous – all calls block • Asynchrony & multiple requests by threading • CRC32 checksums at every transfer • More fair than NFS • Session may make many connections Ian.Bird@jlab.org

Protocol for file moving • Cache server extends the basic protocol • Add database hooks for cache • Add hooks for cache policies • Additional message types were added • High throughput disk pool • Database shared by many servers • Any server in the pool can look up file location, • But data transfer always direct between client and node holding file • Adding servers and disk to pool increases throughput with no overhead, • Provides fault tolerance Ian.Bird@jlab.org

cache1 Cache3 has /foo cache2 cache3 Database Client (farm node) Example: Get from cache • cacheClient.getFile(“/foo”, “halla”); • send locate request to any server • receive locate reply • contact appropriate server • initiate direct xfer • Returns true on success Get /foo Where is /foo? Sending /foo cache4 Ian.Bird@jlab.org

Cache4 has room cache1 Where can I put /quux? cache2 cache3 Database Client (data mover) cache4 Example: simple put to cache • putFile(“/quux”,”halla”,123456789); Ian.Bird@jlab.org

Fault Tolerance • Dead machines do not stop the system • Data Movers work independently • Unfinished jobs will restart on another mover • Cache Servers will only impact NFS clients • System recognizes dead server and will re-cache file from tape • If users would not use NFS would never see a failure – just extended access time • Exception handling for • Received timeouts • Refused connections • Broken connections • Complete garbage on connections Ian.Bird@jlab.org

Authorization and Authentication • Shared secret for each file transfer session • Session authorization by policy objects • Example: receive 5 files from user@bar • Plug-in authenticators • Establish shared secret between client and server • No clear text passwords • Extend to be compatible with GSI Ian.Bird@jlab.org

JASMine Bulk Data Transfers • Model supports parallel transfers • Many files at once, but not bbftp style • But could replace stream class with a parallel stream • For bulk data transfer over WANs • Firewall issues • Client initiates all connections Ian.Bird@jlab.org

SCSI Disk Servers Dual Pentium III 650 (later 933)MHz CPUs 512 Mbytes 100MHz SDRAM ECC ASUS P2B-D Motherboard NetGear GA620 Gigabit Ethernet PCI NIC Mylex eXtremeRAID 1100, 32 MBytes cache Seagate ST150176LW (Qty. 8) - 50 GBytes Ultra2 SCSI in Hot Swap Disk Carriers CalPC 8U Rack Mount Case with Redundant 400W Power Supplies IDE Disk Servers Dual Pentium III 933MHz CPUs 512 Mbytes 133MHz SDRAM ECC Intel STL2 or ASUS CUR-DLS Motherboard NetGear GA620 or Intel PRO/1000 T Server Gigabit Ethernet PCI NIC 3ware Escalade 6800 IBM DTLA-307075 (Qty. 12) - 75 GBytes Ultra ATA/100 in Hot Swap Disk Carriers CalPC 8U Rack Mount Case with Redundant 400W Power Supplies Architecture: Disk pool hardware Ian.Bird@jlab.org

Cache Performance • Matches network, disk I/O, and CPU performance with size of disk pool: • ~800 GB, • 2 x 850MHz • Gb Ethernet Ian.Bird@jlab.org

Cache status Ian.Bird@jlab.org

Disk Array/File System – Ext2 SCSI Disk Server - 8 50 GByte disks in a RAID-0 stripe over 2 SCSI controllers 68 MBytes/sec single disk write 79 MBytes/sec burst for a single disk write 52 MBytes/sec single disk read 56 MBytes/sec burst for a single disk read IDE Disk Server - 6 75 GByte disks in a RAID-0 stripe 64 MBytes/sec single disk write 77 MBytes/sec burst for a single disk write 48 MBytes/sec single disk read 49 MBytes/sec burst for a single disk read Performance – SCSI vs IDE Ian.Bird@jlab.org

NFS v2 udp - 16 clients, rsize=8192 and wsize=8192 Reads SCSI Disk Servers 7700 NFS ops/sec and 80% cpu utilization 11000 NFS ops/sec burst and 83% cpu utilization 32 MBytes/sec and 83% cpu utilization IDE Disk Servers 7700 NFS ops/sec and 72% cpu utilization 11000 NFS ops/sec burst and 92% cpu utilization 32 MBytes/sec and 72% cpu utilization Jcache - 16 clients Reads SCSI Disk Servers 32 MBytes/sec and 100% cpu utilization IDE Disk Servers 32 MBytes/sec and 100% cpu utilization Performance NFS vs Jcache Ian.Bird@jlab.org

JASMine system performance • End-to-end performance • i.e. tape load, copy to stage, network copy to client • Aggregate sustained performance of 50MB/s is regularly observed in production • During stress tests, up to 120 MB/s was sustained for several hours • A data mover with 2 drives can handle ~15MB/s (disk contention is the limit) • Expect current system should handle 150MB/s and is scaleable by adding data movers & drives • N.B. this is performance to a network client! • Data handling • Currently the system regularly moves 2-3 TB per day total • ~6000 files per day, ~2000 requests Ian.Bird@jlab.org

Ian.Bird@jlab.org

JASMine performance Ian.Bird@jlab.org

Tape migration • Begin migration of 5000 Redwood tapes to 9940 • Procedure written • Uses any/all available drives • Use staging to allow re-packing of tapes • Expect will last 9-12 months Ian.Bird@jlab.org

10 TB work areas SCSI disk – RAID 5 16 TB Cache disk SCSI + EIDE disk RAID 0 on Linux servers Raw Data < 10MB/s over Gigabit Ethernet (Halls A & C) • Batch Farm Cluster • 350 Linux nodes (400 MHz – 1 GHz) • 10,000 SPECint95 • Managed by LSF + Java layer + • web interface Raw Data > 20 MB/s over Fiber channel (Hall B) 25-30 MB/s 25-30 MB/s Typical Data Flows Ian.Bird@jlab.org

How to make optimal use of the resources • Plan ahead! • As a group: • Organize data sets in advance (~week) and use the cache disks for their intended purpose • Hold frequently used data to reduce tape access • In a high data rate environment no other strategy works • When running farm productions • Use jsub to submit many jobs in one command – as it was designed • Optimizes tape accesses • Gather output files together on work disks and make a single jput for a complete tape’s worth of data Ian.Bird@jlab.org

Remote data access • Tape copying is deprecated • Expensive, time consuming (for you and us), and inefficient • We have OC-3 (155 Mbps) connection that is under-utilized, filling it will get us upgraded to OC-12 (622 Mbps) • At the moment we do often have to coordinate with ESnet and peers to ensure high-bandwidth path, but this is improving as Grid development continues • Use network copies • Bbftp service • Parallel, secure ftp – optimizes use of WAN bandwidth • Future • Remote jcache • Cache manager can be deployed remotely – demonstration Feb 02. • Remote silo access, policy-based (unattended) data migration • GridFTP, bbftp, bbcp • Parallel, secure ftp (or ftp-like) • As part of a Grid infrastructure • PKI authentication mechanism Ian.Bird@jlab.org

(Data-) Grid Computing Ian.Bird@jlab.org

Particle Physics Data GridCollaboratory Pilot Who we are: Four leading Grid Computer Science Projects and Six international High Energy and Nuclear Physics Collaborations The problem at hand today: Petabytes of storage, Teraops/s of computing Thousands of users, Hundreds of institutions, 10+ years of analysis ahead What we do: Develop and deploy Grid Services for our Experiment Collaborators and Promote and provide common Grid software and standards Ian.Bird@jlab.org

PPDG Experiments ATLAS - aToroidal LHC ApparatuS at CERN Runs 2006 onGoals: TeV physics - the Higgs and the origin of mass … http://atlasinfo.cern.ch/Atlas/Welcome.html BaBar - at the Stanford Linear Accelerator Center Running Now Goals: study CP violation and more http://www.slac.stanford.edu/BFROOT/ CMS - the Compact Muon Solenoid detector at CERN Runs 2006 on Goals: TeV physics - the Higgs and the origin of mass … http://cmsinfo.cern.ch/Welcome.html/ D0 – at theD0 colliding beam interaction region at Fermilab Runs Soon Goals: learn more about the top quark, supersymmetry, and the Higgs http://www-d0.fnal.gov/ STAR - Solenoidal Tracker At RHIC at BNL Running Now Goals: quark-gluon plasma … http://www.star.bnl.gov/ Thomas Jefferson National Laboratory Running Now Goals: understanding the nucleus using electron beams … http://www.jlab.org/ Ian.Bird@jlab.org

PPDG Computer Science Groups Condor – develop, implement, deploy, and evaluate mechanisms and policies that support High Throughput Computing on large collections of computing resources with distributed ownership. http://www.cs.wisc.edu/condor/ Globus - developing fundamental technologies needed to build persistent environments that enable software applications to integrate instruments, displays, computational and information resources that are managed by diverse organizations in widespread locations http://www.globus.org/ SDM - Scientific Data Management Research Group – optimized and standardized access to storage systems http://gizmo.lbl.gov/DM.html Storage Resource Broker - client-server middleware that provides a uniform interface for connecting to heterogeneous data resources over a network and cataloging/accessing replicated data sets. http://www.npaci.edu/DICE/SRB/index.html Ian.Bird@jlab.org

Delivery of End-to-End Applications& Integrated Production Systems to allow thousands of physicists to share data & computing resources for scientific processing and analyses • PPDG Focus: • Robust Data Replication • - Intelligent Job Placement • and Scheduling • - Management of Storage • Resources • - Monitoring and Information • of Global Services • Relies on Grid infrastructure: • - Security & Policy • High Speed Data Transfer • - Network management Operators & Users Resources: Computers, Storage, Networks Ian.Bird@jlab.org

Project Activities,End-to-End Applicationsand Cross-Cut Pilots • Project Activities are focused Experiment – Computer Science Collaborative developments. • Replicated data sets for science analysis – BaBar, CMS, STAR • Distributed Monte Carlo production services – ATLAS, D0, CMS • Common storage management and interfaces – STAR, JLAB • End-to-End Applications used in Experiment data handling systems to give real-world requirements, testing and feedback. • Error reporting and response • Fault tolerant integration of complex components • Cross-Cut Pilots for common services and policies • Certificate Authority policy and authentication • File transfer standards and protocols • Resource Monitoring – networks, computers, storage. Ian.Bird@jlab.org

Year 0.5-1 Milestones (1) Align milestones to Experiment data challenges: • ATLAS – production distributed data service – 6/1/02 • BaBar – analysis across partitioned dataset storage – 5/1/02 • CMS – Distributed simulation production – 1/1/02 • D0 – distributed analyses across multiple workgroup clusters – 4/1/02 • STAR – automated dataset replication – 12/1/01 • JLAB – policy driven file migration – 2/1/02 Ian.Bird@jlab.org

Year 0.5-1 Milestones • Common milestones with EDG: • GDMP – robust file replication layer – Joint Project with EDG Work Package (WP) 2 (Data Access) • Support of Project Month (PM) 9 WP6 TestBed Milestone. Will participate in integration fest at CERN - 10/1/01 • Collaborate on PM21 design for WP2 - 1/1/02 • Proposed WP8 Application tests using PM9 testbed – 3/1/02 • Collaboration with GriPhyN: • SC2001 demos will use common resources, infrastructure and presentations – 11/16/01 • Common, GriPhyN-led grid architecture • Joint work on monitoring proposed Ian.Bird@jlab.org

Year ~0.5-1 “Cross-cuts” • Grid File Replication Services used by >2 experiments: • GridFTP – production releases • Integrate with D0-SAM, STAR replication • Interfaced through SRB for BaBar, JLAB • Layered use by GDMP for CMS, ATLAS • SRB and Globus Replication Services • Include robustness features • Common catalog features and API • GDMP/Data Access layer continues to be shared between EDG and PPDG. • Distributed Job Scheduling and Management used by >1 experiment: • Condor-G, DAGman, Grid-Scheduler for D0-SAM, CMS • Job specification language interfaces to distributed schedulers – D0-SAM, CMS, JLAB • Storage Resource Interface and Management • Consensus on API between EDG, SRM, and PPDG • Disk cache management integrated with data replication services Ian.Bird@jlab.org

Year ~1 other goals: • Transatlantic Application Demonstrators: • BaBar data replication between SLAC and IN2P3 • D0 Monte Carlo Job Execution between Fermilab and NIKHEF • CMS & ATLAS simulation production between Europe/US • Certificate exchange and authorization. • DOE Science Grid as CA? • Robust data replication. • fault tolerant • between heterogeneous storage resources. • Monitoring Services • MDS2 (Metacomputing Directory Service)? • common framework • network, compute and storage information made available to scheduling and resource management. Ian.Bird@jlab.org

PPDG activities as part of the Global Grid Community • Coordination with other Grid Projects in our field: • GriPhyN – Grid for Physics Network • European DataGrid • Storage Resource Management collaboratory • HENP Data Grid Coordination Committee • Participation in Experiment and Grid deployments in our field: • ATLAS, BaBar, CMS, D0, Star, JLAB experiment data handling systems • iVDGL/DataTAG – International Virtual Data Grid Laboratory • Use DTF computational facilities? • Active in Standards Committees: • Internet2 HENP Working Group • Global Grid Forum Ian.Bird@jlab.org

Scientific Computing Resources

Scientific Computing Resources

Presentation Transcript

Scientific Computing

Scientific Computing

Scientific Computing

Scientific Computing

Scientific Computing

Scientific Computing

Scientific Computing

Scientific Computing

Scientific Computing

Scientific Computing

Scientific Computing

Scientific Computing

Scientific Computing

Scientific Computing

Scientific Computing

Scientific Computing

Scientific Computing

Scientific Computing

Scientific Computing

Scientific Computing

Scientific Computing

Scientific Computing