200 likes | 315 Vues
Run IIb DAQ / Online status. Stu Fuess Fermilab. Introduction. In order to meet the DAQ and Online computing requirements for Run IIb we plan: Level 3 farm node increase Brown, Univ. of Washington, Fermilab Host system replacements / upgrades Hardware: Fermilab, Software:various
E N D
Run IIb DAQ / Online status Stu Fuess Fermilab
Introduction • In order to meet the DAQ and Online computing requirements for Run IIb we plan: • Level 3 farm node increase • Brown, Univ. of Washington, Fermilab • Host system replacements / upgrades • Hardware: Fermilab, Software:various • Control system node upgrade • Fermilab • The requirements, plans, status, and future activities will be discussed
Level 3 farm nodes • Need: greater L3 processing capabilities for higher luminosities Dual Nodes “GHz” plan 48 1.0 to be removed 34 1.6 existing 32 2.0 existing 96 2.2 to be added 332 GHz equiv CPUs now 659 GHz equiv CPUs for start of RunIIb For example: 1 kHz @ 500 ms-GHz requires 500 GHz of CPUs
Level 3 farm nodes, cont’d. • Plan: Single purchase, summer 2005, of $210K* of nodes • 3 racks of 32 = 96 nodes plus infrastructure • Strategy: • This is an “off the shelf” purchase, but a major one • Similar to CompDiv farms purchases • Used a Run IIa purchase to refine the procedure: • 32 node addition • History • Req preparation begun 1/04/04 • Req submitted 1/29/04 • PO created 3/23/04 ($51.5K) • Prototype system delivery 4/21/04 • Full order delivery 6/21/04 • Operational in Level 3 on 6/23/04 5 month process ! Thanks to Computing Division for help! * Unburdened FY02 $
Level 3 farm nodes, cont’d. • Other preparations • Will replace 3 racks / 48 nodes of older processors with3 racks / 96 nodes • Existing electrical circuits and cooling sufficient for new racks • Will need additional 48 network ports on Level 3 and Online switches • Impact • Installation somewhat disruptive, as will remove 3 racks (48) of older nodes to make room for these • Remaining 66 nodes operational during installation • Schedule • Plan for arrival of nodes at start of 2005 shutdown • Start purchase process ~ 3/05 • Continued replacement with upgraded nodes will be necessary over the duration of Run IIb (Operating funds)
Host systems • Need • Replace 3-node Alpha cluster, which has the functions: • Event data logger, buffer disk, transfer to FCC • Oracle database • NFS file server • User database • Plan • Replace with Linux servers • Install a number (~4) of clusters which supply "services“ • Shared Fibre Channel (FC) storage and failover software to provide flexibility and high availability • $247K* for processor and storage upgrades * Unburdened FY02 $
DAQ Services Cluster Database Cluster File Server Cluster Online Services Cluster DØ Online Linux clusters Clients Network Switch SAN Fibre Channel Switch Fibre Channel Switch Legacy RAID Array Legacy JBOD Array RAID Array JBOD Array
Cluster Configuration Cluster • Service • Name • Domain • Check interval • Script • Cluster • Member • Name • Device • Device special file • Mount point • File system • Mount options Power Controller ip Address • NFS Export • Export directory • NFS Clients • Client names / addresses • Export options
Cluster Services Details of configuration of cluster services … Using experience of Run IIa in how things actually work!
Host systems, cont’d. • System tests • Performed tests of Fibre Channel, network, storage rates • Network: capable of wire rate (1 Gb/sec) • Storage: Target is 25 MB/s for event path
Host systems, cont’d. • System tests, cont’d. • Checked relative performance of dual vs quad processor systems • Conclude: dual processor nodes, at 20% the cost, are sufficient for all but possibly the highest I/O DAQ data logging nodes • Potential issues/concerns • Linux 2.4 kernel has problems with multiple high-rate buffered I/O streams; much better in 2.6 kernel; alleviated somewhat with use of direct I/O • Expect to see 2.6 next Spring/Summer in Fermi Linux • The design avoids this situation • Fibre Channel redundant paths somewhat complicated • Expect to use a “manual” solution, but is solvable ($$) with commercial Secure Path software
Host systems, cont’d. • Cluster implementation • Red Hat Cluster Suite • Available open source, distributed in Fermi Linux • But also a supported ($) Red Hat Application Suite product • No kernel modifications required • Can use non-homogeneous distributions • Can be made to work with non-homogeneous hardware • Use LVM as virtual storage layer • Cluster tests • Storage device access • NFS failover • File reads/writes transparently complete when active node turned off and service transitioned to backup node
Host systems, cont’d. • Status • A 2-node cluster has been created • Single-path FC SAN • Service failover demonstrated • 6 new servers delivered 6/21/04 • Will construct 4 clusters during summer/fall 2004 shutdown • Schedule • Fall 04 attempt to move everything! • DAQ, Oracle, NFS, etc • Need involvement of software system experts • Dual-path SAN still a challenge • DAB2 rack space juggling a challenge Disruptive (possibly a day or two)! Essential functions will have to be relocated and debugged • Summer 05 to enhance with best processors
Control System • Need: • The current control system processors (~100 of them) • are becoming obsolete and not maintainable • Lost 2 nodes, repaired 5 during Run IIa • are limiting functionality in some areas • Tracker readout crates are CPU limited • Plan: • Upgrade ~1/3 of the control system processors • either with latest generation of processors (PowerPC) which run current software (VxWorks), or transition to different architecture (eg Intel) with new OS (eg Linux) • Inclination is to just purchase appropriate number of the current processor family and minimize software changes • Strategy: • $140K* to upgrade processors • Scheme for replacement on next slide: * Unburdened FY02 $
Control System, cont’d. • Impact: • Potential short disruptions in control system functions as processors are replaced • Schedule: • Recently purchased latest PowerPC processor for testing • Testing EPICS and D0 controls software • Follow evolutionary developments of OS (VxWorks) and Control System Framework (EPICS) • Purchases in advance of Summer 05, then incremental installation of nodes
Conclusion • Three activities: • Level 3 • Host systems • Control system • Level 3 is an “addition of nodes” • Host system changes are most revolutionary • Attempting to perform upgrade this summer/fall • Improvements in functionality • Control system is a “replacement of nodes” • With evolutionary progress of VxWorks, EPICS software Expect nearly seamless transition, ready for IIb