120 likes | 246 Vues
The HEPiX Tier 1 CPU farm and disk farm underwent significant hardware upgrades in Spring 2003 to meet increasing demands. The CPU farm now features 80 new dual-processor Xeon servers with advanced capabilities, while the disk farm has upgraded to dual P4 Xeon servers with extensive disk space using RAID 5 configurations. New performance monitoring tools are being implemented, alongside a modernized helpdesk system. Security remains a growing concern, and better management tools are necessary as the complexity of farm operations increases.
E N D
RAL Tier 1/A Status HEPiX-HEPNT NIKHEF, May 2003
CPU Farm – Existing Hardware • 108 dual processors (450, 600 and 1GHz) • Up to 1GB RAM • Desktop towers on warehouse shelves • 156 dual processor 1400MHz PIII • 133MHz FSB, 1Gb RAM each • 1U rackmount, remote power switching • RedHat 7.2
New Hardware – Spring 2003 + • 80 dual processor 1U rackmount units • 2 x 2.66GHz P4 Xeons @ 533MHz FSB • Hyper-threading • 2048Mbyte memory • 2x1Gb/s NICs (o/b) • RedHat 7.3 • 3 racks, remote power switching • Next delivery expected Summer 2003
Operating Systems • Operating Systems: • Redhat 6.2 service will close end May • Redhat 7.2 service has been in production for Babar for 6 months. • New Redhat 7.3 service now available for LHC/other experiments • Testing/benchmarking on new Xeon systems • Increasing demands for security updates becoming problematic.
Disk Farm – Existing Hardware • 2002 – 26 servers, each with 2 external RAID arrays - 1.7TB disk per server, RAID 5: • Excellent performance, well balanced system • Problems with a bad batch of Maxtor drives – many failures and high error rate – all 620 drives now replaced by Maxtor. • Still outstanding problems with Accusys controller failing to eject bad drives from RAID set.
Disk Farm – Spring 2003 + • Recent upgrade to disk farm: • 11 dual P4 Xeon servers (2.4GHz, 1024Mb RAM, PCIx), each with 2 Infortrend IFT-6300 arrays via Ultra160 SCSI • 12 Maxtor 200GB DiamondMax Plus 9 drives per array, RAID 5. • Not yet in production – but a few snags: • Originally tendered Maxtor Maxline Plus II drive was found not to exist! • Infortrend array has 2TB limit per RAID set – pushing for a firmware update. • 11+1spare better than 2 x 6 – 5Gb over 11 systems. • Nick White (N.G.H.White@rl.ac.uk) for more info.
New Projects • Basic fabric performance monitoring (ganglia) • Resource CPU accounting (based on PBS accounts/mysql) • New CA in production • New batch scheduler (MAUI) • Deploy new helpdesk (May)
Ganglia • Urgently needed live performance and utilisation monitoring: • RAL Ganglia Monitoring http://ganglia.gridpp.rl.ac.uk/ • Scalable solution based on multicast • Very rapidly deployable - reasonable support on all Tier1A Hardware • See: http://ganglia.sourceforge.net/
PBS Accounting Software • Need to keep track of system CPU and disk usage. • Home grown PBS accounting package (Derek Ross): • Upload PBS and disk stats into MYSQL • Process with Perl DBI script • Serve via Apache • http://www.gridpp.rl.ac.uk/stats • Contact Derek (D.Ross@rl.ac.uk) for more info.
MAUI / PBS • Maui scheduler has been in production for last 4 months. • Allows extremely flexible scheduling with many features. But …. • Not all of it works – we have done much work with developers for fixes. • Major problem – MAUI schedules on wall clock time – not CPU time. Had to bodge it!!
New Helpdesk Software • Old helpdesk email based/unfriendly. • With additional staff, urgently need to deploy new solution. • Expect new system to be based on free software – probably Request Tracker • Hope that deployed system will also meet needs of Testbed and may also satisfy Tier 2 sites. • Expect deployment by end of May. • http://requestracker.gridpp.rl.ac.uk
Outstanding issues / worries • We have to run many distinct services. • Fermi Linux • RH 6.2/7.2/7.3… • EDG testbeds, LCG … • Farm management is getting very complex. We need better tools and automation. • Security is becoming a big concern again.