130 likes | 253 Vues
This report outlines the current status and future plans for hardware and infrastructure within the Grid computing environment. It discusses various issues in the machine room, including power, air conditioning, and space constraints. Key hardware details include details on current server configurations, storage systems, and cluster management. Ongoing efforts involve improving backup systems, implementing configuration management with cfengine, and ensuring system redundancy. Monitoring strategies using Ganglia and Nagios are reviewed, alongside plans to merge user and grid cluster accounts, offering a holistic view of the computing environment.
E N D
RAL PPD Site Report Chris Brew SciTech/PPD
Outline • Hardware • Current • Grid • User • New • Machine Room Issues • Power, Air Conditioning & Space • Plans • Tier 3 • Configuration Management • Common Backup • Issues • Log processing • Windows
Current Grid Cluster • CPU: • 52 x Dual Opteron 270 Dual Core CPUs, 4GB RAM • 40 x Dual PIV Xeon 2.8Ghz, 2GB RAM • All running SL3 glite-WN • Disk: • 8 x 24 Slot dCache Pool Servers • Areca ARC-1170 24 RAID cards • 22 x WD5000YS RAID 6 (Storage) – 10TB • 2 x WD1600YD RAID 1 (System) • 64 bit SL4, Single large xfs file system • Misc: • GridPP Front Ends running, Torque, LFC/NFS, R-GMA, dCache Head • Ex WNs running CE, DHCPD/TFTP pxeboot server • Network now at 10Gb/s but external link still limited by Firewall
Current User Cluster • User Interfaces • 7 ex WNs from dual 1.4GHz PIII to dual 2.8 GHz PIV • 6 x SL3 (1 test, 2 general, 3 expt) • 1 SL4 test UI • 2 x Dell PowerEdge 1850 Disk Servers • Dell PERC 4/DC RAID card • 6 x 300GB disks in Dell PowerVault 220 SCSI shelf • Serves Home and experiment areas via NFS • Master copy on one server • rsync’d to backup server 1-4 times daily • Home area backed up to ADS daily • Same hardware as Windows solution, common spares
Other Miscellaneous Boxen • Extra Boxes • Install/Scratch/Internal Web server • Monitoring Server • External Web Server • Minos CVS Server • NIS Master • Security Box (Central Logger and Tripwire) • New Kit (undergoing burnin now) • 32 x Dual Intel Woodcrest 5130 Dual Core CPUs, 8GB RAM (Streamline) • 13 Viglen HS160a Disk servers
Machine Room Issues • Too much equipment for our small departmental Computer room • Taken over adjacent “Display” area • Historically part of computer room • Already has raised floor, and three phase power, though new distribution panel needed for latter • Common air conditioning with Computer Room • Refurbished power distribution, installed kit and powered on: • Temp in new area rose to 26°C, temp in old area fell by 1 °C • “Consulting” engineer called in by estates to “rebalance” air conditioning. Very successful - Old/New now 21.5/22.7 °C • Also calculated total capacity of plant at 50kW of cooling currently we are using ~30kW • Next step is to refurbish the power in the old machine room to reinstate the three phase supply
Monitoring • 2 Different monitoring systems • Ganglia: Monitors per host metrics and records histories to produce graphs, good for trending and viewing current and historic status • Nagios: Monitors “services” and issues alerts, good for raising alerts and viewing “what’s currently bad”. See other talk • In view of current lack of effort, program to get as much monitoring as possible in Nagios to be automatically alerted on. • Recently added alerts for SAM tests and Yumit/Patiki updates
Plans 1: Tier 3 • Physicists seem to want access to batch other than on the grid so need to provide local access • Rather then run 2 batch systems want to give local user access to Grid batch workers • Need to: • Merge grid and user cluster account databases • Modify YAIM to use NIS pool accounts • Change maui settings to Fairshare Grid/Non-Grid before VO before Users
Plans 2: cfengine • Getting to be too many worker nodes to manage with current ad hoc system need to move towards a full configuration management system • After asking around decide upon cfengine • Test deployment promising • Working on re-implementing the Worker Node install in cfengine • Still need to find good solution for secure key distribution to newly installed nodes
Plans 3: Common Backup • Current backup of important files for Unix is to the Atlas Data Store • Not sure how much longer the ADS is going to be around, need to look for another solution • Was intending to look at Amanda but… • Dept bought new 30 slot tape robot for Windows Backup • Veritas Backup software in use on Windows supports Linux Clients • Just starting tests on a single node. Will keep you posted.
Plan 4: Reliable Hardware • Plan to purchase an new class of “more reliable” worker node type machines • Dual system disks in hot swap caddys • Possibly redundant hot swap power supplies • Use this type of machines for running Grid services, Local services (Databases, web servers etc.) and User Interfaces
Issues 1: Log Processing • Already running Central Syslog Server (soon to be expanded to 2 hosts for redundancy). • As with our Tripwire a fairly passive system • Hope to get enough info off the system to get some useful info after the event • Would like some system to monitor these logs and flag “interesting” events. • Would prefer little or no training required.
Windows, etc. • Still using Windows XP, with Office 2003 and Hummingbird eXceed • Are looking at Vista and Office 2007 but not yet seriously and have no plans for rollout yet • Now managed at Business Unit level rather than department • Looking for synergies between Unix and Windows support: • Common file server hardware • Common Backup Solution • Recently equipped PPD Meeting room with Polycom rollabout VideoConferencing system.