1 / 31

Tier1 Status Report

Tier1 Status Report. Andrew Sansum GRIDPP12 1 February 2004. Overview. Hardware configuration/utilisation That old SRM story dCache deployment Network developments Security stuff. Hardware. CPU

adair
Télécharger la présentation

Tier1 Status Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tier1 Status Report Andrew Sansum GRIDPP12 1 February 2004

  2. Overview • Hardware configuration/utilisation • That old SRM story • dCache deployment • Network developments • Security stuff Tier-1 Status Report

  3. Hardware • CPU • 500 dual Processor Intel – PIII and Xeon servers mainly rack mounts (About 884KSI2K) also 100 older systems. • Disk Service – mainly “standard” configuration • Commodity disk service based on core of 57 Linux servers • External IDE&SATA/SCSI RAID arrays (Accusys and Infortrend) • ATA and SATA drives (1500 spinning drives – another 500 old SCSI drives) • About 220TB disk (last tranch deployed in October) • Cheap and (fairly) cheerful • Tape Service • STK Powderhorn 9310 silo with 8 9940B drives. Max capacity 1PB at present but capable of 3PB by 2007. Tier-1 Status Report

  4. LCG in September Tier-1 Status Report

  5. LCG Load Tier-1 Status Report

  6. Babar Tier-A Tier-1 Status Report

  7. Last 7 Days ACTIVE: Babar,D0, LHCB,SNO, H1, ATLAS,ZEUS Tier-1 Status Report

  8. dCache • Motivation: • Needed SRM access to disk (and tape) • We have 60+ disk servers (140+ filesystems) – needed disk pool management • dCache was only plausible candidate Tier-1 Status Report

  9. Mid 2003 We deployed a non grid version for CMS. It was never used in production. End of 2003/Start of 2004 RAL offered to package a production quality dCache. Stalled due to bugs and holidays went back to developers and LCG developers. September 2004 Redeployed DCache into LCG system for CMS, and DTeam VOs. dCache deployed within JRA1 testing infrastructure for gLite i/o daemon testing. Jan 2005: Still working with CMS to resolve interoperation issues, partly due to “hybrid grid/non-grid use”. Jan 2005: Prototype back end to tape written. History (Norse saga) of dCache at RAL Tier-1 Status Report

  10. dCache Deployment • dCache deployed as production service (also test instance, JRA1, developer1 and developer2?) • Now available in production for ATLAS, CMS, LHCB and DTEAM (17TB now configured – 4TB used) Reliability good – but load is light • Will use dCache (preferably production instance) as interface to Service Challenge 2. • Work underway to provide Tape backend, prototype already operational. This will be production SRM to tape at least until after July service challenge Tier-1 Status Report

  11. Current Deployment at RAL Tier-1 Status Report

  12. Original Plan for Service Challenges Service Challenges Experiments UKLIGHT SJ4 Production dCache test dCache(?) technology Tier-1 Status Report

  13. Dcache for Service Challenges Experiments Service Challenge SJ4 UKLIGHT head gridftp Disk Servers dcache Tier-1 Status Report

  14. Network • Recent upgrade to Tier1 Network • Begin to put in place new generation of network infrastructure • Low cost solution based on commodity hardware • 10 Gigabit “ready” • Able to meet needs of: • forthcoming service challenges • Increasing production data flows Tier-1 Status Report

  15. September Site Router 1 Gbit Summit 7i Summit 7i Summit 7i Disk Disk CPU CPU CPU Disk Tier-1 Status Report

  16. Now (Production) Site Router N*1 Gbit Summit 7i N*1 Gbit N*1 Gbit Nortel 5510 stack (80Gbit) 1 Gbit/link Disk+CPU Disk+CPU Tier-1 Status Report

  17. Soon (Lightpath) Site Router 1 Gbit Summit 7i UKLIGHT N*1 Gbit N*1 Gbit 2*1Gbit Nortel 5510 stack (80Gbit) 10Gb Dual attach 1 Gbit/link Disk+CPU Disk+CPU Tier-1 Status Report

  18. Next (Production) New Site Router RALSITE 10 Gb 10 Gigabit Switch N*10 Gbit N*10 Gbit Nortel 5510 stack (80Gbit) 1 Gbit/link Disk+CPU Disk+CPU Tier-1 Status Report

  19. Machine Room Upgrade • Large mainframe cooling (cooking) Infrastructure: ~540KW • Substantial overhaul now completed good performance gains, but was close to maximum by mid summer, temporary chillers over August (funded by CCLRC) • Substantial additional capacity ramp-up planned for Tier-1 (and other E-Science) Service • November (Annual) Air-conditioning RPM Shutdown • Major upgrade – new (independent) cooling system (400KW+) – funded by CCLRC • Also: profiling power distribution, new access control system. • Worrying about power stability (brownout and blackout in last quarter. Tier-1 Status Report

  20. MSS Stress Testing • Preparation for SC 3 (and beyond) underway (Tim Folkes). Underway since August. • Motivation – service load has been historically rather low. Look for “Gotchas” • Review known limitations. • Stress test – part of the way through the process – just a taster here • Measure performance • Fix trivial limitations • Repeat • Buy more hardware • Repeat Tier-1 Status Report

  21. Test system Production system Physical connection (FC/SCSI) Sysreq udp command User SRB command STK ACSLS command VTP data transfer SRB data transfer dylan AIX Import/export 8 x 9940 tape drives STK 9310 buxton SunOS ACSLS Tape devices 4 drives to each switch basil AIX test dataserver Brocade FC switches SRB pathtape commands ADS_switch_1 ADS_Switch_2 ADS0CNTR Redhat counter ADS0PT01 Redhat pathtape ADS0SB01 Redhat SRB interface cache User pathtape commands Logging cache mchenry1 AIX Test flfsys florence AIX dataserver ermintrude AIX dataserver zebedee AIX dataserver dougal AIX dataserver brian AIX flfsys admin commands create query catalogue array3 array1 array4 array2 catalogue All sysreq, vtp and ACSLS connections to dougal also apply tothe other dataserver machines, but are left out for clarity User SRB Inq; S commands; MySRB ADS tape ADS sysreq Tier-1 Status Report Thursday, 04 November 2004

  22. Catalogue Manipulation Tier-1 Status Report

  23. Write Performance Single Server Test Tier-1 Status Report

  24. Conclusions • Have found a number of easily fixable bugs • Have found some less easily fixable architecture issues • Have much better understanding of limitations of architecture • Estimate suggests 60-80MB/s -> tape now. Buy more/faster disk and try again. • Current drives good for 240MB/s peak – actual performance likely to be limited by ratio of drive (read+write):(load+unload+seek) Tier-1 Status Report

  25. Security Incident • 26 August X11 scan acquires userid/pw at upstream site • Hacker logs on to upstream site – snoops known_hosts • Ssh to Tier-1 front end host using unencrypted private key from upstream site • Upstream site responds – but does not notify RAL • Hacker loads IRC bot on Tier1 and registers at remote IRC server (for command/control/monitoring) • Tries to root exploit (fails), attempts login to downstream sites (fails – we think) • 7th October – Tier-1 incident notified by IRC service and begins response. • 2000-3000 sites involved globally Tier-1 Status Report

  26. Objectives • Comply with site security policy – disconnect … etc etc. • Will disconnect hosts promptly once active intrusion detected. • Comply with external Security Policies (eg LCG) • Notification • Protect downstream sites, by notification, pro-active disconnection • Identify involved sites • Establish direct contacts with uptream/downstream • Minimise service outage • Eradicate infestation Tier-1 Status Report

  27. Roles Tier-1 Status Report

  28. Chronology • At 09:30 on 7th October RAL network group forwarded a complaint from Undernet suggesting unauthorized connections from Tier1 hosts to Undernet. • At 10:00 initial investigation suggests unauthorised activity on csfmove02. csfmove02 physically disconnected from network. • By now 5 Tier1 staff + E-Science security Officer 100% engaged on incident. Additional support from CCLRC network group, and site security officer. Effort remained at this level for several days. Babar support staff at RAL also active tracking down unexplained activity. • At 10:07 request made to site firewall admin for firewall logs for all contacts with suspected hostile IRC servers. • At 10:37 firewall admin provides initial report, confirming unexplained current outbound activity from csfmove02, but no other nodes involved. • At 11:29 babar report that bfactory account password was common to the following additional Ids (bbdatsrv and babartst) • At 11:31 Steve completes rootkit check – no hosts found – although possible false positives on Redhat 7.2 which we are uneasy aboutBy 11:40 preliminary investigations at RAL had concluded that an unauthorized access had taken place onto host csfmove02 (a data mover node) which in turn was connected outbound to an IRC service. At this point we notified security mailing lists (hepix-security, lcg-security, hepsysman): Tier-1 Status Report

  29. Security Events Tier-1 Status Report

  30. Security Summary • The intrusion took 1-2 staff months of CCLRC effort to investigate (6 staff fulltime for 3 days (5 Tier-1 plus E-Science security officer), working long hours. Also: • Networking group • CCLRC site security. • Babar support • Other sites • Prompt notification of the incident by up-stream site would have substantially reduced the size and complexity of the investigation. • The good standard of patching on the Tier1 minimised the spread of the incident internally (but we were lucky) • Can no longer trust who logged on uses are many userids (globally) probably compromised. Tier-1 Status Report

  31. Conclusions • A period of consolidation • User demand continues to fluctuate, but increasing number of experiments able to use LCG. • Good progress on SRM to DISK (DCACHE • Making progress with SRM to tape • Having an SRM isn’t enough – it has to meet the needs of the experiments • Expect focus to shift (somewhat) towards service challenge Tier-1 Status Report

More Related