1 / 60

CHEPREO Cyberinfrastructure Highlights

CHEPREO Cyberinfrastructure Highlights. National Science Foundation Reverse Site Visit April 27, 2007. AGENDA CHEPREO Cyberinfrastructure Highlights. Status CI Upgrades made by CHEPREO IT Support Personnel Active Equipment UltraLight Participation Actual Usage

amina
Télécharger la présentation

CHEPREO Cyberinfrastructure Highlights

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CHEPREO Cyberinfrastructure Highlights National Science Foundation Reverse Site Visit April 27, 2007

  2. AGENDACHEPREO Cyberinfrastructure Highlights • Status • CI Upgrades made by CHEPREO • IT Support Personnel • Active Equipment • UltraLight Participation • Actual Usage • International Circuit to Sao Paulo • Brazil’s Distributed Tier-2, UF & Caltech Tier-2 • CHEPREO Tier-3 at FIU • Projected Usage & Future Planning

  3. IT Support Personnel

  4. Active Equipment • 2 Cisco ONS 15454 optical muxes in full production • Located at AMPATH/Miami and ANSP POP/ Sao Paulo • Leveraged by the WHREN-LILA project • International Research Network Connection (IRNC) • Supporting international connection from U.S. to South America • NSF Award # OCI -0441095 • FAPESP Award # 2003/13708-0 (NSF State of Sao Paulo)

  5. Active Equipment • CHEPREO optical muxes terminate international link • Support traffic flows to and from Rio, HEPGrid and Sao Paulo, SPRACE

  6. Active Equipment (network): Recommendations • No funds requested for new active equipment Year 5, • As long as traffic flows do not exceed circuit capacity • Continued funding support for: • Current active equipment maintenance • Personnel and other direct costs

  7. UltraLight is • A four year $2M NSF ITR funded by MPS. • Application driven Network R&D. • A collaboration of BNL, Caltech, CERN, Florida, FIU, FNAL, Internet2, Michigan, MIT, SLAC. • Significant international participation: Brazil, Japan, Korea amongst many others. • Goal: Enable the network as a managed resource. • Meta-Goal: Enable physics analysis and discoveries which could not otherwise be achieved. • 2007 Winners of the Annual Internet2 IDEA Awards • Awards Recognize Innovative and Influential Advanced Network Applications

  8. Four Continent Testbed Building a global, network-aware end-to-end managed real-time Grid

  9. Actual Usage • International Circuit to Sao Paulo • Brazil’s Distributed Tier-2, UF & Caltech Tier-2 • CHEPREO Tier-3 at FIU

  10. CHEPREO Bandwidth Upgrade • WHREN-LILA IRNC Award OCI-0441095 • CHEPREO bandwidth augmented this award • 5 Year Cooperative Agreement 2005 – 2009 • Partners include: • Florida International University, PI Julio Ibarra • CENIC in California • Academic Network of Sao Paulo (Award #2003/13708-0) • CLARA, Latin America • CUDI, Mexico • RNP, Brazil • REUNA, Chile

  11. WHREN-LILA Connections • Improving connectivity in the Americas through the establishment of new inter-regional links • 2.5Gbps circuit + dark fiber segment • U.S. landings in Miami and San Diego • Latin America landing in Sao Paulo and Tijuana 10

  12. Actual Bandwidth Utilization • CHEPREO funded the LILA link capacity upgrade to 2.5Gbps in November, 2006 • Enables Brazil’s distributed Tier-2 facility to participate in the CMS Tier 2 Milestones Plan • The next few slides will traffic flows between: • ANSP-SPRACE and Fermi Lab • HEPGRID and Fermi Lab

  13. ANSP-SPRACE Bandwidth Utilization last 30 days • Each point on the traffic graph generated by Cricket above represents 2 hour bandwidth usage average • Top outbound host ( to US ) from Sao Paulo and top inbound ( from US ) Fermi Lab • Green represents traffic outgoing from ANSP to US; • Blue represents traffic destined to ANSP from US/other destinations

  14. Sao Paulo to Fermi Lab flows • Each point on the traffic graph generated by Cricket above represents 2 hour bandwidth usage average • During this period we experienced traffic peaks as high as 900 Mbps from SPRACE • Green represents traffic outgoing from SPRACE to US • Blue represents traffic destined to SPRACE from US/other destinations.

  15. Sao Paulo to Fermi Lab flows • Each point on the traffic graph generated by Cricket above represents 2 hour bandwidth usage average • During this period we experienced traffic peaks as high as 900 Mbps from SPRACE • Blue represents traffic outgoing from SPRACE to US • Green represents traffic destined to SPRACE from US/other destinations.

  16. UNESP X SPRACE Traffic SPRACE UNESP 20/March/2007

  17. HEPGRID Bandwidth Utilization Last 30 days • Each point on the traffic graph generated by Cricket above represents 2 hour bandwidth usage average.. • Top outbound host ( to US ) from HEPGRID and top inbound ( from US ) Fermi Lab • Green represents traffic outgoing from HEPGRID to US • Blue represents traffic destined to HEPGRID from US/other destinations.

  18. Traffic Flows Between HEPGRID & Fermi Lab over 1 ½ Days • Each point on the traffic graph generated by Cricket above represents 5 minute intervals.. • Green represents traffic outgoing from HEPGRID to US • Blue represents traffic destined to HEPGRID from US/other destinations.

  19. Caltech/CERN & HEP at SC2006: Petascale Transfers for Physics at ~100+ Gbps = 1 PB/Day ~200 CPU 56 10GE Switch Ports50 10GE NICs 100 TB Disk Research PartnersFNAL, BNL, UF, UM, ESnet, NLR, FLR, Internet2, ESNet, AWave, SCInet,Qwest, UERJ, UNESP, KNU, KISTI Corporate PartnersCisco, HP NeterionMyricom DataDirectBlueArc NIMBUS New Disk-Speed WAN Transport Apps. for Science (FDT, LStore)

  20. Fast Data Transport Across the WAN: Solid 10.0 Gbps  “12G + 12G”

  21. A New Era of Networksand Grids for Data Intensive Science • Caltech/CERN/Vanderbilt et al. HEP Team: Enabling scientific discoveries by developing state of the art network tools and systems for widespread use (among hundreds of university groups) • SC06 BWC entry: solid 10 Gbps X 2 (bidirectional) data flow between low-end 1U servers and new, easy to deploy software between Tampa and Caltech via NLR. • Fast Data Transport (FDT): wide area data transport, memory to memory and storage to storage, limited only by the disk speeds,for the first time. A highly portable application for all platforms • LStore: A file system interface to global data storage resources;highly scalable to serve many university sites and store the data at multi-GBytes/sec speeds. Capability of several GBytes/sec/rack • MONALISA: A global monitoring and management tool • ROOTLETS: A new grid-based application for discovery science • SC06 Hyper-Challenge: 100+ Gbps reliable data transport storage-to-storage (bidirectional) with FDT, LStore and Parallel NFS • Shaping the Next Generation Scientific Research

  22. Sao Paulo flows to SC06

  23. HEPGrid flows to SC06

  24. Actual & Previously Projected Bandwidth Usage on the CHEPREO Link over LILA

  25. Tier-2 & Tier-3 CHEPREO Sites • Brazilian Distributed Tier-2 • Caltech and UF Tier-2s • CHEPREO Tier-3 at FIU

  26. CMS Tier 2 site in Sao Paulo - SP-RACE

  27. CMS Tier 2 Site in Brazil Plans for 2008 replace these with rack mounted servers, 1M SI 2Ks = US Tier-2 Scale

  28. Brazil’s distributed Tier-2 facility

  29. Korea Russia UK FermiLab U Florida Caltech UCSD FIT FIU FSU Participation in LHC Global Grid • 2000 physicists, 60 countries • 10s of Petabytes by 2009 • CERN / Outside = 1 / 10 CMS Experiment Online System CERN Computer Center 200 - 1500 MB/s Tier 0 10-40 Gb/s Tier 1 >10 Gb/s UERJ & SP OSG Tier 2 2.5-10 Gb/s Tier 3 Physics data caches Tier 4 PCs

  30. CMS and Grid Activities at UF • iVDGL: PI, leadership • GriPhyN: PI, leadership • UltraLight: co-PI, project coordinator • Open Science Grid: co-PI, resources • CMS • Muon System: Muon Endcap, Trigger, Coordination • Computing: US-CMS Tier-2, DISUN, HPC, fGOC … • Physics Analysis: • HEE faculty: Acosta, Avery, Korytov, Mitselmakher, Yelton • HET faculty: Field, Matchev, Ramond • Plus many Scientists, Post-docs, Grad Students, Engineers

  31. The Tier-2 Center at UF • Support local and regional HEP experimenters • FIU, FIT, FSU, Vanderbilt, others… • Monte Carlo production and data analysis • CMS, CDF official NamCAF site • Support Grid computing R&D • GriPhyN • iVDGL • UltraLight • CHEPREO • DISUN • Open Science Grid Principle Investigator P. Avery Scientists D. Bourilkov R. Cavanaugh Y. Fu B. Kim J. Rodriguez  FIU HPC liaison C. Prescott 1-2 student hires

  32. UF Tier-2 Computational Resources • Computational Hardware • 210 Dual dual-core (840 cores) • ~1M SpecInt2000 • Computation support and config • 4 Grid-enabled clusters • 4 nodes for interactive analysis • User logins, compile, GRID UI… • Service & dev. nodes • Gatekeepers, frontends, webservers, dCache management, devel clusters… • Storage hardware • 37 TBs of RAID5 fileserver • 24 TB are in dCache pools • 13 TB are in nfs servers • 83 TBs of local disk on nodes • ~63 TB allocated to dCache • 18 TBs HP FC attached storage • ~800MB/s sequential reads • Hybrid dCache-based storage • dCache required by CMS worldwide • 126 ftp servers for hi-speed data xfer • Grid access via SRM interface

  33. Mostly a Cisco 6509 All 9 slots populated 2 ea. 4 x 10 GigE blades 6 ea. 48 x 10/100/1000BASE-TX Supervisor Engine Sup720 Cable management a thing of beauty! UF Tier2 Networking

  34. Initially using UltraLight infrastructure (Bourilkov) Partly done in context of SC05, SC06 UF Tier2 – Caltech Tier2 transfers in 2006 4.5 Gbps rates Tier2 – Tier1 transfers (UF-FNAL, Caltech-FNAL) PheDEx data movements (exposed problems) New tests next week using new UF infrastructure Tier2 – Tier3 transfers UF-FIU is next step, using FLR infrastructure Bourilkov + Yu (UF) and Rodriguez (FIU) Data Transfer Tests (UF-Caltech) FDT

  35. REDDnet: National Networked Storage • NSF funded • Vanderbilt (UltraLight) • 8 initial sites • UF, FIU, … • Multiple disciplines • Satellite imagery • HEP • Terascale Supernova Initiative • Structural Biology • Bioinformatics • Storage • 500TB disk • 200TB tape Brazil?

  36. Open Science Grid: July 20, 2005 • Consortium of many organizations (multiple disciplines) • Production grid cyberinfrastructure • 75+ sites, 24,000+ CPUs:US, UK, Brazil, Taiwan\ Funded for $30M for 2006 – 2011 • CHEPREO is participating in OSG through the Tier-3 at FIU, for E&O, and sharing of data for example

  37. OSG Jobs Per Site: 6 Months 5000 simultaneous jobsat multiple sites Sep Oct Nov Dec Jan Feb Mar

  38. Towards a Florida Research Grid FLR: 10 Gbps • Florida facilities involved • UF: Tier-2 & iHEPA clusters • UF: HPC Center • FIU: Tier3 • FSU: Tier3 • FIT: Tier3 • Exploits FLR optical network connections • 10 Gb/s • All but FIT are production sites in OSG! • FIT soon (including UF donated nodes) FSU Tier3 UF HPC Center UF iHEPA UF Tier2 FIT FIU Tier3

  39. Florida Grid Operations Center • Operational Scope • Help support fabric level services, hardware procurement and cluster management • Fully integrated with OSG operations infrastructure • Informal VRVS meetings, archived email • fGOC now moved to FIU (Jorge Rodriguez) • Training workshops for FIU/FIT/FSU sysadmins • Provide dedicated E/O resources • CPU, disk • Resources for student projects

  40. CHEPREO Tier3 Cluster Current Cluster configuration: • Gigabit Copper backbone with GE connectivity to AMPATH Network • OSG Computing Element • Approx. 23 computational cluster • Supports local user logins • Serves as the CE’s gatekeeper • Plus all other cluster services • This year added a “Storage Element” • This year added a very modest fileserver (lack of equipment funds) • Cobbled together from existing equipment, spare and new parts • 1.5 TB of disk added as a Network Attached NFS server • Upgraded OSG middleware stack

  41. Tier 3 Data Center: Current Usage • Usage Statistics (condor) • Since Nov. 2006 a total of ~30K hours used • Usage by a few of the supported VOs • Minor usage by CMS! • CMS can’t currently make much use of opportunistic grid resources • Requires several, heavy services, PhEDeX, SRM/dCache… • Very few non-Tier2 sites are actually used by CMS • Other HEP VOs just don’t know about FIU-PG…

  42. Projections & Recommendations • Projected Bandwidth Usage • Projected Tier-3 Growth Plan • Challenges with international bandwidth management

  43. Bandwidth Utilization Projections • RedCLARA projection due to direct peering w/ US R&E Networks over AtlanticWave • Distrib. Tier-2 projection due to further testing & preparations for LHC experiments • Astronomy projection due to images from Dark Energy camera

  44. International Network Support Recommendation • Continue CHEPREO funding support in Y5 to sustain 2.5Gbps link capacity MIA-SP • Essential to support research and production traffic flows between the U.S. and Latin America • Additional $426,000 over PEP CI funding • WHREN-LILA assumes this cost in FY08 & FY09 • Continue CI (Networking) personnel support at FIU & Caltech

  45. Tier 3 Projected Growth Plan for FY07(Y5 of CHEPREO) • Replace aging hardware • Xeon class server are now over 3 years old • Almost everything is already out of warranty • Replace 1.5 TB fileserver with modern hardware with more disk, more RAM and more CPU • Modest investment would yield 10X the storage • With modern storage server can consider deployment of a real grid enabled Storage Element • Add an interactive analysis server • Cost: Approximately $70,000

  46. Service Nodes webser PhEDEx squid CMS Interactive Server (8 CPUs, 8+ GB RAM) miami OSG Integration FIU-IGT fiuM+1 fiuM+N Tier3 Facility Reconfiguration Phase I Phase II OSG CE “Cluster” OSG CE “Cluster” FIU-PG FIU-PG fiu01 fiu01 fiu20 fiuM nfs server nfs servers nas-01 nas-01 L-Store Depot

  47. Challenges with International Bandwidth Management • Optimize Gigabit Flows over the CHEPREO-WHREN-LILA East link.for and from HEPGRID and SPRACE • Deployment of FastTCP and other Next-generation network protocols • Use FDT to get better disk-to-disk throughput • Implement and test using control plane tools and services • Deploy servers equipped with Caltech's FDT (Fast Data Transport) at FIU, UF, UERJ(Rio) and UNESP ( São Paulo) • Deploy UltraLight Linux kernel to put high speed data transfers into persistent use among these and other CHEPREO and UltraLight sites. • Deploy the dCache /FDT adaptor developed at Caltech, so that the high speed data transport methods can be used for CMS physics, in production • Deploy the Clarens Grid Web Services portal and other GAE components at FIU, UF, UERJ(Rio) and UNESP ( São Paolo). • Install the ROOTlet Clarens server component. • Ensure ROOTlet clients running FWLite can make use of the ROOTlet service for CMS data analysis, and allow access from the ROOTlets to CMS analysis datasets located at FIU, UF, UERJ and UNESP.

  48. Y5 and Y5+: Circuit-Oriented Networking and Large-Scale DataFlows • Deploy FDT for Tier2 – Tier3 Flows • Help with hardware choices, network interfaces, kernel, TCP stack optimization, MonALISA monitoring and agents • Install persistent FDT service • For active monitoring (short burst) as well as data transfers • Integrated FDT/dcache service • For production inter-cluster grid data transport T1-T2, T2-T2 • First “adaptor” version finished this month: Ilya Narsky and Iosif Legrand (Caltech) with support from the Fermilab team • Beginning direct interface (more complex): Faisal Khan (Caltech and NUST) and Caltech software engineering team • Create Virtual “Layer 2+3” Path: e.g. Brazil-Miami-Starlight-(IRNC)-AMS-Geneva (and others) • Using MonALISA/VINCI agents • Test switching the path to and from the US LHCNet links/IRNC links • Understand addressing issues when using/switching among Layer 1 and Layer 2 or 3 segments • Atlantic Wave and Internet2/NLR across the US • Automated dynamic configuration and reconfiguration of paths as needed

  49. FDT Test Results 11/14-11/15 • Memory to memory ( /dev/zero to /dev/null ), using two 1U systems with Myrinet 10GbE PCI Express NIC cards • Tampa-Caltech (RTT 103 msec): 10.0 Gbps Stable: indefinitely • Long range WAN Path (CERN – Chicago – New York – Chicago – CERN VLAN, RTT 240 msec) ~8.5 Gbps  10.0 GbpsOvernight • Disk to Disk: performs very close to the limit for the disk or network speed 1U Disk server at CERN sending data to a 1U server Caltech (each with 4 SATA disks):~0.85 TB/hr per rack unit = 9 Gbytes/sec per rack 

  50. FDT Test Results (2) 11/14-11/15 • Stable disk-to-disk flows Tampa-Caltech: Stepping up to 10-to-10 and 8-to-8 1U Server-pairs 9 + 7 = 16 Gbps; then Solid overnight • Cisco 6509E Counters: • 16 Gbps disk traffic and 13+ Gbps FLR memory traffic • Maxing out the 20 Gbps Etherchannel (802.3ad) between our two Cisco switches ?

More Related