490 likes | 619 Vues
The LCG Project. Ian Bird IT Department, CERN ISGC 2004, Taipei 27-28 th July 2004. The Large Hadron Collider Project 4 detectors. CMS. ATLAS. Requirements for world-wide data analysis Storage – Raw recording rate 0.1 – 1 GBytes/sec Accumulating at 15 PetaBytes/year
E N D
The LCG Project Ian Bird IT Department, CERN ISGC 2004, Taipei 27-28th July 2004 LCG-Asia Workshop – 28 July 2004 - 1
The Large Hadron Collider Project 4 detectors CMS ATLAS Requirements for world-wide data analysis Storage – Raw recording rate 0.1 – 1 GBytes/sec Accumulating at 15 PetaBytes/year 40 PetaBytes of disk Processing – 100,000 of today’s fastest PCs LHCb LCG-Asia Workshop – 28 July 2004 - 2
Large distributed community “Offline” software effort: 1000 person-yearsper experiment CMS ATLAS Software life span: 20 years LHCb ~ 5000 Physicistsaround the world LCG-Asia Workshop – 28 July 2004 - 3
LCG – Goals • The goal of the LCG project is to prototype and deploy the computing environment for the LHC experiments • Two phases: Phase 1: 2002 – 2005 • Build a service prototype, based on existing grid middleware • Gain experience in running a production grid service • Produce the TDR for the final system Phase 2: 2006 – 2008 • Build and commission the initial LHC computing environment • LCG is not a development project – it relies on other grid projects for grid middleware development and support LCG-Asia Workshop – 28 July 2004 - 4
Introduction – the LCG Project • LHC Computing Grid (LCG) is a grid deployment project • Prototype computing environment for LHC • Focus on building a production-quality service • Learn how to maintain and operate a global scale production grid • Gain experience in close collaboration between regional (resource) centres • Understand how to integrate fully with existing computing services • Building on the results of earlier research projects; Learn how to move from test-beds to production services • Address policy-like issues needing agreement between collaborating sites LCG-Asia Workshop – 28 July 2004 - 5
LHC Computing Grid Project - a Collaboration Building and operating the LHC Grid involves a collaboration of • The physicists and computing specialists from the LHC experiment • The projects in the US and Europe that have been developing Grid middleware • The regional and national computing centres that provide resources for LHC • and usually also for other physics experimentsand other sciences Researchers Software Engineers Service Providers LCG-Asia Workshop – 28 July 2004 - 6
The Project Organisation Applications Area Development environment Joint projects Data management Distributed analysis Middleware Area Provision of a base set of gridmiddleware – acquisition,development, integration,testing, support CERN Fabric Area Large cluster management Data recording Cluster technology Networking Computing service at CERN Grid Deployment Area Establishing and managing theGrid Service - Middleware certification, security, operations,registration, authorisation,accounting ARDA Prototyping and testing grid middleware for experiment analysis LCG-Asia Workshop – 28 July 2004 - 7
Applications Area Projects • Software Process and Infrastructure (SPI) (A.Aimar) • Librarian, QA, testing, developer tools, documentation, training, … • Persistency Framework & Database Applications (POOL) (D.Duellmann) • Relational persistent data store, conditions database, collections • Core Tools and Services (SEAL) (P.Mato) • Foundation and utility libraries, basic framework services, object dictionary and whiteboard, maths libraries • Physicist Interface (PI) (V.Innocente) • Interfaces and tools by which physicists directly use the software. Interactive analysis, visualization • Simulation (T.Wenaus) • Generic framework, Geant4, FLUKA integration, physics validation, generator services • ROOT (R.Brun) • ROOT I/O event store; analysis package LCG-Asia Workshop – 28 July 2004 - 8
POOL – Object Persistency • Bulk event data storage – an object store based on ROOT I/O • Full support for persistent references automatically resolved to objects anywhere on the grid • Recently extended to support updateable metadata as well (with some limitations) • File cataloging – Three implementations using – • Grid middleware (EDG version of RLS) • Relational DB (MySQL) • Local Files (XML) • Event metadata – • Event collections with query-able metadata (physics tags etc.) • Transient data cache – • Optional component by which POOL can manage transient instances of persistent objects • POOL project scope now extended to include the Conditions Database LCG-Asia Workshop – 28 July 2004 - 9
POOL Component Breakdown LCG-Asia Workshop – 28 July 2004 - 10
Simulation Project Leader Subprojects Framework Geant4 FLUKA integration Physics Validation Generator Services Simulation Project Organisation LCG-Asia Workshop – 28 July 2004 - 11
Fabrics • Getting the data from the detector to the grid requires sustained data collection and distribution -- keeping up with the accelerator • To achieve the required levels of performance, reliability, resilience -- at minimal cost (people, equipment) -- we also have to work on scalability and performance of some of the basic computing technologies – • cluster management • mass storage management • high performance networking LCG-Asia Workshop – 28 July 2004 - 12
Tens of thousands of disks Thousands of processors Hundreds of tape drives Continuous evolution Sustainedthroughput Resilient toproblems LCG-Asia Workshop – 28 July 2004 - 13
Fabric Automation at CERN HMS Fault & hardware Management SMS Configuration Installation CDB SW Rep OraMon CDB SWRep Includes technology developed by DataGrid Monitoring Node MSA NCM SPMA LEMON Cfg Cache SW Cache LCG-Asia Workshop – 28 July 2004 - 14
5.44 Gbps 1.1 TB in 30 mins WAN connectivity 6.63 Gbps 25 June 2004 We now have to get from an R&D project (DATATAG) to a sustained, reliable service – Asia, Europe, US LCG-Asia Workshop – 28 July 2004 - 15
ALICE DC – MSS Bandwidth LCG-Asia Workshop – 28 July 2004 - 16
Sites in LCG-2/EGEE-0 : July 14 2004 LCG-Asia Workshop – 28 July 2004 - 17
Sites in LCG-2/EGEE-0 : July 14 2004 LCG Tier 0/1 site • 22 Countries • 64 Sites • 50 Europe, • 2 US, • 5 Canada, • 6 Asia, • 1 HP • Coming: • New Zealand, • China, • other HP (Brazil, Singapore) • > 6000 cpu LCG-Asia Workshop – 28 July 2004 - 18
Tier-2 Tier-1 MSU IC IFCA UB Cambridge Budapest Prague Taipei TRIUMF Legnaro CSCS IFIC Rome CIEMAT USC Krakow NIKHEF small centres desktops portables RAL IN2P3 LHC Computing Model (simplified!!) • Tier-0 – the accelerator centre • Filter raw data • Reconstruction summary data (ESD) • Record raw data and ESD • Distribute raw and ESD to Tier-1 • Tier-1 – • Managed Mass Storage –permanent storage raw, ESD, calibration data, meta-data, analysis data and databases grid-enabled data service • Data-heavy analysis • Re-processing raw ESD • National, regional support “online” to the data acquisition processhigh availability, long-term commitment FNAL CNAF FZK PIC ICEPP BNL • Tier-2 – • Well-managed disk storage – grid-enabled • Simulation • End-user analysis – batch and interactive • High performance parallel analysis (PROOF) LCG-Asia Workshop – 28 July 2004 - 19
The LCG Service • LCG-1 service started on September 15 2003 • With 12 sites • LCG-2 with upgraded middleware began to be deployed beginning of 2004 • Currently 64 sites with > 6000 cpu • During 2003 significant effort was expended to: • Integrate VDT, EDG, and other tools • Debug, patch, test, and certify the middleware • LCG-2 currently in use for the LCG experiments data challenges • LCG-2 forms the basis for the EGEE production service LCG-Asia Workshop – 28 July 2004 - 20
LCG Certification • Significant investment in certification and testing process and team • Skilled people capable of system-level debugging, tightly coupled to VDT, Globus, and EDG teams • Needs significant hardware resources • This was essential in achieving a robust service • Making production quality software is • Expensive • Time consuming • Not glamorous! • … and takes very skilled people with a lot of experience LCG-Asia Workshop – 28 July 2004 - 21
Experiences in deployment • LCG covers many sites (>60) now – both large and small • Large sites – existing infrastructures – need to add-on grid interfaces etc. • Small sites want a completely packaged, push-button, out-of-the-box installation (including batch system, etc) • Satisfying both simultaneously is hard – requires very flexible packaging, installation, and configuration tools and procedures • A lot of effort had to be invested in this area • There are many problems – but in the end we are quite successful • System is stable and reliable • System is used in production • System is reasonably easy to install now – 60 sites • Now have a basis on which to incrementally build essential functionality • This infrastructure forms the basis of the initial EGEE production service LCG-Asia Workshop – 28 July 2004 - 22
The LCG Deployment Board • Grid Deployment Board (GDB) set up to address policy issues requiring agreement and negotiation between resource centres • Members: country representatives, applications, and project managers • Sets up working groups • Short term or ongoing • Bring in technical experts to focus on specific issues • GDB approves recommendations from working groups • Groups: • Several that outlined initial project directions (operations, security, resources, support) • Security – standing group – covers many policy issues • Grid Operations Centre task force • User Support group • Storage management and other focused issues • Service challenges LCG-Asia Workshop – 28 July 2004 - 23
Operations services for LCG • Operational support • Hierarchical model • CERN acts as 1st level support for the Tier 1 centres • Tier 1 centres provide 1st level support for associated Tier 2s • “Tier 1 sites” “Primary sites” • Grid Operations Centres (GOC) • Provide operational monitoring, troubleshooting, coordination of incident response, etc. • RAL (UK) led sub-project to prototype a GOC • 2nd GOC in Taipei now in operation • Together providing 16hr coverage • Expect 3rd centre in Canada/US to help achieve 24hr coverage • User support • Central model • FZK provides user support portal • Problem tracking system web-based and available to all LCG participants • Experiments provide triage of problems • CERN team provide in-depth support and support for integration of experiment sw with grid middleware LCG-Asia Workshop – 28 July 2004 - 24
GGUS - Concept • Target:24×7 support via time difference and 3 support teams • Currently:GGUS FZK GGUS ASCC • Desired:GGUS USA LCG-Asia Workshop – 28 July 2004 - 25
Support Teams within LCG Grid Operations Center (GOC)Operations Problems Resource Centers (RC)Hardware Problems CERN Deployment Support (CDS)Middleware Problems Global Grid User Support (GGUS)Single Point of Contact Coordination of User Support Experiment Specific User Support (ESUS)Software Problems OtherCommunities (VOs) 4 LHC experiments(Alice Atlas CMS LHCb) 4 non-LHC experiments(BaBar CDF Compass D0) LCG-Asia Workshop – 28 July 2004 - 26
Security • LCG Security Group • LCG usage rules – proposed as general Grid usage guidelines • Registration procedures and VO management • Agreement to collect only minimal amount of personal data • Registration has limited validity • Initial audit requirements are defined • Initial incident response procedures • Site security contacts etc. are defined • Set of trusted CAs (including Fermilab online KCA) • Security policy • This group is now a Joint Security group covering several grid projects/infrastructure LCG-Asia Workshop – 28 July 2004 - 27
LCG Security environment Users VOs • The players Experiment data Access patterns Membership … Personal data Roles Usage patterns … Grid Sites Resources Availability Accountability … LCG-Asia Workshop – 28 July 2004 - 28
The Risks • Top risks from Security Risk Analysis • http://proj-lcg-security.web.cern.ch/proj-lcg-security/RiskAnalysis/risk.html • Launch attacks on other sites • Large distributed farms of machines • Illegal or inappropriate distribution or sharing of data • Massive distributed storage capacity • Disruption by exploit of security holes • Complex, heterogeneous and dynamic environment • Damage caused by viruses, worms etc. • Highly connected and novel infrastructure LCG-Asia Workshop – 28 July 2004 - 29
joint GOC Guides Policy – the LCG Security Group Incident Response Certification Authorities Audit Requirements Usage Rules Security & Availability Policy Application Development & Network Admin Guide User Registration http://cern.ch/proj-lcg-security/documents.html LCG-Asia Workshop – 28 July 2004 - 30
Authentication Infrastructure • Users and Services own long-lived (1yr) credentials • Digital certificates (X.509 PKI) • European Grid Policy Management Authority • “… is a body to establish requirements and best practices for grid identity providers to enable a common trust domain applicable to authentication of end-entities in inter-organisational access to distributed resources. …” • www.eugridpma.org covers EU (+ USA + Asia) • Jobs submitted with Grid Proxy Certificates • Short-lived (<24hr) credential which “travels” with job • Delegation allows service to act on behalf of user • Proxy renewal service for long-running & queued jobs • Some Issues… • Do trust mechanisms scale up ? • “On-line” certification authorities & Certificate Stores • Kerberized CA • Virtual SmartCard • Limited delegation LCG-Asia Workshop – 28 July 2004 - 31
Resource VOs Site XYZ VO Manager User Registration (2003-4) 1. “I agree to the Usage Rules please register me, my VO is XYZ” GRID Certificate Submit job User lcg-registrar.cern.ch Usage Rules 2. Confirm email Authz 3. User Details CA Certificates ? Authz 4. Register 6. User Details 5. Notify LCG-Asia Workshop – 28 July 2004 - 32
Certificate Roles ? XYZ VO Manager User Registration (? 2004 - ) • Some Issues • Static user mappings will not scale up • Multiple VO membership • Complex authorization & policy handling • VO manager needs to validate user data • How ? • Solutions • VO Management Service - Attribute proxy certificates • Groups and Roles - not just static user mapping • Attributes bound to proxy cert., signed by VO Service • Credential mapping and authorization • Flexible policy intersection and mapping tools • Integrate with Organizational databases, but … • What about exceptions ? (the 2-week summer student) • What about other VO models: lighweight, deployment, testing LCG-Asia Workshop – 28 July 2004 - 33
Audit & Incident Response • Audit Requirements • Mandates retention of logs by sites • Incident Response • Security contact data gathered when site registers • Establish communication channels • maillists maintained by Deployment Team • List of CSIRT lists • Channel for reporting • Security contacts at site • Channel for discussion & resolution • Escalation path • 2004 Security Service Challenges • Check the data is there, complete and communications are open LCG-Asia Workshop – 28 July 2004 - 34
Security Collaboration • Projects sharing resources & have close links • Need for inter-grid global security collaboration • Common accepted Usage Rules • Common authentication and authorization requirements • Common incident response channels • LCG – EGEE – OSG • LCG Security Group is now Joint Security Group • JSG for LCG & EGEE & OSG • Provide requirements for middleware development • Some members from OSG already in JSG LCG-Asia Workshop – 28 July 2004 - 35
What is EGEE ? (I) • EGEE (Enabling Grids for e-Science in Europe) is a seamless Grid infrastructure for the support of scientific research, which: • Integrates current national, regional and thematic Grid efforts • Provides researchers in academia and industry with round-the-clock access to major computing resources, independent of geographic location Applications Grid infrastructure Geantnetwork LCG-Asia Workshop – 28 July 2004 - 36
What is EGEE ? (II) • 70 institutions in 28 countries, federated in regional Grids • 32 M Euros EU funding (2004-5), O(100 M) total budget • Aiming for a combined capacity of over 8000 CPUs (the largest international Grid infrastructure ever assembled) • ~ 300 persons LCG-Asia Workshop – 28 July 2004 - 37
EGEE Activities • Emphasis on operating a production grid and supporting the end-users • 48 % service activities • Grid Operations, Support and Management, Network Resource Provision • 24 % middleware re-engineering • Quality Assurance, Security, Network Services Development • 28 % networking • Management, Dissemination and Outreach, User Training and Education, Application Identification and Support, Policy and International Cooperation LCG-Asia Workshop – 28 July 2004 - 38
EGEE infrastructure • Access to networking services provided by GEANT and the NRENs • Production Service: • in place (based on LCG-2) • for production applications • runs only proven stable, debugged middleware and services • Will continue adding new sites in EGEE federations • Pre-production Service: • For middleware re-engineering • Certification and Training/Demo testbeds LCG-Asia Workshop – 28 July 2004 - 39
EGEE Middleware Activity • Activity concentrated in few major centers • Middleware selection based on requirements of Applications and Operations • Harden and re-engineer existing middleware functionality, leveraging the experience of partners • Provide robust, supportable components • Track standards evolution (WS-RF) LCG-Asia Workshop – 28 July 2004 - 40
VDT EDG . . . LCG-1 LCG-2 EGEE-1 EGEE-2 AliEn LCG . . . Globus 2 based Web services based GLite Middleware Implementation • From day 1 (1st April 2004) Production grid service based on the LCG infrastructure running LCG-2 grid mware • In parallel develop a “next generation” grid facility Produce a new set of grid services according to evolving standards (web services) Run a pre-production service providing early access for evaluation purposes Will replace LCG-2 on production facility in 2005 LCG-Asia Workshop – 28 July 2004 - 41
EGEE Middleware: gLite Starts with components from AliEn, EDG, VDT etc. Aim at addressing advanced requirements from applications Prototyping short development cycles for fast user feedback Initial web-services based prototype being tested internally with representatives from the application groups LCG-Asia Workshop – 28 July 2004 - 42
EGEE Middleware Implementation • LCG-2 • Current base for production services • Evolved with certified new or improved services from the preproduction • Pre-production Service • Early application access for new developments • Certification of selected components from gLite • Starts with LCG-2 • Migrate new mware in 2005 • Organising smooth/gradual transition from LCG-2 to GLite for production operations LCG-2 (=EGEE-0) 2004 prototyping prototyping product 2005 product LCG-3 (=EGEE-x?) LCG-Asia Workshop – 28 July 2004 - 43
ALICE Distr. analysis ATLAS Distr. analysis CMS Distr. analysis LHCb Distr. analysis ARDA Project Collaboration Coordination Integration Specifications Priorities Planning EGEE Middleware Activity Distributed Physics AnalysisThe ARDA Project ARDA – distributed physics analysis batch to interactive end-user emphasis • 4 pilots by the LHC experiments (core of the HEP activity in EGEE NA4) • Rapid prototyping pilot service • Providing focus for the first products of the EGEE middleware • Kept realistic by what the EGEE middleware can deliver LCG-Asia Workshop – 28 July 2004 - 44
LCG EGEE in Europe • User support: • Becomes hierarchical • Through the Regional Operations Centres (ROC) • Act as front-line support for user and operations issues • Provide local knowledge and adaptations • Coordination: • At CERN (Operations Management Centre) and CIC for HEP • Operational support: • The LCG GOC is the model for the EGEE CICs • CIC’s replace the European GOC at RAL • Also run essential infrastructure services • Provide support for other (non-LHC) applications • Provide 2nd level support to ROCs LCG-Asia Workshop – 28 July 2004 - 45
Information systems All MDS-based, bring schema together Storage management Common ideas - SRM File catalogs Not yet clear Security Joint security group Policy VO management, … Can we converge and agree on common: Interfaces? Protocols? Implementations? Middleware? Interoperability – convergence? LCG-Asia Workshop – 26 July 2004 - 46
1) Each GC resource publishes a class ad to the GC collector LCGBDII/RB/ scheduler 1) The LCG RB decides where to send the job (GC/WG or the TRIUMF farm) 2) The GC CE aggregates this info and publishes it to TRIUMF as a single resource 2) Job goes to the TRIUMF farm or Job class ad MDS 3) The same is done for WG RB/scheduler 4) TRIUMF aggregates GC & WG and publishes this to LCG asone resource TRIUMF negotiator/scheduler 5) TRIUMF also publishes its own resourcesseparately 6) The process is repeated on GC if necessary TRIUMFcpu &storage Grid-Cannegotiator/scheduler TRIUMF decides to sendthe job to GC Class ad TRIUMF decides to sendthe job to WG Class ad WGUBC/TRIUMF GCRes.1 GCRes.n ..... Linking HEPGrid to LCG 3) The CondorG job manager at TRIUMF builds a submission script for the TRIUMF Grid 4) The TRIUMF negotiator matches the job to GC or WG 5) The job is submitted to the proper resource M.C. Vetterli; SFU/TRIUMF LCG-Asia Workshop – 26 July 2004 - 47
What next? Service challenges • Proposed to be used in addition to ongoing data challenges and production use: • Goal is to ensure baseline services can be demonstrated • Demonstrate the resolution of problems mentioned above • Demonstrate that operational and emergency procedures are in place • 4 areas proposed: • Reliable data transfer • Demonstrate fundamental service for Tier 0 Tier 1 by end 2004 • Job flooding/exerciser • Understand the limitations and baseline performances of the system • Incident response • Ensure the procedures are in place and work – before real life tests them • Interoperability • How can we bring together the different grid infrastructures? LCG-Asia Workshop – 28 July 2004 - 48
Conclusions • LCG has successfully deployed a service to 60 sites • Significant effort to get some reasonable stability and reliability • Still a long way to go to ensure adequate functionality in place for LHC startup • Focus on building up base level services such as reliable data movement • Continuing data and service challenges, analysis • LCG service will evolve rapidly • to respond to problems found in the current system • To integrate/migrate to new middleware • Effort also focussed on infrastructure and support services • Operations support, user support, security etc. • Bring together the various grid infrastructures for LCG users LCG-Asia Workshop – 26 July 2004 - 49