1 / 34

STAR Computing Status, Out-source Plans, Residual Needs

STAR Computing Status, Out-source Plans, Residual Needs. Torre Wenaus STAR Computing Leader BNL RHIC Computing Advisory Committee Meeting BNL October 11, 1999. Outline. STAR Computing Status Out-source plans Residual needs Conclusions. Manpower.

armen
Télécharger la présentation

STAR Computing Status, Out-source Plans, Residual Needs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. STAR Computing Status,Out-source Plans, Residual Needs Torre Wenaus STAR Computing Leader BNL RHIC Computing Advisory Committee Meeting BNL October 11, 1999

  2. Outline • STAR Computing Status • Out-source plans • Residual needs • Conclusions RHIC Computing Advisory Committee 10/98

  3. Manpower • Very important development in last 6 months: Big new influx of postdocs, students into computing and related activities • Increased participation and pace of activity in • QA • online computing • production tools and operations • databases • reconstruction software • Planned dedicated database person never hired (funding); databases consequently late but we are now transitioning from an interim to our final database • Still missing online/general computing systems support person • Open position cancelled due to lack of funding • Shortfall continues to be made up by the local computing group RHIC Computing Advisory Committee 10/98

  4. Dave Alvarez, Wayne, SVT Lee Barnby, Kent, QA and production Jerome Baudot, Strasbourg, SSD Selemon Bekele, OSU, SVT Marguerite Belt Tonjes, Michigan, EMC Helen Caines, Ohio State, SVT Manuel Calderon, Yale, StMcEvent Gary Cheung, UT, QA Laurent Conin, Nantes, database Wensheng Deng, Kent, production Jamie Dunlop, Yale, RICH Patricia Fachini, Sao Paolo/Wayne, SVT Dominik Flierl, Frankfurt, L3 DST Marcelo Gameiro, Sao Paolo, SVT Jon Gangs, Yale, online Dave Hardtke, LBNL, Calibrations, DB Mike Heffner, Davis, FTPC Eric Hjort, Purdue, TPC Amy Hummel, Creighton, TPC, production Holm Hummler, MPG, FTPC Matt Horsley, Yale, RICH Jennifer Klay, Davis, PID Matt Lamont, Birmingham, QA Curtis Lansdell, UT, QA Brian Lasiuk, Yale, TPC, RICH Frank Laue, OSU, online Lilian Martin, Subatch, SSD Marcelo Munhoz, Sao Paolo/Wayne, online Aya Ishihara, UT, QA Adam Kisiel, Warsaw, online, Linux Frank Laue, OSU, calibration Hui Long, UCLA, TPC Vladimir Morozov, LBNL, simulation Alex Nevski, RICH Sergei Panitkin, Kent, online Caroline Peter, Geneva, RICH Some of our Youthful Manpower • A partial list of young students and postdocs now active in aspects of software... • Li Qun, LBNL, TPC • Jeff Reid, UW, QA • Fabrice Retiere, calibrations • Christelle Roy, Subatech, SSD • Dan Russ, CMU, trigger, production • Raimond Snellings, LBNL, TPC, QA • Jun Takahashi, Sao Paolo, SVT • Aihong Tang, Kent • Greg Thompson, Wayne, SVT • Fuquian Wang, LBNL, calibrations • Robert Willson, OSU, SVT • Richard Witt, Kent • Gene Van Buren, UCLA, documentation, tools, QA • Eugene Yamamoto, UCLA, calibrations, cosmics • David Zimmerman, LBNL, Grand Challenge RHIC Computing Advisory Committee 10/98

  5. Status of Computing Requirements • Internal review (particularly simulation) in process in connection with evaluating PDSF upgrade needs • No major changes with respect to earlier reviews • RCF resources should meet STAR reconstruction and central analysis needs (recognizing 1.5x re-reconstruction factor allows little margin for the unexpected) • Existing (primarily Cray T3E) offsite simulation facilities inadequate for simulation needs • Simulation needs addressed by PDSF ramp-up plans RHIC Computing Advisory Committee 10/98

  6. Current STAR Software Environment • Current software base a mix of C++ (55%) and Fortran (45%) • Rapid evolution from ~20%/80% in September ‘98 • New development, and all physics analysis, in C++ • ROOT as analysis tool and foundation for framework adopted 11/98 • Legacy Fortran codes and data structures supported without change • Deployed in offline production and analysis in Mock Data Challenge 2, Feb-Mar ‘99 • ROOT adopted for event data store after MDC2 • Complemented by MySQL relational DB: no more Objectivity • Post-reconstruction: C++/OO data model ‘StEvent’ implemented • Initially purely transient; design unconstrained by I/O (ROOT or Objectivity) • Later implemented in persistent form using ROOT without changing interface • Basis of all analysis software development • Next step: migrate the OO data model upstream to reconstruction RHIC Computing Advisory Committee 10/98

  7. MDC2 and Post-MDC2 • STAR MDC2: • Full production deployment of ROOT based offline chain and I/O. All MDC2 production based on ROOT • Statistics suffered from software and hardware problems and the short MDC2 duration; about 1/3 of ‘best case scenario’ • Very active physics analysis and QA program • StEvent (OO/C++ data model) in place and in use • During and after MDC2: Addressing the problems • Program size: up to 850MB. Reduced to <500MB in broad cleanup • Robustness of multi-branch I/O (multiple file streams) improved • XDF based I/O maintained as stably functional alternative • Improvements to ‘Maker’ organization of component packages • Completed by late May; infrastructure stabilized RHIC Computing Advisory Committee 10/98

  8. Software Status for Engineering Run • Offline environment and infrastructure stabilized • Shift of focus to consolidation: usability improvements, documentation, user-driven enhancements, developing and responding to QA • DAQ format data supported in offline from raw files through analysis • Stably functional data storage • ‘Universal’ I/O interface transparently supports all STAR file types • DAQ raw data, XDF, ROOT, (Grand Challenge and online pool to come) • ROOT I/O debugging proceeded through June; now stable • StEvent in wide use for physics analysis and QA software • Persistent version of StEvent implemented and deployed • Very active analysis and QA program • Calibration/parameter DB not ready (now 10/99 being deployed) RHIC Computing Advisory Committee 10/98

  9. Real Data Processing • Currently live detector is the TPC • 75% of TPC read out (beam data and cosmics) • Can read and analyze zero suppressed TPC data all the way to DST • real data DST read and used in StEvent post-reco analysis • Bad channel suppression implemented and tested. • First order alignment was worked out (~1mm), the rest to come from residuals analysis • 10 000 cosmics with no field and several runs with field on • All interesting real data from engineering run passed through regular production reconstruction and QA • now preparing for second iteration incorporating improvements in reconstruction codes, calibrations RHIC Computing Advisory Committee 10/98

  10. Event Store and Data Management • Success of ROOT-based event data storage from MDC2 on relegated Objectivity to metadata management role, if any • ROOT provides storage for the data itself • We can use a simpler, safer tool in metadata role without compromising our data model, and avoid complexities and risks of Objectivity • MySQL adopted (relational DB, open software, widely used, very fast, but not a full-featured heavyweight like ORACLE) • Wonderful experience so far. Excellent tools, very robust, extremely fast • Scalability OK so far (eg. 2M rows of 100bytes); multiple servers can be used as needed to address scalability needs • Not taxing the tool because metadata, not large volume data, is stored • Objectivity is gone from STAR RHIC Computing Advisory Committee 10/98

  11. Requirements: STAR 8/99 View (My Version) RHIC Computing Advisory Committee 10/98

  12. RHIC Data Management: Factors For Evaluation • My perception of changes in the STAR view from ‘97 to now are shown • Objy Root+MySQL Factor •   Cost •   Performance and capability as data access solution •   Quality of technical support •   Ease of use, quality of doc •  Ease of integration with analysis •  Ease of maintenance, risk •  Commonality among experiments •  Extent, leverage of outside usage •  Affordable/manageable outside RCF •  Quality of data distribution mechanisms •  Integrity of replica copies •  Availability of browser tools •  Flexibility in controlling permanent storage location •  Level of relevant standards compliance, eg. ODMG •  Java access •  Partitioning DB and resources among groups RHIC Computing Advisory Committee 10/98

  13. STAR Production Database • MySQL based production database (for want of a better term) in place with the following components • File catalogs • Simulation data catalog • populated with all simulation-derived data in HPSS and on disk • Real data catalog • populated with all real raw and reconstructed data • Run log and online log • fully populated and interfaced to online run log entry • Event tag databases • database of DAQ-level event tags. Populated by offline scanner; needs to be interfaced to buffer boxand extended with downstream tags • Production operations database • production job status and QA info RHIC Computing Advisory Committee 10/98

  14. ROOT Status in STAR • ROOT is with us to stay! • No major deficiencies, obstacles found; no post-ROOT visions contemplated • ROOT community growing: Fermilab Run II, ALICE, MINOS • We are leveraging community developments • First US ROOT workshop at FNAL in March • Broad participation, >50 from all major US labs, experiments • ROOT team present; heeded our priority requests • I/O improvements: robust multi-stream I/O and schema evolution • Standard Template Library support • Both emerging in subsequent ROOT releases • FNAL participation in development, documentation • ROOT guide and training materials recently released • Our framework is based on ROOT, but application codes need not depend on ROOT (neither is it forbidden to use ROOT in application codes). RHIC Computing Advisory Committee 10/98

  15. Software Releases and Documentation • Release policy and mechanisms stable and working fairly smoothly • Extensive testing and QA: nightly (latest version) and weekly (higher statistics testing before ‘dev’ version is released to ‘new’) • Software build tools switched from gmake to cons (perl) • more flexible, easier to maintain, faster • Major push in recent months to improve scope and quality of documentation • Documentation coordinator (coercer!) appointed • New documentation and code navigation tools developed • Needs prioritized; pressure being applied; new doc has started to appear • Ongoing monthly tutorial program • With cons, doc/code tools, database tools, … perl has become a major STAR tool • Software by type: All Modified in last 2 months • C 18938 1264 • C++ 115966 52491 • FORTRAN 93506 54383 • IDL 8261 162 • KUMAC 5578 0 • MORTRAN 7122 3043 • Makefile 3009 2323 • scripts 36188 26402 RHIC Computing Advisory Committee 10/98

  16. QA • Major effort during and since MDC2 • Organized effort under ‘QA Czar’ Peter Jacobs; weekly meetings and QA reports • ‘QA signoff’ integrated with software release procedures • Suite of histograms and other QA measures in continuous use and development • Automated tools managing production and extraction of QA measures from test and production running recently deployed • Acts as a very effective driver for debugging and development of the software, engaging a lot of people RHIC Computing Advisory Committee 10/98

  17. Current Software Status • Infrastructure for year one pretty much there • Simulation stable • ~7TB production simulation data generated • Reconstruction software for year one mostly there • lots of current work on quality, calibrations, global reconstruction • TPC in the best shape; EMC in the worst (two new FTEs should help EMC catch up; 10% installation in year 1) • well exercised in production; ~2.5TB of reconstruction output generated in production • Physics analysis software now actively underway in all working groups • contributing strongly to reconstruction and QA • Major shift of focus in recent months away from infrastructure and towards reconstruction and analysis • Reflected in program of STAR Computing Week last week; predominantly reco/analysis RHIC Computing Advisory Committee 10/98

  18. Priority Work for Year One Readiness • In Progress... • Extending data management tools (MySQL DB + disk file management • + HPSS file management + multi-component ROOT files) • Complete schema evolution, in collaboration with ROOT team • Completion of the DB: integration of slow control as data source, • completion of online integration, extension to all detectors • Extend and apply OO data model (StEvent) to reconstruction • Continued QA development • Reconstruction and analysis code development • Responding to QA results and addressing year 1 code completeness • Improving and better integrating visualization tools • Management of CAS processing and data distribution both for mining • and individual physicist level analysis • Integration and deployment of Grand Challenge RHIC Computing Advisory Committee 10/98

  19. STAR Analysis: CAS Usage Plan • CAS processing with DST input based on managed production by the physics working groups (PWG) using the Grand Challenge Architecture • Later stage processing on micro-DSTs (standardized at the PWG level) and ‘nano-DSTs’ (defined by individuals or small groups) occurs under the control of individual physicists and small groups • Mix of LSF-based batch, and interactive • on both Linux and Sun, but with far greater emphasis on Linux • For I/O intensive processing, local Linux disks (14GB usable) and Suns available • Usage of local disks and availability of data to be managed through the file catalog • Web-based interface to management, submission and monitoring of analysis jobs in development RHIC Computing Advisory Committee 10/98

  20. Grand Challenge • What does the Grand Challenge do for the user? • Optimizes access to HPSS based data store • Improves data access for individual users • Allows event access by query: • Present query string to GCA (e.g. NumberLambdas>1) • Receive iterator over events which satisfy query as files are extracted from HPSS • Pre-fetches files so that “the next” file is requested from HPSS while you are analyzing the data in your first file • Coordinates data access among multiple users • Coordinates ftp requests so that a tape is staged only once per set of queries which request files on that tape • General user-level HPSS retrieval tool RHIC Computing Advisory Committee 10/98

  21. Grand Challenge queries • Queries based on physics tag selections: • SELECT (component1, component2, …) • FROM dataset_name • WHERE (predicate_conditions_on_properties) • Example: • SELECT dst, hits • FROM Run00289005 • WHERE glb_trk_tot>0 & glb_trk_tot<10 • Event components: • fzd, raw, dst-xdf, dst-root, hits, StrangeTag, FlowTag, StrangeMuDst, … • Mapping from run/event/component to file via the database • GC index assembles tags + component file locations for each event • Tag based query match yields the files requiring retrieval to serve up that event • Event list based queries allow using the GCA for general-purpose coordinated HPSS retrieval • Event list based retrieval: • SELECT dst, hits • Run 00289005 Event 1 • Run 00293002 Event 24 • Run 00299001 Event 3 • ... RHIC Computing Advisory Committee 10/98

  22. Grand Challenge in STAR RHIC Computing Advisory Committee 10/98

  23. STAR GC Implementation Plan • Interface GC client code to STAR framework • Already runs on solaris, linux • Needs integration into framework I/O management • Needs connections to STAR MySQL DB • Apply GC index builder to STAR event tags • Interface is defined • Has been used with non-STAR ROOT files • Needs connection to STAR ROOT and mysql DB • (New) manpower for implementation now available • Experienced in STAR databases • Needs to come up to speed on GCA RHIC Computing Advisory Committee 10/98

  24. Current STAR Status at RCF • Computing operations during the engineering run fairly smooth, apart from very severe security disruptions • Data volumes small, and direct DAQ->RCF data path not yet commissioned • Effectively using the newly expanded Linux farm • Steady reconstruction production on CRS; transition to year 1 operation should be smooth • New CRS job management software deployed in MDC2 works well and meets our needs • Analysis software development and production underway on CAS • Tools managing analysis operations under development • Integration of Grand Challenge data management tools into production and physics analysis operations to take place over the next few months • Not needed for early running (low data volumes) RHIC Computing Advisory Committee 10/98

  25. Concerns: RCF Manpower • Understaffing directly impacts • Depth of support/knowledge base in crucial technologies, eg. AFS, HPSS • Level and quality of user and experiment-specific support • Scope of RCF participation in software; Much less central support/development effort in common software than at other labs (FNAL, SLAC) • e.g. ROOT used by all four experiments, but no RCF involvement • Exacerbated by very tight manpower within the experiment software efforts • Some generic software development supported by LDRD (NOVA project of STAR/ATLAS group) • The existing overextended staff is getting the essentials done, but the data flood is still to come • Concerns over RCF understaffing recently increased with departure of Tim Sailer RHIC Computing Advisory Committee 10/98

  26. Concerns: Computer/Network Security • Careful balance required between ensuring security and providing a productive and capable development and production environment • Not yet clear whether we are in balance or have already strayed to an unproductive environment • Unstable offsite connections, broken farm functionality, database configuration gymnastics, farm (even interactive part) cut off from the world), limited access to our data disks • Experiencing difficulties, and expecting new ones, from particularly the ‘private subnet’ configuration unilaterally implemented by RCF • Need should be (re)evaluated in light of new lab firewall • RCF security closely coupled to overall lab computer/network security; coherent site-wide plan, as non-intrusive as possible, is needed • We are still recovering from the knee-jerk ‘slam the doors’ response of the lab to the August incident • Punching holes in the firewall to enable work to get done • I now regularly use PDSF@NERSC when offsite to avoid being tripped up by BNL security RHIC Computing Advisory Committee 10/98

  27. Other Concerns • HPSS transfer failures • During MDC2 in certain periods up to 20% of file transfers to HPSS failed dangerously • transfers seem to succeed; no errors and the file seemingly visible in HPSS with the right size • but on reading we find the file not readable • John Riordan has list of errors seen during reading • In reconstruction we can guard against this, but it would be a much more serious problem for DAQ data: cannot afford to read back from HPSS to check its integrity. • Continuing networking disruptions • A regular problem in recent months; network dropping out or very slow for unknown/unannounced reasons • If unintentional: bad network management • If intentional: bad network management RHIC Computing Advisory Committee 10/98

  28. Public Information and Documentation Needed • Clear list of services RCF provides, the level of support of these services, resources allocated to each experiment, personnel support responsibles • - rcas: LSF, .... • - rcrs: CRS software, .... • - AFS: (stability, home directories, ...) • - Disks: (inside / outside access) • - HPSS • - experiment DAQ/online interface • Web based information is very incomplete • e.g. information on planned facilities for year one and after • largely a restatement of first point • General communication • RCF still needs improvement in general user communication, responsiveness RHIC Computing Advisory Committee 10/98

  29. Outsourcing Computing in STAR • Broad local RCF, BNL-STAR and remote usage of STAR software. STAR environment setup counts since early August: • RCF 118571 BNL Sun 33508 BNL Linux 13418 Desktop 6038 • HP 801 LBL 29308 Rice 12707 Indiana 19852 • Non-RCF usage currently comparable to RCF usage: good distributed computing support is essential in STAR • Enabled by AFS based environment; AFS an indispensable tool • But inappropriate for data access usage • Agreement reached with RCF for read-only access to RCF NFS data disks from STAR BNL computers; seems to be working well • New BNL-STAR facilities • 6 dual 500MHz/18GB (2 arrived), 120GB disk • For software development, software and OS/compiler testing, online monitoring, services (web, DB,…) • Supported and managed by STAR personnel • Supports STAR environment for Linux desktop boxes RHIC Computing Advisory Committee 10/98

  30. STAR Offsite Computing: PDSF • pdsf at LBNL/NERSC • Virtually identical configuration to RCF • Intel/Linux farm, limited Sun/Solaris, HPSS based data archiving • Current (10/99) scale relative to STAR RCF: • CPU ~50% (1200 Si95), disk ~85% (2.5TB) • Long term goal: resources ~equal to STAR’s share of RCF • Consistent with long-standing plan that RCF hosts ~50% of experiments’ computing facilities: simulation and some analysis offsite • Ramp-up currently being planned • Other NERSC resources: T3Es major source of simulation cycles • 210,000 hours allocated in FY00: one of the larger allocations in terms of CPU and storage • Focus in future will be on PDSF; no porting to next MPP generation RHIC Computing Advisory Committee 10/98

  31. STAR Offsite Computing: PSC • Cray T3Es at Pittsburgh Supercomputing Center • STAR Geant3 based simulation used at PSC to generate ~4TB of simulated data in support of Mock Data Challenges and software development • Supported by local CMU group • Recently retired when our allocation ran out and could not be renewed • Increasing reliance on PSC RHIC Computing Advisory Committee 10/98

  32. STAR Offsite Computing: Universities • Physics analysis computing at home institutions • Processing of ‘micro-DSTs’ and DST subsets • Software development • Primarily based on small Linux clusters • Relatively small data volumes; aggregate total of ~10TB/yr • Data transfer needs of US institutes should be met by net • Overseas institutes will rely on tape based transfers • Existing self-service scheme will probably suffice • Some simulation production at universities • Rice, Dubna RHIC Computing Advisory Committee 10/98

  33. Residual Needs • Data transfer to PDSF and other offsite institutes • Existing self-service DLT probably satisfactory for non-PDSF tape needs, but little experience to date • 100GB/day network transfer rate today adequate for PDSF/NERSC data transfer • Future PDSF transfer needs (network, tape) to be quantified once PDSF scale-up is better understood RHIC Computing Advisory Committee 10/98

  34. Conclusions: STAR at RCF • Overall: RCF is an effective facility for STAR data processing and management • Sound choices in overall architecture, hardware, software • Well aligned with the HENP community • Community tools and expertise easily exploited • Valuable synergies with non-RHIC programs, notably ATLAS • Production stress tests have been successful and instructive • On schedule; facilities have been there when needed • RCF interacts effectively and consults appropriately for the most part, and is generally responsive to input • Weak points are security issues and interactions with general users (as opposed to experiment liaisons and principals) • Mock Data Challenges have been highly effective in exercising, debugging and optimizing RCF production facilities as well as our software • Based on status to date, we expect STAR and RCF to be ready for whatever RHIC Year 1 throws at us. RHIC Computing Advisory Committee 10/98

More Related