180 likes | 194 Vues
This overview discusses the POOL framework, a component-based system that provides persistent object storage for LHC experiments. It outlines the architecture, integration with existing technologies, and the role of POOL in the LCG applications area.
E N D
POOL Project Overview Dirk Düllmann CERN Openlab storage workshop 17th March 2003
What is POOL? • POOL is the LCG Persistency Framework • Pool of persistent objects for LHC • Started by LCG-SC2 in April ’02 • Common effort in which the experiments take over a major share of the responsibility • for defining the system architecture • for development of POOL components • ramping up over the last year from 1.5 to ~10FTE
POOL and the LCG Architecture Blueprint • POOL is a component based system • A technology neutral API • Abstract C++ interfaces • Implemented reusing existing technology • ROOT I/O for object streaming • complex data, simple consistency model (write once) • RDBMS for consistent meta data handling • simple data, transactional consistency • POOL does not replace any of it’s components technologies • It integrates them to provides higher level services • Insulates physics applications from implementation details of components and technologies used today
Pool as a LCG component • Persistency is just one of several projects in the LCG Applications Area • Sharing a common architecture and s/w process • as described in the Blueprint and Persistency RTAG documents • Persistency is important… • …but not important enough to allow for uncontrolled direct dependencies eg of experiment code on its implementation • Common effort in which the experiments take over a major share of the responsibility • for defining the overall and detailed architecture • for development of Pool components
POOL Work Package breakdown • Based on outcome of SC2 persistency RTAG • File Catalog • keep track of files (and their physical and logical names) and their description • resolve a logical file reference (FileID) into a physical file • pool::IFileCatalog • Collections • keep track of (large) object collection and their description • pool::Collection<T> • Storage Service • stream transient C++ objects into/from storage • resolve a logical object reference into a physical object • Object Cache (DataService) • keep track of already read objects to speed up repeated access to the same data • pool::IDataSvc and pool::Ref<T>
POOL and the GRID • GRID mostly deals with data of file level granularity • File Catalog connects POOL to Grid Resources • eg via our EDG-RLS backend • POOL Storage Service deals with intra file structure • need connection via standard Grid File access • Both File and Object based Collections are seen as important End User concepts • POOL offers a consistent interface to both types • Need to understand to what extend these can be provided in a Grid environment
Exp. DB Services Book Keeping Production Workflow Grid (File) Services File Description Replica Location Remote File I/O? How does POOL fit into the environment POOL client on a CPU Node • POOL will be mainly used from experiment frameworks • mostly as client library loaded from user application • Production Manager • Creates and maintains shared file catalogs and (event) collections • eg add the catalog fragment for the new simulation data to the published analysis catalog • End User • Uses shared collections • eg iterate over collection X User Application Experiment Framework RDBMS Services Collection Description? POOL Collection Location? Collection Access remote access via ROOT I/O
POOL File Catalog Logical Naming Object Lookup • POOL uses GUID implementation for FileID • unique and immutable identifier for a file (generated at create time) • allows to produce sets of file with internal references without requiring a central ID allocation service • catalog fragments created independently can later be merged without modification to data files. • Object lookup is based only on right side box! • Logical filenames are supported but not required
File Catalog & Descr Extraction Grid File Storage Local File Catalog Local Files Result Publishing Local Processing New Files New Catalog & Descr Use Case: Working in Isolation • The user extracts a set of interesting files and a catalog fragment describing them from a (central) grid based catalog into a local (eg XML based) catalog. • Selection is performed based on file or collection descriptions • After disconnecting from the grid the user executes some standard jobs navigating through the extracted data. • New output files are registered into the local catalog • Once the new data is ready for publishing and the user is connected the new catalog fragment is submitted to the grid based catalog.
Use Case: Farm Production Production Node 1 Production Node 2 Production Node n Local File Catalog Local File Catalog Local File Catalog Local Files Local Files Local Files • Production manager may pre-register output files with the catalog (eg a “local” MySQL or XML catalog) • File ID, physical filename job ID and optionally also logical filenames • A production job runs and creates files and their catalog entries locally. • During the production the catalog can be used to cleanup files (and their registration) from unsuccessful jobs based on their associated job ID. • Once the data quality checks have been passed the production manager decides to publishes the production catalog fragment to the grid based catalog. Post Processing New Files New Catalog & Descr Result Publishing Grid Cataloge File Catalog & Descr Grid File Storage
POOL Storage Hierarchy • A application may access databases (eg ROOT files) from a set of catalogs • Each database has containers of one specific technology (eg ROOT trees) • Smart Pointers are used • to transparently load objects into a client side cache • define object associations across file or technology boundaries
Ref<T> Data Service Data Cache Client Client Client Ref<T> Data Cache Data Service Ref<T> Data Service Client Data Access
.h .xml ROOTCINT GCC-XML Code Generator DictionaryGeneration CINT dictionary code LCG dictionary code Gateway I/O CINT dictionary LCGdictionary Other Clients Data I/O Reflection Dictionary:Population/Conversion
Project Status & Plans • First four POOL releases delivered planned functionality on time • Aggressive schedule so far focusing on adding functionality • no consistent attempt of performance optimisation yet • Functional complete (LCG-1 feature set) POOL V1.0 release scheduled for April • several functional extensions compared to V0.4 • automated system tests are being • Bug fix and performance release POOL V1.1 in June • Aim to be ready for first deployment together with LCG-1 environment • Will release • Work on proof of concept storage service re-implementation based on an RDBMS back end starting
Summary • The LCG Pool project provides a hybrid store integrating object streaming (eg Root I/O) with RDBMS technology (eg MySQL) for consistent meta data handling • Strong emphasis on component decoupling and well defined communication/dependencies • Transparent cross-file and cross-technology object navigation via C++ smart pointers • Integration with Grid technology (via EDG-RLS) • but preserving networked and grid-decoupled working modes • Next two releases (V1.0-functionality and V1.1-reliability & performance) will be crucial for POOL acceptance • Need tight coupling with experiment development and production teams to validate the feature set • Assume tight integration with LCG deployment activities
How to find out more about POOL? • POOL Home Page http://lcgapp.cern.ch/project/persist/ • POOL savannah portal http://lcgappdev.cern.ch/savannah/projects/pool