Grid Data Integration In the CMS Experiment

Grid Data Integration In the CMS Experiment Saima Iqbal, Tony Solomonides & Ian Willers CERN & University of the West of England, Bristol

Outline • Project requirements • Use of Data Warehouse and Data Marts • Architectural design • Use of POOL • Prototype critical review • Conclusion & Future Work

CERN – Tier 0 2.5 Gbps IN2P3 622 Mbps RAL FNAL Tier 1 155 mbps 155 mbps 622 Mbps Uni n Lab a Tier2 Uni b Lab c   Department  Desktop CMS data flow

Project Requirements • Provide and maintain the read-only view of the data (for the analysis applications). • Performant persistency mechanism to support data retrieval from the Distributed Heterogeneous Relational Databases (DHRD) across a Grid environment. • Flexible architectureto support changes in the persistency requirements (like schema evolution) and in the backend database technologies.

Use of Data Warehouse and Data Marts • Data Warehouse: • Data warehouse is a database with Performant persistency mechanism , often remote, contains snapshots of data integrated from (distributed) heterogeneous data sources. • A technology independent repository. • Provides a read-only view of the data (i.e. no transaction allowed). • To support fast data access, built with denormalised database schema (i.e. maximum indices and minimum relations). • Populated through the ETL (Extraction, Transformation, Loading) process and provides aflexible persistency architecture. • Best supported by the Relational Database Management technologies. Extraction, Transportation, Transformation and Loading: Data extracted from Heterogeneous data sources; extracted data then transformed according to the schema supported by the warehouse, then transported and loaded into the data warehouse. • Data Marts • Databases that store the replicated or distributed data from the centralized data warehouse.

POOL Relational Abstraction Layer C++ Class ORACLE Tier-0 Relational Access Component POOL RelationalFileCatalog ODBC Component MySQL Tier-2 Use of POOL 3- Connection String (database URL User Name and Password) 1- LFN (Logical Database Name) 2- PFN (Physical Database Name) Provide Connection String

Architectural Design Use POOL RAL (Relational Access Component) to extract data from data mart C++ class/Data Access via POOL RAL POOL’s RelationalFileCatalog C++ class/Data Access via POOL RAL POOL RAL Views from data warehouse materialised in the data mart Used to register databases Queried RelationalFileCatalog to retrieve the database URL for the requested data-set Data Mart (ORACLE) @Tier-1 Use POOL RAL (ODBC Access Component) to extract data from MySQL (source) database Views created on the data stored in the warehouse Data from source databases integrated into the data warehouse Data Warehouse (ORACLE) ORACLE@ CERN Row-Wise-Ntuples MySQL@ CALTECH Row-Wise-Ntuples

Prototype Critical Review • The proposed use of a data warehouse providesa light weight approach for the analysis applications. • Access data locallywithout worrying about the individual relational database technology. • and their respective database schemas. • If there are ‘D’ number of the DHRD technologies with ‘S’ number of distinct schema are needed to make available in the Grid environment, then there could be‘DxS’ database implementations are required. • Whereas,data warehouse approach provides a single denormalised schema(could be replicated and distributed)to access data stored in the ‘DxS’ number of DHRD. • However,separate ETL process is required for each newly added database technology.

Conclusion • Software prototype was successful in handling the project requirements according to the architectural design. • Use of the POOL RelationalFileCatalog makes it possible to use this data warehouse in the Grid environment. • Provides an integrated approach for the registration of distributed heterogeneous relational databases and to access these databases in a globally distributed environment (Grid).

Questions

Future Directions • Databases could be searched according to the type of data they stored instead of logical database names. • Monitoring of databases, especially for the databases stored replicated data. • Use of data warehouse meta-data. • Can be made Grid-Services compliant by using POOL file catalog features. • Data mining instead of hard coded SQL statements. • A single ETL process (research question).

Grid Data Integration In the CMS Experiment