630 likes | 637 Vues
Anaphe OO Libraries for Data Analysis using C++ and Python. Andreas Pfeiffer CERN IT/API andreas.pfeiffer@cern.ch. Outline. Motivation AIDA - Abstract Interfaces for Data Analysis Anaphe Components C++ Lizard: Interactive Data Analysis Python Software quality control Summary.
E N D
AnapheOO Libraries for Data Analysis using C++ and Python Andreas Pfeiffer CERN IT/API andreas.pfeiffer@cern.ch Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Outline • Motivation • AIDA - Abstract Interfaces for Data Analysis • Anaphe Components • C++ • Lizard: Interactive Data Analysis • Python • Software quality control • Summary Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
LHC Computing Challenge • 4 experiments will create huge amount of data • >1 PetaByte/year for each experiment ! • 1015 Bytes • 1,000 TeraBytes • 20,000 Redwood tapes • 100,000 dual-sided DVD-RAM disks • 1,500,000 sets of the Encyclopaedia Britannica(w/o photos) • Need lots of CPU power to reconstruct/analyse • about 1000 PC boxes per experiment (2005 ones !) • 40.000 of today’s boxes (dual P-III 800 MHz) • complex data models • reconstruction s/w is also used for online filtering • needs high quality s/w in order not to waste beam time Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
SPS 1969 W and Z 1983 LEP 1989 LEP ends 2000 K&R C 1978 C++ 1985 Linux V 0.01 1991 Java 1995 Ethernet standard 1983 Intel Pentium 1992 Unix V6 first public version 1975 XML 1.0 1997 IBM PC 1981 WWW Lifetime of LHC software = 25 yrs Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Technology (R)Evolution • 10 yrs major cycle length (HW,SW,OS) • ~12 evolutionary changes in the market • 1 revolutionary change • towards greater diversity • don’t forget changes of requirements • Consequences • s/w written today most probably will be rewritten tomorrow • we must anticipate changes Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Anaphe: what it is • Modular (OO/C++) replacement of CERNLIB functionality for use in HEP experiments • memory management • I/O • foundation classes • histogramming • minimizing/fitting • visualization • interactive data analysis • Trying to use standards wherever possible • Trying to re-use existing class libraries • This talk will not cover detector simulation (GEANT-4) Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Anaphe Components Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
AIDA Abstract Interfaces for Data Analysis Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
The AIDA project • AIDA project (Abstract Interfaces for Data Analysis) was initiated at the HepVis’99 workshop in Orsay • Presently active mainly developers from existing packages • Tony Johnson (JAS) • Andreas Pfeiffer (Lizard/Anaphe) • Guy Barrand (OpenScientist ) • Mark Dönszelmann (Wired) • Developers from LHCb/Gaudi Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Abstract Interfaces • Abstract Interfaces • only pure virtual methods, inheritance only from other A.I. • components use other components onlythrough their A.I. • defines a kind of a “protocol” for a component • Maximize flexibility and re-use of packages • allow each component to develop independently • re-use of existing packages to implement components reduces start-up time significantly • De-couple implementation of a component from its use Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Architectural issue: Components (I) • Identify components by functionality • Define “protocol” using Abstract Interfaces • Emphasize separation of different aspects for each component • Example: Histogram • statistical entity (density distribution of a physics quantity) • view of a “collection of data points” (which can be a density distribution but also a detector efficiency curve) • command to manipulate/store/plot/fit/... • “User’s view” is different from “implementor’s (developer’s) view” • separate Abstract Interfaces for both aspects Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
User Code Histo-IF Fitter-IF Histo- Impl. 1 Fitter- Impl. X Fitter- Impl. Y Use of Components withAbstract Interfaces • User Code uses only Interface classes • IHistogram1D * hist = histoFactory-> create1D(‘track quality’, 100, 0., 10.) • Actual implementations are selected at run-time • loading of shared libraries • No change at all to user code but keep freedom to choose implementation Histo- Impl. 2 Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Across the languages • JAida : C++ access to Java libs • using C++ proxies implementing the C++ Abstract Interfaces to the Java interfaces C++UserCode AIDA-IF C++ JAida AIDA-IF Java Java Lib Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
XML standards • Started with 1D and 2D Histograms • aim: easy transfer between applications • Will extend to other data types • other histos, fits, ntuples, … • Comments/contributions welcome ! Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Anaphe components Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
‘Layered’ Approach • Basic functionalities (histograms, fitting, etc.) are available as individual C++ class libraries. • Easy replacing one part without throwing away everything • Objectivity/DB to provide persistence • HepODBMS library (“insulating layer”, “tags”) • Histogram library (HTL) • Fitting libraries (Gemini, HepFitting) • Graphics libraries (Qt, Qplotter) • Insulate components through Abstract Interfaces • “wrapper” layer to implement Interfaces in terms of existing libs • Apply s/w quality control tools • code checking, testing Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Anaphe Components: Overview Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Basic 3D Graphic Libraries • OpenGL(basic graphics) • De-facto industry standard for basic 3D graphics • Used in CAD/CAE, games, VR, medical imaging • OpenInventor(scene mgmt.) • OO 3D toolkit for graphics • Cubes, polygons, text, materials • Cameras, lights, picking • 3D viewers/editors,animation • Based on OpenGL/MesaGL Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
2D Graphics libraries • Qt • multi-platform C++ GUI toolkit • C++ class library, not wrapper around C libs • superset of Motif and MFC • available on Unix and MS Windows • no change for developer • commercial but with public domain version • www.troll.no • Qplotter • “add-on” functionality for HEP • “HIGZ/HPLOT” Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Mathematical Libraries • NAG (Numerical Algorithms Group) C Library • Covers a broad range of functionality • Linear algebra • differential equations • quadrature, etc. • Special functions of CERNLIB added to Mark-6 release • mostly for theory and accelerator • Quality assurance • extensive testing done by NAG • www.nag.com Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
CLHEP - foundation classes • HEP foundation class library • Random number generators • Physics vectors • 3- and 4- vectors • Geometry • Linear algebra • System of units • more packages recently added • will continue to evolve • wwwinfo.cern.ch/asd/lhc++/clhep/ Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Histograms: the HTL package • Histograms are the basic tool for physics analysis • Statistical information of density distributions • Histogram Template Library (HTL) • design based on C++ templates • Modular : separation between sampling and display • Extensible : open for user defined binning systems • Flexible: support transient/persistent at the same time • Open: large use of abstract interfaces • recent addition: 3D histograms Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Fitting and Minimization • Fittingand Minimization Library(FML) • common OO interface • NAG-C, MINUIT • based on Abstract Interfaces • IVector, IModelFunction, … • fitting as a special case of minimization • minimize “distance” between data and model • replacement for HepFitting (and Gemini) • Gemini • common interface to minimizer engine • very thin layer Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Opening bracket: Persistency Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Object persistencyTwo concepts: serial and page I/O • “Sequential access to objects” (streaming) • good in networking context or serial writes to file(s) • much like “good old Fortran” • often perceived to be “simpler” to implement (“<<“, “>>”) • “Navigational access to objects” (buffered) • I/O on demand for complex data models • location transparent (for user) access to object • typically by de-referencing of a smart pointer • optimized for (random) disk access (disks deliver pages) • sequential write to file(s) still ok • Both concepts need to take care about changes of the internal structure of the objects (schema evolution) Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Architectural Issue:Persistency (“Object-I/O”) • Brings a completely new quality into the design • Objects have now lifetime • don’t “delete” until you really are sure you want to • persistency is kind of “intended memory leak” • would like to see no difference between memory and disk • “Layout” of objects may change during (extended) life • “schema evolution” • additions/deletions of attributes • changes of inheritance relations Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Architectural Issue:Persistency (“Object-I/O”) (II) • Objects can be placed (“clustering”) • de-coupling of logical and physical view of data • Special care needed to ensure consistency in data set • avoid reading group of objects (tracks, events,...) for which writing/updating is not (yet) complete • clean up if only part of the objects are written • typically taken care of by using transactions • Complications possible in distributed computing • need to protect disk access now like memory access in past (“Segmentation violation”) Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Physical Model and Logical Model • Physical model may be changed to optimise performance • Existing applications continue to work transparently ! Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Object Model Thanks to Vincenzo Innocente (CMS) Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Physical clustering Thanks to Vincenzo Innocente (CMS) Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Closing bracket: Persistency Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
“Tags”, Ntuples and Events • Tags - a special kind of Ntuple • Always associated with an underlying persistent store • Tags may be used to store “ntuple-like” data • extracted from all over the event • minPt, maxEmiss, nJets, nMuon, trigger, … • Main use: speedup data selection for analysis … • Tag simplifies selection without loosing complexity • Events more complex than a tree structure (“CWN”) • lots of cross-references between classes, containers • Association from the Tag to the Event may be used to navigate to any other part of the Event • even from an interactive visualization program Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Anaphe components Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Anaphe Internals: (Abstract) Interfaces Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
AIDA compliance of Anaphe • Presently (Anaphe 3.x) only AIDA 1.0 compliant • Plan to implement AIDA 2.2 Interfaces by end 2001 (Anaphe 4.x) • initially as wrappers to existing interfaces/packages • Will maintain 3.x for some time • ensures stability for users • Development will concentrate on 4.x • while AIDA will evolve further • Similar timeschedule as JAS (Tony Johnson) • OpenScientist (Guy Barrand) already there Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Lizard: a tool for Interactive Data Analysis Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Interactive Data Analysis • Aim: “OO replacement for PAW” (at least) • analysis of “ntuple-like data” (“Tags”, “Ntuples”, …) • visualisation of data (Histograms, scatter-plot, “Vectors”) • fitting of histograms (and other data) • access to experiment specific data/code • Maximize flexibility and re-use • Foresee customization/integration • allow use from within experiment’s s/w • Plan for extensions • “code for now, design for the future” • Ensure maintainability • use of s/w quality control tools Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Scripting - why • Typical use of scripting is quite different from programming (reconstruction, analysis, ...) • history “go back to where I was before” • repetition/looping - with “modifiable parameters” • avoid “one size fits all” or “using power-tool as hammer” • rapid prototyping in “scripting language” • quick turn-around times • performance critical code in “core language” • exploit richer set of features/functionality (e.g. templates in C++) • scripting languages usually less susceptible to changes than “mainstream languages” • potentially longer lifes Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Python - why • Python - OO (scripting) language • no “strange $!%-variables” • sensitive to indentation • More easy for users • as Java • Lots of user supplied modules available and ready for use • scientific, numerics, graphics, GUI, network, OS, games, DBs, … • example: http://www.vex.net/parnassus/ • Parnassus Totals: 1173 items in 49 categories. • Also usable in Java (Jython) • used in JAS for scripting • minimize changes needed within AIDA compliant environments Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Python - how • SWIG to (semi-) automatically create connection to chosen scripting language • allows flexibility to choose amongst several scripting languages • Python, Perl, Tcl, Guile, Ruby, (Java) … • Very easy to use • swig -c++ -python -shadow -c myClass.h • create shared lib from myClass.cpp and myClass_wrap.c • start python and import myClass.h to use it • Very easy to extend • simply inherit from “swiggified” class in python • modifications can later be fed back into C++ • performance, type safety, special language features (templates), … Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
PAW -> Lizard translation • Ntuple projection Lizard • lizard --useHBook • :-) nt = ntm.findNtuple(“higgscand.hbk::cands”) • :-) nplot1D(nt, “mass”, “quality=5 && cut > 198”) • Ntuple projection PAW • pawX11 • paw> h/file 1 higgscand.hbk • paw> nt/pl 10.mass quality=5.and.cut>198 • Assuming file higgscand.hbk contains ntuple with number 10 and title cands Any valid C++ expression Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Lizard: History and Present Status • Started after CHEP-2000 • Full version out since June 2001 • “PAW like” analysis functionality plus: • on-demand loading of compiled code using shared libraries • gives full access to experiment’s analysis code and data • based on Abstract Interfaces • flexible and extensible • “License free” version since Sep. 2001 • HBook for RWNtuples and Histogram storage • Minuit as minimizer engine Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Users and Collaborations • AIDA spoken here! • IGUANA (CMS visualization) • GAUDI (LHCb/HARP) framework • ATHENA (Atlas) framework • Analyzer modules in Geant 4 • JAS • Open Scientist • …you? Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Software quality control Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Software quality control • Using tools for testing/checking has started • Insure++, CodeWizard • Package dependencies: Ignominy • Set of perl and shell scripts by Lassi Tuura (CMS) • Ignominy scans… • Make dependency data produced by the compilers (*.d files) • Source code for #includes (resolved against the ones actually seen) • Shared library dependencies (“ldd” output) • Defined and required symbols (“nm” output) • And maps… • Source code and binaries into packages • #include dependencies into package dependencies • Unresolved/defined symbols into package dependencies ignominy: dishonour, disgrace, shame; infamy; the condition of being in disgrace, etc.(Oxford English Dictionary) Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Ignominy Analysis of Anaphe • Distribution of tools and utilities for LHC era physics • Combination of commercial, free and HEP software • Claims to be a toolkit • Seems to live up to its toolkit claims • Good work on modularity • Clean design is evident in many places • Dependency diagrams often split naturally into functional units Thanks to Lassi Tuura (CMS) Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Package Metrics • Size = total amount of source code (not normalised across projects!) • ACD = average component dependency (~ libraries linked in) • CCD = sum of single-package component dependencies over whole release • Indicates testing/integration cost • NCCD = Measure of CCD compared to a balanced binary tree • A good toolkit’s NCCD will be close to 1.0 • < 1.0: structure is flatter than a binary tree (= independent packages) • > 1.0: structure is more strongly coupled (vertical or cyclic) • Aim: NCCD ~ 1 for given software/functionality Thanks to Lassi Tuura (CMS) Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Metrics: NCCD vs Cycles Includes Fortran ATLAS • NCCD (“spaghetti index”) 1.0: good toolkit < 1.0: indep. packages > 1.0: strongly-coupled ROOT ORCA G4 COBRA Anaphe IGUANA Toolkits & Frameworks Thanks to Lassi Tuura (CMS) Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch
Future enhancements • Access to otherimplementations of components • HBOOK CWNtuples • Reading of ROOT (> V3.0) files • similar to Tony Johnson’s (Java) RootIO package • AIDA Ntuple/Histo store • optimized for Ntuples, Histograms as (compressed) XML • Communication with Java tools/packages (JAS, Wired) • via AIDA • Adding other “scripting” languages • Perl , Tcl, cint ? Andreas Pfeiffer, CERN/IT-API, andreas.pfeiffer@cern.ch