Evolution of Parallel Programming in HEP

Evolution of Parallel Programming in HEP F. Rademakers – CERN International Workshop on Large Scale Computing VECC, Kolkata

Outline • Why use parallel computing • Parallel computing concepts • Typical parallel problems • Amdahl’s law • Parallelism in HEP • Parallel data analysis in HEP • PIAF • PROOF • Conclusions IWLSC, 9 Feb 2006

Why Parallelism • Two primary reasons: • Save time – wall clock time • Solve larger problems • Other reasons: • Taking advantage of non-local resources – Grid • Cost saving – using multiple “cheap” machines instead of paying for a super computer • Overcoming memory constraints – single computers have finite memory resources, use many machine to create a very large memory • Limits to serial computing • Transmission speeds – the speed of a serial computer is directly dependent on how much data can move through the hardware • Limits speed of light (30 cm/ns) and transmission limit of copper wire (9 cm/ns) • Limits to miniaturization • Economic limitations • Ultimately, parallel computing is an attempt to maximize the infinite but seemingly scarce commodity called time IWLSC, 9 Feb 2006

Parallel Computing Concepts • Parallel hardware • A single computer with multiple processors (multiple multi-core) • An arbitrary number of computers connected by a network (LAN/WAN) • A combination of both • Parallelizable computational problems • Can be broken apart into discrete pieces of work that can be solved simultaneously • Can execute multiple program instructions at any moment in time • Can be solved in less time with multiple compute resource than with a single compute resource IWLSC, 9 Feb 2006

Parallel Computing Concepts • There are different ways to classify parallel computers (Flynn’s Taxonomy): IWLSC, 9 Feb 2006

SISD • A serial (non-parallel) computer • Single instruction: only one instruction stream is being acted on by the CPU during any one clock cycle • Single data: only one data stream is being used as input during any one clock cycle • Deterministic execution • Examples: most classical PC’s, single CPU workstations and mainframes IWLSC, 9 Feb 2006

SIMD • A type of parallel computer • Single instruction: all processing units execute the same instruction at any given clock cycle • Multiple data: each processing unit can operate on a different data element • This type of machine typically has an instruction dispatcher, a very high bandwidth internal network and a very large array of very small-capacity CPU’s • Best suited for specialized problems with high degree of regularity: image processing • Synchronous and deterministic execution • Two varieties: processor arrays and vector pipelines • Examples (some extinct): • Processor arrays: Connection Machine, Maspar MP-1, MP-2 • Vector pipelines: CDC 205, IBM 9000, Cray C90, Fujitsu, NEC SX-2 IWLSC, 9 Feb 2006

MISD • Few actual examples of this class of parallel computer have ever existed • Some conceivable examples might be: • Multiple frequency filter operating on a single signal stream • Multiple cryptography algorithms attempting to track a single coded message IWLSC, 9 Feb 2006

MIMD • Currently the most common type of parallel computer • Multiple instruction: every processor may be executing a different instruction stream • Multiple data: every processor may be working with a different data stream • Execution can be synchronous or asynchronous, deterministic or non-deterministic • Examples: most current supercomputers, networked parallel computer “grids” and multi-processor SMP computers – including multi-CPU and multi-core PC’s IWLSC, 9 Feb 2006

Relevant Terminology • Observed speedup • wall-clock time of serial execution / wall-clock time of parallel execution • Granularity • Coarse: relatively large amounts of computational work are done between communication events • Fine: relatively small amounts of computational work are done between communication events • Parallel overhead • The amount of time required to coordinate parallel tasks, as opposed to doing useful work, typically: • Task start-up time • Synchronizations • Data communications • Software overhead imposed by parallel compilers, libraries, tools, OS, etc. • Task termination time • Scalability • Refers to a parallel system’s ability to demonstrate a proportional increase in parallel speedup with the addition of more processors • Embarrassingly parallel IWLSC, 9 Feb 2006

Typical Parallel Problems • Traditionally, parallel computing has been considered to be “the high-end of computing”: • Weather and climate • Chemical and nuclear reactions • Biological, human genome • Geological, seismic activity • Electronic circuits • Today commercial applications are the driving force: • Parallel databases, data mining • Oil exploration • Web search engines • Computer-aided diagnosis in medicine • Advanced graphics and virtual reality • The future: during past 10 years trends indicated by ever faster networks, distributed systems and multi-processor, and now multi-core, computer architectures suggest that parallelism is the future IWLSC, 9 Feb 2006

Amdahl’s Law • Amdahl’s law states that potential speedup is defined by the fraction of code (P) that can be parallelized: • If none of the code can be parallelized, P = 0 and the speedup = 1 (no speedup). If all the code is parallelized, P = 1, the speedup is infinite (in theory) • If 50% of the code can be parallelized, maximum speedup = 2, meaning the code will run twice as fast • Introducing the number of processors performing the parallel fraction of work, the relationship can written like: • Where P = parallel fraction, N = number of processors and S = serial fraction IWLSC, 9 Feb 2006

IWLSC, 9 Feb 2006

Parallelism in HEP • Main areas of processing in HEP • DAQ • Typically highly parallel • Process in parallel large number of detectors modules or sub-detectors • Simulation • No need for fine-grained track level parallelism, a single event is not the end product • Some attempts were made to introduce track level parallelism in G3 • Typically job level parallelism, resulting in a large number of files • Reconstruction • Idem as for simulation • Analysis • Run over many events in parallel to get quickly the final analysis results • Embarrassingly parallel, event level parallelism • Preferably interactive for better control on and feedback of the analysis • Main challenge: efficient data access IWLSC, 9 Feb 2006

Parallel Data Analysis in HEP • Most parallel data analysis systems designed in the past and present are based on job splitting scripts and batch queues • When queue full no parallelism • Explicit parallelism • Turn around time dictated by batch system scheduler and resource availability • Remarkably few attempts at real interactive implicitly parallel systems • PIAF • PROOF IWLSC, 9 Feb 2006

catalog files jobs data file splitting myAna.C merging final analysis outputs submit Classical Parallel Data Analysis Storage Batch farm queues manager • “Static” use of resources • Jobs frozen, 1 job / CPU • “Manual” splitting, merging • Limited monitoring (end of single job) IWLSC, 9 Feb 2006

files scheduler query query: data file list, myAna.C feedbacks (merged) final outputs (merged) Interactive Parallel Data Analysis catalog Storage Interactive farm MASTER • Farm perceived as extension of local PC • More dynamic use of resources • Automated splitting and merging • Real time feedback IWLSC, 9 Feb 2006

PIAF • The Parallel Interactive Analysis Facility • First attempt at an interactive parallel analysis system • Extension of and based on the PAW system • Joint project between CERN/IT and Hewlett-Packard • Development started in 1992 • Small production service opened for LEP users in 1993 • Up to 30 concurrent users • CERN PIAF cluster consisted of 8 HP PA-RISC machines • FDDI interconnect • 512 MB RAM • Few hundred GB disk • First observation of hyper-speedup using column-wise n-tuples IWLSC, 9 Feb 2006

PIAF Architecture • Two-tier push architecture • Client → Master → Workers • Master divides total number of events by number of workers and assigns each worker 1/n number of events to process • Pros • Transparent • Cons • Slowest node determined time of completion • Not adaptable to varying node loads • No optimized data access strategies • Required homogeneous cluster • Not scalable IWLSC, 9 Feb 2006

PIAF Push Architecture Slave 1 Master Slave N Process(“ana.C”) Process(“ana.C”) Initialization Processor Initialization SendEvents() SendEvents() 1/n Process 1/n Process SendObject(histo) SendObject(histo) Wait for next command Add histograms Wait for next command Display histograms IWLSC, 9 Feb 2006

PROOF • Parallel ROOT Facility • Second generation interactive parallel analysis system • Extension of and based on the ROOT system • Joint project between ROOT, LCG, ALICE and MIT • Proof of concept in 1997 • Development picked up in 2002 • PROOF in production in Phobos/BNL (with up to 150 CPU’s) since 2003 • Second wave of developments started in 2005 following interest by LHC experiments IWLSC, 9 Feb 2006

PROOF Original Design Goals • Interactive parallel analysis on heterogeneous cluster • Transparency • Same selectors, same chain Draw(), etc. on PROOF as in local session • Scalability • Good and well understood (1000 nodes most extreme case) • Extensive monitoring capabilities • MLM (Multi-Level-Master) improves scalability on wide area clusters • Adaptability • Partly achieved, system handles varying load on cluster nodes • MLM allows much better latencies on wide area clusters • No support yet for coming and going of worker nodes IWLSC, 9 Feb 2006

PROOF Multi-Tier Architecture adapts to cluster of clusters or wide area virtual clusters Physically separated domains good connection ? less important VERY important Optimize for data locality or efficient data server access IWLSC, 9 Feb 2006

PROOF Pull Architecture Slave 1 Master Slave N Process(“ana.C”) Process(“ana.C”) Initialization Packet generator Initialization GetNextPacket() GetNextPacket() 0,100 Process 100,100 Process GetNextPacket() GetNextPacket() 200,100 Process 300,40 Process GetNextPacket() GetNextPacket() 340,100 Process Process 440,50 GetNextPacket() GetNextPacket() 490,100 Process 590,60 Process SendObject(histo) SendObject(histo) Wait for next command Add histograms Wait for next command Display histograms IWLSC, 9 Feb 2006

PROOF New Features • Support for “interactive batch” mode • Allow submission of long running queries • Allow client/master disconnect and reconnect • Powerful, friendly and complete GUI • Work in grid environments • Startup of agents via Grid job scheduler • Agents calling out to master (firewalls, NAT) • Dynamic master-worker setup IWLSC, 9 Feb 2006

Goal: bring these to the same level of perception Interactive/Batch queries Medium term jobs, e.g. analysis design and development using also non-local resources Commands scripts Batch GUI stateless stateful stateful or stateless Interactive analysis using local resources, e.g. • end-analysis calculations • visualizationv Analysis jobs with well defined algorithms (e.g. production of personal trees) IWLSC, 9 Feb 2006

Analysis Session Example AQ1: 1s query produces a local histogram AQ2: a 10mn query submitted to PROOF1 AQ3->AQ7: short queries AQ8: a 10h query submitted to PROOF2 Monday at 10h15 ROOT session on my desktop BQ1: browse results of AQ2 BQ2: browse temporary results of AQ8 BQ3->BQ6: submit 4 10mn queries to PROOF1 Monday at 16h25 ROOT session on my laptop Wednesday at 8h40 ROOT session on my laptop in Kolkata CQ1: Browse results of AQ8, BQ3->BQ6 IWLSC, 9 Feb 2006

New PROOF GUI IWLSC, 9 Feb 2006

TGrid – Abstract Grid Interface class TGrid : public TObject { public: virtual Int_t AddFile(const char *lfn, const char *pfn) = 0; virtual Int_t DeleteFile(const char *lfn) = 0; virtual TGridResult *GetPhysicalFileNames(const char *lfn) = 0; virtual Int_t AddAttribute(const char *lfn, const char *attrname, const char *attrval) = 0; virtual Int_t DeleteAttribute(const char *lfn, const char *attrname) = 0; virtual TGridResult *GetAttributes(const char *lfn) = 0; virtual void Close(Option_t *option="") = 0; virtual TGridResult *Query(const char *query) = 0; static TGrid *Connect(const char *grid, const char *uid = 0, const char *pw = 0); ClassDef(TGrid,0) // ABC defining interface to GRID services }; IWLSC, 9 Feb 2006

PROOF SLAVE SERVERS PROOF SLAVE SERVERS PROOF PROOF PROOF PROOF PROOF on the Grid Slave servers access data via xrootd from local disk pools PROOF SLAVE SERVERS PROOF SUB-MASTER SERVER Proofd Startup Grid Service Interfaces PROOF MASTER SERVER TGrid UI/Queue UI Guaranteed site access through PROOF Sub-Masters calling out to Master (agent technology) Grid Access Control Service Grid/ROOT Authentication Grid File/Metadata Catalogue USER SESSION Client retrieves list of logical files (LFN + MSN) IWLSC, 9 Feb 2006

Running PROOF TGrid *alien = TGrid::Connect(“alien”); TGridResult *res; res = alien->Query(“lfn:///alice/simulation/2001-04/V0.6*.root“); TChain *chain = new TChain("AOD"); chain->Add(res); gROOT->Proof(“master”); chain->Process(“myselector.C”); // plot/save objects produced in myselector.C . . . IWLSC, 9 Feb 2006

Conclusions • The Amdahl’s Law shows that making really scalable parallel applications is very hard • Parallelism in HEP off-line computing still lagging • To solve the LHC data analysis problems, parallelism is the only solution • To make good use of the current and future generation of multi-core CPU’s parallel applications are required IWLSC, 9 Feb 2006

Evolution of Parallel Programming in HEP

Evolution of Parallel Programming in HEP

Presentation Transcript

Parallel Programming

PARALLEL programming

Parallel Evolution

Parallel Programming in Haskell

The Evolution of HEP software

Parallel Programming

Parallel Programming

Parallel Programming

Parallel Programming

Parallel Programming

Parallel Programming

Aspects of practical parallel programming Parallel programming models Data parallel

Parallel Programming

Parallel Programming

Parallel Programming in MPI

Parallel Programming