The Parallel Toolkit

1. The Parallel Toolkit Top-to-bottom software tools and techniques for the FHPCA Supercomputer

2. 09/05/2012 www.fhpca.org 2 The FHPCA Supercomputer �Maxwell� will comprise N similar Nodes A Node is a software process running on a host machine, together with some FPGA acceleration hardware

3. 09/05/2012 www.fhpca.org 3 The Maxwell user environment Standard Linux OS Standard MPI libraries and Gnu compilers Standard Sun Grid Engine batch system front end queues configured for whole machine, or vendor halves Parallel Toolkit FPGA configuration tools Pre-built FPGA accelerated applications Maxwell must �feel� like an HPC system FPGA details must only be for developers to worry about, not users

4. 09/05/2012 www.fhpca.org 4 What is the Parallel Toolkit? A library of C++ code that forms a bridge from the software process (the application) to the FPGA hardware A set of system level tools to support parallel execution on the Supercomputer

5. 09/05/2012 www.fhpca.org 5 Parallel Toolkit aims (1) Multi-vendor Allow the same application code to work with FPGA boards from multiple vendors Access to FPGA hardware resources Provide a clean mechanism for client code to access hardware resources without introducing undesirable dependencies, or necessitating client code changes whenever the hardware is revised Copying efficiency Provide scope to avoid unnecessary data-copying between host and FPGA memory Eg. if two hardware functions are to be applied to the same data, no need to copy data back until the end of the second function

6. 09/05/2012 www.fhpca.org 6 Parallel Toolkit aims (2) Parallel accelerated functions Provide scope to execute multiple accelerated functions in parallel Memory flexibility Allow an FPGA memory to be split into multiple address spaces (more relevant when there are multiple accelerated functions) Potential for �intelligent� tools Provide scope for high-level toolkit functions that can adapt to the quantity of hardware resources available. These requirements pointed to an OO abstraction model

7. 09/05/2012 www.fhpca.org 7 Basic PTK acceleration strategy Identify code �hotspot� function F Design corresponding hardware function F� F� bitstream programmed into accelerator F accelerated by replacing its innards with a call out to F� At runtime, F copies relevant input data to memory component on the accelerator before invoking F� F� processes data directly from local memory Once F� signals completion, F copies output data back to the host Execution of F� may involve external communication with neighbouring nodes to implement the message-passing model of parallel computation commonly used in HPC applications this happens using FPGA-to-FPGA dataflow over RocketIO

8. 09/05/2012 www.fhpca.org 8 Parallel Toolkit architecture There is a loose layering within the PTK 'Client code' means the layer above (ultimately the application)

9. 09/05/2012 www.fhpca.org 9 Basic PTK objects Components essentially computing devices performing some calculation Hard Data Structures data residing in FPGA hardware (as opposed to host memory) Allocators provide controlled access to both these types of resources Accelerators model all the relevant acceleration hardware for a given implementation of a given algorithm for a given node their main role is configuration

10. 09/05/2012 www.fhpca.org 10 PTK object implementations PTK defines Components etc. for a given application as abstract superclasses in C++ eg. class MatrixMultiplier {} This defines the PTK API for that application Accelerated versions are implemented as vendor-specific concrete subclasses eg class MatrixMultiplierVend1 : public MatrixMultiplier {} Get one subclass per flavour of vendor hardware eg. MatrixMultiplierNt1, MatrixMultiplierAd1, etc.

11. 09/05/2012 www.fhpca.org 11 PTK system configuration The PTK uses a config file to set up system-level properties number of nodes, which accelerators to use etc. # PTK config file for running the DU software implementation # of App1 on 2 processes on Maxwell. app.nodes = 2 app.node.0.accelerator = AcceleratorApp1Du1 app.node.1.accelerator = AcceleratorApp1Du1 app.node.*.host = maxwell.epcc.ed.ac.uk app.node.*.exe = "$PTK_HOME/app/app1/Executable/app1solve_info_du.exe"

12. 09/05/2012 www.fhpca.org 12 Developing for the PTK: a worked example Original Application Refactored Application Rewritten Solver rewritten against PTK generic & app specific APIs no FPGA code Rewritten matrixMultiply rewritten against PTK generic & app specific APIs no FPGA code FPGA-specific MatrixMultiplier sub-PTK, vendor-specific API calls

13. 09/05/2012 www.fhpca.org 13 Original application Multi-process MPI code SPMD model nodes run the same code but on different data Identify hotspot areas by profiling The hotspots could occur at any depth in an application�s call tree

14. 09/05/2012 www.fhpca.org 14 Refactored application Refactored so that hotspots are in self-contained functions this is really just good software engineering ?

15. 09/05/2012 www.fhpca.org 15 Refactored solver Original code, tidied up Key functions identified these will be implemented in hardware

16. 09/05/2012 www.fhpca.org 16 PTK rewritten solver

17. 09/05/2012 www.fhpca.org 17 Rewritten solver (2) Code at this level is mixed generic PTK calls Ptk::getInstance->start(); and application specific calls // Copy data to Hard Data Structs MatrixHardAllocator* mha = MatrixHardAllocator::getInstance(); MatrixHard* m = mha->requestMatrixHard(nx, ny, nz); but still non-FPGA specific, clean OO code

18. 09/05/2012 www.fhpca.org 18 PTK rewritten matrixMultiply All code at this level is still non-FPGA specific vendor neutral but becoming app specific

19. 09/05/2012 www.fhpca.org 19 FPGA-specific MatrixMultiplier Concrete subclass of PTK abstract MatrixMultiplier This interfaces to the FPGA Code at this level is vendor specific but can also be written in pure software (we call these the �Dummy� implementations)

20. 09/05/2012 www.fhpca.org 20 How much of the PTK is generic? A modest amount PTK is not a collection of generic cores it�s not an implementation of MPI or BLAS Because getting data onto the FPGA is expensive so the FPGA function has to be worth it it has to be one or two routines where most of the execution time is spent and these are typically not generic

21. 09/05/2012 www.fhpca.org 21 PTK vs generic libraries PTK provides a methodology and utility classes aim is to capture and facilitate �accelerability� at a high level while maintaining a clean software approach throughout A generic function library (eg. BLAS) is too low level single callout to FPGA matrix-vector multiply involves a lot of data transfer from host memory to FPGA memory quite possibly over PCI or some other wet string this will kill any benefit from the FPGA acceleration �MPI for FPGAs� makes no sense for the same reasons

The Parallel Toolkit

The Parallel Toolkit

Presentation Transcript

T MVA – toolkit for parallel multivariate data analysis –

The ADHD Toolkit

The SDG Toolkit

Parallel Programming Using the Global Arrays Toolkit

The Clinician’s Toolkit

The Commissioning Toolkit

The CoBFIT Toolkit

T MVA – toolkit for parallel multivariate data analysis –

Overview of the Global Arrays Parallel Software Development Toolkit

The Archivists’ Toolkit

The ADHD Toolkit

The ‘EThOS’ Toolkit

The WinMine Toolkit

The BIM Toolkit

Overview of the Global Arrays Parallel Software Development Toolkit

The Archivists’ Toolkit

Overview of the Global Arrays Parallel Software Development Toolkit

Parallel Program Analysis Framework for the DOE ACTS Toolkit

Overview of the Global Arrays Parallel Software Development Toolkit

T MVA A Toolkit for (Parallel) MultiVariate Data Analysis

The Archivists’ Toolkit

Parallel Program Analysis Framework for the DOE ACTS Toolkit