E N D
1. The Parallel Toolkit Top-to-bottom software tools and techniques for the FHPCA Supercomputer
2. 09/05/2012 www.fhpca.org 2 The FHPCA Supercomputer ‘Maxwell’ will comprise N similar Nodes
A Node is a software process running on a host machine, together with some FPGA acceleration hardware
3. 09/05/2012 www.fhpca.org 3 The Maxwell user environment Standard Linux OS
Standard MPI libraries and Gnu compilers
Standard Sun Grid Engine batch system front end
queues configured for whole machine, or vendor halves
Parallel Toolkit FPGA configuration tools
Pre-built FPGA accelerated applications
Maxwell must “feel” like an HPC system
FPGA details must only be for developers to worry about, not users
4. 09/05/2012 www.fhpca.org 4 What is the Parallel Toolkit? A library of C++ code that forms a bridge from the software process (the application) to the FPGA hardware
A set of system level tools to support parallel execution on the Supercomputer
5. 09/05/2012 www.fhpca.org 5 Parallel Toolkit aims (1) Multi-vendor
Allow the same application code to work with FPGA boards from multiple vendors
Access to FPGA hardware resources
Provide a clean mechanism for client code to access hardware resources without introducing undesirable dependencies, or necessitating client code changes whenever the hardware is revised
Copying efficiency
Provide scope to avoid unnecessary data-copying between host and FPGA memory
Eg. if two hardware functions are to be applied to the same data, no need to copy data back until the end of the second function
6. 09/05/2012 www.fhpca.org 6 Parallel Toolkit aims (2) Parallel accelerated functions
Provide scope to execute multiple accelerated functions in parallel
Memory flexibility
Allow an FPGA memory to be split into multiple address spaces (more relevant when there are multiple accelerated functions)
Potential for ‘intelligent’ tools
Provide scope for high-level toolkit functions that can adapt to the quantity of hardware resources available.
These requirements pointed to an OO abstraction model
7. 09/05/2012 www.fhpca.org 7 Basic PTK acceleration strategy Identify code ‘hotspot’ function F
Design corresponding hardware function F’
F’ bitstream programmed into accelerator
F accelerated by replacing its innards with a call out to F’
At runtime, F copies relevant input data to memory component on the accelerator before invoking F’
F’ processes data directly from local memory
Once F’ signals completion, F copies output data back to the host
Execution of F’ may involve external communication with neighbouring nodes to implement the message-passing model of parallel computation commonly used in HPC applications
this happens using FPGA-to-FPGA dataflow over RocketIO
8. 09/05/2012 www.fhpca.org 8 Parallel Toolkit architecture There is a loose layering within the PTK
'Client code' means the layer above (ultimately the application)
9. 09/05/2012 www.fhpca.org 9 Basic PTK objects Components
essentially computing devices performing some calculation
Hard Data Structures
data residing in FPGA hardware (as opposed to host memory)
Allocators
provide controlled access to both these types of resources
Accelerators
model all the relevant acceleration hardware for a given implementation of a given algorithm for a given node
their main role is configuration
10. 09/05/2012 www.fhpca.org 10 PTK object implementations PTK defines Components etc. for a given application as abstract superclasses in C++
eg. class MatrixMultiplier {}
This defines the PTK API for that application
Accelerated versions are implemented as vendor-specific concrete subclasses
eg class MatrixMultiplierVend1 : public MatrixMultiplier {}
Get one subclass per flavour of vendor hardware
eg. MatrixMultiplierNt1, MatrixMultiplierAd1, etc.
11. 09/05/2012 www.fhpca.org 11 PTK system configuration The PTK uses a config file to set up system-level properties
number of nodes, which accelerators to use etc.
# PTK config file for running the DU software implementation
# of App1 on 2 processes on Maxwell.
app.nodes = 2
app.node.0.accelerator = AcceleratorApp1Du1
app.node.1.accelerator = AcceleratorApp1Du1
app.node.*.host = maxwell.epcc.ed.ac.uk
app.node.*.exe = "$PTK_HOME/app/app1/Executable/app1solve_info_du.exe"
12. 09/05/2012 www.fhpca.org 12 Developing for the PTK: a worked example Original Application
Refactored Application
Rewritten Solver
rewritten against PTK generic & app specific APIs
no FPGA code
Rewritten matrixMultiply
rewritten against PTK generic & app specific APIs
no FPGA code
FPGA-specific MatrixMultiplier
sub-PTK, vendor-specific API calls
13. 09/05/2012 www.fhpca.org 13 Original application Multi-process MPI code
SPMD model
nodes run the same code but on different data
Identify hotspot areas
by profiling
The hotspots could occur at any depth in an application’s call tree
14. 09/05/2012 www.fhpca.org 14 Refactored application Refactored so that hotspots are in self-contained functions
this is really just good software engineering ?
15. 09/05/2012 www.fhpca.org 15 Refactored solver Original code, tidied up
Key functions identified
these will be implemented in hardware
16. 09/05/2012 www.fhpca.org 16 PTK rewritten solver
17. 09/05/2012 www.fhpca.org 17 Rewritten solver (2) Code at this level is
mixed generic PTK calls
Ptk::getInstance->start();
and application specific calls
// Copy data to Hard Data Structs
MatrixHardAllocator* mha = MatrixHardAllocator::getInstance();
MatrixHard* m = mha->requestMatrixHard(nx, ny, nz);
but still non-FPGA specific, clean OO code
18. 09/05/2012 www.fhpca.org 18 PTK rewritten matrixMultiply All code at this level is still
non-FPGA specific
vendor neutral
but becoming app specific
19. 09/05/2012 www.fhpca.org 19 FPGA-specific MatrixMultiplier Concrete subclass of PTK abstract MatrixMultiplier
This interfaces to the FPGA
Code at this level is vendor specific
but can also be written in pure software
(we call these the “Dummy” implementations)
20. 09/05/2012 www.fhpca.org 20 How much of the PTK is generic? A modest amount
PTK is not a collection of generic cores
it’s not an implementation of MPI or BLAS
Because
getting data onto the FPGA is expensive
so the FPGA function has to be worth it
it has to be one or two routines where most of the execution time is spent
and these are typically not generic
21. 09/05/2012 www.fhpca.org 21 PTK vs generic libraries PTK provides a methodology and utility classes
aim is to capture and facilitate “accelerability” at a high level
while maintaining a clean software approach throughout
A generic function library (eg. BLAS) is too low level
single callout to FPGA matrix-vector multiply involves a lot of data transfer from host memory to FPGA memory
quite possibly over PCI or some other wet string
this will kill any benefit from the FPGA acceleration
“MPI for FPGAs” makes no sense
for the same reasons