Future Languages for Multi-Core Systems Optimization

New Architectures Need New LanguagesA triumph of optimism over experience!Ian Watson 3rd July 2009

‘Obvious Truths’ • Single processors will not get faster, we need to go to multi-core • There will be a need for processors with many (> 32?) cores • These will need to support general purpose applications • Application performance will need to scale with number of cores

‘Obvious Truths’(2) • General purpose parallel computing needs shared memory • Current shared memory requires cache coherence • Cache coherence doesn’t scale beyond 32 cores • Updateable state makes general purpose parallel programming difficult

‘Obvious Untruths’ • HPC already has all the answers to parallel programming • Message passing is the answer (hardware or software or both) • Conventional languages already have adequate threading and locking facilities • We can program without state

So what next? • Simplifying the programming model must be the answer – removing facilities is desirable e.g. • Random control transfer • Pointer arithmetic • Explicit memory reclamation • Arbitrary state manipulation is the enemy of parallelism – we must restrict it!

Half Truths? • Functional languages are the answer to parallelism, all we need is to add state (in a controlled way) • Transactional memory can replace locking to simplify the handling of parallel state • Transactional memory can remove the need for cache coherence

Functions+Transactions • The Cambridge Microsoft Haskell work has shown how transactions can be included in a functional language via monads • Is this a style of programming which can be sold to the world as the way ahead for future multi-core (many-core) systems?

Selling a New Language • It must capable of expressing everything that people want • It isn’t just a case of producing something which is a good technical solution • It mustn’t be too complex • It probably needs to look familiar • It needs to be efficient to implement

The Problems • FP is unfamiliar to many existing programmers • Many people find it hard to understand • Even more find monads difficult • In spite of excellent FP compiler technology, imperative programming will probably always be more efficient

Can We Compromise? • Pure functional programs can be executed easily in parallel because they don’t update global state • But if we only exploit parallelism at the function level, local manipulation of state within a function causes no problems • Can we work with such a model?

What Would We Gain? • ‘Easy’ parallelism at function level • This could either be explicit or implicit • Familiarity of low level code • Can use iteration, assignment, updateable arrays etc. • Potential increase in efficiency • Direct mapping to machine code • Explicit memory re-use

What Would We Lose? • Clearly we lose referential transparency within any imperative code • But this is inevitable if we want to manipulate state – even with monads • Clearly, as described so far, we haven’t got the ability to manipulate global state – we need more

Adding Transactions • We should only use shared state when it is really necessary • It should be clear in the language when this is happening • It should be detectable statically • Ideally, it should be possible to check automatically the need for atomic sections

Memory Architecture • With the right underlying programming model we should be able to determine memory regions • Read only • Thread local • Global write once • Global shared (transactional) • Can lead to simplified scalable memory architecture

Experiments • Using Scala to investigate programming styles • Is open source • Has both imperative & functional feature • Not currently transactional • Using Simics based hardware simulator to experiment with memory architectures

Outstanding Questions • Data Parallelism • How to express • How to handle in-place update of parallel data (array) structures • Streaming applications • Purely functional? • Need message passing constructs? • Need additions to the memory model?

Conclusions • None really so far! • But am convinced, from a technical viewpoint, we need new programming approaches • Am fairly convinced that we need to be pragmatic in order to sell a new approach, even if this requires compromises from ideals

Questions?

Transactional Memory • Programming model to simplify manipulation of shared state • Speculative model • Sections of program declared ‘atomic’ • They must complete without conflict or die and restart • Must not alter global state until complete • Needs system support – software or hardware

Object Based Transactional Memory Hardware • Based on ‘object-aware’ caches • Exploits object structure to simplify transactional memory operations • Advantages over other hardware TM proposals • Handles cache overflow elegantly • Enables multiple memory banks with distributed commit

TM & Cache Coherence • Fine grain cache coherence is the major impediment to extensible multi-cores • Updates to shared memory only occur when a transaction commits • Caches only need to be updated at commit points (which tend to be coarser grain) • If all shared memory is made transactional, the requirement for fine grain coherence is removed

TM Programming • TM constructs can be added to conventional programming languages • But, they require careful use to ensure correctness • If transactional & non-transactional operations are allowed on the same data, the result can become complex to understand.

New Programming Models? • Problems can often be simplified by restricting (unnecessary) programming facilities e.g. • Arbitrary control transfer • Pointer arithmetic • Explicit memory reclamation • A new approach is needed to simplify parallel programming & hardware

We Need Useable & Efficient Models • Shared memory is essential for general purpose programming • Message passing (alone) (e.g. MPI, Occam etc.) is not sufficient • We need shared updateable state – e.g. pure functional programming is not the answer • The languages need to be simple and easily implementable

A Synthesis? • Functional Programming has something to offer – don’t use state unnecessarily • But don’t be too ‘religious’ – local, single threaded state is simple & efficient • Can all global shared state be handled transactionally?

Experiments • Using the language Scala – has both functional and imperative features • Experimenting with applications • Studying how techniques similar to ‘escape analysis’ can identify shared mutable state • Looking at hardware implications, particularly memory architecture

Future Languages for Multi-Core Systems Optimization