Many-Core Software

Many-Core Software Burton SmithMicrosoft

Computing is at a Crossroads • Continual performance improvement is our field’s lifeblood • It encourages people to buy new hardware • It opens up new software possibilities • Single-thread performance is nearing the end of the line • But Moore’s Law will continue for some time to come • What can we do with all those transistors? • Computation needs to become as parallel as possible • Henceforth, serial means slow • Systems must support general purpose parallel computing • The alternative is commoditization • New many-core chips will need new software • Our programming models will have to change • The von Neumann premise is broken

The von Neumann Premise • Simply put, “instruction instances are totally ordered” • This notion has created artifacts: • Variables • Interrupts • Demand paging • And caused major problems: • The ILP wall • The power wall • The memory wall • What software changes will we need for many-core? • New languages? • New approaches for compilers, runtimes, tools? • New (or perhaps old) operating system ideas?

Do We Really Need New Languages? • Mainstream languages schedule values into variables • To orchestrate the flow of values in the program • To incrementally but consistently update state • Introducing parallelism exposes weaknesses in: • Passing values between unordered instructions • Updating state consistently • Our “adhesive bandage” attempts have proven insufficient • Not general enough • Not productive enough • So my answer is “Absolutely!”

Parallel Programming Languages • There are (at least) two promising approaches: • Functional programming • Atomic memory transactions • Neither is completely satisfactory by itself • Functional programs don’t allow mutable state • Transactional programs implement data flows awkwardly • Data base applications show synergy of these two ideas • SQL is a “mostly functional” language • Transactions allow Consistency via Atomicity and Isolation • Many people think functional languages must be inefficient • Sisal and NESL are excellent counterexamples • Both competed strongly with Fortran on Cray systems • Others think memory transactions must be inefficient also • This remains to be seen; we have only just begun to optimize

Transactions and Invariants • Invariants are a program’s conservation laws • Relationships among values in iteration and recursion • Rules of data structure (state) integrity • If statements p and q preserve the invariant I and they do not “interfere”, their parallel composition { p || q } also preserves I† • If p and q are performed atomically, i.e. as transactions, then they will not interfere‡ • Although operations seldom commute with respect to state, transactions give us commutativity with respect to the invariant • It would help if the invariants were available to the compiler • Can we ask programmers to supply them? † Susan Owicki and David Gries. Verifying properties of parallel programs: An axiomatic approach. CACM 19(5):279−285, May 1976. ‡ Leslie Lamport and Fred Schneider. The “Hoare Logic” of CSP, And All That. ACM TOPLAS 6(2):281−296, Apr. 1984.

Styles of Parallelism • We probably need to support multiple programming styles • Both functional and transactional • Both data parallel and task parallel • Both message passing and shared memory • Both declarative and imperative • Both implicit and explicit • We may need several languages to accomplish this • After all, we do use multiple languages today • Language interoperability (e.g. .NET) will help greatly • It is essential that parallelism be exposed to the compiler • So that the compiler can adapt it to the target system • It is also essential that locality be exposed to the compiler • For the same reason

Compiler Optimization for Parallelism • Some say automatic parallelization is a demonstrated failure • Vectorizing and parallelizing compilers (especially for the right architecture) have been a tremendous success • They have enabled machine-independent languages • What they do can be termed parallelism packaging • Even manifestly parallel programs need it • What failed is parallelism discovery, especially in-the-large • Dependence analysis is chiefly a local success • Locality discovery in-the-large has also been a non-starter • Locality analysis is another word for dependence analysis • The jury is still out on in-the-large locality packaging • Local locality packaging works pretty well

Fine-grain Parallelism • Exploitable parallelism grows as task granularity shrinks • But dependences among tasks become more numerous • Inter-task dependence enforcement demands scheduling • A task needing a value from elsewhere must wait for it • User-level work scheduling is needed • No privilege change to stop or restart a task • Locality (e.g. cache content) can be better preserved • Todays OSes and hardware don’t encourage waiting • OS thread preemption makes blocking dangerous • Instruction sets encourage non-blocking approaches • Busy-waiting wastes instruction issue opportunities • We need better support for blocking synchronization • In both instruction set and operating system

Resource Management Consequences • Since the user runtime is scheduling work on processors, the OS should not attempt to do the same • An asynchronous OS API is a necessary corollary • The user-exposed API should be synchronous • Scheduling memory via demand paging is also problematic • Instead, the application and OS should negotiate • The application tells the OS its resource needs & desires • The OS makes decisions based on the big picture: • Requirements for quality of service • Availability of resources • Appropriateness of power level • The OS can preempt resources to reclaim them • But with notification, so the application can rearrange work • Resources should be time- and space-shared in chunks

Bin Packing • The more resources allocated, the more swapping overhead • It would be nice to amortize it • The more resources you get, the longer you may keep them • Roughly, this means scheduling = packing squarish blocks • QOS applications might need long rectangles instead • When the blocks don’t fit, the OS can morph them a little • Or cut corners when absolutely necessary Quantity of resource Time

Parallel Debugging and Tuning • Today, debugging relies on single-stepping and printf() • Single-stepping a parallel program is a bit less effective • Conditional program and data breakpoints are helpful • To stop when an invariant fails to be true • Support for ad-hoc data perusal is also very important • Debugging is data mining • Serial program tuning tries to discover where the program counter spends its time • The answer is usually found by sampling the PC • In contrast, parallel program tuning tries to discover where there is insufficient parallelism • A good way is to log perf counters and a timestamp at events • Visualization is a big deal for both debugging and tuning

Conclusions • It is time to rethink some of the basics • There is lots of work for everyone to do • I’ve left out lots of things, e.g. applications • We need basic research as well as industrial development • Research in computer systems is deprecated these days • In the USA, NSF and DOD need to take the initiative

Many-Core Software