CS510 Concurrent Systems Class 2a

CS510 Concurrent Systems Class 2a Kernel-Kernel Communication in a Shared-Memory Multiprocessor

Review of Multiprocessor Architectures UMA (uniform memory access) NUMA (non-uniform memory access) NORMA (no remote memory access) Which type are today’s shared memory MP architectures? CS510 - Concurrent Systems

Architecture trends More CPUs More CPUs on the same chip On and off chip communication Shared memory abstraction Implemented via multiple levels of caches Some level shared by some CPUs (MP nodes) Increasing distance from “memory” Relative cost of communication to computation increasing Weak memory consistency models to compensate Moving away from shared memory abstraction? CS510 - Concurrent Systems 3

Review of Design Choices for MP OS The problem: How to structure the kernel for invocation from multiple CPUs Local vs global computation Global computation involves more than one CPU How to synchronize access to data used in global computation? How to structure access to remote data? CS510 - Concurrent Systems 4

Review of Design Choices for MP OS The Master/Slave Approach Designate one CPU as special Use it to run OS Others run application code and call it to make system calls Master/Slave uses remote invocation for system calls Advantage: low synchronization complexity Disadvantage: poor scalability Master CPU becomes bottleneck! CS510 - Concurrent Systems 5

Review of Design Choices for MP OS Symmetric multiprocessing (SMP) approach All CPUs are equal Remote memory access for system calls Problem: how to synchronize concurrent accesses? Typically use locking (fine or coarse grain) Advantages Uniformity: any CPU can issue a system call Porting uni-processor OS requires adding synchronization Disadvantage: high contention, poor scalability CS510 - Concurrent Systems 6

Review of Design Choices for MP OS Intermediate Approach - clusters Replicate the master across a few nodes A few privileged CPUs can execute OS code Remaining nodes are specialized application compute nodes How to coordinate among the OS cluster nodes? SMP approach among special nodes? Message passing kernel among structured nodes? CS510 - Concurrent Systems 7

Node Locality vs Locality of Reference Node locality is essential for scalability of OS code on large MP systems Implies most computation/memory access is local to a node Little inter-node communication or idling of CPUs How can we get this easily? Studies of uni-processor OSs show poor address locality Uni-processor OS will not naturally extend to SMP on large MPs using naïve application of remote access False sharing and contention for cache lines results in high network traffic and much blocked/stalled computation CS510 - Concurrent Systems 8

OS Design for Large-scale NUMA MPs This paper studies large scale MPs based on NUMA NUMA with coherent caches is called ccNUMA NUMA systems have properties common to both Bus-based UMA systems Which use remote access to shared memory Distributed memory multi-computers (NORMA) Which use remote invocation via messages On NUMA kernel designers have a choice between remote access and remote invocation for non-local computation Tradeoff's between Remote Access/Invocation remain largely unknown (circa 1993) … and they continually change even today CS510 - Concurrent Systems 9

NUMA OS Design Questions When should remote memory access by used vs. remote invocation? If remote invocation is used should it be restricted to interrupt context only? If remote memory access is used how is it affected by memory consistency models (not in this paper)? Strong consistency models will incur contention Weak consistency models widen the cost gap between normal instructions and synchronization instructions And require use of memory barriers CS510 - Concurrent Systems 10

Three Communication Choices on NUMA Remote memory access Access remote memory where it is These memory accesses are slower than local ones! Bulk data transfer Bring remote memory to you then access it locally Initial high cost then quick memory accesses Remote invocation Ask the remote processor to do the work for you (by IPI) Initial high cost then quick memory accesses again Should work be done in the top or bottom half of kernel? CS510 - Concurrent Systems 11

Bulk Data Transfer – a closer look Message passing Moves data ahead of time in messages Distributed shared memory Demand-faults data at page granularity Pre-fetches and caches data for future access at page granularity Cache hardware (coherence mechanism) Demand-faults data at cache line granularity Pre-fetches and caches data for future access at cache line granularity CS510 - Concurrent Systems 12

Remote Invocation – a closer look Bottom half (paper does not use Linux terminology) Do work in interrupt context Fast, but limited to work that can be done safely in an interrupt handler Can’t block, so can’t lock Disabling interrupts won’t work for some data in a SMP kernel Top half Do work in process context Expensive because it requires a context switch and path through the scheduler May require more synchronization with other processes CS510 - Concurrent Systems 13

How to Deliver Interrupt-Level RIs? Problem: what if interrupts are disabled? if interrupts are disabled, node can’t receive an RI RI is stored, marked pending Maintain count of active critical sections in kernel When count reaches zero, check for pending RIs and generate interrupt to process them (pull them in) Problem: what if handler needs to block? Its not running in a process context, so what blocks? If it needs to make an RI it must spin (and restore invariants, and have incoming RIs enabled) CS510 - Concurrent Systems 14

How to Deliver Interrupt-Level RIs? Deadlock prevention Prohibit outgoing RIs if incoming RIs are blocked! Why? CS510 - Concurrent Systems 15

Process-Level RIs If you need: Locks on other processors (ie, interrupt disabling won’t do) Condition synchronization, Have a lot of work to do (hence want to block) … … use process-level RIs! CS510 - Concurrent Systems 16

Process-Level RIs – how they work If receiving CPU is executing user code, RI is run immediately If its executing kernel code the RI is queued Executed next time control returns to user Simplifies synchronization via non-preemption To prevent deadlock invoking process blocks (doesn’t busy wait) Busy waiting in kernel would lock out incoming process-level RIs Blocking is better anyway: process-level RIs may take a long time to execute Rule: don’t perform a process-level RI while holding a spinlock Because you shouldn’t block while holding a spinlock! CS510 - Concurrent Systems 17

Using the Techniques Together Data centric view of the problem Which levels of code touch this data (interrupt,process)? Where can that code be executed (local, remote)? Choices Local interrupt level Local process level Remote interrupt level Remote process level How do these different invocations synchronize with each other (and themselves)? CS510 - Concurrent Systems 18

Using RA with Process-Level RI Easiest approach is to use them both in the same system, but for different data! If used together on the same data structure Can use spinlocks, but don’t do process-level RI while holding a spin lock because you will block and this will affect performance of the spin lock Semaphore scheduling primitives must be implemented with atomic instructions or using interrupt-level RIs Can’t rely on interrupt disabling CS510 - Concurrent Systems 19

Using RA with Interrupt-Level RI Requires a hybrid lock to protect data accessed by both RA and interrupt-level RI Uses both interrupt disabling and spinning to acquire lock Prioritizes interrupt-level access so it can’t spin indefinitely CS510 - Concurrent Systems 20

Hybrid lock Lock.urgently_needed primitive gives priority to interrupt-level access! CS510 - Concurrent Systems 21

Using Interrupt and Process-Level RI Can be used together on the same data structure Provided it is always possible to receive incoming invocations while waiting for outgoing ones Deadlock prevention Example: can’t make process-level RI with interrupt-level RIs blocked in order to access data that is shared by normal and interrupt-level code CS510 - Concurrent Systems 22

Comparing the Relative Costs This paper uses four dimensions for comparison Latency of operation in isolation Impact on normal operations that is due to the restrictions necessary for remote ones to work Contention and throughput Degree of clash/complement with conceptual organization of the kernel CS510 - Concurrent Systems 23

Latency of Operation - Motivation If (R-1)*N < C then it will be cheaper to perform an operation via RA, where R = ratio of remote to local memory access time N = number of memory accesses in the operation C = overhead of remote operation All other things being equal (are they?) the fixed overhead of RI implies that operations that take a large amount of time should benefit from RI May be simple to estimate whether a given operation is past the threshold for a given architecture CS510 - Concurrent Systems 24

Indirect Costs of RI - Motivation If you can avoid explicit synchronization when accessing data (for example by causing all access to it to be via interrupt-level RI) it will speed up local accesses (a lot)! You are essentially using non-preemption instead of locks But you still need to lock any data you want to hold on to during outgoing RIs! You could use NBS … but we’ll discuss that later! Relative costs of locking and interrupt-level RI disabling depends on architecture: on the Butterfly interrupt-level RI disabling was often a better approach than locking CS510 - Concurrent Systems 25

Indirect Costs of RA - Motivation RA requires that every node’s memory be mapped into the same address space May limit scalability on large machines with 32bit address space Should we map on demand or use multiple address spaces? Mapping on demand may be slower than RI Using multiple address spaces may have high context switch overheads CS510 - Concurrent Systems 26

Throughput - Motivation Limits on parallelism: where do operations serialize? RA – at memory Coarser grain serialization with related work via locks RI – at target processor non-preemption serializes unrelated work too RA overlaps computation on both nodes (more parallelism) RI blocks caller while callee is active (synchronous RI) Implies long-running operations should use RA Doesn’t this conflict with previous conclusion? Interrupt-level RI has low latency but poor throughput due to no parallelism (process-level RI has better throughput) RA completion time should be more predictable too due to finer grain synchronization CS510 - Concurrent Systems 27

Fit with Kernel Architectural Model Kernel operations will be structured differently depending on which approach is used For example, context may need to be packaged with RI How much address locality is required for caches to speed up RA? How difficult will these structural changes be given existing kernel models? CS510 - Concurrent Systems 28

Message vs. Procedure-Based Models Procedure-based kernels Single kernel accessible from all processes in a uniform way Processes invoke kernel routines to access kernel data Concurrent access controlled via synchronization primitives (locks) Conceptually similar to UMA hardware model Message-based kernels Compartmentalized kernel with special processes to manage distinct kernel resources Queued messages synchronize concurrent requests Conceptually similar to NORMA hardware mode CS510 - Concurrent Systems 29

Message vs Procedure-Based Models • Shaded boxes are processes, unshaded ones are data abstractions CS510 - Concurrent Systems 30

Which Model Fits NUMA with RA or RI? RA seems better fit with procedure-based kernels Single kernel address space accessible from anywhere Procedure-based kernels with RA require synchronization Procedure-based kernels on uniprocessors can use non-preemption to manage concurrency RI seems better fit with message-based kernels RI is like a message Send the RI to the right place and it will be processed when its turn comes (non-preemptively scheduled) Complex restructuring, but simple synchronization model Must reason about communication protocols CS510 - Concurrent Systems 31

This Paper’s Experiments Compare remote memory access with two kinds of remote invocation (I-RI and P-RI) Measure direct and indirect costs of each approach Find the domain of applicability for each approach CS510 - Concurrent Systems 32

Experiment Infrastructure On Psyche – a system with good node locality Remote access rarely needed Psyche has been structured such that when multiple remote accesses are needed they can be grouped into a single RI On BBN Butterfly Plus NUMA One CPU per node No caches CS510 - Concurrent Systems 33

BBN Butterfly Plus Performance parameters 12:1 Remote-to-Local memory access time ratio 6.88µs Read 32-bit word remote memory location 0.518µs Read 32-bit word local memory location 4.27µs Write 32-bit word remote memory location 0.398µs Write 32-bit word local memory location 56µs Average Latency Interrupt-level RI 421µs Average latency Process-level RI CS510 - Concurrent Systems 34

Details of RI on Psyche RI uses RA Write local buffer (w/ call complete and op received flags) Write remote pointer to local buffer Cause interrupt on other processor (IPI) Either spin on call complete flag (I-RI) or block (P-RI) Called processor writes call completion flag and/or unblocks caller on completion There is a race among multiple RIs Only one place for the buffer pointer on receiving node Caller may time out on operation received flag Try again (open to starvation, by design choice/trade-off) CS510 - Concurrent Systems 35

Details of RA on Psyche Psyche uses spin locks for synchronization Test-and-test-and-set version Fine grain locking Good concurrency But may need lots of lock acquisitions Depends on cheap locks (they are not cheap today!) Lock acquisition costs 5-10us in Psyche NBS is an alternative approach, but would have similar costs since it uses the same underlying primitives CS510 - Concurrent Systems 36

Experiments Costs of RA, I-RI, P-RI, all with and without explicit synchronization Results summary: Cost of locking is high Remote reference cost is high Avoiding synchronization costs can offset the remote invocation costs Favors P-RI which can use non-preemption I-RI cost can be justified by 11 or more remote memory accesses P-RI cost can be justified by 71-7k remote accesses, where k is number of locks that can be avoided by exclusive use of P-RI CS510 - Concurrent Systems 37

Results – Latency of Kernel Operations CS510 - Concurrent Systems 38

Notes for previous diagram • Times in columns represent cost of operations: • col 1 & 3 measurements for original Psyche kernel • col 2 synchronization without context switching • col 4 subsumed in other operation with course grain locking • col 6 cost if data accessed were always via remote invocation – no explicit synchronization required • col 5 cost of remote invocations in a hybrid kernel

Results: % Latency from Locking • Cost of synchronization • Measure Locking OFF vs ON CS510 - Concurrent Systems 40

Results: Marginal Remote Access Cost • With and without locking (L/NL) • NL = RA(off) - local(off)/ RA(off) • L = RA(off) – local(off)/RA(on) • Example • Interrupt level RI can be justified to avoid 11 remote refs • 6µs memory access cost & 60µs overhead of RI interrupt level CS510 - Concurrent Systems 41

Results: RI time as % of RA time • RA: RA with locking vs RI without (column 6 as % of column 3) • B: Hybrid Kernel that uses RI together with RA and locking (column 5 as % of column 3) • N: No locking - assumes you got locking for free from enclosing locks (column 6 as % of column 4) CS510 - Concurrent Systems 42

The Implications are Obvious ??? A win, but not so clear, for Remote Invocation Experiments represent extreme cases Simple enough to perform via interrupt level RI Complex enough to absorb overhead of process level RI Medium size operations might not accept limitations of interrupt level RI or the overhead of process level RI Throughput for tiny operations Remote memory accesses steal busy cycles from processor where memory is being used Remote invocations steal the entire processor for the entire operation Experiments show Interrupt level RI negatively affects throughput of the remote processor & remote access does not CS510 - Concurrent Systems 43

Conclusions Good Kernel design must consider Cost of remote invocation mechanism Cost of atomic operations and synchronization Ratio of remote to local access time Breakeven points depend on the number of remote references in any given Operation Interrupt level RI has small number of uses (TLB shootdown) Results could apply for ccNUMA machines as well What about ccUMA with many levels of cache? Risks of putting data structures in interrupt level of the kernel might be a reason to use remote access instead for small operations Node locality is really important! CS510 - Concurrent Systems 44

SPARE SLIDES

Trade-Offs (1) Indirect costs for local operations RI may need context of invoker as parameter Operations arranged to increase node locality?? Access between invoker/target processor not interleaved?? Process level RI for all remote accesses may allow data structure implementation without explicit synchronization Uses lack of pre-emption within kernel to provide implicit synchronization Avoiding Explicit Synchronization Can improve speed of remote and local operations because of the high cost of explicit synchronization Remote Memory Access may require large portions of kernel data space on other processors must be mapped into each instance of the kernel May not scale well to large machines Mapping on demand not a good option CS510 - Concurrent Systems 46

Trade-Offs (2) Competition for processor and memory cycles When do operations serialize Remote Invocation – on processor that executes them Serialize whether data is common or not Remote Memory Access – at the memory more chance of parallel computation because operations do more than just access data If competing for locked data operations may serialize Tolerance for Contention Remote Memory Access has slightly higher throughput because of parallel operation possibility However, this can introduce greater variance of completion time CS510 - Concurrent Systems 47

CS510 Concurrent Systems Class 2a

CS510 Concurrent Systems Class 2a

Presentation Transcript

CS510 Concurrent Systems

CS510 Concurrent Systems

CS510 Concurrent Systems Class 13

CS510 Concurrent Systems

CS510 Concurrent Systems Jonathan Walpole

CS510 Concurrent Systems Jonathan Walpole

CS510 Concurrent Systems

CS510 Concurrent Systems Tyler Fetters

CS510 Concurrent Systems

CS510 Concurrent Systems Jonathan Walpole

CS510 Concurrent Systems Class 1

CS510 Concurrent Systems Jonathan Walpole

CS510 Concurrent Systems Jonathan Walpole

CS510 Concurrent Systems

CS510 Concurrent Systems Class 2

CS510 Concurrent Systems

CS510

CS510

CS510 Concurrent Systems

CS510 Concurrent Systems Jonathan Walpole

CS510 Concurrent Systems Jonathan Walpole

CS510 Concurrent Systems Jonathan Walpole