LECTURE 5 A Brief History of TM

LECTURE 5A Brief History of TM

Precursors of Computing: ENIAC • 5000 ops/second • 486k $ in 1946 • 19k vacuum tubes • 200K watts • 67 cubicmeters

Latest trends: Intel Nehalem • 1.9 billion transistors • 12 billion ops per second • 4 microprocessors • 8 MB of on-chip memory • 100 W • 246 square millimeters

The Way: Not just Chip Frequency! • 1970s: Programmable controllers, single chip microprocessors • 1980s: Instruction pipelines, cache hierarchies • 1990s: Speculative execution, Superscalar processors • 2000s: Multicore chips, embedded computing

Pipelining • Split the processing of an instruction into a series of independent steps • Classic pipeline • Instruction Fetch (IF) • Instruction Decode (ID) • Execute (EX) • Memory Access (MEM) • Register Write Back (WB)

Pipelining Different parts of the CPU used for different stages of the pipeline

Pipelining • Throughput: Speed of the slowest step instead of the whole instruction • More expensive design • Performance of a pipelined processor depends on the executing program, and is harder to predict than a non-pipelined processor

Superscalar • Executes multiple instructions per clock cycle by simultaneously dispatching to redundant functional units • Think of it as multiple parallel pipelines, each processing instructions from a single stream • Limitation: Degree of intrinsic parallelism in the stream

Out of Order Execution (OOE) • Multiple instructions fetched • Instructions dispatched to an instruction queue (also called instruction buffer or reservation stations) • Instruction waits in the queue until the input operands are available • Note that the instruction may leave the queue before earlier instructions • Results are queued

Speculation in ILP • Pipelining, OOE, Superscalars all consist of certain “speculation” Branch prediction • There has always been some speculation “circuitry” in processors

Forms of parallelism • Functional: Perform tasks that are functionally different in parallel, e.g. building a house – plumber, carpenter, electrician • Pipeline: Perform tasks that are different in a particular order, e.g. lunch buffet • Data: Perform the same task on different data, e.g. grading exams, MapReduce

Limitations of ILP • Finite amount of ILP in any sequence of instructions • Another possibility: Thread Level Parallelism (Functional parallelism) • How to get multiple threads? • Write parallel programs • Thread level speculation • Code parallelization

Thread Level Speculation • Takes a sequence of instructions • Arbitrarily breaks it into a sequenced group of threads that may run in parallel • Allows for oblivious parallelization of sequential programs • Parallelization by speculation dynamically finds parallelism at runtime, and thus is not conservative

Code parallelization • Implemented in compilers, e.g. SUIF • Problems: Hard to identify dependencies between pieces of code and data at compile time

CMP (Chip Multiprocessors) • Forward data between parallel threads • Detect when reads occur too early • Safely discard speculative state after violations • Retire speculative writes in correct order • Examples: Stanford HYDRA, Wisconsin Multiscalar, CMU Stampede (1995-2000)

Cache Coherence • Consistency of data stored in local caches of a shared resource (Wiki definition) • Protocols • MESI • MOESI • MOSI • MSI

P1 P2 P3 P4 CACHE CACHE CACHE CACHE INTERCONNECTION NETWORK MAIN MEMORY

2-state Invalidation Cache Protocol Write Through, No Allocation Valid indicates cache presence BusWr PrRd / -- PrWr / BusWr VALID INVALID PrWr / BusWr PrRd / BusRd X/Y: Action X / Reaction Y PrRd: Processor Read PrWr: Processor Write BusRd: Fetch a cache block BusWr: Write through one word --: No action

2-State Protocol • Simple hardware and protocol • Requires high bandwidth (every write goes on bus!)

3-state Protocol (MSI) • Modified • Shared • Invalid

MSI State Diagram PrRd / -- BusRdX/BusWB PrWr / -- M I PrWr / BusRdX BusRd/BusWB PrRd /BusRd BusRdX/-- PrWr / BusRdX S PrRd / -- BusRd/--

Further Improvements • MESI: Illinois protocol • MOESI

FIRST TRANSACTIONAL MEMORIES

Precursors: Knight (1986) • Idea of TLS • Two caches per processor • The first idea to propose the use of caches and cache coherence to maintain and enforce ordering among speculatively parallelized regions of a sequential code in the presence of unknown memory dependencies

The word “Transactional Memory” • Introduced by Herlihy and Moss in 1991 • Idea: Adapt the cache coherence protocol so that transactional accesses are monitored

ISCA 93 • Six new instructions • Load-transactional • Load-transactional-exclusion • Store-transactional • Commit • Abort • Validate • New processor flags • Tactive: Is a transaction currently active? • Tstatus: Is the active transaction in progress, or aborted?

Transactional Cache • States: MESI • Additional transactional tags: EMPTY, NORMAL, XCOMMIT, XABORT • Transactional operations create two entries: one with XCOMMIT and one with XABORT • Modifications made to XABORT on Store

Extra three bus cycles • T_READ: On a transactional load • T_RFO: On a transactional load exclusive, or a store • BUSY: Full cache or other reasons (prevent deadlocks or mutual aborts)

Load_transactional • LT: • Search TxCache for an XABORT entry. Return if one exists • No XABORT entry  Search for a NORMAL entry. Change it to XABORT. Allocate a second entry with tag XCOMMIT and same data • Else, issue a T_READ cycle. Behaves as Goodman’s read. Two entries created: tagged with XABORT and XCOMMIT.

Load_transactional_exclusive • Similar to LT • Instead of T_READ, T_RFO used on a miss

Store • Similar to LTX • Changes the XABORT entry’s data too

Validate • Returns the TSTATUS flag • If the TSTATUS flag is FALSE • Sets TSTATUS to TRUE • Sets TACTIVE to FALSE

Abort • Discards the XABORT entries (sets their tags as EMPTY) • Sets the tags of XCOMMIT entries as NORMAL • Sets the TSTATUS to TRUE • Sets the TACTIVE to FALSE

Commit • Discards the XCOMMIT entries (sets their tags to EMPTY) • Sets the tags of XABORT entries to NORMAL • Sets TSTATUS to TRUE • Sets TACTIVE to FALSE

Digression • Why transactional memories instead of locks? • Locks create several problems and require programmers to properly use them • Priority inversion: Lower priority process that holds a lock preempted when a higher priority that needs the lock • Convoying: Process holding a lock is descheduled, and no other process can progress • Deadlock: Two or more processes attempt to lock same set of objects in different orders

Digression • Transactional memory was invented as a faster means of performing lock-free synchronization • That is why, earliest TM implementations have no misspeculations. They have aborts due to capacity constraints (HTM) or lock contentions

Speculative Lock Elision (SLE) • Another reason to use TM! • Speculatively execute critical sections guarded by locks • Use cache coherence and rollback for recovery from misspeculation

Hardware TMs in general • Great idea, efficient implementations • Limitations • High cost of implementation • Small transactional buffer sizes • Context switches • Solutions: Unbounded HTM

SOFTWare TM

Advantage • More flexible than hardware, allows to experiment with variety of algorithms • Fewer limitations imposed by fixed size hardware, like caches

Access Granularity • Detects conflicting accesses on objects / words / regions • Object: Easy implementation, but lot of false conflicts • Word: Less false conflicts • Region: Less overhead than words

Update • How the global memory is updated: Direct / deferred • Direct: The transaction directly modifies the object itself, logs the original value in order to restore in case of abort • Deferred: The transaction makes local modifications, and changes global memory only on commit

Conflict Detection • When are the conflicts detected: Eager / lazy / mixed • What is a conflict: Multiple accesses, one of them is a write • For commit, a transaction must acquire every location updated. Eager if acquired at the first update operation, lazy if done at the time of commit. • Mixed: Eagerly detects write/write conflicts, and lazily detects read/write conflicts

STM: 1995 • Memory to be accessed in a transaction known in advance • Lock-free: Transactions help each other • Motivation: Replace N-word CAS, implement lock-free data structures etc

The System Model We assume that every shared memory location supports these 4 operations: • Writei(L,v) - thread i writes v to L • Readi(L,v) - thread i reads v from L • LLi(L,v) - thread i reads v from L and marks that L was read by I • SCi(L,v) - thread i writes v to L and returns success if L is marked as read by i. Otherwise it returns failure.

Thread classRec { booleanstable = false; boolean,intstatus= (false,0); //can have two values… booleanallWritten = false; intversion = 0; intsize = 0; intlocs[] = {null}; intoldValues[] = {null}; } Each thread is defined by an instance of a Rec class (short for record). The Rec instance defines the current transaction the thread is executing (only one transaction at a time)

status version size locs[] oldValues[] status version size locs[] oldValues[] status version size locs[] oldValues[] The STM Object This is the shared memory Memory Ownerships Pointers to threads Rec2 Recn Rec1

Flow of a transaction STM Threads release Ownerships startTransaction Thread i Success updateMemory initialize Failure Initiate helping transaction to failed loc (isInitiator:=F) isInitiator? calcNewValues transaction T F acquire Ownerships (Null, 0) (Failure,failed loc) agreeOldValues release Ownerships

The STM Object publicclass STM { intmemory[]; Recownerships[]; publicboolean, int[] startTranscation(Recrec, int[] dataSet){...}; privatevoid initialize(Recrec, int[] dataSet) privatevoid transaction(Recrec, int version, booleanisInitiator) {...}; privatevoidacquireOwnerships(Recrec, int version) {...}; privatevoidreleaseOwnershipd(Recrec, int version) {...}; privatevoidagreeOldValues(Recrec, int version) {...}; privatevoidupdateMemory(Recrec, int version, int[] newvalues) {...}; }

Implementation rec – The thread that executes this transaction. dataSet – The location in memory it needs to own. publicboolean, int[] startTranscation(Recrec, int[] dataSet) { initialize(rec, dataSet); rec.stable = true; transaction(rec, rec.version, true); rec.stable = false; rec.version++; if (rec.status) return (true, rec.oldValues); elsereturnfalse; } This notifies other threads that I can be helped

LECTURE 5 A Brief History of TM