Using Prediction to Accelerate Coherence Protocols

Using Prediction to Accelerate Coherence Protocols Shubhendu S Mukherjee and Mark D Hill University of Wisconsin Madison

The topic once again • Using Prediction to Accelerate Coherence Protocols • Discuss the concept of using prediction in a coherence protocol • See how it can be used to accelerate the protocol

Organization • Introduction • Background • Directory Protocol • Two-level Branch Predictor • Cosmos • Basic Structure • Obtaining Predictions • Implementation Issues • Integration with a Coherence Protocol • How and When to act on the predictions • Handling Mis-predictions • Performance • Evaluation • Benchmarks • Results • Summary and Conclusions

Introduction • Large shared memory multi processors suffer from long latencies for misses to remotely cached blocks • Proposals to lessen these latencies • Multithreading • Non-blocking caches • Application specific coherence protocols • Predict future sharing patterns, overlap execution with coherence work • Drawbacks • More complex program model • Require sophisticated compilers • Existing predictors are directed at specific sharing patterns known a priori • Need for a general predictor, hence this paper!

Introduction • If general predictor is not in the army then what is it? A general predictor would sit beside standard directory or cache module, monitor coherence activity and take appropriate actions • See the design of Cosmos coherence message predictor • Evaluate Cosmos on some scientific applications • Alls well that ends well? Summarize and conclude

Background: 6810 strikes back! • Structure of a Directory Protocol • Distributed memory multiprocessor • Hardware based cache coherence • Directory and memory distributed among processors • Physical address gives the location of memory • Nodes connected to each other via a scalable interconnect • Messages routed from sender to receiver • Directory keeps track of sharing states, which are?

Directory Structure Processor & Caches Processor & Caches Processor & Caches Processor & Caches Memory I/O Memory I/O Memory I/O Memory I/O Directory Directory Directory Directory Interconnection network

Example: Coherence Protocol Actions Wr A A Processor 1 & Caches Processor 2 & Caches • ? • ? • ? • ? • ? Memory I/O Memory I/O 3 1 5 4 Directory Directory 2 Interconnection network

Example: Coherence Protocol Actions Wr A A Processor 1 & Caches Processor 2 & Caches • P1 Wr request to Dir 1 • Dir 1 Inval request Dir 2 • Dir 2 Inval Cache copy P2 • Dir2 Inval response Dir 1 • Dir 1 Wr response P1 Memory I/O Memory I/O 3 1 5 4 Directory Directory 2 Interconnection network

Example: Coherence Protocol Actions Wr A A Processor 1 & Caches Processor 2 & Caches • P1 Wr request to Dir 1 • Dir 1 Inval request Dir 2 • Dir 2 Inval Cache copy P2 • Dir2 Inval response Dir 1 • Dir 1 Wr response P1 Memory I/O Memory I/O 3 1 5 4 Directory Directory 2 Interconnection network Point to ponder: Multiple long-latency operations (sequential)

Background: 6810 strikes back! • Branch predictor • Need: Execute probable instructions without waiting, thus improve performance • Two Level • Basically a Local predictor • Use PC of branch to index into Branch History Table(Local) • Use this BHT entry to index into per branch Pattern History Table to obtain a branch prediction

Two Level Predictor Branch PC Table of 16K entries of 2-bit saturating counters Use 6 bits of branch PC to index into branch history table 10110111011001 14-bit history indexes into next level Table of 64 entries of 14-bit histories for a single branch Pattern History Table

What in Universe is COSMOS? • Cosmos is a Coherence Message Predictor • Predicts the sender and type of next incoming message for a particular block. • Structure : Similar to a two level branch predictor

Structure of Cosmos Message History Register (MHR) Pattern History Tables (Per block address) Message History Table (MHT)

Structure of Cosmos Message History Register (MHR) <sender, type> <sender, type> … Number of tuples per MHR constitutes its depth Message History Table (MHT)

Structure of Cosmos • The first level table is called the Message History Table (MHT) • An MHT consists of a series of Message History Registers (MHR) (one per cache block address) • An MHR contains a sequence of <sender,type> tuples (depth) • The second level table is called the Pattern History Table(PHT) • There is one PHT for each MHR • PHT is indexed by the entry in MHR • Each PHT contains prediction tuples corresponding to MHR entries

An Example: Producer - Consumer repeat … if(producer) private_counter++ shared_counter = private_counter barrier else if(consumer) barrier private_counter = shared_counter else barrier endif … until done

Producer Consumer Processor 1 & Caches Processor 2 & Caches Memory I/O Memory I/O Directory Directory Interconnection network An Example: Producer - Consumer

An Example: Producer - Consumer Producer Cache Memory I/O ? ? 1 2 Directory Messages seen by the Producer Cache (from directory)

An Example: Producer - Consumer Producer Cache Memory I/O 1. Get Wr Response 2. Invalidate Wr request 1 2 Directory Messages seen by the Producer Cache

An Example: Producer - Consumer Consumer Cache Memory I/O ? ? 1 2 Directory Messages seen by the Consumer Cache(from directory)

An Example: Producer - Consumer Consumer Cache Memory I/O 1. Get Rd Response 2. Invalidate Rd request 1 2 Directory Messages seen by the Consumer Cache

An Example: Producer - Consumer ? ? ? ? Messages seen by the Directory

An Example: Producer - Consumer 1. Get Wr Request from producer 2. Invalidate Rd Response from consumer 4. Invalidate Wr Response from producer 3. Get Rd Request from consumer Messages seen by the Directory

An Example: Producer - Consumer • Sharing Pattern Signature • Predictable message patterns • Producer send Get Wr request to directory receive Get Wr response from directory receive Invalidate Wr request from directory send Invalidate Wr response to directory • Consumer send Get Rd request to directory receive Get Rd response from directory receive Invalidate Rd request from directory send Invalidate Rd response to directory

Back to Cosmos Pattern History Table for shared_counter • Directory receives get Rd request from the consumer ? <P2, get Rd request> Message History Table <P2, get Rd request> P1: Producer P2: Consumer Global Address of shared_counter

Back to Cosmos • Directory receives get Rd request from the consumer Pattern History Table for shared_counter <P2, get Rd request> <P1, Inval Wr response> Message History Table <P2, get Rd request> P1: Producer P2: Consumer Global Address of shared_counter

Back to Cosmos • Obtaining Predictions • Index into MHR table with the address of the cache block • Use the MHR entry to index into the corresponding PHT • Return the prediction (if one exists) from the PHT. This prediction is of the form < Sender , Message – type >. • Updating Cosmos • Index into MHR table with the address of the cache block • Use the MHR entry to index into the corresponding PHT • Write new <Sender, Message – type> tuple as prediction for index corresponding to the MHR entry • Insert the <Sender, Message – type> tuple into the MHR for the cache block

How Cosmos adapts to complex signatures • Consider one Producer and two Consumers P1 and P2 Two get Rd requests arrive out of order. PHT will then be as shown below Index Prediction <P1, get Rd request> <P2, get Rd request> <P2, get Rd request> <P1, get Rd request>

How Cosmos adapts to complex signatures MHR with depth greater than 1 Index Prediction <P1, get Rd request> <P3, get Rd request> <P2, get Rd request> <P2, get Rd request> <P1, get Rd request> <P3, get Rd request> <P3, get Rd request> <P2, get Rd request> <P1, get Rd request>

Implementation issues • Storage Issues • Possible to merge the first level table with cache block state at cache and the directory? • Second level table will need more memory to catch pattern histories for each cache block • If number of pattern histories for each cache block is found to be low, per allocate memory for the pattern histories • If more pattern histories needed, allocate them from a common pool of dynamically allocated memory • Higher prediction accuracies require higher MHR depths : may result in large amounts of memory

Integration with a Coherence protocol • Predictors sit beside cache and directory module and accelerate coherence activity in two steps: • Step 1: Monitor message activity and make a prediction • Step 2: Invoke an action based on the prediction • Key challenges: • Knowing how and when to act on the predictions • Handling Mis – predictions • Performance

How to act on predictions • Some Examples

Detecting and Handling Mis-predictions • Usual problem with predictions • Mis-predictions may leave processor state / protocol state in an inconsistent state • Actions taken after predictions can be classified into three categories • Actions that move the protocol between two legal states • Actions that move the protocol to a future state, but do not expose this state to the processor • Actions that allow both processor and the protocol to move to future states

Handling Mis-Predictions • Actions that move the protocol between two legal states Example : Replacement of a cache block that moves the block from “exclusive” to “invalid” state No explicit recovery in this case P1 Cache Directory P2 Cache Time Get Wr request Inval Wr response Get Wr response

Handling Mis-Predictions • Actions that move the protocol to a future state, but do not expose this state to the processor If mis-prediction, simply discard the future state If prediction is correct, commit the future state and expose it to the processor P1 Cache Directory P2 Cache Time Predicts, updates protocol state, generates message Get Wr request Inval Wr request Sends message Inval Wr response Get Wr response

Handling Mis-Predictions P1 Cache Directory P2 Cache Time Predicts, updates protocol state, generates message Mis-Predict Send correct response

Handling Mis-Predictions • Actions that allow both processor and the protocol to move to future states • Need greater support for recovery • Before speculation, both processor and protocol can checkpoint their states • On detecting Mis-predictions , they rollback to the check pointed states • On correct prediction, the current protocol and processor states must be committed

Performance • How prediction affects runtime • A simplistic execution model is as follows. Let : p be the prediction accuracy for each message, f be the fraction of delay incurred on messages predicted correctly (e.g .f = 0 means that the time of a message predicted correctly is completely overlapped with other delays), and r be the penalty due to a mis-predicted message (e.g., r = O.5 implies a mis-predicted message takes 1.5 times the delay of a message without prediction).

Performance • How prediction affects runtime p be the prediction accuracy for each message, f be the fraction of delay incurred on messages predicted correctly r be the penalty due to a mis-predicted message If performance is completely determined by the number of messages in the critical path of a parallel program, then speedup due to prediction is: time(w/o prediction) 1 ----------------------------- = ----------------------------- time (with prediction) p * f + (1-p) * (1+r)

Performance E.g.: For a prediction accuracy of 80% (p=0.8), speedup = 56% with a mis-prediction penalty of 100%(r=1) and a prediction success benefit of 30% (f=0.3)

Evaluation • Cosmos’ prediction accuracy is evaluated using traces of coherence messages obtained from the Wisconsin Stache protocol running five parallel scientific applications • Wisconsin Stache protocol Stache is a software, full-map,and write-invalidate directory protocol that uses part of local memory as a cache for remote data. • Benchmarks Five parallel scientific applications: appbt, barnes, dsmc, moldyn, unstructured

Benchmarks • Appbt Appbt is a parallel three-dimensional computational fluid dynamics application. • Barnes Barnes simulates the interaction of a system of bodies in three dimensions using the Barnes-Hut hierarchical N-body method. • Dsmc Dsmc studies the properties of a gas by simulating the movement and collision of a large number of particles in a three-dimensional domain with discrete simulation Monte Carlo method.

Benchmarks • Moldyn Moldyn is a molecular dynamics application. • Unstructured Unstructured is a computational fluid dynamics application that uses an unstructured mesh to model a physical structure,such as an airplane wing or body.

Results C: cache prediction rate D: Directory prediction rate O: Overall prediction rate

Results: Observations • Overall prediction accuracy :62 ~ 86% • Higher accuracy for cache compared to directory: Why ? • Prediction accuracy increases with an increase in MHR depth • However, not much increase beyond MHR depth of 3 • Appbt: High prediction accuracy Producer-consumer sharing pattern Producer reads, writes and consumer reads • Barnes : Lower accuracy than other applications Nodes of octree are assigned different shared memory addresses in different iterations

Results: Observations • Dsmc: Highest accuracy among all applications Producer-consumer sharing patterns Producer writes and consumer reads Why higher than Appbt? • Moldyn: High accuracy Migratory and producer-consumer sharing patterns • Unstructured: Different dominant signatures for same data structures in different phases of the application Migratory and producer-consumer sharing patterns

Effects of noise-filters • Remember them? • Cosmos noise filter: Saturating counter : 0 to MAXCOUNT, here till 2 • MHR depth >2, filters do not help much – Why? • Predictors with MHR>1 can adapt to noise, greater accuracy for repeating noise 0 ,1, 2: MAXCOUNT

Summary and Conclusions • Comparison with directed optimizations • Worse: Less cost effective as more hardware required • Better: • Including the composition of predictors of several directed optimizations in a single protocol will be more complex than a single Cosmos • Can discover application-specific sharing patterns not known a priori

Using Prediction to Accelerate Coherence Protocols

Using Prediction to Accelerate Coherence Protocols

Presentation Transcript

Interconnect-Aware Coherence Protocols for Chip Multiprocessors

Interconnect-Aware Coherence Protocols for Chip Multiprocessors

Lecture 18: Coherence Protocols

Verification of cache-coherence protocols with TLA+

Using Collisions to improve Network protocols

A Systematic Methodology to Develop Resilient Cache Coherence Protocols

Lecture 2. Snoop-based Cache Coherence Protocols

Using systems thinking to accelerate change

A Compositional Approach to Verifying Hierarchical Cache Coherence Protocols

Using Prediction to Accelerate Coherence Protocols

Cache Coherence Protocols

Dynamic Verification of Cache Coherence Protocols

Cache Coherence Protocols: Evaluation Using a Microprocessor Simulation Model

Lecture 3: Coherence Protocols

Using Gordon to Accelerate LHC Science

Using Scientometrics to Accelerate Science Dr. Katy Börner

How to Accelerate your Website using Cloud CDN?

Verification of cache-coherence protocols with TLA+

Using Vector Capabilities of GPUs to Accelerate FFT