1 / 46

Token Coherence: A Framework for Implementing Multiple-CMP Systems

Token Coherence: A Framework for Implementing Multiple-CMP Systems. Mike Marty 1 , Jesse Bingham 2 , Mark Hill 1 , Alan Hu 2 , Milo Martin 3 , and David Wood 1 1 University of Wisconsin-Madison 2 University of British Columbia 3 University of Pennsylvania February 17 th , 2005. Summary.

Télécharger la présentation

Token Coherence: A Framework for Implementing Multiple-CMP Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty1,Jesse Bingham2, Mark Hill1, Alan Hu2, Milo Martin3, and David Wood1 1University of Wisconsin-Madison 2University of British Columbia 3University of Pennsylvania February 17th, 2005

  2. Summary • Microprocessor  Chip Multiprocessor (CMP) • Symmetric Multiprocessor (SMP)  Multiple CMPs • Problem: Coherence with Multiple CMPs • Old Solution: Hierarchical Directory Complex & Slow • New Solution: Apply Token Coherence • Developed for glueless multiprocessor [2003] • Keep: Flat for Correctness • Exploit: Hierarchical for performance • Less Complex & Faster than Hierarchical Directory

  3. Outline • Motivation and Background • Coherence in Multiple-CMP Systems • Example: DirectoryCMP • Token Coherence: Flat for Correctness • Token Coherence: Hierarchical for Performance • Evaluation

  4. Coherence in Multiple-CMP Systems P P P P I I D I D D I D interconnect L2 L2 L2 L2 • Chip Multiprocessors (CMPs) emerging • Larger systems will be built with Multiple CMPs CMP 2 CMP 1 interconnect CMP 3 CMP 4

  5. Problem: Hierarchical Coherence • Intra-CMP protocol for coherence within CMP • Inter-CMP protocol for coherence between CMPs • Interactions between protocols increase complexity • explodes state space CMP 2 CMP 1 interconnect Inter-CMP Coherence Intra-CMP Coherence CMP 3 CMP 4

  6. Improving Multiple CMP Systems with Token Coherence • Token Coherence allows Multiple-CMP systems to be... • Flat for correctness, but • Hierarchical for performance Low Complexity Fast Correctness Substrate CMP 2 CMP 1 interconnect Performance Protocol CMP 3 CMP 4

  7. Example: DirectoryCMP 2-level MOESI Directory RACE CONDITIONS! CMP 0 CMP 1 Store B Store B P0 P1 P2 P3 P4 P5 P6 P7 L1 I&D L1 I&D L1 I&D L1 I&D L1 I&D L1 I&D L1 I&D L1 I&D S S O S data/ ack data/ ack getx WB getx inv ack inv ack inv fwd ack data/ ack Shared L2 / directory Shared L2 / directory S getx WB fwd B: [M I] B: [S O] getx Memory/Directory Memory/Directory

  8. Token Coherence Summary • Token Coherence separates performance from correctness • Correctness Substrate: Enforces coherence invariant and prevents starvation • Safety with Token Counting • Starvation Avoidance with Persistent Requests • Performance Policy: Makes the common case fast • Transient requests to seek tokens • Unordered, untracked, unacknowledged • Possible prediction, multicast, filters, etc

  9. Outline • Motivation and Background • Token Coherence: Flat for Correctness • Safety • Starvation Avoidance • Token Coherence: Hierarchical for Performance • Evaluation

  10. Example: Token Coherence [ISCA 2003] Load B Load B Store B Store B • Each memory block initialized with T tokens • Tokens stored in memory, caches, & messages • At least one token to read a block • All tokens to write a block P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D L2 L2 L2 L2 mem 0 interconnect mem 3

  11. Extending to Multiple-CMP System CMP 0 CMP 1 P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D L2 L2 L2 L2 interconnect interconnect Shared L2 Shared L2 mem 0 interconnect mem 1

  12. Extending to Multiple-CMP System CMP 0 CMP 1 • Token counting remains flat • Tokens to caches • Handles shared caches and other complex hierarchies Store B Store B P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D interconnect interconnect Shared L2 Shared L2 mem 0 mem 1 interconnect

  13. Safety Recap • Safety: Maintain coherence invariant • Only one writer, or multiple readers • Tokens for Safety • T Tokens associated with each memory block • # tokens encoded in 1+log2T • Processor acquires all tokens to write, a single token to read • Tokens passed to nodes in glueless multiprocessor scheme • But CMPs have private and shared caches • Tokens passed to caches in Multiple-CMP system • Arbitrary cache hierarchy easily handled • Flat for correctness

  14. Some Token Counting Implications • Memory must store tokens • Separate RAM • Use extra ECC bits • Token cache • T sized to # caches to allow read-only copies in all caches • Replacements cannot be silent • Tokens must not be lost or dropped • Targeted for invalidate-based protocols • Not a solution for write-through or update protocols • Tokens must be identified by block address • Address must be in all token-carrying messages

  15. Starvation Avoidance • Request messages can miss tokens • In-flight tokens • Transient Requests are not tracked throughout system • Incorrect filtering, multicast, destination-set prediction, etc • Possible Solution: Retries • Retry w/ optional randomized backoff is effective for races • Guaranteed Solution: Persistent Requests • Heavyweight request guaranteed to succeed • Should be rare (uses more bandwidth) • Locates all tokens in the system • Orders competing requests

  16. Starvation Avoidance GETX GETX GETX CMP 0 CMP 1 • Tokens move freely in the system • Transient requests can miss in-flight tokens • Incorrect speculation, filters, prediction, etc Store B Store B Store B P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D interconnect interconnect Shared L2 Shared L2 mem 0 mem 1 interconnect

  17. Starvation Avoidance CMP 0 CMP 1 • Solution: issue Persistent Request • Heavyweight request guaranteed to succeed • Methods: Centralized [2003] and Distributed (New) Store B Store B Store B P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D interconnect interconnect Shared L2 Shared L2 mem 0 mem 1 interconnect

  18. Old Scheme: Central Arbiter [2003] CMP 0 CMP 1 • Processors issue persistent requests Store B Store B Store B timeout timeout timeout P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D interconnect interconnect Shared L2 Shared L2 mem 0 mem 1 arbiter 0 interconnect B: P0 arbiter 0 B: P2 B: P1

  19. Old Scheme: Central Arbiter [2003] CMP 0 CMP 1 • Processors issue persistent requests • Arbiter orders and broadcasts activate Store B Store B Store B Store B P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D B: P0 B: P0 B: P0 B: P0 interconnect interconnect B: P0 Shared L2 Shared L2 B: P0 mem 0 mem 1 arbiter 0 interconnect B: P0 arbiter 0 B: P2 B: P1

  20. Old Scheme: Central Arbiter [2003] CMP 0 CMP 1 • Processor sends deactivate to arbiter • Arbiter broadcasts deactivate (and next activate) • Bottom Line: handoff is 3 message latencies Store B Store B Store B P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D B: P0 B: P2 B: P0 B: P2 B: P0 B: P2 B: P2 B: P0 3 interconnect interconnect B: P0 B: P2 Shared L2 Shared L2 B: P2 B: P0 1 2 mem 0 mem 1 arbiter 0 interconnect B: P0 arbiter 0 B: P2 B: P2 B: P1

  21. Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 • Processors broadcast persistent requests Store B Store B Store B P0 P1 P2 P3 P0: B P0: B P0: B P0: B P1: B P1: B P1: B P1: B L1 I&D L1 I&D L1 I&D L1 I&D P2: B P2: B P2: B P2: B interconnect interconnect P0: B Shared L2 Shared L2 P0: B P1: B P1: B P2: B P2: B mem 0 mem 1 interconnect P0: B P1: B P2: B

  22. Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 • Processors broadcast persistent requests • Fixed priority (processor number) Store B Store B Store B P0 P1 P2 P3 P0: B P0: B P0: B P0: B P0: B P0: B P0: B P0: B P1: B P1: B P1: B P1: B L1 I&D L1 I&D L1 I&D L1 I&D P2: B P2: B P2: B P2: B interconnect interconnect P0: B P0: B Shared L2 Shared L2 P0: B P0: B P1: B P1: B P2: B P2: B mem 0 mem 1 interconnect P0: B P0: B P1: B P2: B

  23. Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 • Processors broadcast persistent requests • Fixed priority (processor number) • Processors broadcast deactivate Store B Store B P0 P1 P2 P3 P0: B P0: B P0: B P0: B P1: B P1: B P1: B P1: B P1: B P1: B P1: B P1: B 1 L1 I&D L1 I&D L1 I&D L1 I&D P2: B P2: B P2: B P2: B interconnect interconnect P0: B Shared L2 Shared L2 P0: B P1: B P1: B P1: B P1: B P2: B P2: B mem 0 mem 1 interconnect P0: B P1: B P1: B P2: B

  24. Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 • Bottom line: Handoff is a single message latency • Subtle point: P0 and P1 must wait until next “wave” P0 P1 P2 P3 P1: B P1: B P1: B P1: B P1: B P1: B P1: B P1: B L1 I&D L1 I&D L1 I&D L1 I&D P2: B P2: B P2: B P2: B interconnect interconnect Shared L2 Shared L2 P1: B P1: B P1: B P1: B P2: B P2: B mem 0 mem 1 interconnect P1: B P1: B P2: B

  25. Implementing Distributed Persistent Requests • Table at each cache • Sized to N entries for each processor (we use N=1) • Indexed by processor ID • Content-addressable by Address • Each incoming message must access table • Not on the critical path– can be slow CAM • Activate/deactivate reordering cannot be allowed • Persistent request virtual channel must be point-to-point ordered • Or, other solution such as sequence numbers or acks

  26. Implementing Distributed Persistent Requests • Should reads be distinguished from writes? • Not necessary, but • Persistent Read request is helpful • Implications of flat distributed arbitration • Simple  flat for correctness • Global broadcast when used • Fortunately they are rare in typical workloads (0.3%) • Bad workload (very high contention) would burn bandwidth • Maximum # processors must be architected • What about a hierarchical persistent request scheme? • Possible, but correctness is no longer flat • Make the common case fast

  27. Reducing Unnecessary Traffic • Problem: Which token-holding cache responds with data? • Solution: Distinguish one token as the owner token • The owner includes data with token response • Clean vs. dirty owner distinction also useful for writebacks

  28. Outline • Motivation and Background • Token Coherence: Flat for Correctness • Token Coherence: Hierarchical for Performance • TokenCMP • Another look at performance policies • Evaluation

  29. Hierarchical for Performance: TokenCMP • Target System: • 2-8 CMPs • Private L1s, shared L2 per CMP • Any interconnect, but high-bandwidth • Performance Policy Goals: • Aggressively acquire tokens • Exploit on-chip locality and bandwidth • Respect cache hierarchy • Detecting and handling missed tokens

  30. Hierarchical for Performance: TokenCMP • Approach: • On L1 miss, broadcast within own CMP • Local cache responds if possible • On L2 miss, broadcast to other CMPs • Appropriate L2 bank responds or broadcasts within its CMP • Optionally filter • Responses between CMPs carry extra tokensfor future locality • Handling missed tokens: • Timeout after average memory latency • Invoke persistent request (no retries) • Larger systems can use filters, multicast, soft-state directories

  31. Other Optimizations in TokenCMP • Implementing E-state • Memory responds with all tokens on read request • Use clean/dirty owner distinction to eliminate writing back unwritten data • Implementing Migratory Sharing • What is it? • A processor’s read request results in exclusive permission if responder has exclusive permission and wrote the block • In TokenCMP, simply return all tokens • Non-speculative delay • Hold block for some # cycles so permission isn’t stolen prematurely

  32. Another Look at Performance Policies • How to find tokens? • Broadcast • Broadcast w/ filters • Multicast (destination-set prediction) • Directories (soft or hard) • Who responds with data? • Owner token • TokenCMP uses Owner token for Inter-CMP responses • Other heuristics • For TokenCMP intra-CMP responses, cache responds if it has extra tokens

  33. Transient Requests May Reduce Complexity • Processor holds the only required state about request • L2 controller in TokenCMP very simple: • Re-broadcasts L1 request message on a miss • Re-broadcasts or filters external request messages • Possible states: • no tokens (I) • all tokens (M) • some tokens (S) • Bounce unexpected tokens to memory • DirectoryCMP’s L2 controller is complex • Allocates MSHR on miss and forward • Issues invalidates and receives acks • Orders all intra-CMP requests and writebacks • 57 states in our L2 implementation!

  34. Writebacks • DirectoryCMP uses “3-phase writebacks” • L1 issues writeback request • L2 enters transient state or blocks request • L2 responds with writeback ack • L1 sends data • TokenCMP uses “fire-and-forget” writebacks • Immediately send tokens and data • Heuristic: Only send data if # tokens > 1

  35. Outline • Motivation and Background • Token Coherence: Flat for Correctness • Token Coherence: Hierarchical for Performance • Evaluation • Model checking • Performance w/ commercial workloads • Robustness

  36. TokenCMP Evaluation • Simple? • Some anecdotal examples and comparisons • Model checking • Fast? • Full-system simulation w/ commercial workloads • Robust? • Micro-benchmarks to simulate high contention

  37. Complexity Evaluation with Model Checking This work performed by Jesse Bingham and Alan Hu of the University of British Columbia • Methods: • TLA+ and TLC • DirectoryCMP omits all intra-CMP details • TokenCMP’s correctness substrate modeled • Result: • Complexity similar between TokenCMP and non-hierarchical DirectoryCMP • Correctness Substrate verified to be correct and deadlock-free • All possible performance protocols correct

  38. Performance Evaluation • Target System: • 4 CMPs, 4 procs/cmp • 2GHz OoO SPARC, 8MB shared L2 per chip • Directly connected interconnect • Methods: Multifacet GEMS simulator • Simics augmented with timing models • Released soon: http://www.cs.wisc.edu/gems • Benchmarks: • Performance: Apache, Spec, OLTP • Robustness: Locking uBenchmark

  39. Full-system Simulation: Runtime • TokenCMP performs 9-50% faster than DirectoryCMP

  40. Full-system Simulation: Runtime • TokenCMP performs 9-50% faster than DirectoryCMP DRAM Directory Perfect L2

  41. Full-system Simulation: Inter-CMP Traffic • TokenCMP traffic is reasonable (or better) • DirectoryCMP control overhead greater than broadcast for small system

  42. Full-system Simulation: Intra-CMP Traffic

  43. Performance Robustness Locking micro-benchmark (correctness substrate only) less contention more contention

  44. Performance Robustness Locking micro-benchmark (correctness substrate only) less contention more contention

  45. Performance Robustness Locking micro-benchmark less contention more contention

  46. Summary • Microprocessor  Chip Multiprocessor (CMP) • Symmetric Multiprocessor (SMP)  Multiple CMPs • Problem: Coherence with Multiple CMPs • Old Solution: Hierarchical Directory Complex & Slow • New Solution: Apply Token Coherence • Developed for glueless multiprocessor [2003] • Keep: Flat for Correctness • Exploit: Hierarchical for performance • Less Complex & Faster than Hierarchical Directory

More Related