1 / 59

Computer Systems Principles Architecture

Computer Systems Principles Architecture. Emery Berger and Mark Corner University of Massachusetts Amherst. Architecture. Von Neumann. “von Neumann architecture”. Fetch, Decode, Execute. The Memory Hierarchy. Registers Caches Associativity Misses “Locality”. registers. L1. L2. RAM.

churchc
Télécharger la présentation

Computer Systems Principles Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computer Systems PrinciplesArchitecture Emery Berger and Mark Corner University of Massachusetts Amherst

  2. Architecture

  3. Von Neumann

  4. “von Neumann architecture”

  5. Fetch, Decode, Execute

  6. The Memory Hierarchy • Registers • Caches • Associativity • Misses • “Locality” registers L1 L2 RAM

  7. Registers • Register = dedicated name for word of memory managed by CPU • General-purpose: “AX”, “BX”, “CX” on x86 • Special-purpose: • “SP” = stack pointer • “FP” = frame pointer • “PC” = program counter SP arg0 arg1 arg0 arg1 arg2 FP

  8. Registers • Register = dedicated name for one word of memory managed by CPU • General-purpose: “AX”, “BX”, “CX” on x86 • Special-purpose: • “SP” = stack pointer • “FP” = frame pointer • “PC” = program counter • Change processes:save current registers &load saved registers =context switch SP arg0 arg1 FP

  9. Caches • Access to main memory: “expensive” • ~ 100 cycles (slow, but relatively cheap ($)) • Caches: small, fast, expensive memory • Hold recently-accessed data (D$) or instructions (I$) • Different sizes & locations • Level 1 (L1) – on-chip, smallish • Level 2 (L2) – on or next to chip, larger • Level 3 (L3) – pretty large, on bus • Manages lines of memory (32-128 bytes)

  10. Memory Hierarchy • Higher = small, fast, more $, lower latency • Lower = large, slow, less $, higher latency registers 1-cycle latency 2-cycle latency L1 evict D$, I$ separate load 7-cycle latency L2 D$, I$ unified RAM 100 cycle latency Disk 40,000,000 cycle latency Network 200,000,000+ cycle latency

  11. “Locality”

  12. “Level 0 Cache”

  13. “Level 1 Cache”

  14. “RAM”

  15. “Disk”

  16. “Disk”

  17. “Disk”

  18. “Disk”

  19. “Book Hierarchy”

  20. Orders of Magnitude • 10^0 registers L1

  21. Orders of Magnitude • 10^1 L2

  22. Orders of Magnitude • 10^2 RAM

  23. Orders of Magnitude • 10^3

  24. Orders of Magnitude • 10^4

  25. Orders of Magnitude • 10^5

  26. Orders of Magnitude • 10^6

  27. Orders of Magnitude • 10^7 Disk

  28. Orders of Magnitude • 10^8 Network

  29. Orders of Magnitude • 10^9 Network

  30. Cache Jargon • Cache initially cold • Accessing data initially misses • Fetch from lower level in hierarchy • Bring line into cache (populate cache) • Next access: hit • Warmed up • cache holds most-frequently used data • Context switch implications? • LRU: Least Recently Used • Use the past as a predictor of the future

  31. Cache Details • Ideal cache would be fully associative • That is, LRU (least-recently used) queue • Generally too expensive • Instead, partition memory addresses and put into separate bins divided into ways • 1-way or direct-mapped • 2-way = 2 entries per bin • 4-way = 4 entries per bin, etc.

  32. Associativity Example • Hash memory based on addresses to different indices in cache

  33. Miss Classification • First access = compulsory miss • Unavoidable without prefetching • Too many items in way = conflict miss • Avoidable if we had higher associativity • No space in cache = capacity miss • Avoidable if cache were larger • Invalidated = coherence miss • Avoidable if cache were unshared

  34. Quick Activity • Cache with 8 slots, 2-way associativity • Assume hash(x) = x % 4 (modulus) • How many misses? • # compulsory misses? • # conflict misses? • # capacity misses? 10 2 0

  35. Locality • Locality = re-use of recently-used items • Temporal locality: re-use in time • Spatial locality: use of nearby items • In same cache line, same page (4K chunk) • Intuitively – greater locality = fewer misses • # misses depends on cache layout, # of levels, associativity… • Machine-specific

  36. Quantifying Locality • Instead of counting misses,compute hit curve from LRU histogram • Assume perfect LRU cache • Ignore compulsory misses 7 3 1 2 3 4 5 6

  37. Quantifying Locality • Instead of counting misses,compute hit curve from LRU histogram • Assume perfect LRU cache • Ignore compulsory misses 7 3 1 2 3 4 5 6

  38. Quantifying Locality • Instead of counting misses,compute hit curve from LRU histogram • Assume perfect LRU cache • Ignore compulsory misses 2 7 3 1 2 3 4 5 6

  39. Quantifying Locality • Instead of counting misses,compute hit curve from LRU histogram • Assume perfect LRU cache • Ignore compulsory misses 2 7 3 1 2 3 4 5 6

  40. Quantifying Locality • Instead of counting misses,compute hit curve from LRU histogram • Assume perfect LRU cache • Ignore compulsory misses 3 2 7 1 2 3 4 5 6

  41. Quantifying Locality • Instead of counting misses,compute hit curve from LRU histogram • Assume perfect LRU cache • Ignore compulsory misses 3 2 7 1 2 3 4 5 6

  42. Quantifying Locality • Instead of counting misses,compute hit curve from LRU histogram • Start with total misses on right hand side • Subtract histogram values 1 1 3 3 3 3 1 2 3 4 5 6

  43. Quantifying Locality • Instead of counting misses,compute hit curve from LRU histogram • Start with total misses on right hand side • Subtract histogram values • Normalize .3 .3 1 1 1 1

  44. Hit Curve Exercise • Derive hit curve for following trace:

  45. 1 2 2 2 3 3 4 5 6 Hit Curve Exercise • Derive hit curve for following trace: 1 2 3 4 5 6 7 8 9

  46. 1 2 2 2 3 3 4 5 6 1 2 3 4 5 6 7 8 9 Hit Curve Exercise • Derive hit curve for following trace:

  47. What can we do with this? • What would be the hit rate • with a cache size of 4 or 9?

  48. Simple cache simulator • Only argument is N, length of LRU queue • Read in addresses (ints) from cin • Output hits & misses to cout • queue<int> • push_front (v) = put v on front of queue • pop_back() = remove back from queue • erase(i) = erase element (iterator i) • size() = number of elements • for (queue<int>::iterator i = q.begin(); i < q.end(); ++i) cout << *i << endl;

  49. Important CPU Internals • Other issues that affect performance • Pipelining • Branches & prediction • System calls (kernel crossings)

  50. Scalar architecture • Straight-up sequential execution • Fetch instruction • Decode it • Execute it • Problem: I or D cache miss • Result – stall: everything stops • How long to wait for miss? • long time compared to CPU

More Related