1 / 61

Buffers

Buffers. Buffers minimize memory delays caused by variation in throughput between the pipeline and memory Two types of buffer design criteria Maximum rate for units that have high request rates The buffer is sized to mask the service latency Generally read buffers that you want to keep full

tariq
Télécharger la présentation

Buffers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Buffers • Buffers minimize memory delays caused by variation in throughput between the pipeline and memory • Two types of buffer design criteria • Maximum rate for units that have high request rates • The buffer is sized to mask the service latency • Generally read buffers that you want to keep full • Mean rate buffers for units that have a lower expected request rate • The buffer design is sized to minimize the probability of overflowing • Generally, write buffers that you want not to be full

  2. Maximum-Rate Buffer Design • Buffer is sized to avoid “runout”. In this case the processor stalls while the buffer is empty awaiting service. • Need buffer input rate > buffer output rate • Then size to cover latency at maximum demand • For I buffer, the buffer size (BF) should beBF = 1+[(Instrs decoded/~)*(IF latency in~)]/Instrs/IF

  3. Maximum-Rate Buffer Example Assumptions: Decode consumes max 1 inst/clock Icache supplies 2 inst/clock bandwidth at 6 clocks latency

  4. Mean-Rate Buffer Design • Buffer is sized to avoid “overflow”. In this case the processor stalls while the buffer is awaiting service to free an entry. • Need buffer input rate < buffer output rate • Then size to reduce frequency of overflow to acceptably low level with respect to overall performance

  5. Mean-Rate Buffer Design • Use inequalities from statistics to target buffer size • For infinite buffer, assume distribution of buffer occupancy is q and mean occupancy is Q • Using Markov’s inequality for buffer of size BF Prob of overflow = p(q >= BF) <= Q/BF • Using Chebyshev’s inequality for buffer of size BF Prob of overflow = p(q >= BF) <= 2/(BF- Q)2 • for a given probability of overflow (p), conservatively select BF BF = min(Q/p,Q +/ p) • Analyze design carefully to pick correct BF that causes overflow/stall

  6. Mean-Rate Buffer Example Reads MemoryReferencesfromPipeline DataCache StoreBuffer Writes Assumptions: store rate = 0.15 inst/cycle store latency to data cache = 2 clocks 2 = 0.3 BF = 2

  7. Cache • Design Target Miss Ratio (DTMR) • DTMR is the design target miss rate assuming fully associative, unified cache with LRU replacement • Includes compulsory and capacity misses for user-only traces

  8. Cache • Apply adjustments to DTMR • Associativity • split Instruction/Data • Write Policy: Write-Through (WT) or Copy Back (CB) • Write Allocate (WA) vs. No Write-Allocate (NWA) • Replacement Policy • OS, multiprogramming, and I/O effects

  9. Target Miss Rate Trend • Target Miss Rates tend to increase over time • programming environments become more complex • application functionality increases • problem size grows • memory capacity increases

  10. Target Miss Rate Trend

  11. System Effects • Operating System • Multiprogramming • Q = no. instructions between task switches • Cold-Start vs. Warm-Start • Cold-Start • short transactions are created frequently and run quickly to completion • Warm-Start • long processes are executed in time slices

  12. Multi-Level Caches • Very useful for matching processor to memory • Generally 2-level • For microprocessors, L1 on-chip at frequency of pipeline and L2 off-chip at slower latency and bandwidth

  13. Multi-Level Caches • Analysis by (statistical) inclusion • If a L2 cache is greater than four times the size of the L1 cache then we assume statistically that the contents of L1 lies in L2 • Relevant L2 Miss Rates • Local Miss Rate: No. L2 misses / No. L2 references • Global Miss Rate: No. misses / No. processor references • Solo Miss Rate: No. misses without L1 / No. processor references • Inclusion => Solo miss rate = Global miss rate • Miss penalty calculation • L1 miss rate x (miss in L1, hit in L2 penalty) plus • L2 miss rate x ( miss in L1, miss in L2 penalty - L1 to L2 penalty)

  14. Multi-Level Cache Example Memory L1 L2 Miss Rate 4% 1% Delays: Miss in L1, Hit in L2 2~ Miss in L1, Miss in L2 7~ Cpi loss due to L1 is 1.5 refr/instr x 0.04 x 2~/miss = 0.12 Cpi loss due to L2 is 1.5 refr/instr x 0.01 x (7-2) ~/miss = 0.075

  15. Logical Inclusion • In multiprocessors it is important to know exactly that the L1 cache does NOT contain a line by determining that the L2 cache does not have this line. • Reduces latency and bandwidth for snooping • For this purpose we need to insure that all the contents of L1 are always in L2 • We call this property Logical Inclusion

  16. Logical Inclusion • Techniques • Control Cache size, organization, policies • No. L2 sets >= No. L1 sets • L2 set size >= L1 set size • Compatible replacement algorithms • But this is highly restrictive and difficult to guarantee • Back-invalidation • Whenever a line is replaced or invalidated in the L2, ensure that it is not present in L1 or it is evicted from L1

  17. Outline: Memory and Queuing Models • Physical Memory • Memory Technology • Simple Memory Performance Models • Hellerman • Strecker

  18. Outline: Memory and Queuing Models • Basic Queuing Models • Terminology • Approximations • Key Results • Application of Queuing Models to Memory Performance • Interleaved memory • Cache • Bus

  19. Memory • Processors are increasingly limited by memory and not processor organization or cycle time. • Memory is characterized by 3 parameters • size • access time (latency) • cycle time (bandwidth)

  20. Physical Memory System

  21. Achieved vs. Offered Bandwidth Offered Request Rate: • Rate that processor(s) would make requests if memory had unlimited bandwidth and no contention

  22. DRAM Technology (text section 6.2) • DRAM cell • Capacitor used to store charge for 0/1 state • Transistor used to switch capacitor to bit line • Charge decays over time => refresh required

  23. DRAM Technology (text section 6.2) • DRAM array • Stores 2n bits in a square array • 2n/2 row lines connect to Tx gates • 2n/2 column bit lines with sense amp • DRAM chip • Row and column addresses muxed • Row/Column Strobes for timing

  24. Technology and Market Trends • Market Trends • Demand for minimum/incremental memory growing only ~30% per year for PCs • Bandwidth requirements growing rapidly • driven by multimedia (esp. video and graphics) => caches do not help • Fill Frequency captures this trend • Rate at which it is possible to read/write all of memory • Bandwidth[MB/sec] / Memory Size [MB] • Example PC: 500 MB/sec for 32MB => 15.6 Hz fill frequency

  25. Technology and Market Trends • Technology Trends • Capacity increases 60% per year • Access time decreases 7% per year • Cost/bit decreases 26% per year As DRAM Density Increases, Need More Bandwidth and Fewer Memory Chips With Few Pins

  26. Techniques to Increase DRAM Bandwidth • Fast Paged Mode • Synchronous DRAM • Rambus (RDRAM)

  27. Fast Page Mode (FPM) • Page Mode => Save most recently accessed row (“page”) • Only need column row and CAS to access within the page • Fast Page Mode • Counter built into RAM for sequential accesses • Only need CAS for sequential accesses • With Extended Data-Out (EDO) can cycle at 33-50 MHz

  28. Synchronous DRAM (SDRAM) • Clocked (Synchronous) interface enables pipelined access • Internally implemented with dual banks for higher throughput • Clock rates of 100-150 MHz possible • Dual Data Rate (DDR SDRAMS) • proposal to transfer 2 data bits per clock • target 200 Mb/s at 100 MHz • technically feasible, but demanding system requirements

  29. RAMBus DRAM (RDRAM) • Channel interface bus replaces row/column address • 9 data signals, 4 clock and control signals • Protocol for transferring address and data bursts • Clock frequency currently 266 MHz with 533 MB/s • Up to 32 RDRAMs per channel • Die area ~10% overhead for 16Mb DRAM • SLDRAM is open standard with similar technology From IEEE Micro 11/97

  30. RAMBus DRAM (RDRAM) • Channel interface bus replaces row/column address • 9 data signals, 4 clock and control signals • Protocol for transferring address and data bursts • Clock frequency currently 266 MHz with 533 MB/s • Up to 32 RDRAMs per channel • Die area ~10% overhead for 16Mb DRAM

  31. RAMBus DRAM (RDRAM) • SLDRAM is open standard with similar technology From IEEE Micro 11/97

  32. Memory Module • The module consists of the DRAM chips that make up the physical memory word. In addition to the DRAM chips there are memory controller, timing and bus driver chips. • If the DRAM is organized 2n words xb bits and the memory has p bits/ physical word then the module has p/b DRAM chips.

  33. Memory Module • The module has 2n words xp bits • Parity or Error-Correction Code (ECC) generally required for error detection and availability • Text section 6.2.2 describes code forSingle-Error Correction, Double-Error Detection (SECDED) • Requires ~ extra log2(p) + 1 bits

  34. Memory system • Consists of multiple modules that are interleaved by low order or high order address bits (or both). • Low order interleaving improves memory BW • High-order interleaving improves availability

  35. Processor memory model • Assume that n processors each make one request each Tc to one of m memory modules. B(n,m) is number of successes. • Tc is the memory cycle time, Ta is the memory access time. • To the memory one processor making n request per Tc behaves as n processors making 1 request per Tc.

  36. Basic terms • B = B(m,n) or B(m) is number of requests that succeed each Tc. It is the bandwidth normalized to Tc. • Ts is a more generalized term for service time. Tc = Ts. Used in I/O models. • BW is the achieved bandwidth in requests serviced per second. BW = B / Ts = B(m,n) /Tc

  37. Modeling and Evaluation Methodology • Identify relevant physical parameters • for memory: word size, module size, no. modules, Tc, Ta • Find the offered Bandwidth • n/Tc • Find the bottleneck • performance limited by most restrictive service point

  38. Modeling and Evaluation Methodology • Determine the type of reference pattern • sequential, stride, random • Select an appropriate model • Use the simplest possible • Evaluate the achieved bandwidth • B(m,n) for memory

  39. Models for computing B(m,n); text 6.3 • Hellerman’s..... B(m) = m • Limited, unrealistic model • single processor generates random references in-order until bank conflict • Only historical interest: “the square-root rule” • Strecker’s (the null binomial) • Queue models • open • closed • mixed

  40. Strecker’s model • Model description • Each processor generates 1 reference per cycle • Requests random and uniformly distributed across modules • Any busy module serves 1 request • All unserviced requests are dropped each cycle • There are no queues • B(m,n) = m[1 - (1 - 1/m)n]

  41. tw ts t Queuing Models, text 6.4 • Arrival process • Server • Occupancy/Utilization • Time • Number of items Art of Computer Systems Performance Analysis, Raj Jain, Fig. 30.2

  42. Key Model Characteristics • Queuing Models characterized by 3-tuple • arrival distribution • service time distribution • No. Servers • E.g., G/G/1 for general arrival and service distribution • Arrival distributions we will use • MB, the Binomial distribution • M, the Poisson distribution • inter-arrival times are exponentially distributed • limiting case of the Binomial

  43. Key Model Characteristics • Service distributions we will use • M, exponential service time • D, deterministic service time (constant) • We will generally be looking for mean values • time in system, utilization

  44. Binomial arrivals, text 6.4.1 • Description • n items enter system with has m modules • requests are randomly distributed across modules with equal probability 1/m(Bernoulli trial) • Probability that k out of n requests are to a specific module • Pn(k) = C(kn)(p)k(1-p)n-k • 1 -Pn(0) = B(m,n) from Strecker’s model

  45. Poisson distribution, text 6.4.1 • Assume n and m very large, then p is small but np, the expected no. of arrivals at the server during T, is finite. • np/T and p = T/n • Then P(k) = [(T)k/k!] e- T

  46. Queuing Properties, text 6.4.5 • No. itemsitems in system = items in queue + items in service • Mean valuesN = Q + r • Little’s result: N = T and Q = Tw • Service Distributions • Coefficient of variation = c =  • For exponential distribution c = 1 • For deterministic (constant) distribution c = 0

  47. Pollaczek-Khinchin (P-K) Theorem • For M/G/1Mean waiting time Tw = (1/)[ 2(1+c2)/2(1-)]Mean items in queue Q =  Tw = 2(1+c2)/2(1-) • Cases of interest • For M/M/1, c2 =1Tw = (1/)[ 2/ (1-)]Q = 2/(1-) • For M/D/1, c2 = 0Tw = (1/)[ 2/ 2(1-)]Q = 2/2(1-) • For MB/D/1, c2 =0Define p =/m where is the prob. that a source makes a requestTw = (1/)[ (2-p)/2(1-)]Q = (2-p)/2(1-)

  48. Open vs Closed queues, text 6.5 • Open Queue Models • Arrival process is independent of queue size • Processor not slowed (or otherwise affected ) by contention • Queue size is unbounded • Closed Queue Models • Arrival rate slows down as queue grows • If n items are offered and the system initially accepts only B, then the queue size Q = n-B

  49. Open vs Closed queues, text 6.5 • Mixed Queue Models • Buffers allow arrival rate to continue without slowdown (like open queue) until queue size reaches a threshold, then arrival rate starts slowing down (like closed queue) • Use this later for I/O

  50. Closed Queues, text 6.5.2 • From the P-K theorem for total queue sizeN = a +[ a2(1+c2)/2(1-a)] • N = n/m • n/m per Tc is offered to each module • a is achieved • and Tw =m-1[( - a)/a] • B(m,n) = ma, solving for B(m,n) in asymptotic case • for M/D/1 • a =( + 1) -2 + 1 • B(m,n) = m + n - n2 + m2 • for MB/D/1 • B(m,n) = m+n-1/2 - (m+n - 1/2)2- 2mn

More Related