1 / 36

Trading Cache Hit Rate for Memory Performance

Trading Cache Hit Rate for Memory Performance. Wei Ding, Mahmut Kandemir , Diana Guttman , Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State University. Summary. Problem

Télécharger la présentation

Trading Cache Hit Rate for Memory Performance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Trading Cache Hit Rate for Memory Performance Wei Ding, MahmutKandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State University

  2. Summary Problem Most data locality optimizations target exclusively cache locality. “Row Buffer Locality” is also important. The problem is especially challenging in the case of irregular programs (sparse data) Proposal A compiler-runtime cooperative data layout optimization that improves row-buffer locality in irregular programs ~17% improvement in overall application performance

  3. Outline • Background • Motivation • Conservative Layout • Fine-grain Layout • Related Work • Evaluation • Conclusion

  4. DRAM Organization DRAM chip DIMM Bank Row Buffer Rank Channel Row-buffer Locality Processor MC MC

  5. Irregular Programs Real X(num_nodes), Y(num_edges); Integer IA(num_edges, 2); for (t = 1, t < T, t++) { /* If it is time to update the interaction list */ for (i = 0, i < num_edges; i++) { X(IA(i, 1)) = X(IA(i, 1)) + Y(i); X(IA(i, 2)) = X(IA(i, 2)) - Y(i); } }

  6. Inspector/Executor model /* Executor */ Real X(num_nodes), Y(num_edges); Real X’(num_nodes), Y’(num_edges); Integer IA(num_edges, 2); for (t = 1, t < T, t++) { X’, Y’ = Trans(X, Y); for (i = 0, i < num_edges; i++) { X’(IA(i, 1)) = X’(IA(i, 1)) + Y’(i); X’(IA(i, 2)) = X’(IA(i, 2)) - Y’(i); } } /* Inspector */ Trans(X, Y): for (i = 0, i < num_edges; i++) { /* data reordering algorithms */ } return (X’, Y’) Used for identifying parallelism or improving cache locality

  7. Outline • Background • Motivation • Conservative Layout • Fine-grain Layout • Related Work • Evaluation • Conclusion

  8. Row-buffer Locality • Prior works that target irregular applications exclusively focus on improving cache locality • No efforts to improve row-buffer locality • Typical latencies (based on AMD architecture) • Last Level Cache (LLC) hit = 28 cycles • Row-buffer hit = 90 cycles • Row-buffer miss = 350 cycles • Application performance is dictated not only by the cache hitrate, but also by the row-buffer hitrate.

  9. Example 1 2 3 Layout (b) eliminates the row-buffer miss caused by accessing ‘y’. Assuming this move will not cause any additional cache misses Layout (c) eliminates the row-buffer misses caused by accessing ‘v’ even at the cost of an additional cache miss

  10. Outline • Background • Motivation • Conservative Layout • Fine-grain Layout • Related Work • Evaluation • Conclusion

  11. Notations • Seq: the sequence of data elements obtained by traversing the index array • αx:the access to a particular data element x in Seq • time(αx):the “logical time stamp” of x in Seq • βx:the memory block where data element x resides • αx,: the “most recent access” to βx before αx • Caches(βx):the set of cache blocks to which βx can be mapped in a k-way set-associative cache

  12. Definition • Block Distance: Given Caches(βx) = Caches(βY), the block distance between αx and αy , denoted as Δ(αy , αx) is the number of “distinct" memory blocks that are mapped to Caches(βx) and accessed during the time period between time(αx) and time(αy)

  13. Lemma • Locality Set: A set of data elements, denoted by Ω, forms a locality set, if and only if: •  x  y = βy (All elements of reside in the same memory block) •  x  y  αx ,  αy : Δ(αy , αx) ≤ k (The block distance between any pair of elements in the set ≤ k) •  x ∉ y αx , αy : Δ(αy , αx) > k (The block distance between an element and a non-element > k) • Non-increased Cache Misses: Moving from βxto βywill not increase the total number of cache misses in Seq if Caches(βy)= Caches(βx)

  14. Conservative Layout • Objective: • Increase row-buffer hitrate • Without affecting the cache performance • Algorithm • Identifying the locality sets • Constructing the interference graph • Assigning rows in memory

  15. 1. Identifying the Locality Sets • Traversing the index array, for each cache set, we maintain a list of most frequent accesses to ‘k’ different memory blocks • The block distance between the current access and any other access on the list is never greater than k • During this traversal, xand yare placed into the same locality set only when = βy

  16. 2. Constructing the Interference Graph • Each node represents a locality set • If αxand αyare the two accesses that incur successive cache misses in Seq, and x and y are located in different rows, then an edge is added between the locality sets of x and y • Weight on this edge represents the total number of such αx and αy pairs

  17. 3. Assigning Rows in Memory • Sort the edges in the interference graph in decreasing order • Assign same row to the locality sets connected by the edge with the largest weight

  18. Outline • Background • Motivation • Conservative Layout • Fine-grain Layout • Related Work • Evaluation • Conclusion

  19. Fine-grain Layout • Partition:Given x , a partition for xis defined as a subset of , denoted as Px, where x Px • Basic Idea: Whenever the accesses to two data elements (denoted as x and y) incur successive cache misses and x and y reside in different rows in memory • Try to find two partitions for xand y, Pxand Py, such that, when placing Pxand Pyinto the same row, the increased cache miss latency is less than the reduced row-buffer miss latency

  20. Algorithm • Constructing the Interference Graph • Constructing the Locality Graphs • Finding Partitions • Assigning Rows in Memory

  21. 1. Constructing the Interference Graph • Each node in the interference graph represents a data element • If αx and αyare two accesses that incur successive cache misses in Seq, and x and y are located in different rows, then we set up an edge between x and y • Weight on the edge represents the number of such αx and αy pairs

  22. 2. Constructing the Locality Graphs • Each locality set has a locality graph, where each node is a data element in • For any access αuwhose block reuse distance is exactly k, if there exists αxand αywithin time slot [time(αu,), time(αu)], such that x, y and u belong to the same locality set, then we increase the weight of the edge between xand y by 1 • If we move all the elements in a partition for x to another memory block , such that Caches() = Caches(), then the number of increased cache misses is at most equal to the sum of the weights of the edges connected to x in the locality graph

  23. 3. Finding Partitions • Sort the edges in interference graph in decreasing order • We first consider isolating x and y from their locality sets, i.e., placing only x into Px, and only y into Py • We add data elements connected to x into Pxand elements connected to y to Pyuntil (N -Nrb) xrb> Nch x

  24. 4. Assigning Rows in Memory • Each partition is assigned to a memory block in a row

  25. Example

  26. Related Work • Inspector/Executor model • Typically used for parallelism (Lawrence Rauchwerger[1]) and cache locality (Chen Ding [2]) • We use it to improve row-buffer locality and our approach is complementary to them • Row buffer locality • Compiler approach: Mary W. Hall [3] • Hardware approach: Al Davis [4] • Our work specifically targets irregular applications

  27. Outline • Background • Motivation • Conservative Layout • Fine-grain Layout • Related Work • Evaluation • Conclusion

  28. Evaluation Platform (modeled in GEM5) Benchmarks

  29. Simulation Results 6 % 15 % 12 % 27 % 17 %

  30. Conclusion • Exploiting row-buffer locality is critical for application performance • We proposed two compiler-directed data layout organizations with the goal of improving row-buffer locality in irregular applications • Without affecting cache performance • Trading cache performance for row-buffer locality

  31. Thank You • Questions?

  32. References • “Improving Cache Performance in Dynamic Applications through Data and Computation Reorganization at Run Time”, ICPP 2012 • “Sensitivity Analysis for Automatic Parallelization on Multi-Cores”, ICS 2007 • “A compiler algorithm for exploiting page-mode memory access in embedded dram devices“, MSP ’02 • “Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement”, ASPLOS 2010

  33. Backup Slides

  34. Results with AMD based system

  35. Memory Scheduling

More Related