1 / 23

A Position-Insensitive Finished Store Buffer

A Position-Insensitive Finished Store Buffer. Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison. http://www.ece.wisc.edu/~pharm. Motivation. As microprocessors get wider and deeper More in-flight stores

zander
Télécharger la présentation

A Position-Insensitive Finished Store Buffer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Position-Insensitive Finished Store Buffer Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison http://www.ece.wisc.edu/~pharm

  2. Motivation • As microprocessors get wider and deeper • More in-flight stores • Need a larger store queue • Increase access time and power consumption • Needs SQ access time <= D$ access time • Avoid replay in case of store-to-load forwarding

  3. A Brief Store Queue Overview • Serve 2 main purposes: • To maintain the order of in-flight stores • To forward store data to later loads • Commonly designed as a circular buffer • Allocate entry on dispatch • Deallocate entry on retirement • Equipped with forwarding logic • CAM structure for address match • Select logic to pick the youngest older matching store

  4. Store to Load Forwarding • Each load needs to search the store queue for any matching older stores • Forwarding logic consists of 3 components: • Store Address CAM • Select Logic • Store Data RAM Store Address CAM Select Logic Store Data RAM

  5. SQ Access Latency • Major components of latency: CAM and Select • CAM is scalable, Select is not

  6. SQ Energy per Access • Major component of energy : CAM

  7. Outline • Motivation and Background • Finished Store Buffer (FSB) • Initial Study • Details of Design • Methodology • Results • Conclusion

  8. SQ Occupancy Study • Most of the time, there are <= 50% of stores are finished and waiting to retire • The number of waiting-to-retire stores does not scale linearly with the size of the OoO window • 12, 20, 32, and 52 are used as the number of entry of our FSB for 128, 256, 512, 1024 window size

  9. Finished Store Buffer • The forwarding logic only cares about waiting-to-retire stores • As shown, only less than 50% of in-flight stores • ROB can be used to track store order • Finished Store Buffer • Much smaller than conventional store queue • Does not maintain positional store ordering

  10. FSB Diagram Fetch Dec Rnm Disp Queue Sched Read Exe WB Ret FSB Conventional SQ • Allocate FSB entry at schedule • Deallocate FSB entry at retirement • FSB is maintained using a free-list • A store is issued only if there is an available entry

  11. Forwarding Logic • Load checks the FSB for matching store • FSB position does not reflect relative age • Non-positional select logic • Same problem in a non-compacting scheduler • Solutions: Buyuktosunoglu [SOC 2002], Robery [US Patent], and Sassone [ISCA 2007] • Solutions similar to that by Buyuktosunoglu is used since it requires the least number of bits

  12. Youngest Select Logic st A 0 0 0st A 0 0 1st A 1 0 0st A 1 0 1ld A 1 0 1 1 0 1 4 inputs 4 inputs 4 inputs … … … A2[3:0] 1 0 1 0 1 0 1 S[2] S[3:0] 1100 1100 0100 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 A2[2] 1100 One hot select signal 0000 0100 0000 0101 • 4-entry FSB, 3-bits color (111:youngest, 000:oldest) • Modification • Add one more bit and a simple reverse logic to handle wrap around • Restructure the algorithm hierarchically, checking happens in parallel A1[3:0] A0[3:0]

  13. FSB Corner Cases • Deadlock avoidance • Happens when a store to issue is the oldest in the window and the FSB is full • Reserves an entry in the FSB for the oldest store • In order retirement • Keeps the FSB index in the ROB entry, uses it to index to FSB at retire • Branch misprediction • Assigns store color to each branch • Uses it to determine which FSB entries to invalidate

  14. Methodology • Simplescalar / Alpha 3.0 tool set • Machine configuration • 12-stage pipeline, 4-wide machine • 128 ROB, 96 PRF • 32 LQ, 24 SQ, 32 scheduler • 2 integer ALUs, 1 mult/div, 1 memory port • I-Cache: 64KB, DM, 64B, 2-cycle • D-Cache: 64KB, 4-way, 64B, 3-cycle • L2: 2MB, 8-way, 128B, 8-cycle • Memory: 150-cycle

  15. Modeling • To estimate timing and power for the select logic • Implemented in Verilog • Synthesized using Synopsys Design Compiler and LSI Logic’s gflxp 0.11 micron CMOS standard cell library • To estimate timing and power for RAM and CAM structures -> CACTI

  16. Access Latency Comparison • Due to fewer entries, select logic for FSB is faster • CAM latency is similar

  17. Energy per Access Comparison • Fewer entries -> less CAM power • Subarrays do not reduce energy, only latency

  18. IPC Comparison (SPEC INT) • FSB: 12, 20, 32, 52 for different window sizes • FSB-min: the most aggressive limit • To avoid stall, only needs 20%*machine-width*issue-retire stages • 5, 10, 20, and 40 for different window sizes • Both FSB and FSB-min less than 1% average slowdown

  19. IPC Comparison (SPEC FP) • Sixtrack with 1024 ROB experiences 5% slowdown • Retirement stall of unfinished stores • Slowdown less than 1% with 2 reservation slots • In some cases, FSB slightly outperforms the baseline IPC • Happens when the store queue size limits instructions dispatch in the baseline

  20. Prior Work • SQIP [Sha, 2005] • Remove the associative search of SQ • Loads use store-set to predict the index of a forwarding SQ entry • Misprediction is detected by precommit re-execution, results in pipeline flush • ULB-LSQ [Sethumadhavan, 2007] • Unordered SQ, allocated at issue time • Similar to our approach • Differs in forwarding policy and overflow handling

  21. Prior Work • [Franklin, 1996]: ARB in Multiscalar • [Sethumadhavan, 2003], [Park, 2003]: Filtering mechanism (bloom filter and store set) to reduce store queue access • [Baugh, 2004]: Decomposed store queue functionality, only stores in forwarding group need to be put into the forwarding buffer • [Torres, 2005]: 2-level SQ, predicted forwarding stores in L1, validation is done in L2 • [Roth, 2005]: SVW, breaking SQ functionality into RSQ and FSQ, validation is done using load re-execution • [Sha, 2005], [Stone, 2005]: SQIP and AIMD, removing the associative search capability from SQ • [Subramanian, 2006], [Sha, 2006]: FnF and NoSQ, eliminate the whole SQ, load re-execution for validation • [Sethumadhavan, 2007]: ULB-LSQ, unordered store queue that is allocated at issue time

  22. Conclusion • FSB, an alternative way to build the SQ • Only contains finished stores • Much smaller • More scalable • Minimal IPC impact, < 1% • Lower power • Possible higher frequency • FSB-min, a more aggressive approach • Also has minimal IPC impact • Future work • Load Queue • Better deadlock handling

  23. Thank you Questions?

More Related