A Position-Insensitive Finished Store Buffer

A Position-Insensitive Finished Store Buffer Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison http://www.ece.wisc.edu/~pharm

Motivation • As microprocessors get wider and deeper • More in-flight stores • Need a larger store queue • Increase access time and power consumption • Needs SQ access time <= D$ access time • Avoid replay in case of store-to-load forwarding

A Brief Store Queue Overview • Serve 2 main purposes: • To maintain the order of in-flight stores • To forward store data to later loads • Commonly designed as a circular buffer • Allocate entry on dispatch • Deallocate entry on retirement • Equipped with forwarding logic • CAM structure for address match • Select logic to pick the youngest older matching store

Store to Load Forwarding • Each load needs to search the store queue for any matching older stores • Forwarding logic consists of 3 components: • Store Address CAM • Select Logic • Store Data RAM Store Address CAM Select Logic Store Data RAM

SQ Access Latency • Major components of latency: CAM and Select • CAM is scalable, Select is not

SQ Energy per Access • Major component of energy : CAM

Outline • Motivation and Background • Finished Store Buffer (FSB) • Initial Study • Details of Design • Methodology • Results • Conclusion

SQ Occupancy Study • Most of the time, there are <= 50% of stores are finished and waiting to retire • The number of waiting-to-retire stores does not scale linearly with the size of the OoO window • 12, 20, 32, and 52 are used as the number of entry of our FSB for 128, 256, 512, 1024 window size

Finished Store Buffer • The forwarding logic only cares about waiting-to-retire stores • As shown, only less than 50% of in-flight stores • ROB can be used to track store order • Finished Store Buffer • Much smaller than conventional store queue • Does not maintain positional store ordering

FSB Diagram Fetch Dec Rnm Disp Queue Sched Read Exe WB Ret FSB Conventional SQ • Allocate FSB entry at schedule • Deallocate FSB entry at retirement • FSB is maintained using a free-list • A store is issued only if there is an available entry

Forwarding Logic • Load checks the FSB for matching store • FSB position does not reflect relative age • Non-positional select logic • Same problem in a non-compacting scheduler • Solutions: Buyuktosunoglu [SOC 2002], Robery [US Patent], and Sassone [ISCA 2007] • Solutions similar to that by Buyuktosunoglu is used since it requires the least number of bits

Youngest Select Logic st A 0 0 0st A 0 0 1st A 1 0 0st A 1 0 1ld A 1 0 1 1 0 1 4 inputs 4 inputs 4 inputs … … … A2[3:0] 1 0 1 0 1 0 1 S[2] S[3:0] 1100 1100 0100 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 A2[2] 1100 One hot select signal 0000 0100 0000 0101 • 4-entry FSB, 3-bits color (111:youngest, 000:oldest) • Modification • Add one more bit and a simple reverse logic to handle wrap around • Restructure the algorithm hierarchically, checking happens in parallel A1[3:0] A0[3:0]

FSB Corner Cases • Deadlock avoidance • Happens when a store to issue is the oldest in the window and the FSB is full • Reserves an entry in the FSB for the oldest store • In order retirement • Keeps the FSB index in the ROB entry, uses it to index to FSB at retire • Branch misprediction • Assigns store color to each branch • Uses it to determine which FSB entries to invalidate

Methodology • Simplescalar / Alpha 3.0 tool set • Machine configuration • 12-stage pipeline, 4-wide machine • 128 ROB, 96 PRF • 32 LQ, 24 SQ, 32 scheduler • 2 integer ALUs, 1 mult/div, 1 memory port • I-Cache: 64KB, DM, 64B, 2-cycle • D-Cache: 64KB, 4-way, 64B, 3-cycle • L2: 2MB, 8-way, 128B, 8-cycle • Memory: 150-cycle

Modeling • To estimate timing and power for the select logic • Implemented in Verilog • Synthesized using Synopsys Design Compiler and LSI Logic’s gflxp 0.11 micron CMOS standard cell library • To estimate timing and power for RAM and CAM structures -> CACTI

Access Latency Comparison • Due to fewer entries, select logic for FSB is faster • CAM latency is similar

Energy per Access Comparison • Fewer entries -> less CAM power • Subarrays do not reduce energy, only latency

IPC Comparison (SPEC INT) • FSB: 12, 20, 32, 52 for different window sizes • FSB-min: the most aggressive limit • To avoid stall, only needs 20%*machine-width*issue-retire stages • 5, 10, 20, and 40 for different window sizes • Both FSB and FSB-min less than 1% average slowdown

IPC Comparison (SPEC FP) • Sixtrack with 1024 ROB experiences 5% slowdown • Retirement stall of unfinished stores • Slowdown less than 1% with 2 reservation slots • In some cases, FSB slightly outperforms the baseline IPC • Happens when the store queue size limits instructions dispatch in the baseline

Prior Work • SQIP [Sha, 2005] • Remove the associative search of SQ • Loads use store-set to predict the index of a forwarding SQ entry • Misprediction is detected by precommit re-execution, results in pipeline flush • ULB-LSQ [Sethumadhavan, 2007] • Unordered SQ, allocated at issue time • Similar to our approach • Differs in forwarding policy and overflow handling

Prior Work • [Franklin, 1996]: ARB in Multiscalar • [Sethumadhavan, 2003], [Park, 2003]: Filtering mechanism (bloom filter and store set) to reduce store queue access • [Baugh, 2004]: Decomposed store queue functionality, only stores in forwarding group need to be put into the forwarding buffer • [Torres, 2005]: 2-level SQ, predicted forwarding stores in L1, validation is done in L2 • [Roth, 2005]: SVW, breaking SQ functionality into RSQ and FSQ, validation is done using load re-execution • [Sha, 2005], [Stone, 2005]: SQIP and AIMD, removing the associative search capability from SQ • [Subramanian, 2006], [Sha, 2006]: FnF and NoSQ, eliminate the whole SQ, load re-execution for validation • [Sethumadhavan, 2007]: ULB-LSQ, unordered store queue that is allocated at issue time

Conclusion • FSB, an alternative way to build the SQ • Only contains finished stores • Much smaller • More scalable • Minimal IPC impact, < 1% • Lower power • Possible higher frequency • FSB-min, a more aggressive approach • Also has minimal IPC impact • Future work • Load Queue • Better deadlock handling

Thank you Questions?

A Position-Insensitive Finished Store Buffer

A Position-Insensitive Finished Store Buffer

Presentation Transcript

What is a buffer?

FINISHED

Take a Position

Arguing A Position

Congress finished

Finished Projects!

Arguing A Position

Stencil Routed A-Buffer

Finished Product

A Buffer Overflow Example

Finished Drawings

Arguing A Position

insensitive, thick-skinned

Context-Insensitive Pointer Analysis

Work finished

Take a Position

buffer

Buffer

buffer

Design A Healthy finished basement

Purchasing a Finished Good!

insensitive, thick-skinned