Dynamically Trading Frequency for Complexity in a GALS Microprocessor

Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott University of Rochester

The gist of the paper… Radical idea: Trade off frequency and hardware complexity dynamicallyat runtime rather than statically at design time The new twist: A Globally-Asynchronous, Locally-Synchronous (GALS) microarchitecture is key to making this worthwhile

Application phase behavior Varying behavior over time Can exploit to save power L2 misses E per interval L1I misses L1D misses branch mispred IPC gcc adaptive issue queue [Sherwood, Sair, Calder, ISCA 2003] [Buyuktosunoglu, et al., GLSVLSI 2001]

What about performance? RAM delay entries relative delay 32 24 16 8 1.0 0.77 0.52 0.31 CAM delay entries relative delay 32 24 26 8 1.0 0.77 0.55 0.34 Lower power and faster access time! [Buyuktosunoglu, GLSVLSI 2001]

What about performance? How do we exploit the faster speed? Variable latency Increase frequency when downsizing Decrease frequency when upsizing

What about performance? L1 I-Cache Main Memory Br Pred Fetch Unit Dispatch, Rename, ROB L2 Cache Issue Queue Issue Queue Ld/St Unit integer FP ALUs & RF ALUs & RF L1 D-Cache clock [Albonesi, ISCA 1998]

What about performance? [Albonesi, ISCA 1998]

Enter GALS… Front-end Domain External Domain L1 I-Cache Main Memory Br Pred Fetch Unit Memory Domain Dispatch, Rename, ROB L2 Cache Integer Domain FP Domain Issue Queue Issue Queue Ld/St Unit L1 D-Cache ALUs & RF ALUs & RF [Semeraro et al., HPCA 2002] [Iyer and Marculescu, ISCA 2002]

Outline • Motivation and background • Adaptive GALS microarchitecture • Control mechanisms • Evaluation methodology • Results • Conclusions and future work

Main Memory Adaptive GALS microarchitecture Front-end Domain External Domain L1 I-Cache L1 I-Cache L1 I-Cache L1 I-Cache Br Pred Br Pred Br Pred Br Pred Fetch Unit Memory Domain Dispatch, Rename, ROB L2 Cache L2 Cache L2 Cache L2 Cache Integer Domain FP Domain Ld/St Unit Issue Queue Issue Queue Issue Queue Issue Queue Issue Queue L1 D-Cache L1 D-Cache L1 D-Cache L1 D-Cache ALUs & RF ALUs & RF

Main Memory Adaptive GALS operation Front-end Domain External Domain L1 I-Cache L1 I-Cache L1 I-Cache L1 I-Cache L1 I-Cache L1 I-Cache Br Pred Br Pred Br Pred Br Pred Br Pred Fetch Unit Memory Domain Dispatch, Rename, ROB L2 Cache L2 Cache L2 Cache L2 Cache Integer Domain FP Domain Ld/St Unit Issue Queue Issue Queue Issue Queue Issue Queue Issue Queue L1 D-Cache L1 D-Cache L1 D-Cache L1 D-Cache ALUs & RF ALUs & RF

Resizable cache organization • Access A part first, then B part on a miss • Swap A and B blocks on a A miss, B hit • Select A/B split according to application phase behavior

Resizable cache control MRU State • Config A1 B3 • hitsA = MRU[0] • hitsB = MRU[1] + [2] + [3] (MRU) 0 1 2 3 (LRU) MRU[1]++ A B C D • Config A2 B2 • hitsA = MRU[0] + [1] • hitsB = MRU[2] + [3] MRU[2]++ Example Accesses B A C D • Config A3 B1 • hitsA = MRU[0] + [1] + [2] • hitsB = MRU[3] MRU[0]++ C B A D • Config A4 B0 • hitsA = MRU[0] + [1] + [2] + [3] • hitsB = 0 MRU[3]++ C B A D • Calculate the cost for each possible configuration: A access costs = (hitsA + hitsB + misses) * CostA B access costs = (hitsB + misses) * CostB Miss access costs = misses * CostMiss Total access cost = A + B + Miss (normalized to frequency)

Resizable issue queue control • Measures the exploitable ILP for each queue size • Timestamp counter is reset at the start of an interval and incremented each cycle • During rename, a destination register is given a timestamp based on the timestamp + execution latency of its slowest source operand • The maximum timestamp, MAXN is maintained for each of the four possible queue sizes over N fetched instructions (N=16, 32, 48, 64) • ILP is estimated as N/MAXN • Queue size with highest ILP (normalized to frequency) is selected Read the paper

Resizable hardware – some details • Front end domain • Icache “A”: 16KB 1-way, 32KB 2-way, 48KB 3-way, 64KB 4-way • Branch predictor sized with Icache • gshare PHT: 16KB-64KB • Local BHT: 2KB-8KB • Local PHT: 1024 entries • Meta: 16KB-64KB • Load/store domain • Dcache “A”: 32KB 1-way, 64KB 2-way, 128KB 4-way, 256KB, 8-way • L2 cache “A” sized with Dcache • 256KB 1-way, 512KB 2-way, 1MB 4-way, 2MB 8-way • Integer and floating point domains • Issue queue: 16, 32, 48, or 64 entries

Evaluation methodology • SimpleScalar and Cacti • 40 benchmarks from SPEC, Mediabench, and Olden • Baseline: best overall performing fully synchronous 21264-like design found out of 1,024 simulated options • Adaptive MCD costs imposed: • Additional branch penalty of 2 integer domain cycles and 1 front end domain cycle (overpipelined) • Frequency penalty as much as 31% • Mean PLL locking time of 15 µsec • Program-Adaptive: profile application and pick the best adaptive configuration for the whole program • Phase-Adaptive: use online cache and issue queue control mechanisms

Performance improvement Mediabench Olden SPEC

Phase behavior – art issue queue entries 100 million instruction window

Phase behavior – apsi 256KB 128KB Dcache “A” size 64KB 32KB 100 million instruction window

Performance summary • Program Adaptive: 17% performance improvement • Phase Adaptive: 20% performance improvement • Automatic • Never degrades performance for 40 applications • Few phases in chosen application windows – could perhaps do better • Distribution of chosen configurations for Program Adaptive: Integer IQ FP IQ D/L2 Cache Icache 16 85% 32 5% 48 5% 64 5% 16 73% 32 15% 48 8% 64 5% 32KB/256KB 50% 64KB/512KB 18% 128KB/1MB 23% 256KB/2MB 10% 16KB 55% 32KB 18% 48KB 8% 64KB 20%

Domain frequency versus IQ size

Conclusions • Application phase behavior can be exploited to improve performance in addition to power savings • GALS approach is key to localizing the impact of slowing the clock • Cache and queue control mechanisms can evaluate all possible configurations within a single interval • Phase adaptive approach improves performance by as much as 48% and by an average of 20%

Future work • Explore multiple adaptive structures in each domain • Better take into account the branch predictor • Resize the instruction cache by sets rather than ways • Explore better issue queue design alternatives • Build circuits • Dynamically customized heterogeneous multi-core architectures using phase-adaptive GALS cores

Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott University of Rochester

Dynamically Trading Frequency for Complexity in a GALS Microprocessor

Dynamically Trading Frequency for Complexity in a GALS Microprocessor

Presentation Transcript

Software project managemnt for building high frequency trading systems

High Frequency Trading Systems Designs

Don’t Harbor GALS

A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O

Dynamically Collapsing Dependencies for IPC and Frequency Gain

High frequency trading: Issues and evidence

GALS

High Frequency Trading

A Case for Complexity

A Framework for Dynamically Configurable Multimedia Distribution

Configuring a Large-Scale GALS System

Hiding Synchronization Delays in a GALS Processor Microarchitecture

WEEE! GALS photo!!

200 gals. / tank 40 gals. / acre

ECO Methodology for Very High Frequency Microprocessor

High Frequency Trading Tools and Technologies

Data Synchronization Issues in GALS SoCs