Experiences with Two FabScalar-based Chips

Experiences with Two FabScalar-based Chips Elliott Forbes, RangeenBasu Roy Chowdhury, Brandon Dwiel, Anil Kannepalli, Vinesh Srinivasan, Zhenqian Zhang, Randy Widialaksono, Thomas Belanger, Steve Lipa, Eric Rotenberg, W. Rhett Davis, Paul D. Franzon

FabScalar Synthesizable and parameterized RTL for OOO superscalar cores Fetch and issue widths, structure sizes No memory hierarchy or “uncore” (until next release) Two FabScalar-based research projects H3 (“Heterogeneity in 3D”) Two cores with different microarchitectures Hardware support for thread migration AnyCore One core with reconfigurable microarchitecture Introduction 1

Technical Explore adaptivity: Ability to adjust microarchitecture to current instruction-level behavior Migrate program execution to more suitable core (H3) Reconfigure core (AnyCore) Non-technical Fulfill original vision of FabScalar Streamline development of single-ISA heterogeneous multi-core processors Experience realities of fabricating designs Have fun building stuff Goals 2

Two stacked asymmetric cores Fast Thread Migration (FTM) Bulk swap of arch. register state Cache-Core Decoupling (CCD) Cores may switch L1 cachesat thread migrations Two phases Phase 1: 2D IC (completed testing in June 2015) Test cores, caches, compiled memories, migration logic, etc. Phase 2: 3D IC (August 2015 tapeout) Demonstrate stacked out-of-order cores, benefits of heterogeneity, etc. H3 Overview Face-to-face 3D bonding provides dense low-latency interconnect 3

9 months til RTL freeze H3 Design FabScalar saved effort. Most effort on caches, buses, new features. Two cores generated with FabScalar I-cache: in-house Modify Fetch-1 for synch. R/W compiled memories Chip I/O (mem. buses, serializer/deserializer) D-cache: retool and integrate OpenSparc-T2 L1 D$. Original plan (canceled): leverage T2’s 8 core x 8 bank crossbar and L2 $ implemented in stacked DRAM. New features: CCD perf. counters FTM 4

H3 Design  Physical design. 6 mo. phys. design 5

H3 Design  Mitigating risk. • Dedicated memory buses for the two cores • Avoid a potential single point of failure • Parameterized memory bus width • Reduce schedule risk: Early pad planning is important but fluid. • Full scan in 1-wide core • Observability/controllability of at least one core with caches • 2-wide core doesn’t have scan overhead • Debug Core • Rationale: • Test a “pure” FabScalar core • Plan B, in case two-core-stack doesn’t work • Eliminate risky aspects • Enhance testability/debuggability • Core configuration: • Same configuration as 2-wide core • Key features: • I-cache and D-cache replaced with synthesized I and D scratchpads • No compiled memories • No complex caches • Full scan • Observability/controllability for debug • No memory buses: Scratchpads preloaded/examined via scan chains Die photo + floorplan 6

RTL verified using SPEC2K SimPoints In retrospect, should have also used microbenchmarks Lesson: Budget enough time for netlist verification Major effort to set up netlist simulation Testbench and debug more complicated (everything blasted into individual nets) SDF annotation requires experience Most issues caused by testbench and SDF problems Found serious, but not fatal, bug just after tapeout A difference between RTL and netlist caused by misplaced `ifdef `ifdef SIM guards instrumentation in the RTL. Thus, SIM is defined in testbench but not in synthesis script. The problem is that a small real code fragment was also mistakenly guarded by it. Lessons: (1) Consolidate all instrumentation in testbench (none in RTL). (2) Do netlist verification, because netlist may not equal RTL. Netlist simulation also would have alerted us to hold-time violations in D$ OpenSparc T2 D$ is a heavily latch-based industry design Problem encountered in chip bring-up, diagnosed with netlist simulation H3 Verification 7

Debug core uses only a dozen signal pins Allowed us to wirebond a die directly to an existing board to check debug core liveness Test Vdd/Gnd and V+/Gnd for shorts Scan-in == Scan-out H3 Packaging, PCB, & Bring-up • Success of debug core liveness tests was the green light to assemble the four configurations. For each configuration: • Package the chip (wirebonding and lead-forming) • Design and fab a 4-layer PCB • Assemble the PCB Chip-on-board debug core liveness • Overall chip has 400 pads divided into four 100-pad experiments • Wirebond chip differently for each experiment • Allows for use of a 128-pin QFP package 8

Test setup PCB connects to LPC mezzanine of Xilinx ML605 FPGA handles memory requests from cores Block RAMs for L2 cache Host PC sends commands to FPGA via serial interface using a custom GUI Custom compiler for writing microbenchmarks For good control of instruction selection and order, without assembly programming H3 Packaging, PCB, & Bring-up • All signals go to both: • Headers (to oscilloscope) • LPC connector (to FPGA) LPC underneath Fully assembled H3 PCB (Phase 1) Layer 1 (shown): package, headers Layers 2,3: Vdd/V+, Gnd Layer 4 (underneath): LPC, DCAPs 9

Results of chip bring-up Identified 9 total issues 3 setup, 1 “feature”, 4 bugs, 1 possible bug See Table 3, #x 3 setup issues: #2, #3, #6 (fixed) 1 “feature” of extra I$ bus traffic: #4(no ill effects, but may want to fix in Phase 2) 4 bugs (will fix in Phase 2) 1 bug detected post-tapeout, pre-silicon: #1(serious, but has workaround) 1 class of hold-time bugs in D$: #5(serious, fortunately top core ok with certain tags) 2 bugs exercised by thread migrations: #7, #8(just annoying, have workarounds) 1 possible bug when migrating with CCD enabled: #9 (debug in progress) H3 Packaging, PCB, & Bring-up 10

H3 Packaging, PCB, & Bring-up † 1-wide core has higher current (and power) than 2-wide core because it has full scan. 11

Scheduled tapeout in August 2015 Just the two-core stack No debug core No scan chains Design tasks Partition RTL for two tiers Implement thread migration enhancements Fix bugs Replace T2 D-caches with in-house D-caches T2 not workable: explicit latch instances, everywhere T2 no longer needed: shelved crossbar and stacked DRAM L2$ We now have in-house D-cache from AnyCore effort Custom-design M1 pads and F2F bondpoints H3 Phase 2 12

FabScalar evolution Released FabScalar: Build-up cores from library of stage designs of different widths Next-gen FabScalar: “Superset Core”: single Verilog description with parameterized widths (structure sizes already parameterized) AnyCore derived from “superset core” Keep static configurability of superset core to allow for synthesis of different max sized AnyCore processors Add dynamic configurability within max size AnyCore instance that was fabricated: AnyCore Overview 13

Automatic liveness test (BIST) Upon applying power, clock, and reset, chip should toggle a dedicated pin. This signals it is correctly executing the preset test program. Got netlist simulation working early Both post-synthesis (ideal clocks, estimated SDF) and post-layout (everything realistic) AnyCore Design and Verification Notable Design Features • In-house L1 cache designs with three modes • Cache • Scratchpad • BIST (mode after processor reset) • Scratchpad mode with first N rows preset to a test program, including a new instruction that toggles a dedicated pin. • Debug interface for direct reading/writing key pipeline structures, scratchpads, core configuration registers, and performance counters Applied Lessons from H3 Project 14

AnyCore Packaging, PCB, & Bring-up • Applied more lessons from H3 project • Used a socket instead of soldering package to PCB • Possible benefit that PCB may be repurposed for other projects • Replace defective chips • Study variations • Used a dedicated bypassable shuntresistor to measurecurrent • Narrowed PCB to LPCprofile. Compatible witharbitrary Xilinx boards(ML605 and Zynq boards). 15

Thank you Any comments or questions? Discussion and Questions Fully assembled H3 PCB Fully assembled AnyCore PCB The H3 project is supported by a grant from Intel. The AnyCore project is supported by NSF grant CCF-1018517. Any opinions, findings, and conclusions or recommendations expressed herein are those of the authors and do not necessarily reflect the views of the National Science Foundation. 16

Backup

N. K. Choudhary, S. V. Wadhavkar, T. A. Shah, H. Mayukh, J. Gandhi, B. H. Dwiel, S. Navada, H. H. Najaf-abadi, and E. Rotenberg. FabScalar: Composing Synthesizable RTL Designs of Arbitrary Cores within a Canonical Superscalar Template. Proceedings of the 38th IEEE/ACM International Symposium on Computer Architecture (ISCA-38), pp. 11-22, June 2011. E. Rotenberg, B. Dwiel, E. Forbes, Z. Zhang, R. Widialaksono, R. Basu Roy Chowdhury, N. Tshibangu, S. Lipa, W. R. Davis, and P. D. Franzon. Rationale for a 3D Heterogeneous Multi-core Processor. Proceedings of the 31st IEEE International Conference on Computer Design (ICCD-31), pp. 154-168, October 2013. References

H3 Detailed Pre-tapeout Timeline

H3 Detailed Post-tapeout Timeline

Cool H3 Pictures Final layout Full test setup (host PC, power supply, ML605, etc.)

H3 Errata

High precision meter still uses shunt resistor Our meter switches shunt resistor using relay The relay causes a lag between changes in core current draw, and the ability of the ammeter to react Ammeter Shunt Resistor Switching • FPGA reset pressed (resets core, which then draws near zero current). • Ammeter switches to higher resistance shunt resistor • FPGA reset released (core starts executing, increasing current draw). • Ammeter switches back to lower resistance shunt resistor 1 2 3 4

Using a discrete, fixed value, resistor to measure current also has trade-offs Increasing resistance gives more accurate measurements of voltage drop (and hence current) However, that same voltage drop also lowers the voltage as seen by the core This is why the ammeter switches shunt resistances in the first place Ammeter Burden Voltage Voltage regulated here

You could bump the voltage of the power supply up so that the voltage as seen by the core is within spec But then during reset, the core current drops to near zero, and the voltage as seen by the core is roughly the same as the bumped up output of the power supply A more fully featured power supply, with a sense input, can help this situation… regulation is at the sense node, not the output of node of the power supply Power Supply Sense Input Voltage regulated here

Needed a way to explicitly control all aspects of emitted instructions Language syntax is assembly with a few higher-level features Registers and memory locations can be named if statements and while loops Arithmetic operators, assignments, address-of Syntax allows the ability to place code/data at arbitrary memory locations Including non-contiguous locations The compiler also emits to our checkpoint file format Written in flex/bison Custom Compiler

Custom Compiler Example % example program that % sums the values of an array mem (0x00400000) { ii: $r1 addr: $r2 cond: $r4 val: $r5 total: $r6 addr = @data total = #0 ii = #0 cond = ii < #4 while (cond) { lw val, #0[addr] addr = addr + #4 total = total + val ii = ii + #1 cond = ii < #4 } addr = @result sw total, #0[addr] } % example program array data mem (0x00100000) { data: !0x0f0f0f0f !0xabcd1234 !0x00000001 !0xdeadbeef result: !0x00000000 }

Hardware Peculiarity: Unless pin F13 of the 80386 is connected to the +5V power supply, the 80386 never terminates a memory cycle, halting the processor. Datasheet indicates pin F13 is NC with a note that “Pins identified as ‘NC’ should remain completely unconnected.” Successive Floating-Point Instructions: If two floating-point instructions are executed close together, the 80386 may force the coprocessor to start the second one too soon if the first one did not require any memory operands. Misaligned Floating-Point Instructions: If 80287 and/or 80387 instructions are not word-aligned, the 80386 passes the wrong instruction to the coprocessor, causing unpredictable behavior. Self-test: The self-test feature does not work on the A1 stepping of the 80386. Double Page Faults: The bug that appeared in the B0 stepping regarding page faults that occur during page faults has been made a permanent feature of the 80386… Intel 80386 Errata (it’s not just us!) Turley, James L., Advanced 80386 Programming Techniques, McGraw-Hill, Berkeley, CA, 1988.

AnyCore Detailed Pre-tapeout Timeline

AnyCore Physical Design Details

AnyCore Detailed Post-tapeout Timeline

Cool AnyCore Pictures Final layout Floorplan

AnyCore Pipeline

BIST Program 0x00 addi r1, r0, #0 0x08 addi r2, r0, #0 0x10 addi r3, r0, #0 0x18 addi r4, r0, #0 0x20 toggle 0x28 nop 0x30 nop 0x38 nop 0x40 st r3(#0), r4 0x48 addi r2, r2, #10 0x50 addi r1, r1, #5 0x58 addi r2, r2, #10 0x60 toggle 0x68 addi r1, r1, #5 0x70 addi r2, r2, #10 0x78 addi r1, r1, #5 0x80 addi r4, r0, #0 0x88 ld r4, r3(#0) 0x90 addi r1, r1, #5 0x98 addi r2, r2, #10 0xa0 toggle 0xa8 addi r2, r2, #10 0xb0 addi r4, r4, #2 0xb8 addi r3, r3, #4 0xc0 addi r2, r2, #10 0xc8 addi r1, r1, #5 0xd0 addi r1, r1, #5 0xd8 jmp 0x40 0xe0 nop 0xe8 nop 0xf0 nop 0xf8 nop

Debug Bus Caches / AMT / PRF Configuration Registers Write Data / Wr En Read Data Performance Counters Read/Write Addr Debug Registers (PC, queue pointers etc.)

Experiences with Two FabScalar-based Chips

Experiences with Two FabScalar-based Chips

Presentation Transcript

Uses of and Experiences with Address-Based Sampling

Chips With Everything

Our Experiences with Phidgets

Chips with everything “chips glorious chips”

Apple Bread with Chocolate Chips

Experiences with PTSD

Experiences with Planning

Physical Design of FabScalar Generated Cores

Uses of and Experiences with Address-Based Sampling

Experiences with Client-based Speculative Remote Display

Experiences with FP5

CHIPS!!!!

Fostering learning-centered experiences with web-based technology.

Experiences with SOA

(Two) ROCs experiences

Two experiences in Abruzzo Region :

Experiences with SWE

Experiences with right-based empowerment of the right holders

Experiences with community based nutrition programs in Thailand

First experiences with CORBA

Experiences with LDN