Reconfigurable Computing - Verifying Circuit Performance!

Reconfigurable Computing -Verifying CircuitPerformance! John Morris Chung-Ang University The University of Auckland ‘Iolanthe II’ in a good breeze on the Bay of Islands

Measuring Circuit Performance • Don’t believe the simulators! • Although some experience has shown that predictions can be reasonably accurate … • Potential for gross error is very large • A large number of small values need to be summed • Possibility of large statistical errors • Professional engineers always check • That’s what makes them professional! • Scientists always want to be able to repeat an experiment • That’s a principle of scientific theory • Don’t accept anything as fact unless you can repeat it! • Whatever your background or reason … • Measurement on an actual device needed • You can use the simulator’s numbers for guidance though!

Measuring Circuit Performance • Use the simulator’s results as a guide • But what does it tell you? • It calculates propagation delays from inputs to outputs along various circuit paths • Simulators try to identify the longest (in time) path for you • In a simple combinatorial block that’s fine eg • a one-stage (no registers) adder • Should identify the carry chain in a ripple carry adder • or its equivalent in a more complex adder • a single-stage parallel array multiplier • Again – in all types of multipliers – • there’s a carry chain that limits performance • In a pipelined circuit, you want the longest path between two clocked flip-flops • In principle, easy for the simulator to find! • In practice, you may need to spend more time checking that it selected the right path!

Measuring Circuit Performance • Checking the simulator’s predictions • Do a sanity check! • Using the manufacturer’s published propagation delays for individual circuit elements • Estimate the path delay yourself • Count the number of logic blocks needed for the computation • Will additional multiplexers be needed for steering or selection logic? • Are I/O buffers needed? • These typically have a considerable delay(relative to other circuit elements)

Measuring Circuit Performance • Using the manufacturer’s published propagation delays for individual circuit elements • Estimate the path delay yourself • … • You can use the synthesizer to help you here • Its count of the number of the total number of logic blocks will be 100% accurate  • From this, you infer the number of logic blocks in a path eg • For a 32-bit adder, you can obviously start by dividing the total number of logic blocks by 32 • Then try to estimate how many logic blocks are needed for overheads, eg • Multiplexers needed in a carry select adder • For FPGAs, remember …

Measuring Circuit Performance • Using the manufacturer’s published propagation delays for individual circuit elements • Estimate the path delay yourself • For FPGAs, remember … • Look up tables (LUTs) are usually used for boolean logic • This means that • Using Xilinx’s 9-input CLBs • y <= a AND b probably takes about the same time as • y <= a AND b AND c AND d AND … (up to 9 inputs) • Beyond 9 inputs, add a considerable delay to connect to a neighbouring CLB • Using Altera’s 4-input logic elements • y <= a AND b probably takes about the same time as • y <= a AND b AND c AND d (up to 4 inputs) Beyond 4 inputs, add a small delay to use the fast cascade chain logic

Measuring Circuit Performance • Using the manufacturer’s published propagation delays for individual circuit elements • Estimate the path delay yourself • For FPGAs, remember … • Paths between logic blocks may have large numbers of transmission gates on them! • As noted before, there’s a considerable advantage to being able to keep critical logic on one logic block • But Altera’s cascade chains attempt to mitigate the penalty for not fitting critical logic into a single logic element • And all manufacturers now provide for fast adder carry chains! • This makes estimation of path delays difficult • Nevertheless, you should make a rough estimate!!

Measuring Circuit Performance • Estimate the path delay yourself • If your estimate matches that from the synthesizer, then we’re in good shape • ‘Matches’ here can be interpreted liberally • If the synthesizer reports 50ns and you calculate 30nsthen this is a reasonable match • You probably didn’t count enough transmission gates, etc, on the connections between logic blocks! • You don’t need to do a very precise calculation • The synthesizer has done that for you! • Your aim is to ensure that you are reading the correct number from the synthesizer’s report! • With a reasonable match (say within 50% - either way), believe the synthesizer and continue … • With a serious mismatch • Read the synthesizer’s report more carefully • You may be looking at the wrong figure! • Check your estimate more carefully

Now we believe we know how fast the circuit is … • What does this speed mean in practice? • You have a longest delay of xns • A synchronous (clocked) circuit can run at 1/x GHz? • Almost! • Don’t forget to allow for • Propagation delay in the registers • Temperature • Circuits run slower at high T • Make sure that your estimate of tpd is a good one for the highest temperature your circuit will need to withstand • Don’t think that this will be low! • Try touching a modern high performance processor!(Make sure you have some burn cream nearby!) or • simply work out that all those fans hiding that chip aren’t there for decoration! • Chip-to-chip variations in fabrication • … • 32-bit adder – inputs a, b, c • Naïve approach - Test all possibilities • a – 4  109 (all possible 32-bit numbers) • b – 4  109 (------- do -------------) • c – 2 ( 0 or 1 ) • Total 4  4  2  1018 = 1.6 x 1019 • 4 GHz machine – 109 cases / sec (optimistic!) • 1.6  1010 seconds – about 6 months will do it! • What about the rest of the machine? • -, x, /, ^, v, <<, >>, … • We should be finished in about 5 years • Hmmmm … our 4 GHz machine should be about 30 GHz now! • Clearly we need to be more efficient about testing!

Now we believe we know how fast the circuit is … • What does this speed mean in practice? • You have a longest delay of xns • A synchronous (clocked) circuit can run at 1/x GHz? • Almost! • Don’t forget to allow for • Propagation delay in the registers • More on pipelines later! • Temperature • Chip-to-chip variations in fabrication • The gates will only be nominally 0.18m! • Some may actually be 0.15m and others 0.25m … • A maximum clock frequency of 1/(x+D) GHz • D may be quite large! • Now you’re ready to design an experiment to verify that the circuit does actually run as predicted!

 A word of warning! • Experimental design! • If you don’t make an estimate of what you expect to measure before starting • You will waste a lot of time doing the experiment! • Working out the expected delay time is formally equivalent to setting out a hypothesis for the experiment • The simulator says the delay will be xnsso I hypothese (predict) that we will measure a delay of about xns • This (simple) hypothesis guides your experimental design and set up! • For example, assume you have a 150MHz oscilloscope available …

Experimental hypothesis • Experimental hypothesis • The simulator says the delay will be xnsso I hypothese (predict) that we will measure a delay of about xns • This (simple) hypothesis guides your experimental design and set up! • For example, assume you have a 150MHz oscilloscope available • You try to make measurements of the delay, but are surprised to find that there appears to be no delay at all! • Somebody then remembers to go back and read the synthesis report .. • Which tells you to expect a 5ns delay – • or • one that will be difficult to measure on a slow ‘scope!

Experimental Hypothesis • The simulator says the delay will be xnsso I hypothese (predict) that we will measure a delay of about xns • This (simple) hypothesis guides your experimental design and set up! • You now know that you have to design your experiment differently, eg • Build a wider adder • So that the delay is long enough to measure easily • Work out how to measure n repeats of the calculation • So that 5 n > 20ns (or some time that you can be certain to measure accurately!) • Devise an entirely new technique • Which doesn’t require direct measurement of such a small delay

Measuring the delay • Usual strategy • Design a test bench that will drive the component you are testing – the Component Under Test (CUT) • This test bench will be different from the one used to verify the component’s correctness! • First task: • What will the test bench do? • What is the worst case input(s)? • Ie the ones that will take the longest time to produce a result! • For an adder, there are several possibilities: • Set up the test bench to produce one of these inputs • Make a small state machine

Measuring the delay • You’ve identified a suitable worst case … • Set up the test bench to produce these inputs • Make a small state machine • 2 states may be adequate: • State 1: Clear the outputs • State 2: Apply the test case • How will you know that the worst case has completed computation? • In the case of an adder, it’s easy • For other circuits, you may need to add some ‘probe’ circuitry • For example, the worst case is when TWO outputs go high • Add an AND gate to your driver and route the output of this gate to an external pin • Set up your scope to measure the delay • from the start of the clock cycle • to the output signalling completion

Measuring the delay • You’ve identified a suitable worst case … • Set up the test bench to produce these inputs • Make a small state machine • 2 states may be adequate: • State 1: Clear the outputs • State 2: Apply the test case PROCESS( clk ) BEGIN IF clk’EVENT AND clk=‘1’ THEN CASE state IS WHEN state1 => a <= zero; b <= zero; state <= state2; start <= ‘0’; WHEN state2 => a <= one; b <= minus_one; state <= state1; start <= ‘1’; END CASE;END PROCESS; • Set up your scope to measure the delay • from the start of the clock cycle (or the start signal) • to the output signalling completion (carry out for an adder)

Measuring a delay PROCESS(clk) BEGIN CASE state IS WHEN state1 => counter <= zero; a <= zero; b <= zero; IF c_out = ‘0’ THEN state <= state2; END IF; WHEN state2 => a <= one; b <= minus_one; counter_enable <= ‘0’; state <= state3; WHEN state3 => IF c_out = ‘1’ THEN counter_enable <= ‘1’; --- transfer counter output to --- LED display, etc state <= state1; END IF; END CASE; END PROCESS; • Second strategy • Use the FPGA to do everything! • Drive a fast counter with the fastest clock available • Stop the counter when the operation is complete • Requires fast input clock • Resolution of this clock determines timing accuracy • Can use FPGA PLLs to multiply clock

Measuring Delay • Strategy three • If the circuit is clocked too fast, it won’t operate correctly ie it won’t complete the computation before the next clock edge arrives eg in the adder example,next clock edge arrives before carry has rippled through to carry out, so it never becomes ‘1’ • Set up test circuit as first case,but gradually increase clock speed until carry_out never becomes ‘1’ • Use secondary clock derived from (fast) master clock whose frequency is gradually increased until circuit stops operating correctly (ie never produces a completion signal) • Count pulses of master clock with a loadable counter • Secondary clock is derived from counter completion signal • Reduce loaded count value to reduce secondary clock cycle time (reciprocal scale)

Measuring the delay • Timing accuracy • Don’t forget the I/O buffer delay! • A signal from either your driver or the CUT has to go through a pin – which implies passing through an I/O buffer • Don’t • Put one probe on the clock input and one on an output • Your result will be increased by the I/O buffer delay! • Instead • Feed the clock back off the FPGA through an I/O buffer • Now both the clock and the completion signal will be delayed by one I/O buffer • Still some possibility of error • Both I/O buffer will not have exactly the same delay • But this error is likely to be of the same magnitude as other unavoidable errors, so … • In general, make sure that both your ‘start’ and ‘stop’ signals have very similar delays

Measuring the delay • Timing accuracy • Don’t forget the I/O buffer delay! • In general, make sure that both your ‘start’ and ‘stop’ signals have very similar delays • If you had to combine your ‘stop’ signals in some logic, • eg one logic block to AND two signals together • Then pass the start signal through an artificial similar delay • Your completion signal is the AND of two outputs • complete <= a_out AND b_out • So your start signal should be similarly delayed • start_delayed <= start AND start • ‘start’ will often be the clock driving your test circuit • Be careful: • Some compilers are clever enough to realize that this is a ‘do nothing’ piece of logic and will remove it!!

FPGA CUT a sum b c_out c_in Clock

Reconfigurable Computing - Verifying Circuit Performance!