Optimizing Finite State Machines in Digital Systems Design

EECS 150 - Components and Design Techniques for Digital Systems Lec 23 – Optimizing State Machines David Culler Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~culler http://inst.eecs.berkeley.edu/~cs150

signals Datapath vs Control Datapath Controller • Datapath: Storage, FU, interconnect sufficient to perform the desired functions • Inputs are Control Points • Outputs are signals • Controller: State machine to orchestrate operation on the data path • Based on desired function and signals Control Points EECS 150, Fa07, Lec 23-optimize

Control Points Discussion • Control points on a bus? • Control points on a register? • Control points on a function unit? • Signal? • Relationship to STG? STT? Input A State / Output Input B EECS 150, Fa07, Lec 23-optimize

Sequential Logic Optimization • State Minimization • Algorithms for State Minimization • State, Input, and Output Encodings • Minimize the Next State and Output logic • Delay optimizations • Retiming • Parallelism and Pipelining (time permitting) EECS 150, Fa07, Lec 23-optimize

FSM Optimization in Context • Understand the word specification • Draw a picture • Derive a state diagram and Symbolic State Table • Determine an implementation approach (e.g., gate logic, ROM, FPGA, etc.) • Perform STATE MINIMIZATION • Perform STATE ASSIGNMENT • Map Symbolic State Table to Encoded State Tables for implementation (INPUT and OUTPUT encodings) • You can specify a specific state assignment in your Verilog code through parameter settings EECS 150, Fa07, Lec 23-optimize

Finite State Machine Optimization • State Minimization • Fewer states require fewer state bits • Fewer bits require fewer logic equations • Encodings: State, Inputs, Outputs • State encoding with fewer bits has fewer equations to implement • However, each may be more complex • State encoding with more bits (e.g., one-hot) has simpler equations • Complexity directly related to complexity of state diagram • Input/output encoding may or may not be under designer control EECS 150, Fa07, Lec 23-optimize

State Reduction: Motivation: lower cost fewer flip-flops in one-hot implementations possibly fewer flip-flops in encoded implementations more don’t cares in NS logic fewer gates in NS logic Simpler to design with extra states then reduce later. Example: Odd parity checker. Two machines - identical behavior. FSM Optimization EECS 150, Fa07, Lec 23-optimize

Algorithmic Approach to State Minimization • Goal – identify and combine states that have equivalent behavior • Equivalent States: • Same output • For all input combinations, states transition to same or equivalent states • Algorithm Sketch • 1. Place all states in one set • 2. Initially partition set based on output behavior • 3. Successively partition resulting subsets based on next state transitions • 4. Repeat (3) until no further partitioning is required • states left in the same set are equivalent • Polynomial time procedure EECS 150, Fa07, Lec 23-optimize

Input Next State Output Sequence Present State X=0 X=1 X=0 X=1 Reset S0 S1 S2 0 0 0 S1 S3 S4 0 0 1 S2 S5 S6 0 0 00 S3 S0 S0 0 0 01 S4 S0 S0 1 0 10 S5 S0 S0 0 0 11 S6 S0 S0 1 0 S0 0/0 1/0 S1 S2 0/0 1/0 0/0 1/0 S3 S4 S5 S6 1/0 1/0 1/0 1/0 0/0 0/1 0/0 0/1 State Minimization Example • Sequence Detector for 010 or 110 Note: Mealy machine Alternative STT EECS 150, Fa07, Lec 23-optimize

Input Next State Output Sequence Present State X=0 X=1 X=0 X=1 Reset S0 S1 S2 0 0 0 S1 S3 S4 0 0 1 S2 S5 S6 0 0 00 S3 S0 S0 0 0 01 S4 S0 S0 1 0 10 S5 S0 S0 0 0 11 S6 S0 S0 1 0 Method of Successive Partitions ( S0 S1 S2 S3 S4 S5 S6 ) ( S0 S1 S2 S3 S5 ) ( S4 S6 ) ( S0 S1 S2 ) ( S3 S5 ) ( S4 S6 ) ( S0 ) ( S1 S2 ) ( S3 S5 ) ( S4 S6 ) S1 is equivalent to S2 S3 is equivalent to S5 S4 is equivalent to S6 EECS 150, Fa07, Lec 23-optimize

Input Next State Output Sequence Present State X=0 X=1 X=0 X=1 Reset S0 S1' S1' 0 0 0 + 1 S1' S3' S4' 0 0 X0 S3' S0 S0 0 0 X1 S4' S0 S0 1 0 S0 X/0 S1’ 0/0 1/0 S4’ S3’ X/0 0/1 1/0 Minimized FSM • State minimized sequence detector for 010 or 110 7 States reduced to 4 States 3 bit encoding replaced by 2 bit encoding EECS 150, Fa07, Lec 23-optimize

Another Example – Row Matching Method • 4-Bit Sequence Detector: output 1 after each 4-bit input sequence consisting of the binary strings 0110 or 1010 EECS 150, Fa07, Lec 23-optimize

State Transition Table • Group states with same next state and same outputs S’10 EECS 150, Fa07, Lec 23-optimize

Iterate the Row Matching Algorithm S’7 EECS 150, Fa07, Lec 23-optimize

Iterate One Last Time S’3 S’4 EECS 150, Fa07, Lec 23-optimize

Final Reduced State Machine 15 states (min 4 FFs) reduced to 7 states (min 3 FFs) EECS 150, Fa07, Lec 23-optimize

00 10 S0[1] S1 [0] 00 01 10 11 present next state output state 00 01 10 11 S0 S0 S1 S2 S3 1 S1 S0 S3 S1 S4 0 S2 S1 S3 S2 S4 1 S3 S1 S0 S4 S5 0 S4 S0 S1 S2 S5 1 S5 S1 S4 S0 S5 0 11 01 00 01 00 S2[1] S3 [0] 01 10 10 11 11 01 10 10 S4[1] S5 [0] 11 00 00 01 11 More Complex State Minimization • Multiple input example inputs here symbolic state transition table EECS 150, Fa07, Lec 23-optimize

State Reduction limits • The “row matching” method is not guaranteed to result in the optimal solution in all cases, because it only looks at pairs of states. • For example: • Another (more complicated) method guarantees the optimal solution: • “Implication table” method: cf. Mano, chapter 9 • What ‘rule of thumb” heuristics? EECS 150, Fa07, Lec 23-optimize

present next state output state 00 01 10 11 S0' S0' S1 S2 S3' 1 S1 S0' S3' S1 S3' 0 S2 S1 S3' S2 S0' 1 S3' S1 S0' S0' S3' 0 S1 present next state output state 00 01 10 11 S0 S0 S1 S2 S3 1 S1 S0 S3 S1 S4 0 S2 S1 S3 S2 S4 1 S3 S1 S0 S4 S5 0 S4 S0 S1 S2 S5 1 S5 S1 S4 S0 S5 0 S0-S1 S1-S3 S2 S2-S2 S3-S4 S0-S1 S3-S0 S3 S1-S4 minimized state table (S0==S4) (S3==S5) S4-S5 S0-S0 S1-S0 S1-S1 S3-S1 S4 S2-S2 S2-S2 S3-S5 S4-S5 S0-S1 S1-S1 S3-S4 S0-S4 S5 S1-S0 S4-S0 S4-S5 S5-S5 S0 S1 S2 S3 S4 Minimized FSM • Implication Chart Method • Table of all pairs of stats • 1st Eliminate incompatible states based on outputs • Fill entry with implied equivalents based on next state • Cross out cells where indexed chart entries are crossed out EECS 150, Fa07, Lec 23-optimize

Minimizing Incompletely Specified FSMs • Equivalence of states is transitive when machine is fully specified • But its not transitive when don't cares are presente.g., state outputS0 – 0 S1 is compatible with both S0 and S2 S1 1 – but S0 and S2 are incompatible S2 – 1 • No polynomial time algorithm exists for determining best grouping of states into equivalent sets that will yield the smallest number of final states EECS 150, Fa07, Lec 23-optimize

X Q1 Q0 Q1+ Q0+0 0 0 0 00 0 1 0 00 1 1 0 01 0 0 0 11 0 1 1 11 1 1 1 1– 1 0 0 0X Q1 Q0 Q1+ Q0+0 0 0 0 00 0 1 0 00 1 1 0 01 0 0 0 11 0 1 1 11 1 1 1 1– 1 0 0 0 X’ 00[0] X’ 01[1] X X’ 11 [0] X X Minimizing States May Not Yield Best Circuit • Example: edge detector - outputs 1 when last two input changes from 0 to 1 Q1+ = X (Q1 xor Q0) Q0+ = X Q1’ Q0’ EECS 150, Fa07, Lec 23-optimize

X’ 00[0] X’ X X 10 [0] 01[1] X’ X’ 11[0] X X Another Implementation of Edge Detector • "Ad hoc" solution - not minimal but cheap and fast EECS 150, Fa07, Lec 23-optimize

Announcements • Reading: K&B 8.1-2 • HW 9 due wed • Last HW will go out next week • TAs in lab this week as much as possible rather than official lab meetings • Nov 29 Bring your question on a sheet of paper • Down to the final stretch EECS 150, Fa07, Lec 23-optimize

State Assignment • Choose bit vectors to assign to each “symbolic” state • With n state bits for m states there are 2n! / (2n – m)! [log n <= m <= 2n] • 2n codes possible for 1st state, 2n–1 for 2nd, 2n–2 for 3rd, … • Huge number even for small values of n and m • Intractable for state machines of any size • Heuristics are necessary for practical solutions • Optimize some metric for the combinational logic • Size (amount of logic and number of FFs) • Speed (depth of logic and fanout) • Dependencies (decomposition) EECS 150, Fa07, Lec 23-optimize

State Assignment Strategies • Possible Strategies • Sequential – just number states as they appear in the state table • Random – pick random codes • One-hot – use as many state bits as there are states (bit=1 –> state) • Output – use outputs to help encode states • Heuristic – rules of thumb that seem to work in most cases • No guarantee of optimality – another intractable problem EECS 150, Fa07, Lec 23-optimize

State Maps Assignment State q2 q1 q0 S0 0 0 0 S1 0 0 1 S2 0 1 0 S3 0 1 1 S4 1 1 1 Assignment State q2 q1 q0 S0 0 0 0 S1 1 0 1 S2 1 1 1 S3 0 1 0 S4 0 1 1 • “K-maps” are used to help visualize good encodings. • Adjacent states in the STD should be made adjacent in the map. EECS 150, Fa07, Lec 23-optimize

State Maps and Counting Bit Changes Bit Change Heuristic S0 0 1 S1 S2 S3 S4 S0 -> S1: 2 1S0 -> S2: 3 1S1 -> S3: 3 1S2 -> S3: 2 1S3 -> S4: 1 1S4 -> S1: 2 2Total: 13 7 EECS 150, Fa07, Lec 23-optimize

State Assignment Alternative heuristics based on input and output behavior as well as transitions: Adjacent assignments to: states that share a common next state (group 1's in next state map) states that share a common ancestor state (group 1's in next state map) states that have common output behavior (group 1's in output map) EECS 150, Fa07, Lec 23-optimize

Successor/Predecessor Heuristics High Priority: S’3 and S’4 share common successor state (S0) Medium Priority: S’3 and S’4 share common predecessor state (S’1) Low Priority: 0/0: S0, S’1, S’3 1/0: S0, S’1, S’3, S’4 Heuristics for State Assignment EECS 150, Fa07, Lec 23-optimize

Heuristics for State Assignment EECS 150, Fa07, Lec 23-optimize

Another Example High Priority: S’3, S’4 S’7, S’10 Medium Priority: S1, S2 2 x S’3, S’4 S’7, S’10 Low Priority: 0/0: S0, S1, S2, S’3, S’4, S’7 1/0: S0, S1, S2, S’3, S’4, S’7, S10 EECS 150, Fa07, Lec 23-optimize

Choose assignment for S0 = 000 Place the high priority adjacency state pairs into the State Map Repeat for the medium adjacency pairs Repeat for any left over states, using the low priority scheme Two alternativeassignments at the left Example Continued EECS 150, Fa07, Lec 23-optimize

Why Do These Heuristics Work? • Attempt to maximize adjacent groupings of 1’s in the next state and output functions EECS 150, Fa07, Lec 23-optimize

General Approach to Heuristic State Assignment • All current methods are variants of this • 1) Determine which states “attract” each other (weighted pairs) • 2) Generate constraints on codes (which should be in same cube) • 3) Place codes on Boolean cube so as to maximize constraints satisfied (weighted sum) • Different weights make sense depending on whether we are optimizing for two-level or multi-level forms • Can't consider all possible embeddings of state clusters in Boolean cube • Heuristics for ordering embedding • To prune search for best embedding • Expand cube (more state bits) to satisfy more constraints EECS 150, Fa07, Lec 23-optimize

One-hot State Assignment • Simple • Easy to encode, debug • Small Logic Functions • Each state function requires only predecessor state bits as input • Good for Programmable Devices • Lots of flip-flops readily available • Simple functions with small support (signals its dependent upon) • Impractical for Large Machines • Too many states require too many flip-flops • Decompose FSMs into smaller pieces that can be one-hot encoded • Many Slight Variations to One-hot • One-hot + all-0 EECS 150, Fa07, Lec 23-optimize

Inputs Present State Next State OutputsC TL TS ST H F0 – – HG HG 0 00 10 – 0 – HG HG 0 00 10 1 1 – HG HY 1 00 10 – – 0 HY HY 0 01 10 – – 1 HY FG 1 01 101 0 – FG FG 0 10 000 – – FG FY 1 10 00– 1 – FG FY 1 10 00– – 0 FY FY 0 10 01– – 1 FY HG 1 10 01 Output-Based Encoding • Reuse outputs as state bits - use outputs to help distinguish states • Why create new functions for state bits when output can serve as well • Fits in nicely with synchronous Mealy implementations HG = ST’ H1’ H0’ F1 F0’ + ST H1 H0’ F1’ F0HY = ST H1’ H0’ F1 F0’ + ST’ H1’ H0 F1 F0’ FG = ST H1’ H0 F1 F0’ + ST’ H1 H0’ F1’ F0’ HY = ST H1 H0’ F1’ F0’ + ST’ H1 H0’ F1’ F0 Output patterns are unique to states, we do notneed ANY state bits – implement 5 functions(one for each output) instead of 7 (outputs plus2 state bits) EECS 150, Fa07, Lec 23-optimize

Current State Assignment Approaches • For tight encodings using close to the minimum number of state bits • Best of 10 random seems to be adequate (averages as well as heuristics) • Heuristic approaches are not even close to optimality • Used in custom chip design • One-hot encoding • Easy for small state machines • Generates small equations with easy to estimate complexity • Common in FPGAs and other programmable logic • Output-based encoding • Ad hoc - no tools • Most common approach taken by human designers • Yields very small circuits for most FSMs EECS 150, Fa07, Lec 23-optimize

Sequential Logic Implementation Summary • Implementation of sequential logic • State minimization • State assignment • Implications for programmable logic devices • When logic is expensive and FFs are scarce, optimization is highly desirable (e.g., gate logic, PLAs, etc.) • In Xilinx devices, logic is bountiful (4 and 5 variable TTs) and FFs are many (2 per CLB), so optimization is not so crucial an issue as in other forms of programmable logic • This makes sparse encodings like One-Hot worth considering EECS 150, Fa07, Lec 23-optimize

Improving Cycle Time • Retiming • Parallelism • Pipelining EECS 150, Fa07, Lec 23-optimize

N’ D’ + Reset (N’ D’ + Reset)/0 Reset Reset/0 0¢ [0] 0¢ N’ D’ N’ D’/0 N N/0 5¢ [0] 5¢ D D/0 N’ D’ N’ D’/0 N N/0 10¢ [0] 10¢ N’ D’ N’ D’/0 D D/1 N+D N+D/1 15¢ [1] 15¢ Reset’ Reset’/1 Example: Vending Machine State Machine • Moore machine • outputs associated with state • Mealy machine • outputs associated with transitions EECS 150, Fa07, Lec 23-optimize

State Machine Retiming • Moore vs. (Async) Mealy Machine • Vending Machine Example Open asserted only whenin state 15 Open asserted when lastcoin inserted leading tostate 15 EECS 150, Fa07, Lec 23-optimize

State Machine Retiming • Retiming the Moore Machine: Faster generation of outputs • Synchronizing the Mealy Machine: Add a FF, delaying the output • These two implementations have identical timing behavior Push the AND gate through theState FFs and synchronize withan output FF Like computing open in the priorstate and delaying it one state time EECS 150, Fa07, Lec 23-optimize

FF prop delay State Out prop delay Open Retimed Open Out calc Plus set-up Open Calculation NOTE: overlaps with Next State calculation State Machine Retiming • Effect on timing of Open Signal (Moore Case) Clk EECS 150, Fa07, Lec 23-optimize

FF input in retimed Moore implementation FF input in synchronous Mealy implementation State Machine Retiming • Timing behavior is the same, but are the implementations really identical? Only differencein don’t care caseof nickel and dimeat the same time EECS 150, Fa07, Lec 23-optimize

Parallelism • Example, Student final grade calculation: read mt1, mt2, mt3, project; grade = 0.2  mt1 + 0.2  mt2 + 0.2  mt3 + 0.4  project; write grade; • High performance hardware implementation: Doing more than one thing at a time: optimization in h/w often involves using parallelism to trade between cost and performance As many operations as possible are done in parallel EECS 150, Fa07, Lec 23-optimize

Parallelism • Is there a lower cost hardware implementation? Different tree organization? • Can factor out multiply by 0.2: • How about sharing operators (multipliers and adders)? EECS 150, Fa07, Lec 23-optimize

Pipelining review from CS61C: Analog to washing clothes: step 1: wash (20 minutes) step 2: dry (20 minutes) step 3: fold (20 minutes) 60 minutes x 4 loads  4 hours wash load1 load2 load3 load4 dry load1 load2 load3 load4 fold load1 load2 load3 load4 20 min overlapped  2 hours Pipelining Principle EECS 150, Fa07, Lec 23-optimize

wash load1 load2 load3 load4 dry load1 load2 load3 load4 fold load1 load2 load3 load4 Increase number of loads, average time per load approaches 20 minutes Latency (time from start to end) for one load = 60 min Throughput = 3 loads/hour Pipelined throughput  # of pipe stages x un-pipelined throughput Pipelining EECS 150, Fa07, Lec 23-optimize

General principle: Cut the CL block into pieces (stages) and separate with registers: T’ = 4 ns + 1 ns + 4 ns +1 ns = 10 ns F = 1/(4 ns +1 ns) = 200 MHz CL block produces a new result every 5 ns instead of every 9 ns Pipelining Assume T = 8 ns TFF(setup +clkq) = 1 ns F = 1/9 ns = 111 MHz Assume T1 = T2 = 4 ns EECS 150, Fa07, Lec 23-optimize

Without FF overhead, throughput improvement proportional to # of stages After many stages are added. FF overhead begins to dominate: Other limiters to effective pipelining: Clock skew contributes to clock overhead Unequal stages FFs dominate cost Clock distribution power consumption feedback (dependencies between loop iterations) Limits on Pipelining FF “overhead” is the setup and clk to Q times. EECS 150, Fa07, Lec 23-optimize

Optimizing Finite State Machines in Digital Systems Design

Optimizing Finite State Machines in Digital Systems Design

Presentation Transcript

David E. Culler University of California, Berkeley

David Culler Electrical Engineering and Computer Sciences University of California, Berkeley

David E. Culler University of California, Berkeley

David E. Culler University of California, Berkeley

David Patterson Electrical Engineering and Computer Sciences University of California, Berkeley

David E. Culler Electrical Engineering and Computer Sciences University of California, Berkeley

David E. Culler University of California, Berkeley

David Culler Electrical Engineering and Computer Sciences University of California, Berkeley

David Culler Electrical Engineering and Computer Sciences University of California, Berkeley

David Culler Electrical Engineering and Computer Sciences University of California, Berkeley

David Culler Electrical Engineering and Computer Sciences University of California, Berkeley

David Culler Electrical Engineering and Computer Sciences University of California, Berkeley

David Patterson Electrical Engineering and Computer Sciences University of California, Berkeley

David Culler Electrical Engineering and Computer Sciences University of California, Berkeley

David E. Culler University of California, Berkeley

David E. Culler University of California, Berkeley

David E. Culler University of California, Berkeley

David Culler Electrical Engineering and Computer Sciences University of California, Berkeley

David E. Culler University of California, Berkeley

David Patterson Electrical Engineering and Computer Sciences University of California, Berkeley

David Culler Electrical Engineering and Computer Sciences University of California, Berkeley

David Culler Electrical Engineering and Computer Sciences University of California, Berkeley