L17 :Logic Level Design

L17 :Logic Level Design 성균관대학교 조 준 동 교수 http://vlsicad.skku.ac.kr

Peak Power has relation to EMI Reducing concurrent switching makes peak power reduction Adjust delay  within the speed of system clock in Bus/Port driver Consider the power consumption of delay element Maintaining total power consumption, we improve EMI in peak power reduction Before Peak Power Reduction After Peak Power Reduction Peak Power Reduction

Factoring Example Function : f = ad + bc + cd The function f is not on the critical path. The signal a,b,c and d are all the same bit width. Signal b is a high activity net. The two factorings below are equivalent from both a timing and area criteria. Net Result : network toggling and power is reduced.

Block diagram of low-voltage, high-speed of LSI • Power Management Processor controls the low-Vt circuit using the sleep signal. • Extend the sleep period as much as possible, because leakage power is reduced during this time

Operations of low-V t LSI Request signal from an I/O device, output the results, waits for the next request signal. During the waiting period, the low-Vt circuit can sleep.

Waking/Sleeping operation Waking operation Sleeping operation

Creating sleep period: Operation during calculation • Heavy operations such as voice CODEC, and light operations such as datacollection can be distributed to both the low-Vt circuit and the PMP, and the low Vt circuit can sleep when the PMP is executing light • operations. • reduce the power by 10%

Interconnection power optimization • Coding for reduced switching activity - Introducing sample to sample correlation such that the total # of transitions is reduced. • Coding scheme ( if n bit data is transmitted with m wires) • Non-redundant : m = n, with the knowledge of the statistics of data, there can be other non-redundant scheme. • Redundant : m > n, one-to-one mapping or one-to-many mapping can be used.

One-Hot Coding • Interconnection of two chip are made by m=2n wires. • Place ‘H’ on i-th bit where 0 <= i =<2n-1, other wires are ‘L’. • Both the encoder and decoder are memoryless. • Power reduction • assume n-bit data words are independent • uniformly randomly distributed

Gray-Coding • Adjacent numbers only have one bit difference. • Useful when the transmitted data is sequential and highly correlated. • Conversion • B=<bn-1, bn-2, …, b1, b0> ( binary number ), • G=<gn-1, gn-2, … , g1, g0> ( gray-coded number ) • binary to gray-code conversion • gn-1 = bn-1, gi = bi+1  bi ( i=n-2, … 0 ) • gray-code to binary conversion • bn-1 = gn-1, bi = bi+1  gi ( i=n-2, … 0 ) • example • B=<1,1,0,1>  <b3, b3  b2, b2  b1, b1  b0>=<1,0,1,1> • G=<1,0,1,1>  <g3, g3  g2, g3  g2  g1, g3  g2  g1  g0>=<1,1,0,1>

Gray-Coding • Application • code the address line for instruction access - sequential and switching activity is reduced. • if used for data address line, the transition is equal for binary representation. binary gray-code 2.64 1.32 qsort 1,33 1,25 1,47 2.57 reducer 1,40 1.71 1,33 2.33 circuit 1,18 1.47 1,38 2.68 semigroup 1,34 1.99 BPI (bit transitions per instruction) BPI 1.0 2.0 3.0 1.0 2.0 3.0

Bus-Invert Coding for Low Power I/O An eight-bit bus on which all eight lines toggle at the same • time and which has a high peak (worst-case) power dissipation. • There are 16 transitions over 16 clock cycles (average 1 transition per clock cycle).

Peak Power Dissipation An eight-bit bus on which the eight lines toggle at different moments and which has a low peak power dissipation. There are the same 16 transitions over 16 clock cycles and thus the same average power dissipation

Bus-Invert - Coding for low power • The Bus-Invert method proposed here uses one extra control bit called invert. By convention then invert = 0 the bus value will equal the data value. When invert = 1 the bus value will be the inverted data value. The peak power dissipation can then be decreased by half by coding the I/O as follow • 1. Compute the Hamming distance (the number of bits in which they differ) between the present bus value (also counting the present invert line) and the next data value. • 2. If the Hamming distance is larger than n=2, set invert = 1 (and thus make the next bus value equal to the inverted next data value). • 3. Otherwise, let invert = 0 (and let the next bus value equal to the next data value). • 4. At the receiver side the contents of the bus must be conditionally inverted according to the invert line, unless the data is not stored encoded as it is (e.g. in a RAM). In any case the value of invert must be transmitted over the bus (the method increases the number of bus lines from n to n + 1).

Bus-Inversion Coding • Redundant coding with m = n+1 • If data word S is to be transmitted, either S or S’ which is bit-wise inversion of S can be transmitted. • Extra wire P is used to indicate the polarity. • Decoder is memoryless, and encoder only used the current state of the wires. • Power saving

Example A typical eight-bit synchronous data bus. The transitions between two consecutive time-slots are \clean". There are 64 transitions for a period of 16 time slots. This represents an average of 4 transitions per time slot, or 0.5 transitions per bus line per time slot.

Bus encoding The same sequence of data coded using the Bus Invert method. There are now only 53 transitions over a period of 16 time slots. This represents an average of 3.3 transitions per time slot, or 0.41 transitions per bus line per time slot. The maximum number of transitions for any time slot is now 4.

Bus-Inversion Coding (Bit transitions with encoding) / (Bit transition without encoding) 1.0 • For large values of n, the efficacy of coding technique disappears as the ratio converges to 1. • Dividing large bit groups to smaller groups is better. 0.9 0.8 0.7 Data word width 4 8 12 16 20 24 28 32

Comparisons Comparison of unencoded I/O and coded I/O with one or more invert lines. The comparison looks at the average and maximum number of transitions per time-slot, per bus-line per time-slot, and I/O power dissipation for different bus-widths.

Remarks • The increase in the delay of the data-path: By looking at the power-delay product which removes the effect of frequency (delay) on power dissipation, a clear improvement is obtained in the form of an absolute lower number of transitions. It is also relatively easy to pipeline the bus activity. The extra pipeline stage and the extra latency must then be considered. • The increased number of I/O pins. As was mentioned before ground-bounce is a big problem for simultaneous switching in high speed designs. That is why modern microprocessors use a large number of Vdd and GND pins. The Bus-Invert method has the side-effect of decreasing the maximum ground-bounce by approximately 50%. Thus circuits using the Bus Invert method can use a lower number of Vdd and GND pins and by using the method the total number of pins might even decrease. • Bus-Invert method decreases the total power dissipation although both the total number of transitions increases (by counting the extra internal transitions) and the total capacitance increases (because of the extra circuitry). This is • possible because the transitions get redistributed very nonuniformly, more on the low-capacitance side and less on the high-capacitance side.

Lower Power Data Encoding • S.S.Chun and J.D.Cho’97 • 허프만 부호화 알고리즘에 의하여 발생된 압축률을 유지하면서 허프만코드를 재구성하여 스위칭 동작 횟수를 줄이는 방법 • 공통된 서브 시퀀스를 많이 갖는 서브 스트림에 그레이 코드와 같은 스위칭 횟수가 적은 부호화 방식을 채택하는 것이다. • RISC 인스트럭션 어드레싱 방식중 바이너리코드 어드레싱 방식에 비해서 그레이코드 어드레싱 방식을 사용할 경우 50%까지의 전력감축 효과를 나타낸다

Gray Code • 두 개의 n 차원(n bit)벡터 U = u_1, u_2, … , u_n 과 V = v_1, v_2, … , v_n 의 해밍 거리를 h(U,V) = SUM from i=1 to n (u_i, v_i ) 로 정의하자. 여기서 (u_i v_i ) 는 u와 v의 bit 값이 다르면 1이 되고 그렇지 않으면 0이 된다. 이것은 n차원 hypercube G의 변을 따라갈 때의 거리로 표현 할 수도 있다. Gray code = shortest path in G • 허프만 코드는 문자의 코드 길이가 다를 수 있으며 prefix-free코드를 유지하여야 하기 때문에 정확한 그레이 코드로 변환하는 것은 불가능하며 비트 변화량을 최소화하기 위한 압축 부호화가 필요하게 된다.

2-D Traveling Salesman Problem • 제안된 문제는 문자의 인접 빈도수가 많은 문자쌍에 해밍 거리가 작은 코드쌍을 할당하는 문제이기 때문에 두 개 이상의 TSP를 동시에 처리하는 새로운 문제로 표현된다. • Using heuristic: 10% reduction in switching activity for random un-correlated data

Data Representation • 2’s complement • most signal processing uses 2’s complement. • significant switching activity when the signal change from negative to positive(MSB’s toggle) • Sign-magnitude • only one-bit toggles when the signal switches sign if the dynamic range of a signal does not span the entire bitwidth.

Data Representation •  is the correlation coefficient of data • large  implies that the signal changes slowly and switches sign very infrequently • negative  implies that the signal changes frequently from positive to negative =-0.99 =-0.99 1.0 =-0.50 =-0.50 =0 Transition probability =0 Transition probability 0.5 =0.50 =0.50 =0.99 =0.99 0.0 0 7 14 0 7 14 Bit Number Bit Number

Switching Activity Information Library DesignPower: inputs & outputs VHDL or Verilog RTL Simulation VHDL or Verilog Gate-Level Simulation DesignPower • Power Report • Total Design • Modules • Individual Nets • Individual Cells Gate-Level Netlist

Switching Activity Generation - RTL • Activity of the synthesis invariant nodes is captured during RTL simulation • sequential outputs, hierarchical boundaries, black-box pins • Utilizes a zero-delay cycle-based propagation engine • Same activity is used for both analysis and optimization • New switching activity is required when the synthesis invariant behavior is changed

A/D Memory µp D/A µc Mega Cells S/P P/S Control Logic DMA Switching Activity: RTL vs. Gate-Level • RTL Switching Activity: • Available early in the design process • Fast • Accurate • Does not account for glitches • Does not fully support state- and path-dependency • Gate-Level Switching Activity: • Very accurate • Accounts for glitches • State- and path-dependency support • Requires lengthy gate-level simulation • Usually done at the later stages of the design process

RTL Design Power Compiler (RTL Clock Gating) DesignPower Design Compiler Power Compiler PowerGate Place & Route Power optimized design PowerGate for Detailed Power • Power verification at the later stages of the design cycle • Ensure that power budget and constraints are satisfied • Time based , peak power and time-average power at user-defined intervals • Identify power hungry vectors / instructions • Isolate power problems in-time

Identify Excessive Power In Time Dual-port RAM Control Logic 1 Control Logic 2 Address 1 Address 2 Common Data Bus • The average power consumption looks O.K yet is there a problem with the memory? • Is the memory cycle valid? (address collision) • Is there data contention? (are both ports in the read mode?) Power Average Time

Power Compiler @ RTL Push-button reduction in power at the RT-Level RTL Source RTL Clock-Gating • No changes required to the RTL code • Can deliver significant reduction in power • Power reduction is design dependent • We have seen 30% - 60% power reduction in some designs Downstream Dependencies • Logic Synthesis • Testability • Clock Tree Synthesis Power Compiler Clock-Gating (elaborate -gate_clock) Un-mapped Net-List + Constraints Design Compiler

FSM D_out D_in Register Bank EN G_CLK FSM Latch CLK Automatic Clock-Gating @ RTL Synchronous-load-enable implementation elaborate D_out Register Bank D_in EN Always @ (posedge CLK) if (EN) D_out = D_in CLK Gated clock implementation elaborate -gate_clock

1 2 D_out D_in Register Bank EN G_CLK 3 FSM Latch CLK Clock-Gating @ RTL - Power Savings • Power Savings by clock-gating • Reduced internal power consumption at the clock-gated flip-flops • No need for Muxes to re-circulate the data for these flip-flops (saves Power & Area) • Reduced power consumption by the clock network • Power Saving dependency • # of load-enable registers • % of disabled cycles

Latch-free {OR} EN GCLK CLK Latch-free {INV NAND BUF} EN GCLK CLK Clock-Gating Styles • Extensive user control • Latch-based or latch-free gating style • Which register banks to gate or exclude from gating • Positive (AND) or negative (OR) gating logic • Minimal bit-width of gated registers Latch-based {NAND INV} EN GCLK CLK

Clock-Gating @ RTL - Dependencies • Logic Synthesis • Power Compiler automatically generates set-up and hold constraints on the gating element • Combinatorial set-up and hold checks are performed by DC • Testability • Medium and high testability options for controllability & observability of the enable signal • Test Compiler and DC XP can handle the gating circuitry during rule-checking and ATPG • Clock-Tree-Synthesis • Supported by many ASIC vendors and tools providers • Contact your vendor for details

FSM Clock-Gating - Medium Testability TEST_MODE D_in D_out Register Bank • TEST_MODE enables override of clock-gating during scan-in andscan-out • Asserting TEST_MODE during the parallel mode will make FSM faultsun-testable EN G_CLK Latch CLK

FSM Clock-Gating - High Testability Other Observability Nodes Observability Register • All FSM faults are testable • Testability logic does not consume power • Higher area cost CLK TEST_MODE D_in D_out Register Bank EN G_CLK Latch CLK

Power Compiler @ Gate-Level Gate-Level Netlist Switching Activity Constraints (timing, power, area) Design Compiler Power Compiler Parasitic (Capacitance) Tech Library Power Compiler dc_shell> compile -incremental Power Optimized Gate-Level Netlist

Power Compiler @ Gate-Level • Optimizes power simultaneously with area and timing • New optimization technologies added for power • Activity-based optimizations minimize power subject to power constraints • Power added to the synthesis optimization cost function • 10% - 20% push-button reduction in power • Works within timing constraints • no increase in negative slack • Requires synthesis libraries updated for power • Completely integrated with Links-to-Layout methodology

Cost Type Design Rule Delay Dynamic Power Leakage Power Area Constraints Max Trans, Max Fanout Clock Period, Max_delay, Min_delay Max Dynamic Power Max Leakage Power Max Area Priority Optimization Priorities • The optimization priorities are hard coded • Try tightening/loosening the constraints to get the required speed/power/area trade-offs Power Compiler works within the specified timing constraints

Sized up Critical path Sized down a a Low activity net an2a an2c b b n1 n1 f f an2c an2a c n2 c n2 an2a an2a d d • Delay (a,f) : reqd = 4, actual = 3.3 • Cload: f = 4; n1, n2 = 2 • TR: a, b = .25, c, d = .5 • => n1 = .125, n2 = .25, f = .56 • Power = 4.125 • Delay (a,f) : reqd = 4, actual = 3.5 • Cload: f = 3; n1 = 2.5, n2 = 1.5 • TR: a, b = .25, c, d = .5 • => n1 = .125, n2 = .25, f = .56 • Power = 3.69 Note: Internal power effects (i.e. edge rate) also considered Cell Sizing Example

a a c b c f b f c b d d Factoring Example • Function: • f = ab + bc + cd • The function f is not on the critical path • The signals a, b, c and d are all the same bit width • Signal b is a high activity net • The two implementations below are equivalent from both • timing and area criteria • Net Result: network toggling and power is reduced f = b(a + c) + cd f = ab + c (b + d)

Cpin = C1 Cpin = C1 toggle rate = .4 toggle rate = .8 d a b b f f c c d a toggle rate = .8 toggle rate =.4 Cpin = 1.5C1 Cpin = 1.5C1 Pin Swapping Example Move high toggle nets to lower capacitance pins

1 A A 2 : 1 Mux TR = .7 2 : 1 Mux TR = .7 1 ? 6 B 5 B TR = .3 TR = .3 area = 7 area = 6 Phase Assignment Example • Implementation tradeoff criteria: • toggle rates of inputs and outputs • pin capacitance of library cell • Solution requires: • dynamic power cost function • actual toggle rates • accurate cell libraries

Push-Button Power Reduction Intel Success (Presented by Intel at SNUG 1998) A graphics chip for which both power and area are critical, synthesized to 0.35m library at 3.3 Volts. Achieved 12%, 21% and 24% reduction in power on 3 blocks with 2% or less area increase. Lucent Success An ISDN Transceiver ASIC, 40K gates block, synthesized to 0.35m library Achieved 12%push-button power reduction with 3.3% area increase

Accuracy Diagnosis ASIC Methodology RTL Simulation RTL Design Design Exploration RTL SA Power Compiler (RTL Clock Gating) DesignPower Speed Design Compiler RTL SA Design Implementation Power Compiler Gate Simulation SNPS .db DesignPower PowerGate SA Cap. Physical Design Place & Route Power optimized design

Links-to-Layout for Power Power Compiler Before: timing constraints not met Physical Design PDEF SDF set_load After: timing constraints met Floorplan Manager Met Constraints? No Yes Lowest power implementation The lowest power silicon within your timing constraints

L17 :Logic Level Design