440 likes | 540 Vues
Power-aware RAM Processing for FPGA Embedded Memory Blocks. Russell Tessier University of Massachusetts Vaughn Betz, David Neto and Thiagaraja Gopalsamy Altera Corporation. Overview. Operation of FPGA embedded memory blocks (EMBs) Power consumption in EMBs
E N D
Power-aware RAM Processing for FPGA Embedded Memory Blocks Russell Tessier University of Massachusetts Vaughn Betz, David Neto and Thiagaraja Gopalsamy Altera Corporation
Overview • Operation of FPGA embedded memory blocks (EMBs) • Power consumption in EMBs • Opportunities for power saving • Shut down clocks to memory core • Three automated power saving techniques • Unused memory port shutdown • Memory control signal transform • Memory mapping to multiple blocks • Experimental results
FPGA Embedded Memory Blocks • Embedded memory blocks (EMBs) are important parts of FPGAs • Consume roughly 14% of Altera Stratix II dynamic power * • Increasing in recent designs * Stratix II Low Power Applications Note, 2005
Port B Data In Port B Address Port B R/W Enable Port A Data In Memory Core Port A Address Clock enables Port A R/W Enable Clock enables Port B Data Out Port A Data Out Port A Port B Stratix II Embedded Memory Block – External View • Input ports (data, address, control) are synchronous • Mode 1: Single port (ignore Port B) • Mode 2: True dual port
Port B Data In Port B Address Port B Read Enable Port A Data In Port A Address Memory Core Clock enables Clock enables Port A Write Enable Port B Data Out Port A Port B Stratix II Embedded Memory Block – External View • Mode 3: Simple dual-port • Large majority of RAM implementations
Clk Enable MClk Clk MClk Embedded Memory Block Port Internal View Bit Line Pre-charge MClk BIT BIT RAM cell Row Decode Column Decode Column Mux Write Buffers Sense Amps MClk Pulse Gen. Write Enable Read Enable MClk Latch Address Read Data Write Data
Clk Enable = 1 MClk Bit Line Pre-charge Clk BIT BIT Embedded Memory Block Port Read: Step 1 • Substantial power required to charge bit lines MClk Precharge BIT lines to VCC
Clk Enable = 1 MClk Clk Embedded Memory Block Port Read: Step 2 Bit Line Pre-charge MClk BIT BIT RAM cell Data read out of RAM cells Row Decode Column Decode Column Mux Sense Amps MClk Address
Clk Enable = 1 MClk Bit Line Pre-charge MClk Clk BIT BIT RAM cell Row Decode Column Decode Column Mux Sense Amps MClk Read Enable = 1 Data passes through latch to Read Data lines Latch Address Read Data Embedded Memory Block Port Read: Step 3
Clk Enable MClk Bit Line Pre-charge Clk BIT BIT RAM cell Row Decode Column Decode Column Mux Sense Amps MClk Read Enable Latch Address Read Data Embedded Memory Block Port Read Summary MClk • If read clock enable = 0, steps 1 and 2 suppressed • If read enable = 0, step 3 suppressed
Clk Enable = 1 Bit Line Pre-charge MClk Clk BIT BIT Embedded Memory Block Port Write: Step 1 MClk • Substantial power required to charge bit lines Precharge BIT lines to VCC
Clk Enable = 1 MClk Clk Embedded Memory Block Port Write: Step 2 MClk Bit Line Pre-charge Data loaded into write buffers based on write enable Column Mux Write Buffers Sense Amps MClk Pulse Gen. Write Enable MClk Write Data
Embedded Memory Block Port Write: Step 3 Clk Enable MClk MClk Bit Line Pre-charge Clk BIT BIT Data loaded into RAM cells RAM cell Row Decode Column Decode Column Mux Write Buffers Sense Amps MClk Pulse Gen. Write Enable MClk MClk Address Write Data
Clk Enable MClk Clk Embedded Memory Block Port Write Summary MClk Bit Line Pre-charge • If write clock enable = 0, steps 1, 2, and 3 suppressed • If write enable = 0, step 2 suppressed BIT BIT RAM cell Row Decode Column Decode Column Mux Write Buffers Sense Amps MClk Pulse Gen. Write Enable MClk MClk Write Data
Clk Enable MClk Clk Reducing RAM Power Consumption • Each RAM element can use an enabled or free running clock • Use enabled clocks rather than free running clocks to prevent bit-line pre-charge • Only enable RAMs when access is necessary • Read enable not always specified by designer • Write enable created for functionality
Shut Off Port A Data In Port A Address Port B Write Enable = 0 Port A R/W Enable Port B Clock Memory Core Clock enables Port A Data Out Port A Port B Power Optimization #1 • For single-port memories • Tie Port B clock enable to GND • Previously tied high, with write enable disabled
Single Port Optimization Experiments • Determine power effect of shutting off clock to unused Port B • Only impacts single-port RAM and ROM designs • 43 Stratix II designs • Large customer designs with memory • Targeted to smallest achievable FPGA • Hand-generated input vectors • Quartus 5.0 • Target maximum frequency
Memory Power – Port Optimization • 9.2% average power reduction for designs with memories (only impacts ROMs and single port memories) Memory Dynamic Power 60 50 40 % Power Reduction 30 20 10 0 5 10 15 20 25 30 35 40 Designs
Dynamic Power - Port Optimization • 2.4% average power reduction for designs with memories (only impacts ROMs and single port memories) Dynamic Power 35 30 25 20 % Power Reduction 15 10 5 0 5 10 15 20 25 30 35 40 Designs
FPGA RAM Processing FIFO, Shift Register, RAM specification • FIFOs and Shift registers converted into logical RAMs • Logical RAMs broken into RAM blocks of sizes appropriate for physical implementation • Each RAM block assigned to a physical embedded memory block Logical-to-physical RAM processing Memory/ logic placement Placed Memory Create Logical Memory Logical RAMs RAM blocks/ logic
Logical RAM After Before Data Data Q Q Wr clk enable Rd clk enable Vcc Vcc Wrreq Write enable Read enable Rdreq Data Data Q Q Write Address Read Address Wrreq Wrreq Rdreq Rdreq counter counter Clock Implemented in LUTs/FFs Clock FIFO Elaboration to Logical RAM • Convert to logic and synchronous RAM with signal pattern found on EMB
Before After Data Data Q Q Data Data Q Q Wr clk enable Rd clk enable Wr clk enable Rd clk enable Vcc Vcc Wren Rden Wren Rden Vcc Write enable Read enable Write enable Read enable Vcc Read Address Read Address Write Address Read Address Write Address Read Address Write Address Write Address Clock Clock Power Optimization #2 • Convert EMB read enable/write enable signals to associated read/write clock enable signals • Limitations • Each port must have dedicated read or write enable signal (simple-dual port) • Embedded memory block have read enable
MClk MClk VCC Clk MClk Read Port Control Signal Equivalence Bit Line Conditioning • Memory core inactive if read clock enable inactive • Read operation will occur if both read enable and read clock enable are high • One signal could be tied to VCC BIT BIT RAM cell Row Decode Column Decode Column Mux Write Buffers Read Enable Latch Read Address Read Data
Read enable MClk MClk Clk Read Port Control Signal Equivalence • If read clock enable = 0 and read enable = 1, read suppressed • If read clock enable = 1 and read enable = 0, read suppressed Bit Line Pre-charge BIT BIT RAM cell Row Decode Column Decode Column Mux Sense Amps MClk VCC Latch Address Read Data
MClk MClk VCC Clk MClk Write Port Control Signal Equivalence Bit Line Conditioning • Memory core inactive if write clock enable is inactive • Write operation will occur if both write enable and write clock enable are high • One signal could be tied to VCC BIT BIT RAM cell Row Decode Column Decode MClk Pulse Gen. Column Mux Write Buffers Write Enable MClk Write Address Write Data
Write enable MClk MClk Clk Write Port Control Signal Equivalence Bit Line Pre-charge • If write clock enable = 0 and write enable = 1, write suppressed • If write enable = 0 and write clock enable = 1, write suppressed BIT BIT RAM cell Row Decode Column Decode Column Mux Write Buffers Sense Amps MClk Pulse Gen. VCC MClk MClk Write Data
User-defined Write Clk Enable MClk Clk Write Enable User-defined Write Clk Enable MClk Clk Quartus II Implementation • Conversion mode • Quartus II default • Ties off R/W enable to RAM clock enables • Doesn’t make transform if CE already present on port • Combining mode • AND user RAM clock enables with derived R/W clock • Could impact performance
Clock Enable Conversion Experiments • 40 Stratix II RAM-based designs designs • Quartus 5.1 • Target max frequency • Quartus II simulation with test vectors • Dynamic power evaluated with Quartus II PowerPlay power analyzer • Covers the following optimizations • Automatic conversion of R/W enable to R/W clock enable • Combining of R/W enable with existing R/W clock enable
Memory Power – Clock Enable Optimization • 9.7% average power reduction for convert and combine for all designs (6.3% for convert only) Memory Dynamic Power 70 Enable convert 60 Enable convert/ combine 50 40 % Power Reduction 30 20 10 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 -10 Designs
Core Dynamic Power – Clock Enable Optimization • 2.6% average power reduction for convert and combine for all designs (1.8% for convert only) Core Dynamic Power 30 Enable convert 25 Enable convert/ combine 20 15 % Power Reduction 10 5 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 -5 Designs
User-defined (logical) memory Physical (EMB) memory 16K bits 4K bits 4K bits 4K bits 4K bits 4k deep x 4 wide M4K M4K M4K M4K Mapping RAM to Multiple EMBs • User-defined memory often too large to fit in one EMB • Must use RAM in multiple EMBs to implement logical RAM • Implementation choice can impact design area, performance, and power.
512 words deep 4K words deep 8 bits wide 1 bit wide Memory Organization • Each EMB can be configured to have different depth and width (e.g. Stratix II M4K) • All hold 4K bits • Slightly lower power consumption for wider EMB configurations (not including routing) 128 words deep 32 bits wide
Logical memory 4k words deep and 1 bit wide (4 times) 4k words deep and 4 bits wide 4 EMBs active during access Addr[0:11] EMB Data[0:3] Area and Delay Optimal Mapping • Configure each EMB to be as deep as possible • Number of address bits on each EMB same as on logical memory • Area and performance efficient: no external logic needed • Power inefficient: All EMBs must be active during each logical RAM access Vertical Slicing
Addr[10:11] Addr[10:11] Alternative Mapping • Configure EMB to have width of logical RAM (e.g. 1Kx4) • Allows shutdown of some RAMs each cycle • But adds some logic • Saves RAM power, adds combinational logic and register power Horizontal Slicing Addr Decoder 1K deep x 4 wide More Power Efficient: Logical memory (4 times) 1 EMB active during access Addr[0:9] 4k words deep and 4 bits wide 4 Data[0:3]
4kx32 Dynamic Power Multiplexer Power Increasing 140 Best range 120 100 80 Dynamic Power (mW) 4kx32 60 40 20 0 128 256 512 1k 2k 4k EMB Power Increasing Maximum Depth RAM Slicing - Example • Power reduction available with different slicing
Power Optimization #3: Power-aware RAM Partitioning FIFO, Shift Register • Power optimal EMB configuration often between “horizontal” and “vertical” • Need algorithm to consider possible logical to physical RAM mappings Logical to Physical RAM processing Create Logical Memory Insert Decode and Mux Logic Power-aware RAM Partitioner Logical RAMs RAM blocks/ Logic Memory/ Logic Placement Completed placement
Power-aware RAM Partitioning Algorithm • For each EMB type • For each EMB depth versus width configuration • Determine number of required EMBs, decoder, and output mux circuits • Estimate power of RAM access (active EMBs, decoder, and output mux) • Limit to four-way muxing at most • Save lowest power configuration • Rank possible EMB implementations by power • Select lowest-power, feasible choice • Check if EMB usage overflowed by choice • If yes, select next choice
Experimental Approach • Simulation and power estimation performed • Multi-bit input multiplexers • Decoders • EMB blocks in different configurations • 40 designs evaluated • Quartus 5.1 • Mapped to smallest possible device and target max frequency • Simulation with test vectors, power analysis with PowerPlay • Approach used in combination with clock enable conversion and combining
Memory Power • 21.0% average power reduction for all techniques for memory designs (9.7% with only enable convert/combine)
Overall Core Dynamic Power • 6.8% average power reduction for all techniques for memory designs (2.6% with convert/combine) 35 Enable convert/ combine 30 Enable convert/ combine + mem partition 25 20 % Dyn. Power Reduction 15 10 5 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 -5 Designs
Design Performance • 1.0% average performance loss for all techniques (0.1% for enable convert/combine) Average Design Clock Frequency 10 5 0 -5 % Frequency Improvement -10 Enable Convert/ -15 Combine -20 Enable Convert/ Combine + -25 Mem Partition -30 Designs
Results Summary • Almost 7% core dynamic power reduction across all designs • Some designs benefit more than others • Minimal clock frequency hit for most designs
Impact of Multiple Embedded Memory Blocks • Rerun 40 designs but only allow one type of target EMB for each mapping • All designs targeted to Stratix II EP2S180 • Significant power impact for most designs versus EP2S180 target with no restrictions
Summary • Key to reducing RAM power is keeping clocks disabled. • Single port RAMs a straightforward optimization • Movement of read/write enables to clock enables limits dynamic activity • Power-aware RAM partitioner attempts to select power-optimal mapping – combined with clock enable enhancement • Overall • About 30% average memory power reduction • 9% single port optimization • 21% enable convert/combine and memory partitioning • About 9% average dynamic power reduction • 2% single port optimization • 7% enable convert/combine and memory partitioning