160 likes | 279 Vues
This document presents advanced techniques for optimizing the design and implementation of the AES algorithm to achieve maximum throughput in both FPGA and ASIC platforms. Emphasizing key additions, pipelining methodologies, area optimization strategies, and area-delay trade-offs, the focus is on achieving high performance while managing resource consumption. Designs are elaborated with different variations, including resource sharing and re-timing to minimize critical path delays. Performance metrics for various implementations are also included to highlight efficiency in converging to optimal throughput rates.
E N D
High Throughput AES Alireza Hodjat IVGroup
kn Key Addition Key Sch_Sub Substitution Key Sch_rt Shift Row Key Sch_xor Mix Column ki Key Addition Key Sch_Sub Key Sch_rt Substitution Key Sch_xor Shift Row Key Addition The AES Algorithm
The Highest Possible Throughput • The choice of 128-bit key only • Completely unrolled loop • Pipelined • Between each round (Outer-round) • Inside each round (Inner-round) • This causes huge area consumption.
Area Optimization • Area optimization inside each round • Two different techniques: • Resource sharing • Re-timing • Break the critical path and perform the algorithm in multiple clock cycles • Critical path: Substitution • Area-delay trade-off
Sbox area-delay trade-off for FPGA Sbox area-delay trade-off for ASIC Design Type Design Type Critical path Critical path Area Area Re-timing Re-timing Direct No-Pipeline Direct No-Pipeline 4.05 ns 1.19 ns 2.086 Kgates 136 LUTs No No Indirect No-Pipeline Indirect No-Pipeline 10.41 ns 3.67 ns 1.167 Kgates 94 LUTs No No Direct One stage pipeline Direct One stage pipeline 3.91 ns 0.78 ns 3.51 Kgates 136 LUTs Yes 2 pipe stages Yes 2 pipe stages Indirect Three stage pipeline Indirect Three stage pipeline 5.95 ns 1.11 ns 1.65 Kgates 90 LUTs Yes 3 pipe stages Yes 3 pipe stages Direct No-pipeline Using Block RAM 4.87 ns 0 LUTs No Sbox Area-Delay Trade-off • Direct Implementation: Look-up table • Indirect Implementation: GF(24) • Wolkerstorfer Design • Patrick’s codes
4 4 4 4 4 3 3 3 3 3 2 2 2 2 2 1 1 1 1 1 4 4 4 4 4 3 3 3 3 3 2 2 2 2 2 1 1 1 1 1 4 4 4 4 4 3 3 3 3 3 2 2 2 2 2 1 1 1 1 1 4 4 4 4 4 3 3 3 3 3 2 2 2 2 2 1 1 1 1 1 S S S S S S S S S S S S S S S S 4 3 2 1 M M M M 4 3 2 1 + + + + + + + + + + + + + + + + AES Encrypt Datapath
4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 S S S S + + + + + + + + + + + + + + + + + Key Scheduling Datapath
4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 S S S S S S S S S S S S S S S S 1 Cycle 1 Cycle 1 Cycle M M M M 1 Cycle + + + + + + + + + + + + + + + + Design 1: Straight Forward 1 Round
4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S 1 Cycle 1 Cycle 1 Cycle M M M M + + + + + + + + + + + + + + + + 1 Cycle Design 2: Use re-timing for Sbox 1 Round
4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 S-A S-B S-D S-C 4 Cycle 4 Cycle M + + + + 4 Cycle Design 3: Use resource sharing 1 Round
4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 S-A-1 S-C-1 S-B-1 S-D-1 S-C-2 S-B-2 S-A-2 S-D-2 M + + + + Design 4: Use resource sharing and re-timing for Sbox 5 Cycle 1 Round 5 Cycle 5 Cycle
4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 S-D-1 S-C-1 S-A-1 S-B-1 1 Cycle S-D-2 S-A-2 S-C-2 S-B-2 1 Cycle Mix Column 1 Cycle + + + + 1 Cycle Design 5: Resource sharing and pipelining and re-timing for Sbox 1 Round
S1 S2 M A S1 S2 M A 1 2 1 3 2 1 Time 4 3 2 1 1 4 3 2 2 1 4 3 3 2 1 4 4 3 2 1 1 1 4 3 2 2 1 2 1 4 3 3 2 1 3 2 1 4 4 3 2 1 4 3 2 1 1 4 3 2 1 … Round 1 Round 2 Inner-Round Pipeline for Design 5
Design # 1 # 2 # 3 # 4 # 5 Clock per Sample 1 1 4 5 4 Pipe stages per round 4 stages 4 stages 3 stages 4 stages 4 stages Total pipe stages 4 10 stages 4 10 stages 3 10 stages 4 10 stages 4 10 stages Latency 4 10 cycles 4 10 cycles 4 3 10 cycles 5 3 10 cycles (4 10) + 4 cycles FPGA Throughput (200MHz) 25.6 Gbit/s 25.6 Gbit/s 6.4 Gbit/s 6.4 Gbit/s 6.4 Gbit/s ASIC Critical path 1.5 ns 650 MHz 1 ns 1 GHz 1.5 ns 650 MHz 1 ns 1 GHz 1 ns 1 GHz Estimated Area Less than 500 Kgates Less than 900 Kgates Less than 150 Kgates Less than 300 Kgates Less than 250 Kgates ASIC Throughput (128*650) 83.2 Gbit/s (128*1) 128 Gbit/s (128*650/4) 20.8 Gbit/s (128*1/5) 25.6 Gbit/s (128*1/4) 32 Gbit/s Performance Estimation