Optimization Techniques for High-Throughput AES Implementation in FPGA and ASIC Designs

High Throughput AES Alireza Hodjat IVGroup

kn Key Addition Key Sch_Sub Substitution Key Sch_rt Shift Row Key Sch_xor Mix Column ki Key Addition Key Sch_Sub Key Sch_rt Substitution Key Sch_xor Shift Row Key Addition The AES Algorithm

Outer-round Pipelining

Inner- and Outer-round Pipelining

The Highest Possible Throughput • The choice of 128-bit key only • Completely unrolled loop • Pipelined • Between each round (Outer-round) • Inside each round (Inner-round) • This causes huge area consumption.

Area Optimization • Area optimization inside each round • Two different techniques: • Resource sharing • Re-timing • Break the critical path and perform the algorithm in multiple clock cycles • Critical path: Substitution • Area-delay trade-off

Sbox area-delay trade-off for FPGA Sbox area-delay trade-off for ASIC Design Type Design Type Critical path Critical path Area Area Re-timing Re-timing Direct No-Pipeline Direct No-Pipeline 4.05 ns 1.19 ns 2.086 Kgates 136 LUTs No No Indirect No-Pipeline Indirect No-Pipeline 10.41 ns 3.67 ns 1.167 Kgates 94 LUTs No No Direct One stage pipeline Direct One stage pipeline 3.91 ns 0.78 ns 3.51 Kgates 136 LUTs Yes 2 pipe stages Yes 2 pipe stages Indirect Three stage pipeline Indirect Three stage pipeline 5.95 ns 1.11 ns 1.65 Kgates 90 LUTs Yes 3 pipe stages Yes 3 pipe stages Direct No-pipeline Using Block RAM 4.87 ns 0 LUTs No Sbox Area-Delay Trade-off • Direct Implementation: Look-up table • Indirect Implementation: GF(24) • Wolkerstorfer Design • Patrick’s codes

4 4 4 4 4 3 3 3 3 3 2 2 2 2 2 1 1 1 1 1 4 4 4 4 4 3 3 3 3 3 2 2 2 2 2 1 1 1 1 1 4 4 4 4 4 3 3 3 3 3 2 2 2 2 2 1 1 1 1 1 4 4 4 4 4 3 3 3 3 3 2 2 2 2 2 1 1 1 1 1 S S S S S S S S S S S S S S S S 4 3 2 1 M M M M 4 3 2 1 + + + + + + + + + + + + + + + + AES Encrypt Datapath

4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 S S S S + + + + + + + + + + + + + + + + + Key Scheduling Datapath

4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 S S S S S S S S S S S S S S S S 1 Cycle 1 Cycle 1 Cycle M M M M 1 Cycle + + + + + + + + + + + + + + + + Design 1: Straight Forward 1 Round

4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S 1 Cycle 1 Cycle 1 Cycle M M M M + + + + + + + + + + + + + + + + 1 Cycle Design 2: Use re-timing for Sbox 1 Round

4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 S-A S-B S-D S-C 4 Cycle 4 Cycle M + + + + 4 Cycle Design 3: Use resource sharing 1 Round

4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 S-A-1 S-C-1 S-B-1 S-D-1 S-C-2 S-B-2 S-A-2 S-D-2 M + + + + Design 4: Use resource sharing and re-timing for Sbox 5 Cycle 1 Round 5 Cycle 5 Cycle

4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 S-D-1 S-C-1 S-A-1 S-B-1 1 Cycle S-D-2 S-A-2 S-C-2 S-B-2 1 Cycle Mix Column 1 Cycle + + + + 1 Cycle Design 5: Resource sharing and pipelining and re-timing for Sbox 1 Round

S1 S2 M A S1 S2 M A 1 2 1 3 2 1 Time 4 3 2 1 1 4 3 2 2 1 4 3 3 2 1 4 4 3 2 1 1 1 4 3 2 2 1 2 1 4 3 3 2 1 3 2 1 4 4 3 2 1 4 3 2 1 1 4 3 2 1 … Round 1 Round 2 Inner-Round Pipeline for Design 5

Design # 1 # 2 # 3 # 4 # 5 Clock per Sample 1 1 4 5 4 Pipe stages per round 4 stages 4 stages 3 stages 4 stages 4 stages Total pipe stages 4  10 stages 4  10 stages 3  10 stages 4  10 stages 4  10 stages Latency 4  10 cycles 4  10 cycles 4  3  10 cycles 5  3  10 cycles (4  10) + 4 cycles FPGA Throughput (200MHz) 25.6 Gbit/s 25.6 Gbit/s 6.4 Gbit/s 6.4 Gbit/s 6.4 Gbit/s ASIC Critical path 1.5 ns 650 MHz 1 ns 1 GHz 1.5 ns 650 MHz 1 ns 1 GHz 1 ns 1 GHz Estimated Area Less than 500 Kgates Less than 900 Kgates Less than 150 Kgates Less than 300 Kgates Less than 250 Kgates ASIC Throughput (128*650) 83.2 Gbit/s (128*1) 128 Gbit/s (128*650/4) 20.8 Gbit/s (128*1/5) 25.6 Gbit/s (128*1/4) 32 Gbit/s Performance Estimation

Optimization Techniques for High-Throughput AES Implementation in FPGA and ASIC Designs

Optimization Techniques for High-Throughput AES Implementation in FPGA and ASIC Designs

Presentation Transcript

High-throughput Proteomics

High-throughput genotyping

High-Throughput Screening

High Throughput Computing

High-Throughput Screening

High-Throughput Sequencing

High-Throughput Sequencing Technologies

High-Throughput Sequencing

High Throughput Sequencing

High-Throughput Sequencing

High Throughput Urgent Computing

IEEE 802.11n – High Throughput

High Throughput Sequencing

High performance Throughput

High throughput biology projects

High-Throughput Screening

High Throughput Rheometer

High-Throughput Screening

High Throughput Screening Market

High Throughput Screening

high throughput phenotyping

High-throughput genotyping