Enhancing Embedded Cryptographic Performance with Xtensa+ and TIE Extensions

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla

Agenda • Introduction • Survey of Existing Architectures • Xtensa+ Crypto Processor • Rijndael Algorithm (AES final selection) • RC6, IDEA, and DES • Performance • Trade-off Analysis • Conclusion

Introduction • Commercial Networking Applications require flexible & high throughput secure connectivity • Encryption/Decryption algorithm computation intensive • Multi-session applications present significant load on embedded processors • Embedded systems need performance while optimizing power and area • Our study – existing architectures, analysis of Xtensa as an alternative, performance analysis and trade-offs for embedded

Survey of Existing Architectures • Three categories • Specialized Crypto Processors • Reconfigurable Architectures • Full Hardware Implementation (ASICs/FPGAs) • High Variation in architecture complexity • Performance vs Area tradeoff • Suitability for Embedded Applications

Specialized Crypto Processors • Few VLIW architectures - CryptoManiac • Instruction Combining – Instruction Word combining to exploit ILP • Crypto Arithmetic Unit(s) – multiple XORs, GF multiplication/addition, lookup table substitution, and permutation • Coarse configurability of datapath • Mostly lacking SIMD support • Performance is typically 2x to 6x that of general processors

Reconfigurable Architectures • Numerous reconfigurable processor architectures – PipeRench, MorphoSys, COBRA, and GARP • Functional Units that provide all crypto arithmetic - multiple XORs, GF multiplication/addition, modulo multiplication • Reconfigurable Interconnection Network to provide dynamic change to functional unit connectivity • VLIW Instructions • Reconfiguration Registers • Suitable for Block Ciphers • High Variability in Performance increase w.r.t Processors

Full Hardware Implementation • High performance implementations targeted to ASICs/FPGAs • DES – 12 Gbps on Virtex-E XCV300E • AES – 18 Gbps on ASIC using TSMC 0.18m process • Lacking flexibility and crypto-modes • Memory and Area efficient • Typical latency only in DMA of data to Hardware unit • Need additional processor for control path

Xtensa+ Crypto Architecture • Custom Extensions to Xtensa Processor using the TIE framework • Addition of Generic Key Schedule Register File and Instructions to support all Crypto Algorithms studied • Addition of multiple on-chip SRAMs (in addition to 4 Data-RAMs) to the Xtensa processor • Currently Implemented using Table construct in TIE • Hacked TIE Compiler generated Verilog Code to instantiate multiple RAM models (implemented using multi-dimensional array) for viability analysis • Addition of 4 State Registers and 4 Next State Registers generic to all algorithms studied • Possible future extensions to include multi-session key storage and fast retrieval support

AES Overview • AES (Advanced Encryption Standard) is the standard set to replace DES for both government and private-sector encryption • Uses a fixed block size of 128-bits, with key sizes of 128-, 196-, or 256-bits • Designed to be efficient in both hardware and software across a variety of platforms • 10, 12, or 14 rounds depending on key size • 128-bit round key used for each round • Can be pre-computed and cached for future encryptions

AES Implementation Abstraction • Each round consists of a lookup, byte-level permutation, finite field multiplication, and key XOR • Lookup and multiplication can be combined into four separate 8x32 lookup tables, so each round is 16 lookups and 16 XORs • Decryption is essentially the same, but with different tables and a different key schedule

TIE Implementation • Our implementation does all 16 lookups in parallel, requiring 16 SRAMs • x0, x1, x2, x3, represents the round state (each 32 bits), k0, k1, k2, k3 are the current round key, and Tij are the T-boxes, where i is a duplication index and j is the T-box index • Each round is then: x0 = T00[x0]^T01[x1>>8]^T02[x2>>16]^T03[x3>>24] ^ k0 x1 = T10[x1]^T11[x2>>8]^T12[x3>>16]^T13[x0>>24] ^ k0 x2 = T20[x2]^T21[x3>>8]^T22[x0>>16]^T23[x1>>24] ^ k0 x3 = T30[x3]^T31[x0>>8]^T32[x1>>16]^T33[x2>>24] ^ k0

Other Ciphers Implemented • DES (Data Encryption Standard) • 64-bit block, 56-bit key, 16 rounds, Feistel network • 8 6x4 S-Boxes, XORs, and bit-level permutations • Can’t really be done efficiently in software • TIE Implementation required 1 Instruction per round • IDEA (International Data Encryption Algorithm) • 64-bit block, 128-bit key, 8 rounds, iterated, operates on 16-bit numbers • 4 Multiplications mod 216 + 1, 4 adds mod 216, 6 XORS • Each round is highly sequential, so difficult to parallelize • TIE Implementation required 7 Instructions per round • RC6 • Same block and key modes as AES, 20 rounds, iterated • Multiplication mod 232, XORs, rotations, addition mod 232 • TIE Implementation required 2 Instructions per round

AES Performance in Xtensa+ • Performance of TIE extensions approaches performance of non-pipelined ASICs • Total of 31 run-time instructions per data-block • Initial EXOR Instruction • 1 Instruction per round computation (10 total) • 20 Cycles for Load and Store of 128-bit Data Blocks • Generally an order of magnitude better than pure software • Also faster than reconfigurable hardware or a specialized VLIW processor

Mbps of Throughput

Cycles Per Block

Design Tradeoffs • Flexibility • Algorithm changes • New algorithms • New encryption modes • Implementation bugs • Time to Market • Closer to software development time • Can choose which parts to accelerate

Power vs. Performance: Mbps/mW

Conclusion • Xtensa instructions provide flexibility, performance, and Mbps/mW all somewhere between an ASIC and a VLIW or Software-based solution • Suitable for most Embedded Applications like 802.11i, etc. • Using Xtensa for cryptography is a good choice if: • You don’t need absolute throughput • You don’t need absolute flexibility • You need a control processor anyway • The algorithms needed are known ahead of time

Enhancing Embedded Cryptographic Performance with Xtensa+ and TIE Extensions

Enhancing Embedded Cryptographic Performance with Xtensa+ and TIE Extensions

Presentation Transcript

A Calculus for Cryptographic Protocols

CRYPTOGRAPHIC ALGORITHMS

Cryptographic Protocols

Cryptographic Security

Cryptographic Protocols

Cryptographic basics

Cryptographic Security

Cryptographic Security

Cryptographic Protocols

Cryptographic Protocols for Electronic Voting

Cryptographic Security

Argument for Acceleration

Cryptographic Tools

Cryptographic Protocols

Cryptographic Protocols

Cryptographic Extraction

Cryptographic Hashes

Cryptographic Protocols

Cryptographic methods

Cryptographic Tools

Safety Tie For Horses