180 likes | 299 Vues
This paper explores advanced solutions for cryptographic acceleration in embedded systems, focusing on the Xtensa+ architecture. We survey existing architectures, including specialized crypto processors and reconfigurable architectures, highlighting performance trade-offs and suitability. Key algorithms such as AES, DES, IDEA, and RC6 are analyzed, showcasing TIE framework extensions to enhance cipher execution on Xtensa processors. The study reveals the importance of flexibility, power optimization, and high throughput for secure connectivity in commercial networking applications.
E N D
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla
Agenda • Introduction • Survey of Existing Architectures • Xtensa+ Crypto Processor • Rijndael Algorithm (AES final selection) • RC6, IDEA, and DES • Performance • Trade-off Analysis • Conclusion
Introduction • Commercial Networking Applications require flexible & high throughput secure connectivity • Encryption/Decryption algorithm computation intensive • Multi-session applications present significant load on embedded processors • Embedded systems need performance while optimizing power and area • Our study – existing architectures, analysis of Xtensa as an alternative, performance analysis and trade-offs for embedded
Survey of Existing Architectures • Three categories • Specialized Crypto Processors • Reconfigurable Architectures • Full Hardware Implementation (ASICs/FPGAs) • High Variation in architecture complexity • Performance vs Area tradeoff • Suitability for Embedded Applications
Specialized Crypto Processors • Few VLIW architectures - CryptoManiac • Instruction Combining – Instruction Word combining to exploit ILP • Crypto Arithmetic Unit(s) – multiple XORs, GF multiplication/addition, lookup table substitution, and permutation • Coarse configurability of datapath • Mostly lacking SIMD support • Performance is typically 2x to 6x that of general processors
Reconfigurable Architectures • Numerous reconfigurable processor architectures – PipeRench, MorphoSys, COBRA, and GARP • Functional Units that provide all crypto arithmetic - multiple XORs, GF multiplication/addition, modulo multiplication • Reconfigurable Interconnection Network to provide dynamic change to functional unit connectivity • VLIW Instructions • Reconfiguration Registers • Suitable for Block Ciphers • High Variability in Performance increase w.r.t Processors
Full Hardware Implementation • High performance implementations targeted to ASICs/FPGAs • DES – 12 Gbps on Virtex-E XCV300E • AES – 18 Gbps on ASIC using TSMC 0.18m process • Lacking flexibility and crypto-modes • Memory and Area efficient • Typical latency only in DMA of data to Hardware unit • Need additional processor for control path
Xtensa+ Crypto Architecture • Custom Extensions to Xtensa Processor using the TIE framework • Addition of Generic Key Schedule Register File and Instructions to support all Crypto Algorithms studied • Addition of multiple on-chip SRAMs (in addition to 4 Data-RAMs) to the Xtensa processor • Currently Implemented using Table construct in TIE • Hacked TIE Compiler generated Verilog Code to instantiate multiple RAM models (implemented using multi-dimensional array) for viability analysis • Addition of 4 State Registers and 4 Next State Registers generic to all algorithms studied • Possible future extensions to include multi-session key storage and fast retrieval support
AES Overview • AES (Advanced Encryption Standard) is the standard set to replace DES for both government and private-sector encryption • Uses a fixed block size of 128-bits, with key sizes of 128-, 196-, or 256-bits • Designed to be efficient in both hardware and software across a variety of platforms • 10, 12, or 14 rounds depending on key size • 128-bit round key used for each round • Can be pre-computed and cached for future encryptions
AES Implementation Abstraction • Each round consists of a lookup, byte-level permutation, finite field multiplication, and key XOR • Lookup and multiplication can be combined into four separate 8x32 lookup tables, so each round is 16 lookups and 16 XORs • Decryption is essentially the same, but with different tables and a different key schedule
TIE Implementation • Our implementation does all 16 lookups in parallel, requiring 16 SRAMs • x0, x1, x2, x3, represents the round state (each 32 bits), k0, k1, k2, k3 are the current round key, and Tij are the T-boxes, where i is a duplication index and j is the T-box index • Each round is then: x0 = T00[x0]^T01[x1>>8]^T02[x2>>16]^T03[x3>>24] ^ k0 x1 = T10[x1]^T11[x2>>8]^T12[x3>>16]^T13[x0>>24] ^ k0 x2 = T20[x2]^T21[x3>>8]^T22[x0>>16]^T23[x1>>24] ^ k0 x3 = T30[x3]^T31[x0>>8]^T32[x1>>16]^T33[x2>>24] ^ k0
Other Ciphers Implemented • DES (Data Encryption Standard) • 64-bit block, 56-bit key, 16 rounds, Feistel network • 8 6x4 S-Boxes, XORs, and bit-level permutations • Can’t really be done efficiently in software • TIE Implementation required 1 Instruction per round • IDEA (International Data Encryption Algorithm) • 64-bit block, 128-bit key, 8 rounds, iterated, operates on 16-bit numbers • 4 Multiplications mod 216 + 1, 4 adds mod 216, 6 XORS • Each round is highly sequential, so difficult to parallelize • TIE Implementation required 7 Instructions per round • RC6 • Same block and key modes as AES, 20 rounds, iterated • Multiplication mod 232, XORs, rotations, addition mod 232 • TIE Implementation required 2 Instructions per round
AES Performance in Xtensa+ • Performance of TIE extensions approaches performance of non-pipelined ASICs • Total of 31 run-time instructions per data-block • Initial EXOR Instruction • 1 Instruction per round computation (10 total) • 20 Cycles for Load and Store of 128-bit Data Blocks • Generally an order of magnitude better than pure software • Also faster than reconfigurable hardware or a specialized VLIW processor
Design Tradeoffs • Flexibility • Algorithm changes • New algorithms • New encryption modes • Implementation bugs • Time to Market • Closer to software development time • Can choose which parts to accelerate
Conclusion • Xtensa instructions provide flexibility, performance, and Mbps/mW all somewhere between an ASIC and a VLIW or Software-based solution • Suitable for most Embedded Applications like 802.11i, etc. • Using Xtensa for cryptography is a good choice if: • You don’t need absolute throughput • You don’t need absolute flexibility • You need a control processor anyway • The algorithms needed are known ahead of time