Unified Architectures for Efficient and Compact Crypto-Processing

Unified Architectures for Efficient and Compact Crypto-Processing Erkay Savaş Sabancı University Erkay Savaş

Outline • Research Motivation • Public Key Cryptography • Unified Arithmetic • High-Radix Multiplication • Dual-Radix Multiplication • Support for GF(3n) Arithmetic • Implementation Results • Future Research Erkay Savaş

Motivation • Compatibility • support for fast arithmetic in different finite fields and groups • Saving in Area • Improve {time  area} metric • Algorithm Agility • NTRU  ECC Erkay Savaş

Public Key Cryptography (PKC) • Each user has a pair of keys: • Private Key - known only to the owner • Public Key - known to everyone in the systems with assurance • Encryption: • Encryption with the Public Key of the receiver • Decryption: • Only the receiver can decrypt the message by her/his Private Key Erkay Savaş

Public Key Cryptography in Use • RSA, Rabin’s scheme • Integer factorization, Square root of modulo a composite number • Discrete Logarithm Based Algorithms • Diffie-Helman Key Exchange, El Gamal • Elliptic curve DH Key Exchange, ECDSA • Discrete logarithm over elliptic curves • IBE • pairings over elliptic curve points Erkay Savaş

RSA • Most popular PKC • Invented by Rivest/Shamir/Adleman in 1977 at MIT. • Its patent expired in 2000. • Based on Integer Factorization problem • Each user has public and private key pair. Erkay Savaş

RSA Encryption & Decryption • Encryption done by using public key y  xe mod n, where x, y < n • Decryption done by using private key x  yd mod n Erkay Savaş

DL Based Cryptosystems • Fundamental operation gx mod p, where x, g < p and g is primitive Erkay Savaş

Elliptic Curve Cryptography 1/2 • Emerging public key cryptography standard for constrained devices. • 160 bit key length is equivalent in cryptographic strength to 1024-bit RSA. • 313 bit ECC is equivalent to 4096 bit RSA • As algebraic/geometric entities have been studied extensively for the past 150 years. • Rich and deep theory suitable to cryptography • First proposed for cryptographic usage in 1985 independently by Neal Koblitz and Victor Miller Erkay Savaş

Elliptic Curve Cryptography 2/2 • Dominant fundamental operations • Multiplication in GF(q) where q = pk and p is prime • Alternatives • GF(p) k = 1 • GF(2k) p = 2 • GF(pk) • GF(3k) p = 3 Erkay Savaş

Identity Based Encryption (IBE) • Public key can be any string • e-mail address, name, etc. • No need for certificates • Anonymity achieved • users can choose any public key without revealing their ID • It can easily change it Erkay Savaş

IBE – Bilinear Mapping • e(xP, yQ) = e(P, Q)xy = e(yP, xQ) = g • g is in an (extension of) the underlying field. • Bilinear mapping over elliptic curves • Weil pairing • Tate pairing • Resource consuming • Most efficient bilinear mappings • defined on curves over GF(3k) Erkay Savaş

An Introduction to UnifiedArithmetic • Types of finite fields are heavily used • Prime fields, GF(p) • Binary extension fields, GF(2k) • Ternary extension fields GF(3k) (recently, due to IBE schemes) • These finite fields feature dissimilar properties • Different implementations on specialized hardware Erkay Savaş

Unified Arithmetic • Unified hardware design methodology requires • A single (unified) datapath • A single (unified) control • Insignificant overhead in the area • Insignificant overhead in the time complexity (e.g. critical path delay) • Good {timearea} metric Erkay Savaş

Unified Arithmetic (GF(p) + GF(2k)) • A unified hardware design methodology for both field is possible since: • the elements of either field are represented using almost the same data structures in digital systems • the algorithms for basic arithmetic operations in both fields have structural similarities (i.e. the steps of the algorithms are almost identical) • Hence, eventually unified arithmetic is possible Erkay Savaş

Finite Field Operations in ECC • Addition in GF(p) and GF(2k) • Relatively inexpensive in area and time complexity • Multiplicative inversion in GF(p) and GF(2k) • Prohibitively expensive in terms of time • Possible to avoid some of them • Multiplication in GF(p) and GF(2k) • Expensive in terms of time and area • Usually most important operation • Our focus Erkay Savaş

Montgomery Multiplication • Very efficient way of doing multiplication in GF(p) and GF(2k) (now also in GF(3k)) • Faster (replaces division by shifts) • Suitable for unified design • Suitable for scalable design • Highly parallel • Suitable for pipelining Erkay Savaş

Montgomery Multiplication • Definition: • Given a, bGF(p), MonMul(a, b) = a·b·R-1 mod p, where R = 2k mod p and k = log2p. • Algorithm • c := 0 • for i = 0 to k-1 • c := (c + ai · b) • c := (c + c0 · p)/2 • if c > p then c := c-p (final subtraction) Erkay Savaş

Algorithm for GF(2k) • Input : a(x), b(x) GF(2k), p(x) and k • Output: c(x) = a(x)·b(x)·xkGF(2k) • c(x) := 0 • for i = 0 to k-1 • c(x) := (c(x)  ai · b(x)) • c(x) := (c(x)  c0 · p(x))/x • No final subtraction • Note that • c/2 and c(x)/x are implemented in an identical way in SW and HW Erkay Savaş

Representation • Addition • Atomic operation: multiplication is performed as a repeated addition • Unified addition • most efficient when carry-save representation is used for elements of GF(p) • Carry-save representation • an integer is represented as the sum of two other integers • x := xs + xc (sum and carry parts, resp.) Erkay Savaş

Scalability • Original Montgomery multiplication algorithm performs full-precision integer additions • Not scalable • Instead, • long integers are divided into words • Addition of words are handled separately on word adders. • Choice of word length depends on the precision, area and speed requirements Erkay Savaş

b(j) b(j+1) p(j) p(j+1) c(j) c(j+1) ai+1 b(j) p(j) c(j) PUi+1 Word-Based Multiplication ai PUi c(j)w-1 c(j)0 c(j)1 c(j+1)w-1 c(j+1)1 c(j+1)0 c(j) Erkay Savaş

Dependency Graph Erkay Savaş

FSEL Dual-Field Adder Dual-Field Adder Dual-Field Adder Dual-Field Adder Processing Unit (PU) with w=2 C1(j) C0(j) Erkay Savaş

Dual-Field Adder (DFA) 1/2 • Almost identical to a full-adder (FA) • Difference • it has and additional (control) input (FSEL) which suppress the carry output of the adder when it is set to logic-0 • Namely, when FSEL = 0 then the adder operates in GF(2k), otherwise it becomes a regular FA Erkay Savaş

DFA 2/2 B S A C FSEL Cout Erkay Savaş

SR-a RAM-a PU-1 PU-2 RAM-b RAM-p SR-C Pipeline Organization with two PUs s: the number of PUs Erkay Savaş

Total Computation Time (in clock cycles) w: word size, k: precision, e := k/w, s: the number of PUs Erkay Savaş

Example Execution Times • Example: k = 1024, w = 32 • s = 17  T = 2105 • s = 15  T = 2305 • s = 10  T = 3415 • s = 1  T = 33792 • Example: k = 2048, w = 32 • s = 33  T = 4221 • s = 30  T = 4543 • s = 10  T = 13343 • s = 1  T = 133120 Erkay Savaş

Comparison to the single-field (GF(p)) design w: word size 1.2 m CMOStechnology Erkay Savaş

Design Alternatives • Higher Radix • Original design is radix 2 • Namely, multiplier bits are scanned one bit in each clock cycle • Possible to scan two or more bits of the multiplier a • Radix-4: two bits • Radix-8: three bits • More Complex Design: lower clock frequency, higher area • Less clock cycle count  Faster execution of multiplication Erkay Savaş

Comparison • Higher radix vs. single radix • Metric • area  time • For small total area (i.e. <10000 equivalent NAND gates) the performances of radix-2 and radix-8 are comparable • Radix-8 multiplier outperforms radix-2 multiplier more than 3 times when the total area is around 25000 NAND gates Erkay Savaş

MUX-2 MUX-1 Selection Logic 3x2 Dual Field Adder Dual-Radix Multiplier • Radix-2 for GF(p) and radix-4 for GF(2k) Erkay Savaş

Dual-Radix Multiplier • Three multipliers • A1: GF(p)-only multiplier • A2: single-radix unified multiplier (with precomp.) • A3: dual-radix multiplier • Performance (area  time) • A3 performs slightly worse than A1 and A2 (between 7% to 19%) in GF(p) mode • A3 outperforms A2 by 38% to 46% in GF(2k)-mode Erkay Savaş

Unified Arithmetic? • Unified multiplier • carry-save adders used in multiplier • It is not easy to perform other arithmetic operations with carry-save representation such as subtraction and comparison (essential in inversion) Erkay Savaş

New Redundant Representation • Recall: • Carry-save representation • X = xs + xc. • New redundant representation • Redundant signed representation (RSD) • X = xp - xn. • Subtraction is equivalent to the addition • X-Y = (xp - xn) - (yp - yn) = (xp - xn) + (yn - yp) • Comparison is relatively easy Erkay Savaş

RSD • All previous multipliers require a reverse transformation to non-redundant for after each multiplication • There are thousands multiplication in ECC • With RSD, all the computation can be done in RSD form without any reverse transformation • a single transformation is necessary if the result is needed in non-redundant form. Erkay Savaş

Support for GF(3n) Arithmetic • RSD lends itself to a unified arithmetic architecture that efficiently supports GF(3n) arithmetic Erkay Savaş

Analysis • A1: GF(p)-only architecture • A2: GF(2k)-only architecture • A3: GF(3n)-only architecture • A4: Unified architecture (GF(p) + GF(2k)) • A5: Unified architecture (GF(p) + GF(2k) + GF(3n)) • A1 + A2: Hypothetical architecture that has separate datapath for GF(p) and GF(2k) Erkay Savaş

Analysis • Metric: area  time • A4 over A1 + A2: 7.94% • A5 over A1 + A2 + A3: 33.54% • A5 over A4 + A3: 28.36% Erkay Savaş

Implementation Results • 2.38 GHz, 0.13 m CMOS • 4 PUs  ~11,000, 8 PUs  ~15,000 NAND gates Erkay Savaş

Research Directions • Embed the unified architectures into common general-purpose processors • Unified inversion using RSD • Unified architectures for other PKC Erkay Savaş

Ending… • Questions • Contact • Erkay Savaş • erkays@sabanciuniv.edu • http://people.sabanciuniv.edu/~erkays Erkay Savaş

Unified Architectures for Efficient and Compact Crypto-Processing

Unified Architectures for Efficient and Compact Crypto-Processing

Presentation Transcript

ERC for Compact Efficient Fluid Power

Efficient Architectures for Eigen Value Decomposition

Data Storage and Data Processing Architectures

CS 213: Parallel Processing Architectures

Data Processing Architectures

Efficient Processing for Backlog Reduction: Applied Minimal Processing Strategies

Efficient Memory Shadowing for 64-bit Architectures

A Unified Energy Efficient Topology for Unicast and Broadcast

HEXA : Compact Data Structures for Faster Packet Processing

Efficient and Easily Programmable Accelerator Architectures

HEXA: Compact Data Structures for Faster Packet Processing

Efficient Algorithm For Processing XPath Queries

Query Processing and Online Architectures

Different parallel processing architectures

Different parallel processing architectures

Power efficient and unified 802.11s solution

Efficient VLSI architectures for baseband signal processing in wireless base-station receivers

Data Storage and Data Processing Architectures

Caching Architectures and Graphics Processing

Unified Onboard Processing and Spectrometry

Reorganized and Compact DFA for Efficient Regular Expression Matching

Buy Compact And Efficient Electric Forklifts For Sale