Create Presentation
Download Presentation

Sashisu Bajracharya MS CpE Candidate Master’s Thesis Defense Advisor: Dr. Kris Gaj

203 Views
Download Presentation

Download Presentation
## Sashisu Bajracharya MS CpE Candidate Master’s Thesis Defense Advisor: Dr. Kris Gaj

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Reconfigurable Hardware Implementation and Analysis of Mesh**Routing for the Matrix Step of the Number Field Sieve Factorization Sashisu Bajracharya MS CpE Candidate Master’s Thesis Defense Advisor: Dr. Kris Gaj George Mason University**Outline**1. RSA and Factoring with Number Field Sieve(NFS) 2. Matrix Step of NFS Basic Mesh Routing 6. 3. Improved MeshRouting 4. 7. 5. 8. Results and Conclusions 9. Summary &Conclusions SRC-6e Reconfigurable Computer FPGA Array**RSA**Major Public Key Cryptosystem { e, N } { d, P, Q } N = PQ P, Q - large prime factors Private key (d, P,Q) Public key (e, N) Network Alice Bob Decryption Encryption**RSA developed by**Ron Rivest, Adi Shamir & Leonard Adlemann in 1977**Secure WWW, SSL, 95% of e-commerce**Network Browser WebServer S/MIME, PGP Bob Alice Applications of RSA**How hard is to break RSA?**Largest Number Factored: 576 bits RSA-576 (Dec 2003) 1881 9881292060 7963838697 2394616504 3980716356 3379417382 7007633564 2298885971 5234665485 3190606065 0474304531 7388011303 3967161996 9232120573 4031879550 6569962213 0516875930 7650257059 Resources & efforts 100 workstations from 8 different sites around the world 3 months**Recommended key sizes for RSA**Old standard: 512 bits (155 decimal digits) Individual users Broken in 1999 New standard: Individual users 768 bits Organizations (short term) 1024 bits Organizations (long term) 2048 bits**Estimated Difficulty of factoring 1024-bit number by RSA**Security, Inc. 342 million PCs, 500 MHz 170 GB RAM 1 year**Our Task**Determine how hard is to break RSA for factoring large key sizes using reconfigurable hardware SRC-6e Reconfigurable Computer Generic Array of FPGAs**Complexity: Sub-exponential time and memory**N = Number to factor, k = Number of bits of N exponential function, ek Sub-exponential function, e k1/3 (ln k)2/3 Polynomial function, a•km Best Algorithm to Factor NUMBER FIELD SIEVE**Computationally intensive steps**Number Field Sieve(NFS) Steps Polynomial Selection Sieving Matrix(Linear Algebra) Square Root**Hardware Architecture of NFS proposed to date**Daniel Bernstein Adi Shamir, Eran Tromer Univ. of Illinois, Chicago Weizemann Institute, Israel Mesh Routing Mesh Approach Matrix Matrix and Sieving AsiaCrypt 2002 Mesh Sorting: TWIRL Matrix Sieving Fall 2001 Crypto 2003, AsiaCrypt 2003 Mesh method improves asymptotic complexity for NFS performance Just analytical estimations, no real implementation, no concrete numbers**Matrix(Linear Algebra)**Focus of Research My Objective Bring this mesh algorithm to practical hardware implementation and concrete numbers**My Objective**• Detailed design in RTL code of the mesh algorithm • Synthesis and Implementation Results for an array of Virtex FPGAs and SRC-6e Reconfigurable computer**Function of Matrix Step**Find the linear dependency in the large sparse matrix obtained after sieving step. D = number of matrix columns or rows. . . . . . . . . .. 0 0 0 0 . . . 0 0 1 . . . . . . . . . 0 1 0 0 . . . 0 0 0 . . . . . . . . .. 0 0 0 0 . . . 0 1 0 . . . . . . . . .. 0 0 0 0 . . . 0 1 0 1 0 0 0 . . . 0 0 0 . . . . . . . . .. 106 for 512-bit 107 for 1024-bit D ci1 ci2 ciL ci1 ci2 . . .ciL = 0**Block Weidemann Algorithm for the Matrix Step**1) Uses multiple matrix-vector multiplications of the sparse matrix A with K random vectorsAvi, A2vi, …. , Akvi . k = 2D/K 2) Post computation leading to the determination of linear dependence on columns of matrix A Most Time consuming operation: A[DxD]v[Dx1] Mesh based hardware circuits, proposed by Bernstein and Shamir-Tromer, decrease the time and cost of matrix-vector multiplications**Mesh Sorting (Bernstein)**Mesh Routing (Shamir-Tromer) • Based on Recursive Sorting • Based on Routing • Does one multiplication at a time • large area • Does K multiplications at a time • compact area - (handles large matrix size) Two Architectures for Matrix-vector multiplication**1**0 1 1 0 1 0 1 0 Matrix-Vector Multiplication v 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 A·v A Sparse Matrix**Mesh Routing**m x m mesh where m= D v 0 0 1 1 0 0 1 1 1 1 0 0 1 1 0 0 1 1 cell( S ) 0 1 2 3 4 5 6 7 8 9 1 1 2 2 4 4 3 3 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 5 5 7 7 9 9 8 8 1 1 1 1 1 1 3 3 2 2 5 5 1 1 0 A m D 1 1 1 1 1 6 6 4 4 7 7 1 1 1 1 1 1 3 3 2 2 1 0 1 1 1 1 1 4 4 8 8 6 6 8 8 9 9 1 1 1 1 1 1 m 1 1 1 1 D d = maximum non-zero entries for all column**3**2 4 0 1 0 8 9 5 7 3 2 1 1 5 1 0 7 6 4 2 1 3 1 0 1 4 8 6 8 9 Routing in the Mesh Fourth cell Each time a packet arrives at the target cell, the packet’s vector’s bit is xored with the partial result bit on the target cell**Mesh Routing**Mesh contains the result of the multiplication 3 2 4 0 0 1 1 1 0 8 9 5 7 3 2 5 0 1 1 0 1 1 7 6 4 2 1 3 1 0 0 1 1 0 4 8 6 8 9**1**0 0 1 0 0 0 1 1 1 1 0 0 0 0 1 0 1 Mesh Routing with K parallel multiplications Example for K=2 v2 0 1 0 1 0 0 1 1 1 0 0 1 0 1 0 0 1 0 v1 0 1 1 0 0 0 1 1 0 1 1 0 0 1 0 0 1 0 1 1 1 1 1 1 1 1 2 2 4 4 3 3 1 1 1 1 1 1 5 5 7 7 9 9 8 8 A 1 1 1 1 1 1 3 3 2 2 5 5 1 1 1 1 1 6 6 4 4 7 7 1 1 1 1 1 1 3 3 2 2 1 1 1 1 4 4 8 8 6 6 8 8 9 9 1 1 1 1 1 1 1 1 1 1 mesh**Clockwise Transposition Routing**Each step a cell holds one packet, and receives one packet from neighbor for compare-exchange Exchange is done only if it reduces the distance to target of the farthest traveling packet**Cells**Compare-exchange direction Clockwise Transposition Routing Four iterations repeated**Types of Packets**1) Valid packet 2) Invalid packet - packet becomes invalid when reached to destination**N**1 2 1 N 2 1 1 b) Current packet invalid, incoming new packet valid (may need to exchange) a) Both packets are valid (may need to exchange) N N 2 N N 2 N N c) Current packet valid, incoming new packet invalid (may need to annihilate) c) Current packet invalid, incoming new packet invalid (no action) Compare-Exchange Cases Four cases for a cell Left cell**Basic Mesh Routing Design**• Each Cell of mesh handles one column of matrix A • K = 1 or K 32, K = number of vectors multiplied by matrix A concurrently • Total routing takes d·4·m compare-exchange operations**Basic Loading and Unloading Design**Non Zero Matrix Entries Result Vector Vector**Parallel Loading & Unloading Design**Non Zero Matrix Entries Result Vector Vector Restricted by Number of IO pins available**Basic Cell Design for Basic Mesh Routing**LUT-RAM R[i] address P[i] CU decode en equal annihilate eq_pack en_cur CR eq_packet exchange P’[i] row/col Check Dest exchange en_equal oper Comparator r c annihilate equal coordinate en_equal eq_packet Status bits**Comparator Design**cell’s coordinate current packet new packet row col s1 row col s2 row col row/col s1, s2 > = a1 b1 oper Control Signal Logic s1s2 en_equal exchange eq_packet annihilate**Improved Mesh Routing Design**• Each Cell of mesh handles p columns of the matrix A • Compact area => handles larger matrix size • Total routing takes p·d·4·m compare-exchange steps • proposed for cost reduction**Mesh Cell Design for Improved Mesh Routing**R[i] address CU LUT-RAM decode en addr en P[i] equal eq_pack annihilate en_cur CR exchange eq_packet P’[i] addr1 en_equal row/col Check_Dest exchange addr2 r c oper annihilate Comparator coordinate equal en_equal eq_packet Status bits**XC2V8000**• 46,592 CLB slices • 93,184 LUT ( LookUpTable ) Block RAMs Block RAMs Block RAMs Block RAMs • 93,184 FF (Flip-Flop) XC2V6000 • 33,792 CLB slices Multipliers 18 x 18 Multipliers 18 x 18 Multipliers 18 x 18 Multipliers 18 x 18 • 67,584 LUT ( LookUpTable ) • 67,584 FF (Flip-Flop) CLB slice I/O Block Target FPGA Devices Xilinx Virtex II**Synthesis Results for one Virtex II XC2V8000 using Basic**Mesh Routing Design K = number of concurrent matrix-vector multiplicationsTime for K mult = d * 4 * m * Clock period**Speedup vs. Software Implementation**Reference Optimized SW Implementation: PC, Pentium IV, 2.768 GHz, 1 GB RAM**Distributed Computation(Geiselmann, Steinwandt)**A v Av A1,1 A1,2 A1,3 v1 A’1 A2,1 A2,2 A2,3 A’2 v2 = A3,1 A3,2 A3,3 v3 A’3 A1,1 v1 + v2 + A1,3 v3 = A’1 A1,2 A·v =**1) FPGA array performs single sub-matrix by sub-vector**multiplication 2) Reuse FPGA array for next sub-computation 512-bit & 1024-bit performance with different number of square array of FPGAs connected in two dimensions**512-bit Performance with one chip & multiple chips connected**in mesh for Basic Mesh Routing D = number of columns in matrix A m = mesh dimension n = number of times to repeat multiplications = D2/(m4) TK = routing time for K multiplications in the mesh = d*4*m* Clock Period TLoad = time for loading and unloading for K multiplications TTotal = total time for a Matrix step = 3*D/K * n *( TK + TLoad)**1024-bit Performance with one chip & multiple chips**connected in mesh for Basic Mesh Routing D = number of columns in matrix A m = mesh dimension n = number of times to repeat multiplications = D2/(m4) TK = routing time for K multiplications in the mesh = d*4*m* Clock Period TLoad = time for loading and unloading for K multiplications TTotal = total time for a Matrix step = 3*D/K * n *( TK + TLoad)**TTotal =**m1 = mesh size in one Virtex II chip Analysis & Conclusion • Polynomial Speedup with number of FPGAs • Speedup approximately proportional to (#FPGA)3/2**Synthesis Results on one Virtex II XC2V8000 for Improved**Mesh Routing Design Time for K mult = p * d * 4 * m * Clock period**512-bit Performance with one chip & multiple chips connected**in mesh for Improved Mesh Routing D = number of columns in matrix A p = number of columns handled in one cell=16 n = number of times to repeat sub-multiplications = D2/(m2 p)2TK = routing time for K multiplications in the mesh = p*d*4*m*Clock period TLoad = time for loading and unloading for K multiplications TTotal = total time for a Matrix step = 3*D/K* n *( TK + TLoad)**1024-bit Performance with one chip & multiple chips**connected in mesh for Improved Mesh Routing D = number of columns in matrix A p = number of columns handled in one cell=16 n = number of times to repeat sub-multiplications = D2/(m2 p)2TK = routing time for K multiplications in the mesh = p*d*4*m*Clock period TLoad = time for loading and unloading for K multiplications TTotal = total time for a Matrix step = 3*D/K* n *( TK + TLoad)**Comparison of Basic & Improved Mesh Routing performance with**the number of FPGAs 512-bit 1024-bit**Speedup of Improved to Basic Mesh Routing vs. Number of**Virtex II FPGAs 512-bit 1024-bit