slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Sashisu Bajracharya MS CpE Candidate Master’s Thesis Defense Advisor: Dr. Kris Gaj PowerPoint Presentation
Download Presentation
Sashisu Bajracharya MS CpE Candidate Master’s Thesis Defense Advisor: Dr. Kris Gaj

play fullscreen
1 / 78

Sashisu Bajracharya MS CpE Candidate Master’s Thesis Defense Advisor: Dr. Kris Gaj

203 Views Download Presentation
Download Presentation

Sashisu Bajracharya MS CpE Candidate Master’s Thesis Defense Advisor: Dr. Kris Gaj

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Reconfigurable Hardware Implementation and Analysis of Mesh Routing for the Matrix Step of the Number Field Sieve Factorization Sashisu Bajracharya MS CpE Candidate Master’s Thesis Defense Advisor: Dr. Kris Gaj George Mason University

  2. Outline 1. RSA and Factoring with Number Field Sieve(NFS) 2. Matrix Step of NFS Basic Mesh Routing 6. 3. Improved MeshRouting 4. 7. 5. 8. Results and Conclusions 9. Summary &Conclusions SRC-6e Reconfigurable Computer FPGA Array

  3. RSA Major Public Key Cryptosystem { e, N } { d, P, Q } N = PQ P, Q - large prime factors Private key (d, P,Q) Public key (e, N) Network Alice Bob Decryption Encryption

  4. RSA developed by Ron Rivest, Adi Shamir & Leonard Adlemann in 1977

  5. Secure WWW, SSL, 95% of e-commerce Network Browser WebServer S/MIME, PGP Bob Alice Applications of RSA

  6. How hard is to break RSA? Largest Number Factored: 576 bits RSA-576 (Dec 2003) 1881 9881292060 7963838697 2394616504 3980716356 3379417382 7007633564 2298885971 5234665485 3190606065 0474304531 7388011303 3967161996 9232120573 4031879550 6569962213 0516875930 7650257059 Resources & efforts 100 workstations from 8 different sites around the world 3 months

  7. Recommended key sizes for RSA Old standard: 512 bits (155 decimal digits) Individual users Broken in 1999 New standard: Individual users 768 bits Organizations (short term) 1024 bits Organizations (long term) 2048 bits

  8. Estimated Difficulty of factoring 1024-bit number by RSA Security, Inc. 342 million PCs, 500 MHz 170 GB RAM 1 year

  9. Our Task Determine how hard is to break RSA for factoring large key sizes using reconfigurable hardware SRC-6e Reconfigurable Computer Generic Array of FPGAs

  10. Complexity: Sub-exponential time and memory N = Number to factor, k = Number of bits of N exponential function, ek Sub-exponential function, e k1/3 (ln k)2/3 Polynomial function, a•km Best Algorithm to Factor NUMBER FIELD SIEVE

  11. Computationally intensive steps Number Field Sieve(NFS) Steps Polynomial Selection Sieving Matrix(Linear Algebra) Square Root

  12. Hardware Architecture of NFS proposed to date Daniel Bernstein Adi Shamir, Eran Tromer Univ. of Illinois, Chicago Weizemann Institute, Israel Mesh Routing Mesh Approach Matrix Matrix and Sieving AsiaCrypt 2002 Mesh Sorting: TWIRL Matrix Sieving Fall 2001 Crypto 2003, AsiaCrypt 2003 Mesh method improves asymptotic complexity for NFS performance Just analytical estimations, no real implementation, no concrete numbers

  13. Matrix(Linear Algebra) Focus of Research My Objective Bring this mesh algorithm to practical hardware implementation and concrete numbers

  14. My Objective • Detailed design in RTL code of the mesh algorithm • Synthesis and Implementation Results for an array of Virtex FPGAs and SRC-6e Reconfigurable computer

  15. Function of Matrix Step Find the linear dependency in the large sparse matrix obtained after sieving step. D = number of matrix columns or rows. . . . . . . . . .. 0 0 0 0 . . . 0 0 1 . . . . . . . . . 0 1 0 0 . . . 0 0 0 . . . . . . . . .. 0 0 0 0 . . . 0 1 0 . . . . . . . . .. 0 0 0 0 . . . 0 1 0 1 0 0 0 . . . 0 0 0 . . . . . . . . .. 106 for 512-bit 107 for 1024-bit D  ci1 ci2 ciL ci1 ci2 . . .ciL = 0

  16. Block Weidemann Algorithm for the Matrix Step 1) Uses multiple matrix-vector multiplications of the sparse matrix A with K random vectorsAvi, A2vi, …. , Akvi . k = 2D/K 2) Post computation leading to the determination of linear dependence on columns of matrix A Most Time consuming operation: A[DxD]v[Dx1] Mesh based hardware circuits, proposed by Bernstein and Shamir-Tromer, decrease the time and cost of matrix-vector multiplications

  17. Mesh Sorting (Bernstein) Mesh Routing (Shamir-Tromer) • Based on Recursive Sorting • Based on Routing • Does one multiplication at a time • large area • Does K multiplications at a time • compact area - (handles large matrix size) Two Architectures for Matrix-vector multiplication

  18. Mesh Routing

  19. 1 0 1 1 0 1 0 1 0 Matrix-Vector Multiplication v 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 A·v A Sparse Matrix

  20. Mesh Routing m x m mesh where m= D v 0 0 1 1 0 0 1 1 1 1 0 0 1 1 0 0 1 1 cell( S ) 0 1 2 3 4 5 6 7 8 9 1 1 2 2 4 4 3 3 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 5 5 7 7 9 9 8 8 1 1 1 1 1 1 3 3 2 2 5 5 1 1 0 A m D 1 1 1 1 1 6 6 4 4 7 7 1 1 1 1 1 1 3 3 2 2 1 0 1 1 1 1 1 4 4 8 8 6 6 8 8 9 9 1 1 1 1 1 1 m 1 1 1 1 D d = maximum non-zero entries for all column

  21. 3 2 4 0 1 0 8 9 5 7 3 2 1 1 5 1 0 7 6 4 2 1 3 1 0 1 4 8 6 8 9 Routing in the Mesh Fourth cell Each time a packet arrives at the target cell, the packet’s vector’s bit is xored with the partial result bit on the target cell

  22. Mesh Routing Mesh contains the result of the multiplication 3 2 4 0 0 1 1 1 0 8 9 5 7 3 2 5 0 1 1 0 1 1 7 6 4 2 1 3 1 0 0 1 1 0 4 8 6 8 9

  23. 1 0 0 1 0 0 0 1 1 1 1 0 0 0 0 1 0 1 Mesh Routing with K parallel multiplications Example for K=2 v2 0 1 0 1 0 0 1 1 1 0 0 1 0 1 0 0 1 0 v1 0 1 1 0 0 0 1 1 0 1 1 0 0 1 0 0 1 0 1 1 1 1 1 1 1 1 2 2 4 4 3 3 1 1 1 1 1 1 5 5 7 7 9 9 8 8 A 1 1 1 1 1 1 3 3 2 2 5 5 1 1 1 1 1 6 6 4 4 7 7 1 1 1 1 1 1 3 3 2 2 1 1 1 1 4 4 8 8 6 6 8 8 9 9 1 1 1 1 1 1 1 1 1 1 mesh

  24. Clockwise Transposition Routing Each step a cell holds one packet, and receives one packet from neighbor for compare-exchange Exchange is done only if it reduces the distance to target of the farthest traveling packet

  25. Cells Compare-exchange direction Clockwise Transposition Routing Four iterations repeated

  26. Types of Packets 1) Valid packet 2) Invalid packet - packet becomes invalid when reached to destination

  27. N 1 2 1 N 2 1 1 b) Current packet invalid, incoming new packet valid (may need to exchange) a) Both packets are valid (may need to exchange) N N 2 N N 2 N N c) Current packet valid, incoming new packet invalid (may need to annihilate) c) Current packet invalid, incoming new packet invalid (no action) Compare-Exchange Cases Four cases for a cell Left cell

  28. Basic and Improved Mesh Routing Designs

  29. Basic Mesh Routing Design • Each Cell of mesh handles one column of matrix A • K = 1 or K  32, K = number of vectors multiplied by matrix A concurrently • Total routing takes d·4·m compare-exchange operations

  30. Basic Loading and Unloading Design Non Zero Matrix Entries Result Vector Vector

  31. Parallel Loading & Unloading Design Non Zero Matrix Entries Result Vector Vector Restricted by Number of IO pins available

  32. Basic Cell Design for Basic Mesh Routing LUT-RAM R[i] address P[i] CU decode en equal annihilate eq_pack en_cur CR eq_packet exchange P’[i] row/col Check Dest exchange en_equal oper Comparator r c annihilate equal coordinate en_equal eq_packet Status bits

  33. Comparator Design cell’s coordinate current packet new packet row col s1 row col s2 row col row/col s1, s2 > = a1 b1 oper Control Signal Logic s1s2 en_equal exchange eq_packet annihilate

  34. Improved Mesh Routing Design • Each Cell of mesh handles p columns of the matrix A • Compact area => handles larger matrix size • Total routing takes p·d·4·m compare-exchange steps • proposed for cost reduction

  35. Mesh Cell Design for Improved Mesh Routing R[i] address CU LUT-RAM decode en addr en P[i] equal eq_pack annihilate en_cur CR exchange eq_packet P’[i] addr1 en_equal row/col Check_Dest exchange addr2 r c oper annihilate Comparator coordinate equal en_equal eq_packet Status bits

  36. XC2V8000 • 46,592 CLB slices • 93,184 LUT ( LookUpTable ) Block RAMs Block RAMs Block RAMs Block RAMs • 93,184 FF (Flip-Flop) XC2V6000 • 33,792 CLB slices Multipliers 18 x 18 Multipliers 18 x 18 Multipliers 18 x 18 Multipliers 18 x 18 • 67,584 LUT ( LookUpTable ) • 67,584 FF (Flip-Flop) CLB slice I/O Block Target FPGA Devices Xilinx Virtex II

  37. Results and Analysis

  38. Synthesis Results for one Virtex II XC2V8000 using Basic Mesh Routing Design K = number of concurrent matrix-vector multiplicationsTime for K mult = d * 4 * m * Clock period

  39. Speedup vs. Software Implementation Reference Optimized SW Implementation: PC, Pentium IV, 2.768 GHz, 1 GB RAM

  40. Distributed Computation(Geiselmann, Steinwandt) A v Av A1,1 A1,2 A1,3 v1 A’1 A2,1 A2,2 A2,3  A’2 v2 = A3,1 A3,2 A3,3 v3 A’3 A1,1  v1 +  v2 + A1,3  v3 = A’1 A1,2 A·v =

  41. 1) FPGA array performs single sub-matrix by sub-vector multiplication 2) Reuse FPGA array for next sub-computation   512-bit & 1024-bit performance with different number of square array of FPGAs connected in two dimensions

  42. 512-bit Performance with one chip & multiple chips connected in mesh for Basic Mesh Routing D = number of columns in matrix A m = mesh dimension n = number of times to repeat multiplications = D2/(m4) TK = routing time for K multiplications in the mesh = d*4*m* Clock Period TLoad = time for loading and unloading for K multiplications TTotal = total time for a Matrix step = 3*D/K * n *( TK + TLoad)

  43. 1024-bit Performance with one chip & multiple chips connected in mesh for Basic Mesh Routing D = number of columns in matrix A m = mesh dimension n = number of times to repeat multiplications = D2/(m4) TK = routing time for K multiplications in the mesh = d*4*m* Clock Period TLoad = time for loading and unloading for K multiplications TTotal = total time for a Matrix step = 3*D/K * n *( TK + TLoad)

  44. TTotal = m1 = mesh size in one Virtex II chip Analysis & Conclusion • Polynomial Speedup with number of FPGAs • Speedup approximately proportional to (#FPGA)3/2

  45. Speedup vs. number of FPGA chips

  46. Synthesis Results on one Virtex II XC2V8000 for Improved Mesh Routing Design Time for K mult = p * d * 4 * m * Clock period

  47. 512-bit Performance with one chip & multiple chips connected in mesh for Improved Mesh Routing D = number of columns in matrix A p = number of columns handled in one cell=16 n = number of times to repeat sub-multiplications = D2/(m2 p)2TK = routing time for K multiplications in the mesh = p*d*4*m*Clock period TLoad = time for loading and unloading for K multiplications TTotal = total time for a Matrix step = 3*D/K* n *( TK + TLoad)

  48. 1024-bit Performance with one chip & multiple chips connected in mesh for Improved Mesh Routing D = number of columns in matrix A p = number of columns handled in one cell=16 n = number of times to repeat sub-multiplications = D2/(m2 p)2TK = routing time for K multiplications in the mesh = p*d*4*m*Clock period TLoad = time for loading and unloading for K multiplications TTotal = total time for a Matrix step = 3*D/K* n *( TK + TLoad)

  49. Comparison of Basic & Improved Mesh Routing performance with the number of FPGAs 512-bit 1024-bit

  50. Speedup of Improved to Basic Mesh Routing vs. Number of Virtex II FPGAs 512-bit 1024-bit