Optimized Hardware Architecture for Montgomery Multiplication Algorithm

An Optimized Hardware Architecture for the Montgomery Multiplication Algorithm Miaoqing Huang Nov. 5, 2010

Outline • Background • Optimized hardware architecture • Avoid the extra clock cycle delay • The overall architecture • Each PE focuses on the computation of one word of S • The data dependency graph of the proposed architecture • Comparison with other published architecture • Demonstration of computation • Resource utilization and performance comparison • High-radix architecture • Conclusion • Reference

Background • Montgomery Multiplication Algorithm is used in modular exponentiation to avoid the division by modulus, M. • Following is one implementation of Montgomery Multiplication, Radix-2 Montgomery Multiplication Algorithm, assuming we want to calculate S = X • Y mod M in which S, X, Y and M are all n-bit long.

Background (cont.) • Some definitions • n : the bit-length of original operands • w : the word-length used in real computation • e=(n+1)/w : the quantity of words to store S • S(j): one word in S • Multiple-Word Radix-2 Montgomery Multiplication Algorithm • Scan the X bit-by-bit and scan Y and M word-by-word • Calculate S word-by-word • Easy for hardware implementation because of small propagation

Background (cont.) • Data dependency in the original architecture [4] of MWR2MM algorithm • Task A consists of three steps: • Test the parity of least significant bit of S • Addition of words from S, xi•Y, and M if applicable • One-bit right shift of a S word • Task B corresponds to the last two steps of Task A [4] Tenca, A.F. and Koç, Ç. K.: A scalable architecture for Montgomery multiplication, CHES 99, LNCS 1717:94--108, 1999

Background (cont.) • One PE is in charge of the computation of one column that corresponds to the updating of S with respect to one single bit Xi. • The delay between two contiguous PEs is 2 clock cycles. • The minimum computation time in terms of clock cycle is 2•n+e given (e+1)/2 PEs are implemented to work in parallel.

Avoid the extra clock cycle delay • The origin of the extra clock cycle delay The computation of S(j-1) (of next round) requires one extra bit from S(j) (of current round), S(j)0 • Solution • Compute the two possible results of S(j) (of next round) in the same clock cycle as computing the S(j+1) (of current round); make a decision at the end of clock cycle

Avoid the extra clock cycle delay (cont.) • One singe PE is responsible to update one fixed word in S • It has two branches corresponding to two possibilities of S(i+1)0 • The correct results, the carry and the S(i)w-1, is selected from two sets of possible results by S(i+1)0, both available and registered at the same moment

The overall architecture • Every PE focuses on the computation of one single word of S The computation pattern of the architecture in [4] The computation pattern of the proposed architecture

The overall architecture (cont.) • The data dependency graph of the proposed architecture • Task D consists of three steps • Generate qi • Pre-compute two sets of data • Select one set from two • Task E corresponds to the last steps of Task D • Task F (invisible in the graph) is responsible to compute S(e-1) • Only has one branch

The overall architecture (cont.) • e PEs are required to compute the e words in S respectively. • Two shift registers, one providing single bits in X and one providing the parities of S(0)0, parallel these PEs. • (n+e-1) clock cycles are required to process the Montgomery multiplication of two n-bit operands.

Demonstration of computation • Sequential S(e-1) S(2) S(1) S(0) ←X0 • Tenca & Koç’s proposal PE#0 ←X0 S(4) S(3) S(1) S(0) S(2) PE#1 ←X1 S(0) S(1) S(2) PE#2 ←X2 S(0)

Demonstration of computation (cont.) • The proposed optimized architecture PE#0 S(0) S(0) S(0) S(0) S(0) ←X2 ←X0 ←X3 ←Xe-1 ←X1 ←X1 ←Xe-2 ←X0 ←X2 PE#1 S(1) S(1) S(1) S(1) ←X0 ←X1 ←Xe-3 PE#2 S(2) S(2) S(2) PE#3 S(3) S(3) ←X0 ←Xe-4 PE#(e-1) S(e-1) ←X0

Resource utilization and performance comparison Test platform: Xilinx Virtex-II 6000 FF1517-4

High-radix Architecture • Same optimization concept can be applied to high-radix implementation • The number of pre-computation branches is 2k • The hardware implementation beyond radix-4 becomes less viable Comparison between radix-2 and radix-4 of proposed architecture (n=1024, w=16)

Conclusion • An optimized hardware architecture to implement MWR2MM algorithm is proposed • The radix-2 version of this architecture takes (n+e-1) clock cycles to process the Montgomery multiplication of two n-bit operands • Compared to original architecture by Tenca & Koç, the new approach takes half time for processing and introduces less than 10% area penalty • The same optimization technique can be applied onto the original architecture by Tenca & Koç, keeping the scalability while reducing the processing latency to half

Reference • [1] Rivest, R. L., Shamir, A. and Adleman, L.: A method for obtaining digital signatures and public-key cryptosystems. Communications of the ACM, vol.21, no.2, pp.120--126, 1978 • [2] Montgomery, P. L.: Modular multiplication without trial division. Mathematics of Computation, vol.78, pp.315--333, 1985 • [3] Gaj, K., et al.: Implementing the Elliptic Curve Method of Factoring in Reconfigurable Hardware. In CHES 2006, LNCS, vol.4249, pp.119--133, 2006 • [4] Tenca, A.F. and Koç, Ç.K.: A scalable architecture for Montgomery multiplication. In CHES 99, LNCS, vol.1717, pp.94--108, 1999 • [5] Tenca, A.F. and Koç, Ç.K.: A scalable architecture for modular multiplication based on Montgomery's algorithm, IEEE Trans. Computers, vol.52, no.9, pp.1215--1221, 2003 • [6] Tenca, A.F., Todorov, G., and Koç, Ç.K.: High-radix design of a scalable modular multiplier, In CHES 2001, LNCS, vol.2162, pp.185--201, 2001

Reference • [7] Harris, D., Krishnamurthy, R., Anders, M., Mathew, S. and Hsu, S.: An Improved Unified Scalable Radix-2 Montgomery Multiplier. In Proc. ARITH 17, pp.172--178, 2005 • [8] Michalski, E. A. and Buell, D. A.: A scalable architecture for RSA cryptography on large FPGAs. In Proc. FPL 2006, pp.145--152, 2006 • [9] Koç, Ç.K., Acar, T. and Kaliski Jr., B. S.: Analyzing and comparing Montgomery multiplication algorithms. IEEE Micro, vol.16, no.3, pp.26--33, 1996 • [10] McIvor, C., McLoone, M. and McCanny, J.V.: High-Radix Systolic Modular Multiplication on Reconfigurable Hardware. In Proc. FPT 2005, pp.13--18, 2005 • [11] McIvor, C., McLoone, M. and McCanny, J.V.: Modified Montgomery Modular Multiplication and RSA Exponentiation Techniques. IEE Proceedings -- Computers & Digital Techniques, vol.151, no.6, pp.402--408, 2004

Optimized Hardware Architecture for Montgomery Multiplication Algorithm

Optimized Hardware Architecture for Montgomery Multiplication Algorithm

Presentation Transcript

On Karatsuba Multiplication Algorithm

Partial Products Algorithm for Multiplication

Developing For The Windows Hardware Error Architecture

Karatsuba’s Algorithm for Integer Multiplication

Partial Products Algorithm for Multiplication

Montgomery Multiplication

Enhanced matrix multiplication algorithm for FPGA

Strassen Matrix Multiplication Algorithm

An Optimized Hardware Architecture for the Montgomery Multiplication Algorithm

Database Architecture Optimized for the New Bottleneck: Memory Access

Booth's Algorithm for 2's Complement Multiplication

Database Architecture Optimized for the new Bottleneck: Memory Access

Montgomery multiplication Algorithm

Partial Products Algorithm for Multiplication

The Architecture Of Photosynthesis is Optimized to:

Montgomery Modular Multiplication

The ARM7TDMI Hardware Architecture

Very High Radix Montgomery Multiplication

DDAQ Hardware Architecture

Partial Product (Multiplication Algorithm)

Partial Product (Multiplication Algorithm)

Area-Time-Efficient Montgomery Modular Multiplication