An Efficient FPGA Implementation of IEEE 802.16e LDPC Encoder

An Efficient FPGA Implementation of IEEE 802.16e LDPC Encoder Speaker: Chau-Yuan-YuAdvisor: Mong-Kai Ku

Outline • Introduction • Low-Density Parity-Check Codes • Related work • General encoding for LDPC codes • Efficient encoding for Dual-Diagonal matrix • Better Encoder scheme • LDPC Encoder Architecture • Parallel Encoder • Serial Encoder • Result • Conclusion

Outline • Introduction • Low-Density Parity-Check Codes • Related work • General encoding for LDPC codes • Efficient encoding for Dual-Diagonal matrix • Better Encoding scheme • LDPC Encoder Architecture • Parallel Encoder • Serial Encoder • Result • Conclusion

Low-Density Parity-Check Code • Benefit of LDPC Codes. • Approaching Shannon limit • Low error floor • LDPC code is adopted by various standards (e.g. DVB-S2, 802.11n, 802.16e)

Low-Density Parity-Check Code • Parity check matrix H is sparse • Very few 1’s in each row and column • Null space of H is the codeword space Valid Codeword

Low-Density Parity-Check Code • In (n, k) block codes, k-bit information data can be encoded as n-bit codeword. • In systematic block codes, the information bits directly exist in the bits of codeword. Systematic Part Parity Part

Low-Density Parity-Check Code • General encoding of systematic linear block codes • Finding generator matrix G via H. • C = sG = [s | p] • Issues with LDPC codes • The size of G is very large. • G is not generally sparse. • Encoding complexity will be very high.

Structured LDPC Codes • Quasi-Cyclic LDPC Codes • In QC-LDPC, H can be partitioned into square sub-blocks of size z x z. • Each sub-blocks can be Z x Z zero sub-block or identity matrix with permutation.

Structured LDPC Codes • QC Codes With Dual-Diagonal Structure • In IEEE standards QC-LDPC Codes have Dual-Diagonal parity structure. • We take 802.16e code rate ½ matrix for example. 0 represent identity matrix.

General Encoding for LDPC Codes • Richardson and Urbanke (RU) algorithm • Partition the H matrix into several sub-matrix. • In H, the part T is a low triangle matrix.

General Encoding for LDPC Codes O(n+g2) p0 p1 O(n+g2) Richardson and Urbanke (RU) algorithm

Efficient Encoding for Dual-Diagonal LDPC Codes A valid codeword c = [s|p] must satisfy Replace by dual-diagonal matrix Information bits Parity bits Define lambda value as From equation, we obtained

Related Work (1)Sequential Encoding Encoding scheme One-way derivation Step 1 Compute lambda value by doing matrix operation x = HsS Step 2 Determines parity vector P0 by adding all the lambda value Step 3 Rest of parity vector is obtained by exploiting dual-diagonal matrix T

Related Work (2)Arbitrary Bit-generation and Correction Encoding In [1], an alternative encoding for standard matrix was presented. • Matrix will be modify by parity portion of weight-3 column set. • H can be sectorized into three sub matrices • The information bit region A • The parity bit region Q for bit-flipping operation • The parity bit region U for non bit-flipping. Q U A Replace with zero cyclic shift [1] C. Yoon, E. Choi, M. Cheong, and S.-K. Lee, "Arbitrary bit generation and correction technique for encoding QC-LDPC codes with dual-diagonal parity structure," IEEE Wireless Communications and Networking Conference, (WCNC 2007), pp. 662-666, March 2007.

Related Work (2)Arbitrary Bit-generation and Correction Encoding Encoding scheme Step 1 Compute lambda value by doing matrix operation x = As Step 2 Set P0 as arbitrary binary values. solve unknown parity bits Step 3 Computed correction vector f from P0 Step 4 Add correction vector to parity bits in region Q to correct them One-way derivation

Related Work (2)Arbitrary Bit-generation and Correction Encoding • Advantage • Low-complexity encoding • The number of addition required is less than RU scheme • Drawback • Can not directly applicable to standard code • Modifying matrix will decrease code performance

Better encoding scheme • Advantages of the encoding scheme proposed in [2] • Low-complexity encoding • Can directly applicable to matrices defined in IEEE standards without any modification • Achieve higher level parallelism [3] C.-Y. Lin, C.-C. Wei, and M.-K. Ku, "Efficient Encoding for Dual-Diagonal Structured LDPC Code Based on Parity bits Prediction and Correction," IEEE Asia Pacific Conference on Circuits and Systems (APPCCAS), pp.1648-1651, Dec. 2008.

Better Encoding Scheme Step 1 Set P0’ as any binary vector Step 2 Compute lambda value by doing matrix operation Hs Step 3 [Forward Derivation] Step 4 [Backward Derivation] Step 5 Compute the P0 by adding prediction parity vector Step 6 Compute the correction vector f Step 7 Correct prediction parity by adding f Correct prediction vector by f Compute P0 by adding prediction vector f = (P0)d Compute correction vector f

Better Encoding Scheme Step 1 Set P0’ as any binary vector. Step 2 Compute lambda value by doing matrix operation Hs. Step 3 [Forward Derivation] Step 4 [Backward Derivation] Step 5 Compute the P0 by adding prediction parity vector. Step 6 Compute the correction vector f. Step 7 Correct prediction parity by adding f. Reduce encoding delay !! Two-way derivation

LDPC Encoder Architecture Based on the encoding scheme proposed bedore, we design both parallel and serial architecture. • Parallel architecture • Achieve higher level parallelism • High-speed • Serial architecture

Parallel architecture divider Accumulator Correct Prediction Parity memory Barrel shifter#6 Barrel shifter#1 Matrix Input data register lambda position

Parallel architecture (Stage 1) divider Accumulator Correct Prediction Parity memory Barrel shifter#6 Barrel shifter#1 Matrix Input data register lambda position • Benefit: • When the input data is coming, it can work immediately without all the input data are coming. • Reduce the numbers of barrel shifter. In this stage, matrix select the shift values and multiply specific value according to the code length.

Shifter Value Computation Equation for computing shift value Normal code rate : Code rate 2 ∕ 3 A code : Two type of matrix implement result with multiple rate and length

Parallel architecture (Stage 2) divider Accumulator Correct Prediction Parity memory Barrel shifter#6 Barrel shifter#1 Matrix Input data register lambda position Divide the datas from matrix. This module used to save the input data. These data are used in barrel shifters.

Parallel architecture (Stage 3) divider Accumulator Correct Prediction Parity memory Barrel shifter#6 Barrel shifter#1 Matrix Lambda position = 3 Input data register lambda position These module are used to circulated shift the input data This module records the row position of the shifter values Lambda position = 8 Lambda position = 11 Shifter value

Parallel architecture (Stage 4) divider Accumulator Correct Prediction Parity memory Barrel shifter#6 Barrel shifter#1 Matrix Input data register lambda position According to the lambda position, in this clock cycle λ1,λ2, λ5, λ8, λ9, λ11 need to be accumulated. Computed the lambda value by accumulating the shifted data after Kb clock cycle Kb

Parallel architecture (Stage 5) divider Accumulator Correct Prediction Parity memory Barrel shifter#6 Barrel shifter#1 Matrix Input data register lambda position Computed the prediction vector Pi‘ by equation

Parallel architecture (Stage 5) P_0 <= acc_out0; P_1 <= acc_out0 ^ acc_out1; P_2 <= acc_out0 ^ acc_out1 ^ acc_out2; P_3 <= acc_out0 ^ acc_out1 ^ acc_out2 ^ acc_out3; P_4 <= acc_out0 ^ acc_out1 ^ acc_out2 ^ acc_out3 ^ acc_out4; P_5 <= acc_out0 ^ acc_out1 ^ acc_out2 ^ acc_out3 ^ acc_out4 ^ acc_out5; P_6 <= acc_out11 ^ acc_out10 ^ acc_out9 ^ acc_out8 ^ acc_out7 ^ acc_out6; P_7 <= acc_out11 ^ acc_out10 ^ acc_out9 ^ acc_out8 ^ acc_out7; P_8 <= acc_out11 ^ acc_out10 ^ acc_out9 ^ acc_out8; P_9 <= acc_out11 ^ acc_out10 ^ acc_out9; P_10 <= acc_out11 ^ acc_out10; P_11 <= acc_out11; For saving the hardware area, we use one architecture to compute the prediction values for four different code rate. In code rate 1 / 2, P_0 ~ P_11 are the prediction In code rate 2 / 3, P_0 ~ P_3 P_8~P_11are the prediction

Parallel architecture (Stage 5) P_0 <= acc_out0; P_1 <= acc_out0 ^ acc_out1; P_2 <= acc_out0 ^ acc_out1 ^ acc_out2; P_3 <= acc_out0 ^ acc_out1 ^ acc_out2 ^ acc_out3; P_4 <= acc_out0 ^ acc_out1 ^ acc_out2 ^ acc_out3 ^ acc_out4; P_5 <= acc_out0 ^ acc_out1 ^ acc_out2 ^ acc_out3 ^ acc_out4 ^ acc_out5; P_6 <= acc_out11 ^ acc_out10 ^ acc_out9 ^ acc_out8 ^ acc_out7 ^ acc_out6; P_7 <= acc_out11 ^ acc_out10 ^ acc_out9 ^ acc_out8 ^ acc_out7; P_8 <= acc_out11 ^ acc_out10 ^ acc_out9 ^ acc_out8; P_9 <= acc_out11 ^ acc_out10 ^ acc_out9; P_10 <= acc_out11 ^ acc_out10; P_11 <= acc_out11; In code rate 3 / 4, P_0 ~ P_2 P_9~P_11 are the prediction vectors In code rate 5 / 6, P_0 ~ P_1 P_10~P_11are the prediction vectors For saving the hardware area, we use one architecture to compute the prediction values for four different code rate.

Parallel architecture (Stage 6) divider Accumulator Correct Prediction Parity memory Barrel shifter#6 Barrel shifter#1 Matrix Input data register lambda position Step2: Correct the other Pi.Using the equationPi= Pi’^ P0 Step1: Compute the P0.In code rate = 1 / 2, P0= P5 ^ P6

Serial architecture (Stage 1) divider Correct Accumulator & Predict memory Barrel shifter#1 Matrix Barrel shifter#2 2 3 Input data register Input control As the stage1 in parallel architecture. In the first Kb clock cycle, encoder order are from top->middle and down ->middle, column by column 1 3 3 2 1

Serial architecture (Stage 1) divider Correct Accumulator & Predict memory Barrel shifter#1 Matrix Barrel shifter#2 1 2 3 Input data register Input control • Reason: • Prepare the input data • Reduce the slice In the last clock cycle, encoder order are from left->right, row by row 3 1 2

Serial architecture (Stage 2) divider Correct Accumulator & Predict memory Barrel shifter#1 Matrix Barrel shifter#2 Input data register Input control Divide the datas from matrix. Choose the corresponding input value to barrel shifter (Take clock cycle #2 for example)

Serial architecture (Stage 3) divider Correct Accumulator & Predict memory Barrel shifter#1 Matrix Barrel shifter#2 Input data register Input control Shift the input data according to the shifter value chosen form Mux

Serial architecture (Stage 4) divider Correct Accumulator & Predict memory Barrel shifter#1 Matrix Barrel shifter#2 Input data register Input control • In this module, there are three works: • Compute λi • Compute Pi’ • Compute P0 In normal, this module accumulate the shifted data to compute λi . When the data is the last value in this row, also compute Pi’.

Serial architecture (Stage 4) divider Correct Accumulator & Predict memory Barrel shifter#1 Matrix Barrel shifter#2 Input data register Input control When all Pi have been computed, compute the P0 by Xor Px’ and Px+1’ which are the middle prediction vector in the matrix.

Serial architecture (Stage 5) divider Correct Accumulator & Predict memory Barrel shifter#1 Matrix Barrel shifter#2 Input data register Input control Correct the other Pi.Using the equationPi= Pi’^ P0

Implementation Results • The proposed encoder based on IEEE 802.16e LDPC codes can encode the code with code rate 1/2 2/3 3/4 5/6 and code length ranging from 576 to 2304. • The hardware implementation was performed and verification on Xilinx Virtex-4 and Altera Stratix Field Programmable Gate Array (FPGA) device.

Implementation Results Parallel architecture • Information throughput ranging from 2.262 to 10.441 Gbps • The encoder area is constant in any code rate or code length. • For a given code rate, an increase in the code length will increase the throughput.

Implementation Results Serial architecture • Information throughput ranging from 0.867 to 4.019 Gbps • For a given code rate, an increase in the code length will increase the throughput.

Implementation Results Parallel architecture using row by row Area comparison

Implementation Results IT comparison IT/Area comparison

Table 4.5a The synthesis result of [22] at code rate 1/2 Compare to Related Work • We compare implementation with [3]. • Better throughput for longer code length • Using less area to implement multiple code length and code rate • The clock cycle is shorter the [3]. [3] S. Kopparthi and D. M. Gruenbacher, "Implementation of a fiexible encoder for structured low-density parity-check codes," IEEE Pacic Rim Conference on Communications, Computers and Signal Processing (PacRim 2007), pp.438-441, Aug. 2007.

Compare to Related Work The comparison of throughput • The proposedencoder outperforms the work in [3] in terms of throughput when the code length longer then 1200 • The proposed encoder architecture provides better throughput for a longer code length while the work in [3] does not have this kind of speed-up

Compare to Related Work The comparison of throughput/area ratio • The proposedencoder outperforms the work in [3] in terms of throughput/area ratio by1.216 to 3.757 times • The proposed encoder utilizes hardware resources moreefficiently

Compare to Related Work • We compare implementation with [2].

An Efficient FPGA Implementation of IEEE 802.16e LDPC Encoder

An Efficient FPGA Implementation of IEEE 802.16e LDPC Encoder

Presentation Transcript

An Implementation Method of the Box Filter on FPGA

Efficient Implementation

FPGA Implementation of H.264 Video Encoder

Reed Solomon Encoder Implementation

Implementation of a High Rate Modular JPEG2000 Encoder in a Virtex2 FPGA

DVB-S2 LDPC Encoder Example (L . Schirber 2/21/13 )

IEEE 802.16e/Mobile WiMAX

Implementation of FSM int o FPGA

FPGA Implementation of Lookup Algorithms

FPGA Implementation of Multipliers

EFFICIENT FPGA IMPLEMENTATION OF PWM CORE

LDPC FEC for IEEE 802.11n Applications

A Compact and Efficient FPGA Implementation of DES Algorithm

Adaptive Channel Scanning for IEEE 802.16e

FPGA Implementation of Lookup Algorithms

Interconnect Efficient LDPC Code Design

802.16e

IEEE 802.16e/Mobile WiMAX

AN ENERGY-EFFICIENT SCHEDULING FOR MULTIPLE MSSS IN IEEE 802.16E BROADBAND WIRELESS

Efficient FPGA Implementation of QR Decomposition Using a Systolic Array Architecture

An Energy Efficient Sleep Scheduling Considering QoS Diversity for IEEE 802.16e Wireless Networks

Literature Review – An FPGA Implementation of the Simplex Algorithm