Overview September 2004

FPGA Implementation of Reduced Bit Plane Motion Estimation Shrutisagar Chandrasekaran, Abbes Amira and Faycal Bensaali Overview September 2004

Outline • Research Objectives • Introduction • Reduced Bit-Plane Motion Estimation • Proposed Architecture • FPGA Implementations and Results • Conclusions • Future Work and Acknowledgments

Research Objectives • To efficiently implement a reduced bit plane motion estimation algorithm on FPGA using Handel-C for onboard video compression • To develop efficient low power architectures for image processing techniques such as Motion Estimation (ME) • To evaluate and model power consumption of FPGA based designs at various levels of abstraction and to evolve and implement strategies for low power energy efficient design

Introduction • Block Matching (BM) is a widely used Motion Estimation (ME) technique for calculating motion vectors by minimising some cost functions • Optimal prediction is obtained when a Full Search (FS) algorithm is performed • FS algorithm is computationally intensive and requires a large number of I/O pins and large bandwidth for real time ME • An effective method for reducing the complexity of ME architecture is to reduce the number of bit planes used for computing the motion vector

Introduction • Most of the motion information is the 6th bit plane and a significant amount of the motion information is also available in the 7th bit plane • The lower bit planes contain significantly less motion information as they represent the smooth areas of the image • Reduce bit-plane methods for ME using a range of arithmetic units and simple Boolean operations leads to power and area efficient architectures

Reduced Bit-Plane ME Pseudo Code

Proposed Architecture

Proposed Architecture • The architecture exploits the massive parallelism available in hardware to reduce the computation time • The search window is stored on-chip in an array of 32 bit wide registers, the width of each register being equal to the size of the search window • The block size is taken to be 16x16 bits (1 Bit Per Pixel), and is stored on-chip in an array of registers • Each Processing Sub-Unit (PSU) contains 256 Processor Elements PEs (256 XOR + 16 5-bit Adders) for parallel execution of the block matching and estimate the SAD (Sum of Absolute Differences)

Proposed Architecture • 2 PSUs are used to cover the entire search window by means of bitwise shift of the contents of the search window in horizontal and vertical directions • The intermediate values of motion vectors are stored in the on chip array, with one location for each PSU • At the end, the global values of motion vectors are obtained using the intermediate values and the output of the comparators

Proposed Architecture • The proposed architecture yields improved performance metrics when compared to other existing work [1] Y-H. Yeh and C-Y. Lee, IEEE Trans. VLSI Syst. 7, 345 (1999) [2] T. Komarek and P. Pirsch, IEEE Trans. Circuits Syst. 36, 1301 (1989) [3] C-H. Hseih and T-P. Lin, IEEE Trans. Circuits Syst. Video Technol. 2, 169 (1992)

FPGA Implementations and Results • In order to verify the performance of the proposed architectures, designs have been prototyped on the Celoxica RC1000 board containing the Xilinx XCV2000E FPGA • Available on chip logic resource include - Slices : 19200 - CLB Array : 80 x 120 - Block RAM : 655,360 bits - Distributed RAM : 614,400 bits • The RC1000 has 4 memory banks which communicate with the host by means of DMA transfers

FPGA Implementations and Results Design Flow

FPGA Implementations and Results • Handel-C adds constructs to ANSI-C to enable DK to directly implement hardware • Fully synthesizable HW programming language based on ANSI-C • Implements C algorithm direct to optimized FPGA or outputs RTL from C Handel-C Additions for hardware Majority of ANSI-C constructs supported by DK Parallelism Timing Interfaces Clocks Macro pre-processor RAM/ROM Shared expression Communications Handel-C libraries FP library Bit manipulation Control statements (if, switch, case, etc.) Integer Arithmetic Functions Pointers Basic types (Structures, Arrays etc.) #define #include Software-only ANSI-C constructs Recursion Side effects Standard libraries Malloc

FPGA Implementations and Results

FPGA Implementations and Results • The bit-plane values from the current frame are sent from the host to the SRAM Bank 0, and those from the previous frame are sent as 16 bit values to the SRAM Bank 1 • The motion vectors are computed by the ME core and stored in the SRAM Bank 3 • The host application reads the motion vectors and generates the predicted image in real time

FPGA Implementations and Results • The proposed architecture is area efficient, as the motion estimation is performed on a single bit plane, requiring compact logic and greatly reduced on-chip memory size • The architecture is efficient, compact and can be massively parallelised as the PE contains simple 1-bit XOR gates only • Memory access is greatly reduced due to use of single bit plane only, saving considerable amount of I/O power

FPGA Implementations and Results • This, along architectural level optimisations including parallelism and pipelining yield power efficient implementation • Implementation is carried out on the Celoxica RC1000 board equipped with Xilinx XCV2000E FPGA, as well as synthesised on Xilinx QPro Virtex-II FPGA • Results in terms of power/area/maximum frequency show that using reduced bit planes instead of full resolution images drastically reduces the FPGA resources used

Various performance metrics of the RBFSBM algorithm implemented on the Virtex-E and the QPro Virtex-II FPGAs FPGA Implementations and Results

Conclusions • A reduced bit plane architecture for full search block matching has been proposed • The proposed architecture is low power, area efficient and suitable for VLSI/FPGA implementation • The developed architecture can be used for space applications such as onboard video compression, video conferencing, etc.

Future work and Acknowledgments • Develop Complete on-chip compression engine for real-time video compression, with applications ranging from onboard satellite compression to video conferencing • Explore the effect of Algorithmic, architectural and RTL level optimisations to minimise power consumption Acknowledgments Celoxica (Mr. Roger Gook) and EPSRC for supporting this work

Overview September 2004