1 / 15

Implementation and Optimization of SIFT on a OpenCL GPU

6.869-6.338 Final Project 5/5/2010 Guy-Richard Kayombya. Implementation and Optimization of SIFT on a OpenCL GPU. Overview. Motivation Quick Intro to OpenCL Implementation Results. Motivation. Learn OpenCL Adapt the SIFT algorithm to yet another parallel architecture

janetwhite
Télécharger la présentation

Implementation and Optimization of SIFT on a OpenCL GPU

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 6.869-6.338 Final Project 5/5/2010 Guy-Richard Kayombya Implementation and Optimization of SIFT on a OpenCL GPU

  2. Overview • Motivation • Quick Intro to OpenCL • Implementation • Results

  3. Motivation • Learn OpenCL • Adapt the SIFT algorithm to yet another parallel architecture • Maybe achieve some speedup

  4. Quick Intro to OpenCL • New Standard from Khronos for Heterogeneous Parallel Computing (v1.0 Released Dec 2008) • Initiated by Apple • Open and royalty free • Cross-Vendor and Cross-Platform • Make use of all available processing entities • CPUs, GPUs and other Processors • Scales from Embedded to HPC solutions

  5. Quick Intro to OpenCL(2) • One Host, Multiple Devices • Each Device has multiple Compute Units • Each Compute Unit has multiple Processing Elements E.g: GT200 has 30 Compute units/Streaming processors and 8 Processing Elements/Scalar SIMD processors = 240 Processing elements

  6. Quick Intro to OpenCL(3) • NDRange = size of the problem to solve 1D or 2D • Work-Group = block of work-items • Work-Item ~ lightweight thread

  7. Quick Intro to OpenCL(4) • Global : per device • Local : per Work-Group • Private : Per Work-Item

  8. Quick Intro to OpenCL(4) __kernel void vec_inc ( __global float *a, __global const float b) { int gid = get_global_id(0); a[gid] = a[gid] + b; }

  9. Implementation • Abstraction Layer (85 %) • Gaussian/DoG Pyramids (100 % semi-optimized) • Keypoint Detection (95 % - Naive) • Keypoint Refinement (90 % - Naive) • Orientation Assignment (10 %) • Descriptor generation(0 %)

  10. Abstraction Layer • Problem : Host device code is cumbersome • Requires dozens of repetitive lines to setup device contexts kernels,buffers,etc... • Solution: OpenCL wrapper • Simplifies creation and management of hybrid Host/Client buffers and execution of kernels • Facilitates transition from serial to parallel execution • Host/Client Synchronization • Memory management issues still need to be fixed

  11. Gaussian Pyramid • Separable convolution • 2 1D filters • Indirect filtering to reduce kernel size • sigma_diff = sqrt(sig_dst^2 – sigma_src^2) • Use convolutionSeperable() provided by Nvidia for efficient 2D seperable convolution on the GPU

  12. Keypoint detection • Each pixels is processed by one work item independently • No state sharing • Worst case 26 comparisons / per work Item

  13. Keypoint Refinement • Each Keypoint is processed independently by one work item • Kernel is a slightly modified version of the keypoint refinement Matlab Mex module by Vedaldi

  14. Preliminary Results (Time) All the measurements are performed on an input image of size 512x512. Gaussian Filtering (sigma 4.1): • VedaldiMatlab CPU = 0.19s – 100 % • Naïve C++ CPU = 0.33s – 57% • GPU = 0.0094s – 2000 % • GPU with data transfer = 0.0133s – 1400 % Extrema Detection (octave 0 of pyramid): • VedaldiMatlab CPU = 0.179 s – 100 % • GPU = 0.035725s – 500 % Keypoint Refinement (octave 0 of pyramid): • VedaldiMatlab CPU = 0.004 s – 100 % • GPU = 0.0689s – 6 %

  15. Preliminary Results(performance) • Refined Keypoints for octave 0 • Blue: Matlab implementation • Red: OpenCL • Green: Common • 85% Correspondence

More Related