1 / 29

Computation of Mutual Information Metric for Image Registration on Multiple GPUs

Computation of Mutual Information Metric for Image Registration on Multiple GPUs. Andrew V. Adinetz 1 , Markus Axer 2 , Marcel Huysegoms 2 , Stefan Köhnen 2 , Jiri Kraus 3 , Dirk Pleiter 1. 26.08.2013. 1 JSC, Forschungszentrum Jülich 2 INM-1, Forschungszentrum Jülich 3 NVIDIA GmbH.

lester
Télécharger la présentation

Computation of Mutual Information Metric for Image Registration on Multiple GPUs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computation of Mutual Information Metricfor Image Registration on Multiple GPUs Andrew V. Adinetz1, Markus Axer2, Marcel Huysegoms2, Stefan Köhnen2, Jiri Kraus3, Dirk Pleiter1 26.08.2013 1 JSC, Forschungszentrum Jülich 2 INM-1, Forschungszentrum Jülich 3 NVIDIA GmbH

  2. Outline Brain Image Registration Multi-GPUImplementation systemmemory listupdate Performance Evaluation Conclusion

  3. Preparationofthebrain

  4. Pushingthelimitsfor a cellularbrainmodel • BigBrain – first high-resolution brain model at microscopical scale • 7404 histological sections stained for cell bodies • scanned with a flad bed scanner • original resolution 10 × 10 × 20 μm3(11.000 × 13.000 pixels) • downscaling to 20 μmisotropic • removal of artifacts • 1 Terabyte • in cooperation with Alan Evans, McGill, Montreal • Amunts et al. (2013)Science

  5. Image Registration The processofaligningimagesiscalledregistration ITK Workflow

  6. Mutual Information Metric i, j – pixelvalues (0 .. 255) successfulfor multi-modal registration

  7. Two Image Cross-Histogram for(inty = 0; y < fixed_sz_y; y++) for(intx = 0; x < fixed_sz_x; x++) { inti = bin(fixed[x, y]); float x1 = transform_x(x, y); float y1 = transform_y(x, y); intj = bin(interpolate(moving, x1, y1)); histogram[i, j]++; // atomic on GPU } maincomputationalkernel transformcanbecomplex (1000+ parameters) GPU implementation: 1 pixel/thread, atomics

  8. Large Data Size Large-area Polarimeter PolarizingMicroscope size: 3.000 × 3.000 px pixel size: 60 × 60 μm file size: 30 MB size: 100.000 × 100.000 px pixel size: 1.6 x 1.6 μm file size: 40 GB

  9. Multi-GPU Mutual Information Domain decomposition distributefixed and movingimages histogramcontributionssummed up Moving image: how to handle? irregularaccesspattern Approaches System memoryreplication (sysmem) Listupdate (listupdate)

  10. System MemoryReplication Replicateentiremoving image in pinnedhost RAM accessible to GPU + easy to implement – systemmemoryaccessesareslower – cannotusetextureinterpolation Optimizations moving image halo in GPU RAM

  11. Listupdate typedefstruct{ float[2] movingCoords; short destRank; char fixedBin; } message_t; Processing bufferremoteaccesses exchangebuffers computecontributionsremotely + computation-communicationoverlap – hard to implement – chunkprocessing (orwon‘t fit intobuffer) Optimizations buffers: AoS vs. SoA, atomics vs. grouping using multiple streams

  12. y ChunkProcessing and Overlap Fixed Image Fixed Image (0,0) x Group Group Exchange Handle messages Process chunk Process chunk 1 Group Exchange Process chunk 2

  13. Buffer Writeout: Atomics vs. Group atomics eachwritingthreadincrementsatomiccounter + simpler – atomicscanbe a bottleneck – onebuffer per receiverrequired grouping eachthreadwrites to fixedlocation buffersgroupedbeforesending + singlebuffer, lessmemory + optimizedgrouping (shared-memoryatomics, prefixsum) – morecomplicated (separate kernelrequired)

  14. Benchmarksetup Remote access Moving Image y Fixed Image Fixed Image Mask (0,0) x

  15. Test Hardware JUDGE 256-node GPU cluster Each M2070 node: • 2x M2070 (Fermi) GPU, each 6 GB RAM • 12-core X5650 CPU @ 2.67 GHz, 96 GB RAM JuHydra single-node Kepler machine • 2x K20X (Kepler) GPU, each 6 GB RAM • 16-core E5-2650 CPU @ 2 GHz, 64 GB RAM

  16. Baseline: Full Replication (M2070) ideal scalability

  17. Sysmem on Fermi

  18. Sysmem on Fermi: Explanation No sysmem Access Good Coalescing Few sysmem Access Bad Coalescing Manysysmem Access Bad Coalescing Most sysmem Access Good Coalescing

  19. Sysmem on Fermi: PCI-E Queries

  20. Sysmem: Halo Sizes mostly quantitative, not qualitative difference

  21. Listupdate: Multiple Streams 4 streams look the best

  22. Listupdate: AoSvsSoA, Atomicsvs Group SoA + atomics looks best

  23. Sysmem vs. Listupdate: Fermi on Fermi, sysmem is better

  24. Sysmem vs. Listupdate: Kepler (Closeup) on Kepler, listupdate is better

  25. Conclusions Fermi performancelimitedbyatomics systemmemoryreplicationisbetter Kepler order of magnitudefasterthanFermi no longerdominatedbyatomics listupdate (atomic, SoA, 4 streams) isbetter Future work Compression Trials on real images

  26. ? Questions INM-1 at FZJ: http://www.fz-juelich.de/inm/inm-1/EN/Home/home_node.html NVidiaApplication Lab at FZJ: http://www.fz-juelich.de/ias/jsc/nvlab Andrew V. Adinetz: adinetz@gmail.com Jiri Kraus: jkraus@nvidia.com Dirk Pleiter: d.pleiter@fz-juelich.de

More Related