Multimedia Content Analysis on Clusters and Grids

Multimedia Content Analysis on Clusters and Grids Frank J. Seinstra (fjseins@cs.vu.nl) Vrije Universiteit Faculty of Sciences Amsterdam

Overview (1) • Part 1: What is Multimedia Content Analysis (MMCA) ? • Part 2: Why parallel computing in MMCA - and how? • Part 3: Software Platform: Parallel-Horus • Part 4: Example – Parallel Image Processing on Clusters

Overview (2) • Part 5: Grids and their specific problems • Part 6: Towards a Software Platform for MMCA on Grids • Part 7: Large-scale MMCA applications on Grids • Part 8: Future research directions (new projects?)

Introduction: A Few Realistic Problem Scenarios…

automatic analysis? A Real Problem… • News Broadcast – September 21, 2005: • Police Investigation: over 80.000 CCTV recordings (by hand) • First match found only 2.5 months after attacks

Another Real Problem… • Web Video Search: • Search based on Video content • User annotations are known to be notoriously bad (e.g. YouTube) Hillary Clinton candidate

NFI (Dutch Forensics Institute, Den Haag): • Surveillance Camera Analysis / Crime Scene Reconstruction Are these problems realistic? • Beeld&Geluid (Dutch Institute for Sound and Vision, Hilversum): • Interactive access to Dutch national TV history

Part 1: What is Multimedia Content Analysis (MMCA)?

Multimedia • Multimedia = Text + Sound + Image + Video + …. • Video = image + image + image + …. • In many (not all) multimedia applications: • calculations are executed on each separate video frame independently • So: we focus on Image Processing(+ Computer Vision)

What is a Digital Image? • An image is a continuous function that has been discretized in spatial coordinates, brightness and color frequencies • Most often: 2-D with ‘pixels’ as scalar or vector value • However: • Image dimensionality can range from 1-D to n-D • Example (medical imaging): 5-D = x, y, z, time, emission wavelength • Pixel dimensionality can range from 1-D to n-D • Generally: 1D = binary/grayscale; 3D = color (e.g. RGB) • n-D = hyper-spectral (e.g. remote sensing by satellites; every 10-20nm)

Impala A K R Z (Parallel-) Horus Low level operations Intermediate level operations High level operations In: image Out: ‘meaning’ful result Image ---> (sub-) Image Image ---> Scalar / Vector Value ---> Feature Vector Image ---> Array of S/V Values Feature Vector(s) ---> Similarity Matrix Ranked set of images Feature Vector(s) ---> Clustering (K-Means, …) ---> Recognized object Feature Vector(s) ---> Classification (SVM, …) ...... Complete A-Z Multimedia Applications

MMCA: software library design issues • Two major implementation problems: • 1. Nr. of commonly applied operations is very large • 2. Nr. of data types is large • combinatorial explosion caused many software projects to be discontinued • Existing Library (Horus): • Sets of operations with similar behavior & data accesses implemented as single generic algorithm • So, we have a small number of algorithmic patterns • Each to be used by instantiating it with the proper parameters, incl. the operation to be applied to the individual pixels

+ = Binary Pixel Operation (example: addition) + Template / Kernel / Filter / Neighborhood Operation (example: Gauss filter) = Low Level Image Processing ‘Patterns’ (1) = Unary Pixel Operation (example: absolute value) N-ary Pixel Operation…

2 1 7 6 4 = N-Reduction Operation (example: histogram) M + = Geometric Transformation (example: rotation) transformation matrix Low Level Image Processing ‘Patterns’ (2) = Reduction Operation (example: sum)

Example Application: Template Matching Template Input Image Result Image

Example Application: Template Matching for all images { inputIm = readFile ( … ); unaryPixOpI ( sqrdInIm, inputIm, “set” ); binaryPixOpI ( sqrdInIm, inputIm, “mul” ); for all symbol images { symbol = readFile ( … ); weight = readFile ( … ); unaryPixOpI (filtIm1, sqrdInIm, “set”); unaryPixOpI (filtIm2, inputIm, “set”); genNeighborhoodOp (filtIm1, borderMirror, weight, “mul”, “sum”); binaryPixOpI (symbol, weight, “mul” ); genNeighborhoodOp (filtIm2, borderMirror, symbol, ”mul”, “sum”); binaryPixOpI (filtIm1, filtIm2, “sub”); binaryPixOpI (maxIm, filtIm1, “max”); } writeFile ( …, maxIm, … ); } See: http:/www.science.uva.nl/~fjseins/ParHorusCode/

Part 2: Why Parallel Computing in MMCA (and how)?

The ‘Need for Speed’ in MMCA Research & Applications • Growing interest in international ‘benchmark evaluations’ • Task: find ‘semantic concepts’ automatically • PASCAL VOC Challenge (10,000++ images) • NIST TRECVID (200+ hours of video) • A problem of scale: • At least 30-50 hours of processing time per hour of video • Beel&Geluid: 20,000 hours of TV broadcasts per year • NASA: over 1 TB of hyper-spectral image data per day • London Underground: over 120,000 years of processing…!!!

GPUs Accelerators General Purpose CPUs • Question: • What type of high-performance hardware is most suitable? Clusters Grids High-Performance Computing • Solution: • Parallel & distributed computing at a very large scale • Our initial choice: • Clusters of general purpose CPUs (e.g. DAS-cluster) • For many pragmatic reasons…

User Transparent Parallelization Tools But… how to let ‘non-experts’ program clusters easily & efficiently? • Parallelization tools: • Compilers • Languages • Parallelization Libraries • General purpose & domain specific Message Passing Libraries (e.g., MPI, PVM) Effort Shared Memory Specifications (e.g., OpenMP) Parallel Languages (e.g., Occam, Orca) Extended High Level Languages (e.g., HPF) Parallel Image Processing Languages (e.g., Apply, IAL) Automatic Parallelizing Compilers Parallel Image Processing Libraries Efficiency

! • Ignore optimization across library calls [all] Existing Parallel Image Processing Libraries • Suffer from many problems: • No ‘familiar’ programming model: • Identifying parallelism still the responsibility of programmer (e.g. data partitioning [Taniguchi97], loop parallelism [Niculescu02, Olk95]) • Reduced maintainability / portability: • Multiple implementations for each operation [Jamieson94] • Restricted to particular machine [Moore97, Webb93] • Non-optimal efficiency of parallel execution: • Ignore machine characteristics for optimization [Juhasz98, Lee97] Bräunl et al (2001)

Our Approach • Sustainable library-based software architecture for user-transparent parallel image processing • (1) Sustainability: • Maintainability, extensibility, portability (i.e. from existing sequential ‘Horus’ library) • Applicability to commodity clusters • (2) User transparency: • Strictly sequential API (identical to Horus) • Intra-operation efficiency & inter-operation efficiency (2003)

Part 3: Software Platform: Parallel-Horus

What Type(s) of Parallelism to Support? • Data parallelism: • “exploitation of concurrency that derives from the application of the same operation to multiple elements of a data structure” [Foster, 1995] • Task parallelism: • “a model of parallel computing in which many different operations may be executed concurrently” [Wilson, 1995]

Why Data Parallelism (only)? for all images { inputIm = readFile ( … ); unaryPixOpI ( sqrdInIm, inputIm, “set” ); binaryPixOpI ( sqrdInIm, inputIm, “mul” ); for all symbol images { symbol = readFile ( … ); weight = readFile ( … ); unaryPixOpI (filtIm1, sqrdInIm, “set”); unaryPixOpI (filtIm2, inputIm, “set”); genNeighborhoodOp (filtIm1, borderMirror, weight, “mul”, “sum”); binaryPixOpI (symbol, weight, “mul” ); genNeighborhoodOp (filtIm2, borderMirror, symbol, ”mul”, “sum”); binaryPixOpI (filtIm1, filtIm2, “sub”); binaryPixOpI (maxIm, filtIm1, “max”); } writeFile ( …, maxIm, … ); } • Natural approach for low level image processing • Scalability (in general: #pixels >> #different tasks) • Load balancing is easy • Finding independent tasks automatically is hard • In other words: it’s just the best starting point… (but not necessarily optimal at all times)

Many Low Level Imaging Algorithms are Embarrassingly Parallel • On 2 CPUs: Parallel Operation on Image { Scatter Image (1) Sequential Operation on Partial Image(2) Gather Result Data (3) } (1) (2) (3) • Works (with minor issues) for unary, binary, n-ary operations & (n-) reduction operations

Other Imaging Algorithms Are Only Marginally More Complex (1) • On 2 CPUs (without scatter / gather): Parallel Filter Operation on Image { Scatter Image (1) Allocate Scratch (2) Copy Image into Scratch (3) Handle / Communicate Borders (4) Sequential Filter Operation on Scratch(5) Gather Image (6) } SCRATCH SCRATCH • Also possible: ‘overlapping’ scatter • But not very useful in iterative filtering

Other Imaging Algorithms Are Only Marginally More Complex (2) • On 2 CPUs (rotation; without b-cast / gather): Parallel Geometric Transformation on Image { Broadcast Image (1) Create Partial Image (2) Sequential Transform on Partial Image (3) Gather Result Image (4) } RESULT IMAGE RESULT IMAGE • Potential faster implementations for special cases

followed + + = = by More Challenging: Separable Recursive Filtering (2 x 1-D) • Separable filters (1 x 2D becomes 2 x 1D): drastically reduces sequential computation time • Recursive filtering: result of each filter step (a pixel value) stored back into input image • So: a recursive filter re-uses (part of) its output as input 2D Template / Kernel / Filter / Neighborhood Operation (example: Gauss filter) + =

Parallel Recursive Filtering: Solution 1 • Drawback: transpose operation is very expensive (esp. when nr. CPUs is large) (SCATTER) (FILTER X-dir) (TRANSPOSE) (FILTER Y-dir) (GATHER)

P0 P1 P2 P0 • Loop carrying dependence at innermost stage (pixel-column level) • high communication overhead • fine-grained wave-front parallelism P1 P2 P0 • Tiled loop carrying dependence at intermediate stage (image-tile level) • moderate communication overhead • coarse-grained wave-front parallelism P1 P2 Parallel Recursive Filtering: Solution 2 • Loop carrying dependence at final stage (sub-image level) • minimal communication overhead • full serialization

Parallel Recursive Filtering: Wave-front Parallelism • Drawback: • partial serialization • non-optimal use of available CPUs Processor 0 Processor 1 Processor 2 Processor 3

Parallel Recursive Filtering: Solution 3 • Multipartitioning: • Skewed cyclic block partitioning • Each CPU owns at least one tile in each of the distributed dimensions • All neighboring tiles in a particular direction are owned by the same CPU Processor 0 Processor 1 Processor 2 Processor 3

Multipartitioning • Full Parallelism: • First in one direction… • And then in other… • Border exchange at end of each sweep • Communication at end of sweep always with same node Processor 0 Processor 1 Processor 2 Processor 3

Parallel-Horus: Parallelizable Patterns • Minimal intrusion • Much of the original sequential Horus library has been left intact • Parallelization localized in the code • Easy to implement extensions Horus Sequential API Parallel Extensions Parallelizable Patterns MPI

Parallel-Horus: Pattern Implementations (old vs. new) template<class …, class …, class …> inline DstArrayT* CxPatUnaryPixOp(… dst, … src, … upo) { if (dst == 0) dst = CxArrayClone<DstArrayT>(src); if (!PxRunParallel()) { // run sequential CxFuncUpoDispatch(dst, src, upo); } else { // run parallel PxArrayPreStateTransition(src, …, …); PxArrayPreStateTransition(dst, …, …); CxFuncUpoDispatch(dst, src, upo); PxArrayPostStateTransition(dst); } return dst; } template<class …, class …, class …> inline DstArrayT* CxPatUnaryPixOp(… dst, … src, … upo) { if (dst == 0) dst = CxArrayClone<DstArrayT>(src); CxFuncUpoDispatch(dst, src, upo); return dst; }

Do this: Avoid Communication On the fly! Parallel-Horus: Inter-operation Optimization • Lazy Parallelization: • Don’t do this:

local structures Parallel-Horus: Distributed Image Data Structures • Distributed image data structure abstraction • 3-tuple: < state of global, state of local, distribution type > global structure • state of global = { none, created, valid, invalid } • state of local = { none, valid, invalid } • distribution type = { none, partial, full, not-reduced } CPU 0 (host) CPU 1 • 9 combinations are ‘legal’ states(e.g.: < valid, valid, partial > ) CPU 2 CPU 3

Lazy Parallelization: Finite State Machine • Communication operations serve as state transition functions between distributed data structure states • State transitions performed only when absolutely necessary • State transition functions allow correct conversion of legal sequential code to legal parallel code at all times • Nice features: • Requires no a priori knowledge of loops and branches • Can be done on the fly at run-time (with no measurable overhead)

Part 4: Example – Parallel Image Processing on Clusters

Application: Detection of Curvilinear Structures • Apply anisotropic Gaussian filter bank to input image • Maximum response when filter tuned to line direction • Here 3 different implementations • fixed filters applied to a rotating image • rotating filters applied to fixed input image • separable (UV) • non-separable (2D) • Depending on parameter space: • few minutes - several hours

Sequential = Parallel for all orientations theta { geometricOp ( inputIm, &rotatIm, -theta, LINEAR, 0, p, “rotate” ); for all smoothing scales sy { for all differentiation scales sx { genConvolution ( filtIm1, mirrorBorder, “gauss”, sx, sy, 2, 0 ); genConvolution ( filtIm2, mirrorBorder, “gauss”, sx, sy, 0, 0 ); binaryPixOpI ( filtIm1, filtIm2, “negdiv” ); binaryPixOpC ( filtIm1, sx*sy, “mul” ); binaryPixOpI ( contrIm, filtIm1, “max” ); } } geometricOp ( contrIm, &backIm, theta, LINEAR, 0, p, “rotate” ); binaryPixOpI ( resltIm, backIm, “max” ); } IMPLEMENTATION 1 for all orientations theta { for all smoothing scales sy { for all differentiation scales sx { genConvolution (filtIm1, mirrorBorder, “func”, sx, sy, 2, 0 ); genConvolution (filtIm2, mirrorBorder, “func”, sx, sy, 0, 0 ); binaryPixOpI (filtIm1, filtIm2, “negdiv”); binaryPixOpC (filtIm1, sx*sy, “mul”); binaryPixOpI (resltIm, filtIm1, “max”); } } } IMPLEMENTATIONS 2 and 3

Measurements on DAS-1 (Vrije Universiteit) • 512x512 image • 36 orientations • 8 anisotropic filters • So: part of the efficiency of parallel execution always remains in the hands of the application programmer!

Measurements on DAS-2 (Vrije Universiteit) • 512x512 image • 36 orientations • 8 anisotropic filters • So: lazy parallelization (or: optimization across library calls) is very important for high efficiency!

Part 5: Grids and Their Specific Problems

The Grid • The “Promise of the Grid”: • 1997 an beyond: efficient and transparent (i.e. easy-to-use) wall-socket computing over a distributed set of resources • Compare electrical power grid:

Grid Problems (1) • Getting an account on remote compute clusters is hard! • Find the right person to contact… • Hope he/she does not completely ignore your request… • Provide proof of (a.o.) relevance, ethics, ‘trusted’ nationality… • Fill in and sign NDA’s, Foreign National Information sheets, official usage documents, etc… • Wait for account to be created, and username to be sent to you… • Hope to obtain an initial password as well… • Getting access to an existing international Grid-testbed is easier • But only marginally so…

Grid Problems (2) • Getting your C++/MPI code to compile and run is hard! • Copying your code to the remote cluster (‘scp’ often not allowed)… • Setting up your environment & finding the right MPI compiler (mpicc, mpiCC, … ???)… • Making the necessary include libraries available… • Finding the correct way to use the cluster reservation system (if there is any)… • Finding the correct way to start your program (mpiexec, mpirun, … and on which nodes ???)… • Getting your compute nodes to communicate with other machines (generally not allowed)… • So: • Nothing is standardized yet (not even Globus) • A working application in one Grid domain will generally fail in all others

Grid Problems (3) • Keeping an application running (efficiently) is hard! • Grids are inherently dynamic: • Networks and CPUs are shared with others, causing fluctuations in resource availability • Grids are inherently faulty: • compute nodes & clusters may crash at any time • Grids are inherently heterogeneous: • optimization for run-time execution efficiency is by-and-large unknown territory • So: • An application that runs (efficiently) at one moment should be expected to fail a moment later

Realizing the ‘Promise of the Grid’ for MMCA • Grid programming and execution is hard! • Even for experts in the field • Set of fundamental methodologies required • Each solving part of the Grid’s complexities • For most of these methodologies solutions exist today: • JavaGAT / SAGA, Ibis / Satin, SmartSockets, Parallel-Horus / SuperServers ?

Multimedia Content Analysis on Clusters and Grids