260 likes | 459 Vues
Efficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism. Peter Krusche and Alexander Tiskin Department of Computer Science University of Warwick May 09/2006. Outline. Introduction LLCS Computation The BSP Model Problem Definition and Algorithms
 
                
                E N D
Efficient Longest Common SubsequenceComputation using Bulk-Synchronous Parallelism Peter Krusche and Alexander Tiskin Department of Computer Science University of Warwick May 09/2006 P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism
Outline • Introduction • LLCS Computation • The BSP Model • Problem Definition and Algorithms • Standard Algorithm • Parallel Algorithm • Experiments • Experiment Setup • Predictions • Speedup P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism
Motivation Computing the (Length of the) Longest Common Subsequence is representative of a class of dynamic programming algorithms. Hence, we want to • Examine the suitability of high-level BSP programming for such problems • Compare different BSP libraries on different systems • See what happens when there is good sequential performance • Examine performance predictability P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism
Related Work • Sequential dynamic programming algorithm (Hirschberg, ’75) • Crochemore, Iliopoulos, Pinzon, Reid: A fast and practical bit-vector algorithm for the Longest Common Subsequence problem (2001) • Alves, Cáceres, Dehne: Parallel dynamic programming for solving the string editing problem on a CGM/BSP (2002). • Garcia, Myoupo, Semé: A coarse-grained multicomputer algorithm for the longest common subsequence problem (2003). P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism
Our Work • Combination of bit-parallel algorithms and fast BSP-style communication • A BSP performance model and predictions • Comparison using different libraries on different systems • Estimation of block size parameter before calculation for better speedup P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism
The BSP Model • p identical processor/memory pairs (computing nodes) • Computation speed f on every node • Arbitrary interconnection network, latency l, bandwidth gap g P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism
BSP Programs • SPMD execution, takes place in supersteps • Communication may be delayed until the end of the superstep • Time/Cost Formula : T = f ·W + g · H + l · S Bytes will be used as a base unit for communication size P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism
Problem Definition Let X = x1x2 . . .xmand Y = y1y2 . . .yn be two strings on a finite alphabet • Subsequence U of string X: U can be obtained by deleting zero or more elements from X i.e. U = xi1xi2 . . .xik and iq < iq+1 for all q with 1 ≤ q < k. • Strings X and Y : LCS (X, Y) is any string which is subsequence of both X and Y and has maximum possible length. • Length of these sequences: LLCS (X, Y). P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism
Sequential Algorithm • Dynamic programming matrix L0..m,0..n • Li,j = LLCS( x1x2…xi, y1y2…yj ) • The values in this matrix can be computed in O(mn) time and space P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism
Parallel Algorithm • Based on a simple parallel algorithm for grid DAG computation • Dynamic programming matrix L is partitioned into a grid of rectangular blocks of size (m/G)×(n/G) (G : grid size) • Blocks in a wavefront can be processed in parallel • Assumptions: • Strings of equal length m = n • Ratio a = G/p is an integer P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism
Parallel Cost Model • Input/output data distribution is block-cyclic • Can keep data for block-columns locally • Running time: • Parameter a can be used to tune performance P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism
Bit-Parallel Algorithms • Bit-parallel computation processes w entries of L in parallel (w : machine word size) • This leads to substantial speedup for the sequential computation phase and slightly lower communication cost per superstep. P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism
Systems Used Measurements on parallel machines at the Centre for Scientific Computing: • aracari: IBM cluster, 64 × 2-way SMP Pentium3 1.4 GHz/128 GB of memory (Interconnection Network: Myrinet 2000, MPI: mpich-gm) • argus: Linux cluster, 31 × 2-way SMP Pentium4 Xeon 2.6 GHz processors/62 GB of memory (Interconnection Network: 100Mbit Ethernet, MPI: mpich-p4) • skua: SGI Altix shared memory machine, 56 × Itanium-2 1.6 GHz processors / 112 GB of memory (MPI: SGI native) P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism
BSP Libraries Used • The Oxford BSP Toolset on top of MPI (www.bsp-worldwide.org/implmnts/oxtool/) • PUB on top of MPI (except on the SGI) (wwwcs.uni-paderborn.de/~bsp/) • A simple BSPlib implementation based on MPI(-2) P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism
Input and Parameters • Input strings generated randomly of equal length • Predictability examined for string lengths between 8192 and 65536, grid size parameter a between 1 and 5 • Values of l, g measured by timing random permutations P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism
Experimental Values of f and f´ • Simple Algorithm (f) skua 0.008 ns/op 130 M op/s argus 0.016 ns/op 61 M op/s aracari 0.012 ns/op 86 M op/s • Bit-Parallel Algorithm (f´) skua 0.00022 ns/op 4.5 G op/s argus 0.00034 ns/op 2.9 G op/s aracari 0.00055 ns/op 1.8 G op/s P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism
PredictionsGood results on distributed memory systems aracari/MPI – 32 Processors P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism
PredictionsSlightly worse results on shared memory (skua, MPI, p=32) P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism
Problems when Predicting Performance • Results for PUB less accurate on shared memory • Setup costs only covered by parameter l • difficult to measure • Problems on the shared memory machine when communication size is small • PUB has performance break-in when communication size reaches a certain value • Busy communication network can create ‘spikes’ P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism
Predictions for the Bit-Parallel Version • Good results on distributed memory systems • Results on the SGI have larger prediction error because local computations use block sizes for which f´is not stable P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism
Speedup Results (LLCS, aracari) P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism
Speedup for the Bit-Parallel Version • Speedup slightly lower than for the standard version • However, overall running times for same problem sizes are shorter • Can expect parallel speedup for larger problem sizes P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism
Speedup for the Bit-Parallel Version argus, p=10 skua, p=32 P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism
Result Summary P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism
Summary and Outlook • Summary • High-level BSP programming is efficient for the dynamic programming problem we considered. • Implementations benefit from a low latency implementation (The Oxford BSP toolset/PUB) • Very good predictability • Outlook • Different modeling of bandwidth allows better predictions • Lower latency possible by using subgroup synchronization • Extraction of LCS possible, using post processing step or other algorithm. . . P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism