Parallel Sorting: An Analysis

Parallel Sorting: An Analysis Madison Solarana & Kevin Zheng

Outline • Sorting: • Introspective Sort • Odd-Even Transposition Sort • Shear Sort • Rank/Enumeration Sort • Merge Sort • Hyperquicksort • BitonicSort • Radix Sort • Sample Sort • Reconfigurable Architecture

Introspective Sort Quicksort + Heapsort = Introsort Switches from Quicksort to Heapsort based on recursive depth. C++ Implementation in <algorithm> - std::sort(firstElement, lastElement) Worst Case: Average Case: Not parallel, but replaces C’s qsort for sequential sorting

Odd-Even Transposition Sort pOddEven(n) id := rank for i := 1 to n do if i is odd then if id is odd then compare-exchange-min(id+ 1); else compare-exchange-max(id– 1); if i is even then if id is even then compare-exchange-min(id + 1); else compare-exchange-max(id – 1); end for end pOddEven

Odd-Even Transposition Sort Compares all (odd, even)-pairs of adjacent elements in a list and, if a pair is in the wrong order, the elements are switched. Repeat this for (even, odd)-pairs , then alternate between (odd, even) and (even, odd) steps until the list is sorted. N Iterations with one compare-exchange operation per iteration. if and only if if due to costs of merge-splits and exchanges.

Shear Sort A sorting algorithm used specifically for mesh architecture where P = n2 Sorting a row of length n with odd-even transposition takes n steps (P = n) and there are log(n) iterations, so Shear Sort takes O(n log(n)). Speedup = Tseq / Tpar = O(n2log(n)) / O(n log(n)) = O(n) Efficiency = 1/n

Shear Sort

Shear Sort ShearSort(n) for i := 1 to log(n) do if i is odd then if id is odd then call Odd_Even_Row_Sort(n) else call Odd_Even_Column_Sort(n) end ShearSort

Rank/Enumeration Sort forall (i=0; i<n; ++i) { numRank=0; for (j=0; j<n; ++j) { if (rawNums[i] > rawNums[j]) { ++numRank; } } sortedNums[numRank] = rawNums[i]; }

Rank/Enumeration Sort Counts the number of numbers that are smaller than each number. This determines its rank (order). if if if and if concurrent memory writes are allowed.

Merge Sort pMergesort(myid, d, data, newdata) data = mergesort(data) for dim = 1 to d data = pMerge(myid, dim, data) endfor newdata = data end

Merge Sort Tseq = O(n log(n)) Tpar = O(4n) = O(n) if P = N S = O(log(n)) E = O(log(n)/ n)

Hyperquicksort • Distribute the data evenly to all nodes. • Each node sorts the data it has using sequential quicksort. • Node 0 broadcasts its median key K to the rest of the cube. • Each node separates its data into two groups: keys <= K and keys > K • Break up the cube into two subcubes: the lower subcube ( node 0 through ) and the upper subcube ( nodes through). Each node in the lower subcube sends its items whose keys are > K to its adjacent node in the upper subcube. Each node in the upper subcube sends its items whose keys are <= K to its adjacent node in the lower subcube. Now,all data whose keys are <= K are in the lower subcube while all those whose keys are > K are in the upper subcube. • Each node merges together the group it just received with the one it kept so that its data is sorted. • Repeat steps 3 through 6 on each of the two subcubes. This time node 0 will correspond to the lowest-number node in the subcube, and the value of d will be one less. • Repeat steps 3 through 7 until the subcubes only contain a single node.

Hyperquicksort

Hyperquicksort Sequential Quicksort: Simple Parallel Quicksort: , where Hyperquicksort: , where is the broadcast cost and is the merging cost.

Bitonic Sort IF master processor Retrieve data to sort Scatter it among all processors ELSE Receive portion to sort Sort local data using std::sort FOR( level = 1; level <= lg(P) ; level++ ) FOR ( j = 0; j<level; j++ ) partner = rank ^ (1<<(level-j-1)); Exchange data with partner IF((rank<partner) == ((rank & (1<<level)) ==0)) extract low values from local and received data (mergeLow) ELSE extract high values from local and received data (mergeHigh) Collect Sorted Data

Bitonic Sort if if

Radix Sort Uses Bucket Sort and is similar to Histogram Sort Array is partitioned into bucket and then each bucket is sorted individually. LSD & MSD Radix Sort

Radix Sort Parallelize counting sort: each processor gets N/p elements from p partitions. All processors work to compute the global prefix sum Then each processor copies its assigned values to the shared output array Tseq= O(kN)

Sample Sort n/p elements per-processor. Each processor sorts its local elements. (std::sort or bitonic sort) Each processor selects p-1 equally spaced elements from its own list. The combined p(p-1) set of elements are sorted and p-1 equally spaced elements are selected from that list. Each processor splits its own list according to these splitters into p buckets. Each processor sends its ithbucket to the ithprocessor. Each processor merges the elements that it receives.

Sample Sort

Reconfigurable Architecture • RMESH – Reconfigurable mesh with buses • PARBUS • MRN • Polymorphic torus

Reconfigurable Architecture

Benefits of Reconfigurable Architecture In any given time unit, one of the PE in this collection can choose to broadcast a message which is assumed to be readable in the same time unit by all the other PE in this collection.

Column Sort on a Rmesh Step 0: [Input] Q is available in column major order, one element per PE, in row 1 of the RMESH Step 1: [Sort Transpose] Obtain the Q matrix by sorting column-wise then conducting a transpose. Step 2: [Sort Untranspose] Obtain the Q matrix by sorting column-wise then untransposing the matrix. Step 3: [Sort Shift] Obtain the Q matrix by sorting column-wise then shifting the matrix. Step 4: [Sort Unshift] Obtain the result after sorting column-wise then unshifting the matrix.

Column Sort on a Rmesh

Column Row Transposition

Column Sort on a Rmesh • O(1) • Column Sort • Total number of broadcasts is 139. • Rotate Sort • Total number of broadcasts is 120. • Sort n elements in O(1) using n2 processors. • In general, n = Nknumbers can be sorted in O(1) time using Nk+1 = n1+1/k processors in a k+1 dimensional configuration.

Questions?

Main Sources M. Nigam and S. Sahni, “Sorting n Number On n x n Reconfigurable Meshes With Buses” J. Jang and V. Prasanna, “An optimal sorting algorithm on reconfigurable meshes” R. Lin, S. Olariu, J. L. Schwing, J. Zhang “Sorting in O(1) time on an n xn reconfigurable mesh”

Parallel Sorting: An Analysis

Parallel Sorting: An Analysis

Presentation Transcript

COMP108 Algorithmic Foundations Algorithm efficiency + Searching/Sorting

Introduction to Parallel I/O and MPI-IO

Analysis of Multithreaded Programs

Parallel Computing Explained Parallel Computing Overview

Parallel Programming in C with MPI and OpenMP

CS 484 Parallel Programming spring 2014

Parallel Algorithms for VLSI Routing

Looking Beneath the Surface of Sorting

Parallel and Distributed Algorithms

DNA Nanotechnology: Geometric sorting boards

200

CUDA Lecture 3 Parallel Architectures and Performance Analysis

Sorting

Lecture 5: Parallel Tools Landscape – Part 2

Sorting

Parallel Architecture is Ubiquitous

Parallel HDF5 Tutorial

Sorting

Sorting Algorithms

How to Think Algorithmically in Parallel?

Sorting Techniques

PARALLEL COMPUTING WITH MPI