1 / 32

Towards an Optimal Bit-Reversal Permutation Program

Towards an Optimal Bit-Reversal Permutation Program. Larry Carter and Kang Su Gatlin Presented by Ari Shotland. What is Bit-Reversal. We index arrays by binary strings. The pseudo code “for i = 0 to N”, means that i iterates through all binary strings of length log 2 (N).

kasi
Télécharger la présentation

Towards an Optimal Bit-Reversal Permutation Program

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards an Optimal Bit-Reversal Permutation Program Larry Carter and Kang Su Gatlin Presented by Ari Shotland

  2. What is Bit-Reversal • We index arrays by binary strings. The pseudo code “for i = 0 to N”, means that i iterates through all binary strings of length log2(N). • Given a binary string s, r(s) denote the reversal of s. Example: r(01101) = 10110.

  3. What is Bit-Reversal (cont.) • The following program, where N is power of 2, makes B a Bit-Reversal Permutation of A. • BitReverse(A, B) : for i = 0 to N – 1 B[r(i)] = A [i]

  4. What is it good for? • Our interest in very accurate analyses of BitRevers comes from our interest in high performance library programs for the Fast Fourier Transform (FFT). • The next slide shows a DAG representation of a 16-point FFT, assuming all edges go from left to right. The right most stage of the DAG is a BitReverse. This stage is necessary in practice so that repeated applications of the FFT can swap between the “time domain” and the “frequency domain”

  5. What is it good for? (cont.) • In practice, most FFT Implementation avoid bit reversal because the cache associativity problems it presents • One of the more popular algorithms is the “four step” FFT runs in Ө(Nlg(N)lglg(N)) avoiding the BitReverse permutation by some recursive transposes. • An elimination of the lglg(N) term by replacing the recursive transposes with a single BitReverse achieves Ө(Nlg(N)) complexity.

  6. The RoColTM pebble game • Equipment: Two buckets, labeled A and B. N pebbles (initially in A). An (infinitely large) “Go” board. An Integer K. • Object of the game: Move all pebbles from A to B in as few moves as possible.

  7. The RoColTM pebble game (cont.) • Rules: • Initially, all the pebbles are in A. • At most K pebbles can be on the Go board at any time. • There are two types of moves: • Row move: Choose a row of the go board. Place as many pebbles from A as desired (subject to #2) on that row, in any positions. • Column move: Choose a column of the go board, and move as many pebbles as desired from that column to bucket B.

  8. RoCol strategies • A poor strategy would be to repeatedly make one Row move to place K pebbles on one row of the board, and then to make K Column moves to pick them up one at a time. This strategy would require N + N/K = N(1 + 1/K) moves. Note that for any K the average transfer rate will be less then one pebble per move. • Assuming K = H2, a much better strategy (named the “square” strategy) is to first commit H Row moves to create an H x H square of pebbles, and then H Column moves to empty the board to bucket B. By 2H =2√K moves we transfer K pebbles, thus the “bandwidth”from A to B is of √K/2 pebbles per move, assuming N is a multiple of K, This strategy will require 2N/√K moves, N(2/√k).

  9. RoCol strategies (cont.) • Assuming K is a triangular number, i.e. K = H2/2 + H/2 there is even a better strategy (named the “triangle” strategy). An initial H Row moves are made to create a right triangle with legs of length H. Thereafter, we alternate one Column move of H pebbles, restoring the triangle to its full size. • Neglecting the first H Row moves and the last H Column moves we derive that that the asymptotic “bandwidth” of this strategy is H/2 = (√1+8K - 1)/4pebbles per move, which is nearly √2 better then square strategy “bandwidth”.

  10. The Triangle Strategy Is Optimal • Define T(h) = h(h+1)/2 = k • Thus, T-1(k) =(√1+8k – 1)/2 = h • Let Gi denote the “board position” after i moves have been made. • Given a “board position” G of pebbles on the Go board, let rj denote the number of pebbles on the jth row and cj the number on the jth column. • Define the potential P of position G as: P can be thought of as a measure of the “verticalness” of G, a lot of pebbles in a small number of columns will result in a large P, thus many pebbles can be removed with few Column moves.

  11. The Triangle Strategy Is Optimal (Cont.) • Lemma 1: P(Gi) – P(Gi-1) ≤ K – T(ki), where ki is the number of pebbles involved in move i. proof: Assuming move i is a Row move (Column move is symmetric, it will be clearer why shortly…). The second summation element was effected only for the j in which the Row move was made on, thus the change in this element is T(rji) - T(rji-1) = T(rji-1 + ki) - T(rji-1) = rji-1ki + T(ki) ≥ T(ki) The first summation element was effected for all j such that cj was increased by 1 in the Row move. Since T(cji) – T(cji-1 - 1) = cj the change in this element is

  12. The Triangle Strategy Is Optimal (Cont.) • Lemma 2: In a game of m moves, Proof: From Lemma 1 we derive T(ki) ≤K - P(Gi) + P(Gi-1) And deduct: The potentials of the first and last position of the game are equal

  13. The Triangle Strategy Is Optimal (Cont.) • Lemma 3: Proof: Let S = ∑T(ki) Lemma 3.1: ∑Ki is maximized, subject to the constraint that S = ∑T(ki), when the all the ki’s equal to the same value, call it k0. Proof: Suppose ∑ki is maximized but say k1≠k2 let t1=t2=(k1+k2)/2, and ti=ki for i > 2. We have that ∑ki = ∑ti but since 2T([k1+ k2]/2) ≤ T(k1) + T(k2) we got ∑T(ti) ≤ ∑T(ki) = S Which means we can increase t1 till ∑T(ti) = S thus contradicting that ∑Ki is maximized.

  14. The Triangle Strategy Is Optimal (Cont.) • Lemma 3 : cont… We can calculate k0 observing that S = ∑T(ki) = mT(k0) so k0=T-1(S/m). Thus ∑ki≤ mk0=mT-1 (S/m). We also know from Lemma 2 that S ≤ mK and since T is an increasing function, it follows that ∑ki≤ mT-1(K). • The Above leads to…

  15. The Triangle Strategy Is Optimal (Cont.) • Theorem 1: A game of RoCol that allows at most K pebbles on the board at a time requires at least 2N/H moves to transfer N pebbles from A to B where H=T-1(K) Proof: A complete game of m moves involves moving all N pebbles out of bucket A and moving N pebbles into B, so the total number of pebble-moves ∑ki is 2N, thus we can rewrite Lemma 3 as m > 2N/H. • Recalling that the Triangle Strategy satisfies the game’s goal with 2N/H moves we now deduct that … • The Triangle Strategy Is Optimal!

  16. Lower Bounds On Permutations We will focus on programs that implement Transpose since the exposition is easier, but the theorems apply equally to BitReverse. A B We always assume that the dimensions of the array are multiples of the size of a page and therefore are multiples of the size of a cache line . We also assume that data is aligned, that is every page lies entirely inside the arrays. 2 lines 2 way associative cache Denote b as the size of a cache line. Partitioning A and B into sub arrays bxb, we notice a performance problem since each b cache lines that holds a sub array are all in the same associativity class. Yet these lines hold elements to be moved to a single cache line of B.

  17. Lower Bounds On Permutations (Cont.) • This situation leads to a trade-off between how often data is moved into cache and how often it is moved into registers. • Suppose we restrict our implementation to move each element only once during execution. • The following theorem is applicable where A and B each represents a sub array whose elements are of the same associativity (Though A class can be deferent from B’s). • After applying the theorem to the sub arrays we carry forward to the original (large) arrays. • The parameter Z is the cache associativity.

  18. Lower Bounds On Permutations (Cont.) • Theorem 2: Given a computer with K registers and a cache that allows at most Z cache lines of A and Z cache lines of B at a time. Let N be the size of A. Then any program that computes Transpose (A, B) by bringing each element into a register only once during execution must have at least N/(Z + T-1(K)/2) cache lines moves during execution. • Proof…

  19. Lower Bounds On Permutations (Cont.) • We think of a given program as sequence of data operations from the following types: • Encache (X), brings the cache line containing the element X, where X is either A[i, j] or B[j, i] for some i and j. • Evict (X), moves the cache line containing X out of cache, freeing up space for a different cache line to be cached. • Load (A[i, j]), Rk) moves A[i, j] into register Rk, the cache line containing A[i, j] must currently reside in cache. • Store (Rk, B[j, i]) moves the element in Rk into its final position in the B array. The cache line containing B[j, i] must currently reside in cache. • Copy (A[i,j],B[j,i]), which has the effect if Load immediately followed by a store, but doesn’t require naming registers. Both A[i,j]’s and B[j, i]’s cache lines must currently reside in cache.

  20. Lower Bounds On Permutations (Cont.) • We assume that the sequence is free of copy’s. • We rearrange the sequence into a canonical form: • We search the sequence for any instant of time when for some i and j, both A[i, j]’s and B[j,i]’s cache line are in cache. For each such occurrence, we remove the Load(A[i, j], Rk) and the Store(B[j , i], Rk) from the sequence and instead, at the earliest point in the schedule where both lines are in cache we insert a Copy(A[i,j], B[j.i]) opearation. Note that this will be immediately after an Encache operation. For accounting purpose, we “charge” the Copy to the Encache operation it immediately follows. • After introducing as many Copy’s as possible, each Encache operation will have at most Z Copy’s charged to it (there is only one element in each target array cache line). 1 2 3 4 1 2 3 4 1,1 1,1 1,2 2,1 16,15 15,16 16,16 16,16 1 2 3 4 1 1,1 2,1 3,1 4,1 2 1,2 2,2 3,2 4,2

  21. Lower Bounds On Permutations (Cont.) • We now show how the normalized sequence (ignoring the Copy’s) corresponds to a game of RoCol. • The (i, j) pairs are the positions on the board. • Number of pebbles on the board is bounded by the number of the registers in the machine. • Each Encache operation is a RoCol move: • Consider an Encache(A[i, j]) and a set of following Load’s from the cache line containing A[i, j] that occur before Evict(A[i, j]). For each such Load the corresponding Store must occur after the Evict, otherwise we would have replaced the Load Store pair with a copy. Thus it is legal to move all the Load’s to just before the Evict, we bundle this set of Load’s together to a Row move. Similarly for each Encache(B[j, i]) operation we move all Store’s to just after the Encache and designate them as a column move. (remember that a row move placed pebbles from bucket A on the board and a column move removed pebbles from the board to bucket B.)

  22. Lower Bounds On Permutations (Cont.) • Let c be the number of Encache operations. By our accounting method, we know that at most cZ elements will be transposed by Copy’s. There remain N – cZ elements to be moved in the RoCol game. • Since each Encache is a RoCol move Theorem 1 says that c ≤ 2(N – cZ)/ T-1(K) thus c ≤ N/(Z + T-1(K)/2) • Corollary: Given a computer with K registers, a cache line of L and a Cache associativity of Z, then Transpose with one register roundtrip per element will require at least L/(2Z + T-1(K)) cache roundtrips per element. (A cache roundtrip involves two cache misses!)

  23. Roundtrips to Registers VS Roundtrips to Cache • The 66-MHz IBM Power2 processors has Z = 4 and L = 32 (measured in 4-Byte elements). The corollary says that even if we could use all K = 32 registers an algorithm with one register roundtrip per element must average at least 2.06 cache roundtrips per element. • An important observation is, exactly the same analysis applies to the TLB. For the Power2’s TLB, Z = 2 and L =1024 hence at least 89 TLB roundtrips per element are required. • Is it worth having this much cache and TLB traffic, just so an algorithm can make optimum usage of register traffic? • The answer depends on the relative costs, and the possible alternative algorithms.

  24. Roundtrips to Registers VS Roundtrips to Cache (Cont.) Comparison of the theoretical lower bound cost, in cycles/element, for a BitReverse that is constrained to one register roundtrip per element compared to the data movement cost of COBRA. The figure suggest that the COBRA algorithm has a significant advantage. In practice is impossible to design a register efficient algorithm that has only 2.06 cache roundtrips per element (4 is more realistic). However the projected costs of COBRA are close to those of the actual experiment. Thus, it is better to have a cache efficient algorithm then a register efficient one.

  25. Lower bounds on Cache Efficient Algorithms • Theorem 3: Suppose our computer has K registers and a cache that allows at most Z cache lines of A and Z cache line of B in cache at a time, where each cache line is of length L. Then any program that computes transpose(A, B) with one cache roundtrip per element must have an average of at least 2 – (2Z + T-1(K))/L register roundtrips per element. • Proof: Since the algorithm is cache efficient it has exactly 2M/L cache line move where M is size of each array…

  26. Lower bounds on Cache Efficient Algorithms (Cont.) Consider the sequence of data movement from the proof of theorem 3. if we remove from these sequence all loads and stores of elements that make more then one roundtrip into registers, we remain with a sequence that transposes a sparse array. Theorem 3 asserts that this program has at least N/(Z + T-1(K)/2) cache line moves, where N is the number of elements in the spars array. Thus 2M/L ≤ N/(Z + T-1(K)/2) and N/M ≤ (2Z + T-1(K))/L Note that N/M is also the fraction of elements that require only one register roundtrip. The remaining 1 – N/M require at least two. Thus the average number of register roundtrips per element is at least 2(1 – N/M) + N/M = 2 – N/M ≤ 2 -(2Z + T-1(K))/L .

  27. Lower bounds on Cache Efficient Algorithms (Cont.) • Theorem 5: Suppose our computer can hold K elements in cache and registers combined, and it has TLB associativity of Z and page length of L elements. Assuming that at most Z pages of A and Z pages of B can be in the TLB at a time. Any program that computes Transpose(A ,B) with only one cache roundtrip per element must have an average of at least L/(2Z + T-1(K)) TLB roundtrips per element. • Proof: The proof is essentially the same proof of Theorem 3 and its corollary.

  28. An efficient Bit-Reversal • The Cache Optimal Bit Reverse Algorithm (COBRA) is cache efficient and also has good TLB efficiency. It uses the “square strategy”, which is easier to implement. • We represent the indexes of the array we wish to BitReverse, A, with binary strings of the form abc, where a and c are of length q, where q is chosen to be at least log of the size of a cache line. Thus if |abc| = Log(N) then |b| = Log(N) – 2q.

  29. An efficient Bit-Reversal (Cont.) N = 64, |a|=|b|=|c|=2 ,b = ’01’ A for b = 0 to 2^(lgN-2q)-1 b’ = r(b) for a = 0 to 2^q-1 a’ = r(a) for c = 0 to 2^q-1 T[a’c] = A[abc] for c = 0 to 2^q-1 c’ = r(c) for a’ = 0 to 2^q-1 B[c’b’a’] = T[a’c] T B

  30. Experimental Results

  31. Thank You!

More Related