Special Course on Computer Architecture 2008

Toolkit ver1.0 Special Course on Computer Architecture This is based on “Cell Speed Challenge 2008” in IPSJ workshop SACSIS (Symposium on Advanced Computing Systems and Infrastructures) Special Course on Computer Architecture 2008

About this document • Brief information about toolkit ver.1.0 • Tools for solving simultaneous linear equations • with multiple SPEs • Please refer to implementation guide of your homework • This document explains the algorithm to solve simultaneous linear equations Special Course on Computer Architecture 2008

Summary of homework • Your task is to write a parallel program for solvingsimultaneous linear equations. • You will compete performance of the program to obtain a solution vector matrix x, from constant matrix A and right-hand vector b. • A is a matrix of N×N elements(each element is float : 4 Byte) • x and b are matrices of M×N element • Contact ：yosimi@am.ics.keio.ac.jp Special Course on Computer Architecture 2008

Toolkit ver1.0 • Solving simultaneous linear equations by multiple SPEs • Number of SPE can be modified. Default is 6 (maximum). • Limitation • Sizes of matrices MUST BE a multiple of 32 (N=32n) • Implement program to modify function spe_soleqs()in spe1.c • Other modification will be ignored in evaluation • You can implement freely even the code inside spe_soleqs(). Special Course on Computer Architecture 2008

Initial data distribution1/2 • Matrices A,b,x are distributed as follows: Main memory address buf • The head address of working memory which is available for users • It must be aligned to128 Byte • Constant matrix A (NxNx4) is stored in the buf. A b x b = buf+N×N×sizeof(float) • The head address of the region which is stored right-hand vectorb(MxNx4Byte). • Notice : elements are ordered column-direction(data are stored in (0,0),…,(0,N-1),(1,0),…,(1,N-1),…,(M-1, N-1)) Blank Region ｘ= buf+(N×N+M×N)×sizeof(float) • The head address of solution vector x(MxNx4Byte) • Ordering of data is same as b SPE Special Course on Computer Architecture 2008

Initial data distribution1/2 Brank region • Head address : buf+N×(N+2M)×sizeof(float) • allocated in PPE program you can use this region. • The size if same as total size of matricesN×(N+2M)×sizeof(float) Main memory A Mapped address for transferring between SPE ls_addr[5] b • Physical memory does not allocated • Each of them is 256KB ls_addr[0]～ls_addr[4] • You can transfer data to the local store of each SPE accessing these regions directory. x Blank Region • Memory allocation is suppressed less than 80 MB • Such that, total size of matrices A, b, and x is guaranteed less than half of allocated memory • N is the multiple of 32 SPE Special Course on Computer Architecture 2008

New Ordering elements in matrices • Notice! : Distribution of elements is not the same between matrix A and others buf buf+N×N×sizeof(float) buf+4 buf+N×4 A b Special Course on Computer Architecture 2008

The algorithm adopted by toolkit • LU decomposition • pivoting • Forward substitution • Backward substitution • pivoting is always done in spite of the form of matrices and size Special Course on Computer Architecture 2008

Update Subroutines for DMA transfer (1/2) • Functions for DMA transfer • dmaget, dmaput : Subroutines for DMA transfer in toolkit • void dmaget_burst(unsigned int ppe_addr, unsigned int spe_addr, unsigned int row, unsigned int col, unsigned int n)Read 128 Byte from the element of matrix (col, row) whose head address is ppe_addr in main memory(type of each element is float) and data into LocalStore whose head address is spe_addr (a certain element can be fetched by *(float*)(spe_addr+row%32*sizeof(float)) ) Please pay a attention to identify the location of matrix SPE(LocalStore) PPE(Main memory) ppe_addr 行列(n×n) element (col,row) spe_addr address inalignment of 128Byte Special Course on Computer Architecture 2008

Update Subroutines for DMA transfer (2/2) • float dmaget_value(unsigned int addr, unsigned int row, unsigned int col, unsigned int n)Reads one element from n x n matrix, whose coordinates are given by (col, row). The matrix is stored in the main memory, and its beginning address is addr. • void dmaput_value(unsigned int addr, unsigned int row, unsigned int col, unsigned int n, float value)Writes value to the element of n x n matrix, whose coordinates are given by (col, row). The matrix is stored in the main memory, and its beginning address is addr. Note that the value is NOT synchronized among SPEs. PPE(main memory) SPE dmaget_value addr Matrix (n×n) element(col, row) dmaput_value Special Course on Computer Architecture 2008

Synchronization • Subroutines for synchronization • SPE0 is in charge of DMA synchronization among SPEs. • void sync(UINT32 id, // ID number of SPE UINT32* ppe_ls, // array with addresses which “LocalStore” of • each SPEs are mapped in main memory. volatile struct spe_sync* sd, // array with local addresses in the SPE UINT32 key) // a key used for synchronization • In function sync … • SPE0 writes a value key to variable start_flag of other SPEs, whose address is given by sp. SPEs except SPE0 starts their calculation after start_flag=key becomes true. • SPE1～5writes a value key to SPE0’s variables (sd[id].end_flag) ．SPE0 stops calculation of SPE1～5 after their end_flag=key becomes true. • Users can set any value to the key, but be aware of the conflict with other sync functions. Special Course on Computer Architecture 2008

LU decomposition • Following procedures are repeated for N times(i=0～N-1) • Pivot selection (selection of a row with a largest element) • Row swapping • LUdecompotions (right looking method) Partial matrices i=0 N×N n×n matrix i=1 (N-1)×(N-1) i=2 (N-2)×(N-2) i=3 (N-3)×(N-3) i=N-1 1×1 Special Course on Computer Architecture 2008

1. Pivot selection • Pivotingfunction：searches a row with maximum i-th value • Parallel task with use of 6 SPEs • An SPE reads i-th value of each row(use “dmaget_value” function) • Reports the row number maxj with the maximum value to SPE0(use “sync_collect” functions) • SPE0 selects the row with the maximum value among all the SPEs. (n-i)×(n-i) partial matrix SPE0 Finds a row with maximum i-th value SPE1 SPE2 SPE3 SPE4 SPE5 Reports the row number to the SPE0 Calculates a row if (row number)%6is equal to the own ID Special Course on Computer Architecture 2008

2. Swapping of rows & columns • “swap_row”function • Each SPE swaps rows indicated in the arguments • Swaps i-th row of matrix A and “maxj” row • 32 elements are swapped at once (dmaget, dmaput) SPE0 SPE1 SPE2 SPE3 SPE4 SPE5 Matrix(n×n) i-th row “maxj” row • “swap_col”function • Swaps i-th column of Matrix b and “maxj” column Special Course on Computer Architecture 2008

Update 3. LU decomposition with Right LookingMethod • lu_decomposition • Allots partial matrices to multiple SPEs, specified by units of rows. • Same procedure as pivot selection SPE0 Following procedures 1-3 must be repeated for N times to decompose matrix A: buf1 buf2 buf3 row SPE1 An element of (R1, R2)is stored to variable diag(dmaget_value) Elements of rowR1 is stored in buf2, beginning from the second element in row R1 (for SPE0～5) Writes back value t1, the quotient of diag/Element of (i, row) Elements of i-th row is stored in buf2, beginning from the second element of i-th row. buf1- buf2×t1 is calculated for each elements，and written back to the i-th row. Repeat procedures 2, 4, and 5 until it reaches the last row. (Use buf3 when needed ) row SPE2 SPE3 SPE4 SPE5 Special Course on Computer Architecture 2008

Forward and backward substitution • forward_substitution&backward_substitutionfunctions • Refer to the source code for detail • Each SPE calculates by a solution vector • When the number of solution vector is less than 6, some SPE may not any work in these function • forward_substitution use “blank region” to store intermediate data • Result of backward substitution is written to x in main memory Special Course on Computer Architecture 2008

References • Numerical Resipes in Chttp://www.fizyka.umk.pl/nrbook/bookcpdf.html • 2.3 LU Decomposition and Its Application • Wikipedia “LU decomposition” http://en.wikipedia.org/wiki/LU_decomposition • 奥村晴彦著「Ｃ言語による最新アルゴリズム辞典」技術評論社 • 小国力編著「行列計算ソフトウエアーＷＳ、スーパーコン、並列計算機」丸善株式会 • 斉藤宏樹，廣安知之，三木光範「 LU分解の並列化について」 http://mikilab.doshisha.ac.jp/dia/research/report/2002/0612/018/report20020612018.html Special Course on Computer Architecture 2008

Special Course on Computer Architecture 2008

Special Course on Computer Architecture 2008

Presentation Transcript

Computer Architecture

Computer Architecture

Computer Architecture

Computer Architecture

Computer Architecture

Special Course on Computer Architecture

Computer Architecture

International Symposium on Computer Architecture, June 2008. Beijing, China.

ACAC 2001 Advanced Computer Architecture Course

Computer Architecture

Computer Architecture

Computer Architecture

On the course architecture and course homepage

Computer Architecture

Computer Architecture

Computer Architecture

Computer Architecture

Computer Architecture

Computer Architecture