160 likes | 271 Vues
Transitioning from Algorithms to Software. Thomas Kue Southern Arkansas University Dr. Ernst Leiss University of Houston REU Summer 2011. Outline. VMM and the Memory Hierarchy Problem The First Experiment The Improved Algorithm Results Retesting the Algorithm
E N D
Transitioning from Algorithms to Software Thomas Kue Southern Arkansas University Dr. Ernst Leiss University of Houston REU Summer 2011
Outline • VMM and the Memory Hierarchy Problem • The First Experiment • The Improved Algorithm • Results • Retesting the Algorithm • Revisiting Example: Adding Matrices • A Basic Algorithm • Adding Matrices: Summary • Providing Implementation • Results • Conclusion
VMM and the Memory Hierarchy Problem • Scientific computing often requires massive data sets • Virtual Memory Manager – divides program into ‘pages’ • Out of core program – program in which there is transfer of data between memory and hard disk • The goal: reduce frequency of data transfer to and from hard disk
The First Experiment • We start with Algorithm 1: • Premise: • M is a zero n×n matrix larger than main memory. • A data item is a triple [i,j,x], where i and j (1≤i,j≤n) are row and column indices and x is a real value to be added to M • Each data item is randomly assigned • Because of this randomness, locality is poor • Algorithm Analysis shows the number of page swaps for Algorithm 1 is 9m/10, where m is the number of data items[Leiss, 2007]. while more input do{ read a triple [i,j,x]; M[i,j] := M[i,j] + x; }
An Improved Algorithm • An improved algorithm, Algorithm 1’, is proposed • M is divided into 10 even sections (each section is size n/10) • Subsequences, St, hold all data items corresponding to section t=2,3,…10. • Algorithm analysis shows that the number of block transfers is 9m/(5B), where B is the block size[Leiss, 2007]. • allocate M1 in the available main memory and initialize it to 0; • set the sequence St to empty, for all t=2,…,10; • while more input do{ • read a triple [i,j,x]; • if [i,j] is in M1 then • M[i,j] := M[i,j] + x; • else{ • determine t such that [i,j] is in Mt; • append [i,j,x] to the sequence St; • } • } • for t:=2 to 10 do{ • write Mt-1 to disk; • allocate Mt in the available main memory and initialize it to 0; • while more input in St do{ • read a triple [i,j,x] from St; • M[i,j] := M[i,j] + x; • } • }
Results • Written in C++ using GCC compiler. Table 1 shows the execution times of Algorithm 1 and Algorithm 1’. • The program for Algorithm 1’ crashed for n≥1600 due to excessive memory. • An effort to translate the C++ code into both Java and C languages yielded similar results. • Because n was small such that the VMM was not invoked (i.e. in-core) we were unable to prove nor disprove the improvements of the asserted improved algorithm, Algorithm 1’. • The data from this particular experiment was unable to be used.
Algorithm 1 Retested • After the previous experiment failed to run properly, we set out to show the real problem of the original algorithm when transitioning from algorithms to software. • We set m=16000 × 16000 × 100 • In setting m constant, each experiment run will process the same amount of data items and should produce similar timings. • Seemingly good algorithms may not provide properly efficient implementation as anticipated • But do the proposed improved algorithms actually show improvements in implementation?
Revisiting Example: Adding Matrices • We have two matrices, A and B, of size n2 • n is large such that matrix cannot fit within main memory • VMM is invoked and paging occurs A Basic Algorithm for i := 1 to n do for j := 1 to n do C[i,j] = A[i,j] + B[i,j]
Transitioning Into Software • Because memory is linear, these 2-dimensional matrices must be mapped into the 1-dimensional memory • Row Major Mapping • Column Major Mapping A11 A12 … A1n A21 A22 … A2n . . … . . . … . . . … . . . … . An1 An2 … Ann Memory A11 A12 … A1n … Memory A11 A21 … An1 …
Problems Transitioning Into Software • Assume column major mapping • Assume one column = one page • Assume memory can hold three pages (1 from each matrix) Our basic algorithm: for i := 1 to n do for j := 1 to n do C[i,j] = A[i,j] + B[i,j] A11 A12 … A1n A21 A22 … A2n . . … . . . … . . . … . . . … . An1 An2 … Ann Memory A11 A21 … An1 A12 A22 … An2 • Total # of page swaps: 3n2
Problems Transitioning into Software • The interaction between the algorithm and the VMM plays an important role in software performance Modifying the algorithm: for j := 1 to n do for i := 1 to n do C[i,j] = A[i,j] + B[i,j] A11 A12 … A1n A21 A22 … A2n . . … . . . … . . . … . . . … . An1 An2 … Ann Memory A11 A21 … An1 • Total # of page swaps: 3n
Adding Matrices: Summary • Using the first algorithm produces bad software of I/O complexity 3n2 • Using the second algorithm produces a good software that is n times faster than the first • Achieving the goal: • Restructuring the program to reduce disk I/O • Our basic algorithm: • for i := 1 to n do • for j := 1 to n do • C[i,j] = A[i,j] + B[i,j] Modifying the algorithm: for j := 1 to n do for i := 1 to n do C[i,j] = A[i,j] + B[i,j]
Providing Implementation • Both algorithms were implemented in C code using the GCC compiler • We refer to the column-traversing algorithm as Algorithm 2, and the row-traversing algorithm as Algorithm 2’. • The C language uses row-major mapping so the slower algorithm will be the one that traverses the matrix via columns (i.e. Algorithm 2’ will perform better than Algorithm 2) • Algorithm 2’: • for i := 1 to n do • for j := 1 to n do • C[i,j] = A[i,j] + B[i,j] Algorithm 2: for j := 1 to n do for i := 1 to n do C[i,j] = A[i,j] + B[i,j]
Adding Matrices: Results • The results show that Algorithm 2’ provides better performance for all n≥500. • For Algorithm 2, the time does not grow linearly for any n (i.e. not performing as it should). • We improved the performance of Algorithm 2 by a factor of 10 for n=16000 by applying a loop interchange. • The results show that a good algorithm can produce poorly performing out-of-core programs • The results confirm our improved algorithm also performs in execution
Conclusion • Algorithms can produce poorly performing out of core programs. • The performance of out-of-core programs can be improved via loop transformations using information from algorithm and dependence analysis. • Using this information, it is intended that a tool be developed that automatically utilizes the methods discussed in this project to improve the performance of out-of-core applications.
References • Leiss, E. (2007). A Programmer’s Companion to Algorithm Analysis. Boca Raton, FL: Chapman & Hall/CRC.