220 likes | 330 Vues
This paper explores the integration of pipelining and data parallelism utilizing SIMD Reconfigurable Architecture (SIMD RA) for enhanced performance and energy efficiency. It compares traditional approaches like ASIC and GPU with new methodologies focusing on coarse-grained reconfigurability. We discuss the mapping techniques, execution models, and various optimization strategies to mitigate issues such as bank conflicts and large compilation times, aiming for superior scalability and utilization of resources in computing architectures.
E N D
Exploiting Both Pipelining and Data Parallelism with SIMD RA Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, IngooHeo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science & Technology) ARC March 21, 2012Hong Kong
Reconfigurable Architecture Source: ChipDesignMag.com • Reconfigurable architecture • High performance • Flexible • Cf. ASIC • Energy efficient • Cf. GPU
Coarse-Grained Reconfigurable Architecture ADRES Main Processor CGRA MainMemory DMA Controller MorphoSys • Coarse-Grained RA • Word-level granularity • Dynamic reconfigurability • Simpler to compile • Execution model
DFG generation Place & Route Application Mapping Application Front-end IR Arch Param. Partitioner Seq Code Loops Mapping for CGRA Conventional Ccompilation <DFG> <CGRA> Assembly Configuration Extended assembler Exec. + Config. • Place and route DFG on the PE array mapping space • Should satisfy several constraints • Should map nodes on the PE which have a right functionality • Data transfer between nodes should be guaranteed • Resource consumption should be minimized for performance
Software Pipelining • Modulo scheduling-based mapping A[i] B[i] 0 1 C[i] 3 2 4 0 1 3 2 0 0 1 1 4 PE1 PE0 II = 2 cycles 3 3 2 2 PE2 PE3 4 4 II : Initiation Interval time
Problem - Scalability • Suffer several problems in a large scale CGRA • Lack of parallelism • Limited ILP in general applications • Configuration size(in unrolling case) • Search a very large mapping space for placement and routing • Skyrocketing compilation time CGRAs remain at 4x4 or 8x8 at the most.
Overview Background SIMD Reconfigurable Architecture (SIMD RA) Mapping on SIMD RA Evaluation
SIMD Reconfigurable Architecture • Consists of multiple identical parts, called cores • Identical for the reuse of configurations • At least one load-store PE in each core Core 1 Core 2 Core 3 Core 4 Crossbar Switch Bank1 Bank2 Bank3 Bank4
Large Core Advantages of SIMD-RA Large Core • More iterations executed in parallel • Scale with the PE array size • Short compilation time thanks to small mapping space • Archive denser scheduled configuration • Higher utilization and performance. • Loop must not have loop-carried dependence. time Core 1 Core 2 Core 3 Core 4 Iter. 9 Iter. 6 Iter. 3 Iter. 0 Iter. 4 Iter. 10 Iter. 7 Iter. 1 Iteration 3 Iteration 0 Iter. 11 Iteration 1 Iter. 8 Iteration 5 Iter. 5 Iteration 4 Iteration 2 Iter. 2 Core 1 Core 2 Core 3 Core 4 time
Overview Background SIMD Reconfigurable Architecture (SIMD RA) Bank Conflict Minimization in SIMD RA Evaluation
Problems of SIMD RA mapping • New mapping problem • Iteration-to-core mapping • Iteration mapping affects on the performance • related with a data mapping • affect the number of bank conflicts 15 iterations for(i=0 ; i<15 ; i++) { B[i] = A[i] + B[i]; } Core 1 Core 2 Core 3 Core 4
Crossbar Switch Mapping schemes A[0] A[4] A[8] A[12] B[1] B[5] B[9] B[13] B[0] B[1] B[2] B[3] B[4] B[5] B[13] B[14] A[0] A[1] A[2] A[3] A[4] A[5] A[13] A[14] A[3] A[7] A[11] B[0] B[4] B[8] B[12] A[2] A[6] A[10] A[14] B[3] B[7] B[11] A[1] A[5] A[9] A[13] B[2] B[6] B[10] B[14] Iteration-to-core mapping Data mapping … … Iter. 0-3 Iter. 4-7 for(i=0 ; i<15 ; i++) { B[i] = A[i] + B[i]; } Iter. 8-11 Iter. 12-14 Crossbar Switch < Sequential > < Sequential > Iter. 0,4,8,12 Iter. 1,5,9,13 Iter. 2,6,10,14 Iter. 3,7,11 < Interleaving > < Interleaving >
Interleaving data placement Iter. 0-3 Iter. 4-7 A[3] A[7] A[11] B[0] B[4] B[8] B[12] A[2] A[6] A[10] A[14] B[3] B[7] B[11] A[1] A[5] A[9] A[13] B[2] B[6] B[10] B[14] A[0] A[4] A[8] A[12] B[1] B[5] B[9] B[13] • With interleaving data placement, interleaved iteration assignment is better than sequential iteration assignment. • Weak in stride accesses • reduce the number of utilized banks, • increase bank conflicts Configuration Load A[2i] Load A[i] … Iter. 8-11 Iter. 12-14 … … Crossbar Switch Iter. 0,4,8,12 Iter. 1,5,9,13 Iter. 2,6,10,14 Iter. 3,7,11
Crossbar Switch Sequential data placement Iter. 0-3 Iter. 4-7 A[0] A[1] A[2] A[3] B[0] B[1] B[2] B[3] A[4] A[5] A[6] A[7] B[4] B[5] B[6] B[7] A[8] A[9] A[10] A[11] B[8] B[9] B[10] B[11] A[12] A[13] A[14] B[12] B[13] B[14] A[0] A[1] A[2] A[3] A[4] A[5] A[13] A[14] B[0] B[1] B[2] B[3] B[4] B[5] B[13] B[14] • Cannot work well with SIMD mapping • Cause frequent bank conflicts • Data tiling • i) array base address modification • ii) rearranging data on the local memory. • Sequential iteration assignment with data tiling suits for SIMD mapping Configuration Load A[i] … Iter. 8-11 Iter. 12-14 … … … … Crossbar Switch Iter. 0,4,8,12 Iter. 1,5,9,13 Iter. 2,6,10,14 Iter. 3,7,11 14
Summary of Mapping Combinations Analysis • Two out of the four combinations have strong advantages • Interleaved iteration, interleaved data mapping • Weak in accesses with stride • Simple data management • Sequential iteration, sequential data mapping (with data tiling) • More robust against bank conflict • Data rearranging overhead
Experimental Setup • Sets of loop kernels from OpenCV, multimedia and SPEC2000 benchmarks • Target system • Two CGRA sizes – 4x4, 8x4 • 2x2 core with one load-store PE and one multiplier PE • Mesh + diagonal connections between PEs • Full crossbar switch between PEs and local memory banks • Compared with non-SIMD mapping • Original : non-SIMD previous mapping • SIMD : Our approach (interleaving-interleaving mapping)
Configuration Size reduced by 61% in 4x4 CGRA, 79% in 8x4 CGRA
Runtime 29% 32%
Conclusion • Presented SIMD reconfigurable architecture • Exploit data parallelism and instruction level parallelism at the same time • Advantages of SIMD reconfigurable architecture • Scale the large number of PEs well • Alleviate increasing compilation time • Increase performance and reduce configuration size
Core size • In a large loop case, • small core might not be a good match • Merge multiple cores ⇒Macrocore • No HW modification require Core 1 Macrocore 1 Core 2 Macrocore 2 Core 3 Core 4 Crossbar Switch Bank1 Bank2 Bank3 Bank4
SIMD RA mapping flow Check SIMD Requirement Traditional Mapping Fail Select Core Size Iteration Mapping Int-Int Seq-Tiling Array Placement (Implicit) Data Tiling Operation Mapping Modulo Scheduling If scheduling fails, increase II and repeat. If scheduling fails and MaxII<II, increase core size.