250 likes | 390 Vues
MacroSS : Macro-SIMDization of Streaming Applications. Amir Hormati *, Yoonseo Choi ‡ , Mark Woh *, Manjunath Kudlur † , Rodric Rabbah ‡ , Trevor Mudge *, Scott Mahlke * . * Advanced Computer Arch. Lab., University of Michigan. ‡ IBM T.J. Watson Research Center. † Nvidia Corp.
E N D
MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, YoonseoChoi‡, Mark Woh*, ManjunathKudlur†, RodricRabbah‡, Trevor Mudge*, Scott Mahlke* * Advanced Computer Arch. Lab., University of Michigan • ‡ IBM T.J. Watson Research Center † Nvidia Corp.
Importance of SIMD • Energy and area efficient way to exploit data-level parallelism • Performance in multimedia and communication apps • Ubiquitous in modern processors • Intel: SSE, Larrabee • IBM: Altivec, Cell SPE • ARM: Neon Control Unit Control Unit Control Unit Functional Units Functional Units Functional Units Cache Cache Cache
Stream Computing • Prevalent in embedded, desktop and server systems • Many optimizations for mapping and scheduling applications to parallel architectures • Retargetability is a big plus in streaming languages • Task, pipeline, and data-level parallelism is mapped into core-level parallelism • Data-level parallelism on SIMD engines is not utilized
Why SIMD engines are under-utilized? • Finding data-level parallelism suitable for SIMD engines • Proper data-alignment • Complicated compiler optimization and transformations • Wide variety of SIMD standards
In this work… • Macro-level SIMDization techniques for streaming languages. • MacroSS compiler for StreamIt language • Hardware-based buffer optimizations for packing/unpacking operations • Evaluation of MacroSS on Intel Core i7
StreamIt • Main Constructs: • Filter: Encapsulate computation. • Stateful • Stateless • Pipeline Expressing pipeline parallelism • Splitjoin Expressing task/data-level parallelism • Exposes different types of parallelism • Scheduling and rate-matching are needed filter pipeline splitjoin
Macro SIMDization • SIMDization at graph level • Tunes the graph based on the target system • SIMD standards • Wide/Narrow SIMD • Actor SIMDization: • Single-Actor • Vertical • Horizontal
Single-Actor SIMDization Overview Serial Execution Execution Reordering Realistic Vectorization Ideal Vectorization E(8) E E E E v E E E E E E v E E E E E E v E v E E E
Single Actor SIMDization • Only stateless actors • Scalar buffer accesses • Strided pushes and pops
Why Scalar Buffers? 128 bits 20 21 22 23 16 17 18 19 12 13 14 15 8 9 10 11 4 5 6 7 0 1 2 3 ?
Horizontal SIMDization Source • Find isomorphic actors in split/join structures • The isomorphic actors are merge in one vectorized actor • Actors can be both stateful or stateless. Splitter An A1 . . . . . . . . . B1 Bn C1 Cn Joiner Sink
? ?
Streaming Address Generation • Area overhead less than 1% on Core i7. • Critical path two 16-bit adds and one 64-bit add. Scalar Buffer Vector Buffer 20 21 22 23 14 17 20 23 16 17 18 19 13 16 19 22 12 13 14 15 12 15 18 21 8 9 10 11 2 5 8 11 4 5 6 7 1 4 7 10 0 1 2 3 0 3 6 9
Experimental Setup Streaming Program • Frontend StreamIt MIT Compiler • Backend MacroSS • ICC 11.1 compile C/C++ code • Core i7 with SSE4 Frontend Compiler Backend Compiler C Code Host Compiler Intel Core i7
Conclusion • Streaming is prevalent in all computing domains. • Applying traditional SIMDization on streaming applications fails to utilize SIMD engines. • Macro-SIMDization is done at higher level. • MacroSS outperforms traditional SIMDization techniques by 54%.
SAGU Implementation • Area overhead less than 1% on Core i7. • Critical path two 16-bit adds and one 64-bit add. • Minor ISA modifications are needed.
SIMD + Multi-core Scheduling • How to schedule for a heterogeneous SIMD system? • SIMDization reduces memory/bus traffic • Exploit SIMD parallelism before Core-level parallelism. • Is this the best we can do?