Flexible Filters for High Performance Embedded Computing
400 likes | 418 Vues
Explore flexible stream programming models using filters and channels for applications like signal and image processing. Learn about CFAR benchmarking and efficient stream processing strategies.
Flexible Filters for High Performance Embedded Computing
E N D
Presentation Transcript
Flexible Filters for High Performance Embedded Computing Rebecca Collins and Luca Carloni Department of Computer Science Columbia University
Stream Programming • Stream Programming model • filter: a piece of sequential programming • channels: how filters communicate • token: an indivisible unit of data for a filter • Examples: Signal processing, image processing, embedded applications A B C
Example: Constant False-Alarm Rate (CFAR) Detection guard cells cell under test • with additional factors • number of gates • threshold (μ) • number of guard cells • rows and other dimensional data compare to HPEC Challenge http://www.ll.mit.edu/HPECchallenge/
CFAR Benchmark uInt to float right window left window align data find targets square add Data Stream: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
CFAR Benchmark uInt to float right window left window align data find targets square 1 add 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
CFAR Benchmark uInt to float right window left window align data find targets square 2 1 add 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
CFAR Benchmark 1 uInt to float right window left window align data find targets square 3 2 1 add 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
CFAR Benchmark 11 10 9 8 7 6 uInt to float right window left window align data find targets 13 square 12 3 2 1 11 10 9 8 add t cell under test right window left window 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 t
core 1 core 2 core 3 A B C multi-core chip Mapping A B C Throughput: rate of processing data tokens
multi-core chip Mapping core 1 core 3 A B C Throughput: rate of processing data tokens A C B
Unbalanced Flow • Bottlenecks reduce throughput • Caused by backpressure • inherent algorithmic imbalances • data-dependent computational spikes core 1 core 2 core 3 A B C “wait”
Data Dependent Execution Time: CFAR • Using Set 1 from the HPEC CFAR Kernel Benchmark • Over a block of about 100 cells • Extra workload of 32 microseconds per target
Some other examples: Bloom Filters (Financial, Spam detection) Compression (Image processing) Data Dependent Execution Time
Our Solution: Flexible Filters flex-merge flex-split • Unused cycles on B are filled by working ahead on filter C with the data already present on B • Push stream flow upstream of a bottleneck • Semantic Preservation core 1 core 2 core 3 A B C C
Related Works • Static Compiler Optimizations • StreamIt [Gordon et al, 2006] • Dynamic Runtime Load Balancing • Work Stealing/Filter Migration • [Kakulavarapu et al., 2001] • Cilk [Frigo et al., 1998] • Flux [Shah et al., 2003] • Borealis [Xing et al., 2005] • Queue based load balancing • Diamond [Huston et al., 2005] distributed search, queue based load balancing, filter re-ordering • Combination Static+Dynamic • FlexStream [Hormati et al., 2009] multiple competing programs
Outline • Introduction • Design Flow of a Stream Program with Flexibility • Performance • Implementation of Flexible Filters • Experiments • CFAR Case Study
Design Flow of a Stream Program with Flexible Filters Design stream algorithm • Mapping • Filters • Memory Profile Add Flexibility to Bottlenecks compiler/ design tools
Design Considerations Design stream algorithm • Mapping • Filters • Memory Profile Add Flexibility to Bottlenecks
Adding Flexibility Design stream algorithm • Mapping • Filters • Memory Profile Add Flexibility to Bottlenecks
Outline • Introduction • Design Flow of a Stream Program with Flexibility • Performance • Implementation of Flexible Filters • Experiments • CFAR Case Study
Mapping Stream Programs to Multi-Core Platforms core 1 core 1 core 2 core 3 A B C A B C core 2 pipeline mapping A B C core 1 core 2 core 3 A B C A B C sharing a core SPMD mapping
EB=2 EC=3 EA=2 A1 B1 C1 A2 C2 B2 B3 A3 C3 Throughput: SPMD Mapping core 1 Suppose EA = 2, EB = 2, EC = 3 A B C 1 core 2 A B C 2 core 3 A B C 3 3 tokens processed in 7 timesteps, ideal throughput = 3/7 = 0.429 SPMD mapping
1 3 2 2 1 3 4 2 A3 A2 A4 B1 B2 B3 C1 C2 Throughput: Pipeline Mapping core 3 core 1 core 2 C A B 1 latency 2 latency 3 latency 2 A1 throughput = 1/3 = 0.333 < 0.429
Data Blocks core 1 core 2 core 3 A B C C • data block: group of data tokens
1 2 3 2 1 4 3 2 A3 A2 A4 B1 B2 B3 C1 C2 Throughput: Pipeline Augmented with Flexibility core 3 core 1 core 2 C A B 1 C A1 C2 throughput = 2/5 = 0.4 < 0.429 (but > 0.333)
Outline • Introduction • Design Flow of a Stream Program with Flexibility • Performance • Implementation of Flexible Filters • Experiments • CFAR Case Study
Flex-Split Flex-split pop data block b from in n0 = available space on out0 n1 = |b| - n0 send n0 to out0, n1 to out1 send n0 0’s, then n1 1’s to select core 2 core 3 B C C select in0 out0 in out C flex split flex merge in1 out1 C • maintain ordering • based on run-time state of queues
Flex-Merge Flex-merge pop i from select if i is 0, pop token from in0 if i is 1, pop token from in1 push token to out core 2 core 3 B C C select in0 out0 in out Overhead of Flexibility? C flex split flex merge in1 out1 C
output channel 1 output channel 1 flex merge flex merge flex merge filterflex output channel 2 output channel 2 filter flex split output channel n output channel n Multi-Channel Flex-Split and Flex-Merge … select
filter Multi-Channel Flex-Split and Flex-Merge input channel 1 input channel 2 … input channel n
flex merge filterflex filter flex split Multi-Channel Flex-Split and Flex-Merge input channel 1 Centralized input channel 2 … input channel n select
flex merge flex merge filterflex filterflex filter filter flex split flex split βflex split βflex split select Multi-Channel Flex-Split and Flex-Merge input channel 1 Centralized input channel 2 … input channel n select Distributed input channel 1 input channel 2 … input channel n
Outline • Introduction • Design Flow of a Stream Program with Flexibility • Performance • Implementation of Flexible Filters • Experiments • CFAR Case Study
Cell BE Processor • Distributed Memory • Heterogeneous • 8 SIMD (SPU) cores • 1 PowerPC (PPU) • Element Interconnect Bus • 4 rings • 205 Gb/s • Gedae Programming Language Communication Layer
Gedae • Commercial data-flow language and programming GUI • Performance analysis tools
CFAR Benchmark uInt to float right window left window align data find targets square add
By changing threshold, change % targets 1.3 % 7.3 % Additional workload per target 16 µs 32 µs 64 µs Data Dependency % Targets Additional Workload
Conclusions • Flexible filters • adapt to data dependent bottlenecks • distributed load balancing • provide speedup without modification to original filters • can be implemented on top of general stream languages