Flexible Filters for High Performance Embedded Computing

Flexible Filters for High Performance Embedded Computing Rebecca Collins and Luca Carloni Department of Computer Science Columbia University

Motivation

Stream Programming • Stream Programming model • filter: a piece of sequential programming • channels: how filters communicate • token: an indivisible unit of data for a filter • Examples: Signal processing, image processing, embedded applications A B C

Example: Constant False-Alarm Rate (CFAR) Detection guard cells cell under test • with additional factors • number of gates • threshold (μ) • number of guard cells • rows and other dimensional data compare to HPEC Challenge http://www.ll.mit.edu/HPECchallenge/

CFAR Benchmark uInt to float right window left window align data find targets square add Data Stream: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

CFAR Benchmark uInt to float right window left window align data find targets square 1 add 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

CFAR Benchmark uInt to float right window left window align data find targets square 2 1 add 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

CFAR Benchmark 1 uInt to float right window left window align data find targets square 3 2 1 add 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

CFAR Benchmark 11 10 9 8 7 6 uInt to float right window left window align data find targets 13 square 12 3 2 1 11 10 9 8 add t cell under test right window left window 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 t

core 1 core 2 core 3 A B C multi-core chip Mapping A B C Throughput: rate of processing data tokens

multi-core chip Mapping core 1 core 3 A B C Throughput: rate of processing data tokens A C B

Unbalanced Flow • Bottlenecks reduce throughput • Caused by backpressure • inherent algorithmic imbalances • data-dependent computational spikes core 1 core 2 core 3 A B C “wait”

Data Dependent Execution Time: CFAR • Using Set 1 from the HPEC CFAR Kernel Benchmark • Over a block of about 100 cells • Extra workload of 32 microseconds per target

Some other examples: Bloom Filters (Financial, Spam detection) Compression (Image processing) Data Dependent Execution Time

Our Solution: Flexible Filters flex-merge flex-split • Unused cycles on B are filled by working ahead on filter C with the data already present on B • Push stream flow upstream of a bottleneck • Semantic Preservation core 1 core 2 core 3 A B C C

Related Works • Static Compiler Optimizations • StreamIt [Gordon et al, 2006] • Dynamic Runtime Load Balancing • Work Stealing/Filter Migration • [Kakulavarapu et al., 2001] • Cilk [Frigo et al., 1998] • Flux [Shah et al., 2003] • Borealis [Xing et al., 2005] • Queue based load balancing • Diamond [Huston et al., 2005] distributed search, queue based load balancing, filter re-ordering • Combination Static+Dynamic • FlexStream [Hormati et al., 2009] multiple competing programs

Outline • Introduction • Design Flow of a Stream Program with Flexibility • Performance • Implementation of Flexible Filters • Experiments • CFAR Case Study

Design Flow of a Stream Program with Flexible Filters Design stream algorithm • Mapping • Filters • Memory Profile Add Flexibility to Bottlenecks compiler/ design tools

Design Considerations Design stream algorithm • Mapping • Filters • Memory Profile Add Flexibility to Bottlenecks

Adding Flexibility Design stream algorithm • Mapping • Filters • Memory Profile Add Flexibility to Bottlenecks

Mapping Stream Programs to Multi-Core Platforms core 1 core 1 core 2 core 3 A B C A B C core 2 pipeline mapping A B C core 1 core 2 core 3 A B C A B C sharing a core SPMD mapping

EB=2 EC=3 EA=2 A1 B1 C1 A2 C2 B2 B3 A3 C3 Throughput: SPMD Mapping core 1 Suppose EA = 2, EB = 2, EC = 3 A B C 1 core 2 A B C 2 core 3 A B C 3 3 tokens processed in 7 timesteps, ideal throughput = 3/7 = 0.429 SPMD mapping

1 3 2 2 1 3 4 2 A3 A2 A4 B1 B2 B3 C1 C2 Throughput: Pipeline Mapping core 3 core 1 core 2 C A B 1 latency 2 latency 3 latency 2 A1 throughput = 1/3 = 0.333 < 0.429

Data Blocks core 1 core 2 core 3 A B C C • data block: group of data tokens

1 2 3 2 1 4 3 2 A3 A2 A4 B1 B2 B3 C1 C2 Throughput: Pipeline Augmented with Flexibility core 3 core 1 core 2 C A B 1 C A1 C2 throughput = 2/5 = 0.4 < 0.429 (but > 0.333)

Flex-Split Flex-split pop data block b from in n0 = available space on out0 n1 = |b| - n0 send n0 to out0, n1 to out1 send n0 0’s, then n1 1’s to select core 2 core 3 B C C select in0 out0 in out C flex split flex merge in1 out1 C • maintain ordering • based on run-time state of queues

Flex-Merge Flex-merge pop i from select if i is 0, pop token from in0 if i is 1, pop token from in1 push token to out core 2 core 3 B C C select in0 out0 in out Overhead of Flexibility? C flex split flex merge in1 out1 C

output channel 1 output channel 1 flex merge flex merge flex merge filterflex output channel 2 output channel 2 filter flex split output channel n output channel n Multi-Channel Flex-Split and Flex-Merge … select

filter Multi-Channel Flex-Split and Flex-Merge input channel 1 input channel 2 … input channel n

flex merge filterflex filter flex split Multi-Channel Flex-Split and Flex-Merge input channel 1 Centralized input channel 2 … input channel n select

flex merge flex merge filterflex filterflex filter filter flex split flex split βflex split βflex split select Multi-Channel Flex-Split and Flex-Merge input channel 1 Centralized input channel 2 … input channel n select Distributed input channel 1 input channel 2 … input channel n

Cell BE Processor • Distributed Memory • Heterogeneous • 8 SIMD (SPU) cores • 1 PowerPC (PPU) • Element Interconnect Bus • 4 rings • 205 Gb/s • Gedae Programming Language Communication Layer

Gedae • Commercial data-flow language and programming GUI • Performance analysis tools

CFAR Benchmark uInt to float right window left window align data find targets square add

By changing threshold, change % targets 1.3 % 7.3 % Additional workload per target 16 µs 32 µs 64 µs Data Dependency % Targets Additional Workload

More Benchmarks

Conclusions • Flexible filters • adapt to data dependent bottlenecks • distributed load balancing • provide speedup without modification to original filters • can be implemented on top of general stream languages

Flexible Filters for High Performance Embedded Computing

Flexible Filters for High Performance Embedded Computing

Presentation Transcript

High Performance Futures High Performance Embedded Computing Workshop MIT LL, Bedford, MA

HIGH PERFORMANCE COMPUTING

Java for High Performance Computing

High Performance Embedded Computing Software Initiative (HPEC-SI)

High Performance Computing

Java for High Performance Computing

Java for High Performance Computing

High-Performance Computing

High-Performance Computing

Flexible Filters for High Performance Embedded Computing

High Performance Computing

HIGH PERFORMANCE COMPUTING

High-Performance Computing

High Performance Computing

High Performance Computing

High Performance Computing

HIGH-PERFORMANCE COMPUTING

High Performance Computing

High-Level Transformations for Embedded Computing

2009 High Performance Embedded Computing (HPEC) Workshop

HIGH PERFORMANCE COMPUTING

High Performance Embedded Computing Software Initiative (HPEC-SI)