Sample Integrated Fourier Transform(SIFT) With High-Performance ASIC Implementation

# Sample Integrated Fourier Transform(SIFT) With High-Performance ASIC Implementation

## Sample Integrated Fourier Transform(SIFT) With High-Performance ASIC Implementation

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. Sample Integrated Fourier Transform(SIFT) With High-Performance ASIC Implementation

2. Contents • Fast Fourier Transform Overview • SIFT: An Alternative Approach • SIFT: Architecture and Implementation • Future Design: 1024-Point SIFT • Conclusion

3. Fast Fourier Transform Overview

4. Fast Fourier Transform (FFT) • Computes the Discrete Fourier Transform (DFT) • Requires that N = 2m • The DFT definition equations are:

5. FFT Butterfly • Butterfly treats one N-point signal as N-single-point signals • Interlaced decomposition: separation into odd and even groups • Decomposition is a reordering of the samples • It is accomplished by a bit reversal algorithm • Ex: sample (0011) exchanged with sample (1100) • Requires log2(N) stages for the decomposition • Ex: A 16 point requires 4 stages

6. FFT BUTTERFLY + 2 point output 2 point input xS + The basic element of the FFT butterfly.

7. PROPERTIES OF THE BUTTERFLY • General • FFT butterfly algorithm samples contribute dependently • FFT butterfly requires all samples to begin execution • FFT butterfly reduces the number of multiplies to N x log(N) • FFT Butterfly is a batch process (all samples before execution)

8. FFT BUTTERFLY PIPELINE 0 T 2T 3T load S1 exe S1 load S2 exe S2 S1, S2 are N-point samples • FFT Butterfly: • Has a Latency of T • Requires Memory to Store Samples • Requires Complex Addressing Scheme

9. SIFT: An Alternative Approach

10. PROPERTIES OF SIFT • Sample contributions compute Independently • Each sample is processed Transactionally • Coefficients are all updated from each sample

11. ADVANTAGES OF SIFT • Less memory: • After a sample is used, it is not stored • Processing is Transactional • Less Hardware: • • Storing • • Addressing • • Processing

12. SIFT PIPELINE S1 SN S(N+1) S(2N) 0 T T+1 2T 3T exe Emit Clear exe Emit Clear • Continuous Execution • Low Latency • No Stored Samples

13. SIFT Pipeline vs. FFT Butterfly Pipeline 0 T 2T 3T 4T FFT Butterfly pipeline load S1 exe S1 load S3 exe S3 load S2 exe S2 load S4 SIFT pipeline exe S1 exe S2 exe S3 exe S4 S1, S2, S3, S4 are N-point samples

14. EXAMPLE OF 8-SAMPLE SIFT MATHEMATICS

15. There are N/4 - 1 absolute values (other than 0 and 1.)

16. The DFT Equations for 16 Samples

17. SIFT: Architecture and Implementation

18. Basic SIFT Architecture CLOCK STATUS SAMPLE (A) (A x S) (M) (M+A x S) COEFFICIENT (ACC / OUT) RESET

19. 64-Point SIFT • Key factors achieved in SIFT implementation • Silicon Savings • Power Savings • High Speed

20. Silicon Savings Butterfly FFT SIFT Buffer P P M C M Cache, Buffer and Coefficient Memory with Addressing Only Coefficient Memory is Required

21. Silicon Savings Aspect Generator Structure Example Encoder 0 Aspect + 1 Clock Sample # Encoder Reset Sample [ ] 0 0 Sequence [ ] 1 1 Forward/Inverse

22. High Speed • Four-stage pipelined architecture • Continuous execution • High speed SRAM • Simple Gated Aspect Generation • Low Latency Set Aspect Multiply Add Store

23. Low Power • Fully Static and Synchronous Design • Coefficient Memory Only • No Bus or Addressing System

24. Performance # of Area Clock Power Execution Time channels (mm2) (MHz) (mW) (micro-second) 10.21 330 8 12.28 20.32 330 8 24.57 40.55 300 9.6 54.07 8 1.01 300 9.6 108.14 161.93 300 9.6 216.28 Technology: 0.18um CMOS, Vdd = 1.8V, Temp = 25C

25. 64-Point Benchmarks CMOS Algorithm Area Frequency Exec Time Latency Power process mm2(MHz) (mili-second) (unit) (mW) TI C54X0.18 Butterfly 144 160 0.01045 1 96 SIFT0.18 SIFT .21 330 0.01228 0 8

26. Future Design: A 1024-Point SIFT

27. 1024-Point SIFT • 32-point SIFT used as building block • 1 each - 15 bit system counter • 64-multiply/add elements in two sets of 32 • 2 each - 32 segment, 32 word by 24 bit coefficient • memory locations, duplex segments • 1 each - logic to interleave input/output locations

28. A 1024-Point SIFT Block Diagram Clock Input Stream Output Stream Steering Logic to interleave inputs Single Common Counter 32 cores by 32 deep Interleaved Transform Units with swap-set coefficients FRONT END 32 cores by 32 deep Interleaved Transform Units with swap-set coefficients BACK END

29. A 1024-Point SIFT Target Performance Execute Time 3.2 microsecond Latency 3.2 microseconds Parallelism 64 MAC per cycle (20.48 GMAC/S sustained rate) Size 25mm2 for 8 bits 49 mm2 for 16 bits Power 250mW 1W Efficiency 89 GMAC/S/W 20 GMAC/S/W Pins 28 pins 44 pins

30. Benchmarks 1024-Point Complex Input FFT Part NameTi C54x ADSP-2192 DSP24 DFT1024 Algorithm Butterfly Butterfly Butterfly SIFT CMOS process (um) 0.18 NA 0.5 0.18 Frequency (MHz) 160 160 100 320 Exec. Time (micro-s) 263 151 22 3.2 Latency (micro-s) 263 151 22 3.2 Power (mw) 96 NA 3000 500 Area (mm2) 144 400 289 49

31. Conclusions • SIFT is introduced as an alternative to FFT • SIFT designs presented with simulation results • Low Latency • Low Power • High Speed