1 / 79

Lecture 10b: Implementing DSP Functionality: Alternatives

Lecture 10b: Implementing DSP Functionality: Alternatives. Prepared by: Professor Kurt Keutzer Computer Science 252, Spring 2000 With contributions from: Prof. Heinrich Meyr, University of Aachen Philip Chong, David Chinnery, Rhett Davis, Paul Husted,

adonis
Télécharger la présentation

Lecture 10b: Implementing DSP Functionality: Alternatives

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 10b: Implementing DSP Functionality:Alternatives Prepared by: Professor Kurt Keutzer Computer Science 252, Spring 2000 With contributions from: Prof. Heinrich Meyr, University of Aachen Philip Chong, David Chinnery, Rhett Davis, Paul Husted, Niraj Shah, Chris Taylor, Scott Weber, Ning Zhang Kurt Keutzer

  2. System Implementation Choices System Functionality DSP Program ROM Program ROM ASIP Core DSP Core ASIC OFF-THE SHELF µP/ DSP Coefficient ROM Control Coefficient ROM Control EMBEDDED CORE µP/DSP APPLICATION SPECIFIC µP (ASIP) Kurt Keutzer

  3. Making a Successful Comparison - 1 • Find an interesting application kernel • viterbi decoding for speech processing (not a full modem!) • Find realistic constraints native to the application • n=2, K=7, QPSK, 100KBS, BER= 10^-4 • Find architectures/implementations that are promising for the application • TI TMS320C54, Tensilica Xtensa • What are the relevant features of this architecture that support this application? • Fix application constraints across all implementations (above) • Fix key parameters for implementation comparison • performance (constraint) • area • power Kurt Keutzer

  4. Making a Successful Comparison - 2 • Identify how key parameters will be measured • performance - instruction set simulator, eval board • area - data sheets, gate estimates • power - eval board, TI application note • Implement your application kernel • Examine different algorithms • Start with code downloaded from the web - multimedia benchmarks etc. • Build your software development/evaluation environment: • http://www.ti.com/sc/docs/tools/dsp/6ccsfreetool.htm Kurt Keutzer

  5. Making a Successful Comparison - 3 • Implement your application kernel (cont) • Phase 0: Research • Find application notes, research reports for your own or comparable architectures • Phase 1: Estimation • Develop a quick estimate based on initial code • Integrate research findings • Do a quick back-of-envelope reality check • Phase 2: Real implementation/Tuning • Tailor algorithm, implementation to architecture • Do your very best! Have a contest with your partner • Phase 3: Evaluation • Apply evaluation tools to key parameters • Evaluate and compare results - return to 2 • If your life depended on choosing the right part - what would you do? Kurt Keutzer

  6. Making a Successful Comparison - 4 • Final evaluation and comparison - compare all implementations • To evaluate for a product - everything is fair game • To evaluate principally the architectures - need to consider: • Fab differences - TSMC vs. IBM (10-20% faster) • process differences - .35 micron vs. .25 (50% faster) • power supply differences 3.0V vs. 1.5V • asic vs. custom implementations - (2x faster) • Now evaluate - if I was the architect of this processor/implementor of this system on a chip, what would I do differently? • cache sizes • register availability • additional instructions • on chip memory Kurt Keutzer

  7. Making a Successful Comparison - 5 • Just for fun … • In addition to primary constraints (speed, cost, power) • final real world considerations • business relationships (joint partnership with Lucent) • Time-to-market issues • time to configure? • software development environment • library/application software support • application engineering support Kurt Keutzer

  8. Viterbi Algorithm Prof. Heinrich Meyr University of Aachen Kurt Keutzer

  9. Viterbi Decoders in digital communication systems Kurt Keutzer

  10. Convolutional Coder and Trellis diagram Kurt Keutzer

  11. ACS recursion for M = 2 Kurt Keutzer

  12. Viterbi Decoder block diagram Kurt Keutzer

  13. Characteristic of a 2-bit step-at-zero quantizer Kurt Keutzer

  14. Architecture Kurt Keutzer

  15. Node parallel ACS architecture Kurt Keutzer

  16. Alternative Implementations Kurt Keutzer

  17. Butterfly trellis structure and resource sharing for the K = 3, rate 1/2 code Kurt Keutzer

  18. Survivor Memory Unit Kurt Keutzer

  19. REA hardware architecture Kurt Keutzer

  20. Decoded Sequence: 0 0 ... 0 1 0 Kurt Keutzer

  21. uncoded word length = 1 coded word length (n) = 2 this means that it is rate 1/2 constraint length (K aka. L) = 7 this means that the number of states in trellis is 2^(K-1) or 64 states branch metric calculation is QPSK soft decision wordlength (q) = 6 chain-backing depth (D) = 96 generator polynomials: p0 = 171, p1= 133 (octal) this means that p0=1111001, p1=1011011 data rate 100 kbs goal: bit error rate (BER) = 10^-4 signal to noise ratio (SNR) degradation 0.05dB Viterbi Project Constraints Kurt Keutzer

  22. Viterbi Decoder Implementation on an ARM EE 290S Final Project May 4, 1999 Phillip Chong Kurt Keutzer

  23. ARM Overview • 32-bit RISC microprocessor • Five stage pipeline • Features fast ALU operations (barrel shifter) • Scalar integer unit, no FPU Kurt Keutzer

  24. Algorithm Tweaking • Performing the metric computation through table lookup (load = 1 delay slot) is faster than using ALU (multiplication = up to 3 delay slots) • Parity computation (Viterbi code) can also be done through table lookup Kurt Keutzer

  25. Reducing Memory Footprint • Cache misses can be very costly due to pipeline stalls • We are willing to give up some algorithmic efficiency to eliminate cache misses • To minimize the memory footprint, we pack 32 bits of traceback into single word; we can easily unpack this data due to the barrel shifter (1 cycle operation) • For 128 level traceback, memory requirements are 512 bytes (metrics table) + 1024 bytes (traceback) + 768 bytes (parity lookup tables) = 2304 bytes Kurt Keutzer

  26. Simulation Results • Simulated decoding of 4096 bits on a 125 MHz 3.3V model • Execution requires 11.72M ARM instruction cycles, giving 44 kb/s data rate • Power consumption was estimated at 52.47 mW • Scaling simulation results up to 275 MHz 2.0V ARM (fastest commercially available) gives 96 kb/s at 42.40 mW Kurt Keutzer

  27. Summary • Clock speed: 275 MHz • Execution Performance: 96kb/s • Power Dissipation: 42.40 mW (5.68 mW/mm2) • Area: 7.47mm2 in 0.25 m • Design Effort: 4 days • Portability very high: code is ANSI C; architecture-dependent tweaks may need reworking Kurt Keutzer

  28. Conclusion/Thanks • One-bit quantization gives opportunities for performance improvements, at a huge cost in QOR • Viterbi algorithm would benefit greatly from having hardware parallelism (vector ops) available • Many thanks to Marlene Wan for providing power estimation Kurt Keutzer

  29. Viterbi Decoder Implementation on a TI C54x EE 290S Final Project May 4, 1999 Paul Husted Kurt Keutzer

  30. Introduction • Implemented Viterbi Decoder on a TI TMS320VC5402 DSP • Examine: • Performance (bits/sec) • Power (mW/bit) • Cost ($/unit,area) • Design effort (engineer-months) Kurt Keutzer

  31. Viterbi Decoder Specifications • Implementation Specifications: • Constraint Length (K aka. L) = 7 • Branch Metric Calculation is QPSK • Soft Decision Wordlength (q) = 6 • Chain-backing Depth (D) = 96 • Gen. Polynomials: p0 = 171, p1= 133 (octal) • Data Rate 100 kbs • Goal: Bit Error Rate (BER) = 10^-4 Kurt Keutzer

  32. C54x Capabilities • Capabilities of all C54x DSP Cores: • Three 16-bit Data, One 16-bit program bus • 40 bit ACC with 40 bit barrel shifter • Two independent accumulators • A single cycle non-pipelined MAC • Single-instruction repeat and block-repeat • Six channel DMA controller • Arithmetic instructions with parallel store and parallel load Kurt Keutzer

  33. Helpful Instructions for the Viterbi Decoder • The C54x Has Specialized Instruction Set • Dual Add/Subtract in 1 Cycle • Compare, Select, and Store Unit (CSSU) • Compare Branch Metrics • Store Larger Value, Store Decision Bit • Increment Address Registers in Circular Buffer • 1 Cycle • Allows Butterfly (2 States) in 5 cycles Kurt Keutzer

  34. Butterfly Implementation T Register = Local Distance Old(2*j) New(j) DADST CMPS DSADT CMPS Old(2*j+1) New(j+2(K-2)) Kurt Keutzer

  35. TI TMS320VC5402 DSP • Specific Chip Characteristics: • Operates at 100 MIPS • Core Voltage of 1.8V • I/O Pins Operate at 3.3V • 16K Word x 16 Bits of Dual-Access RAM • 4K Word x 16 Bits of ROM • Internal DMA • Created in 0.18 Micron Technology Kurt Keutzer

  36. Dataflow • Data I/O • Input Values Assumed to be Placed at Specified Memory Location by Internal DMA • Output Values Assumed to be removed from another Memory Location by Internal DMA • Alternatively, Data Could be Placed in this Memory Location After Other On-Chip Receiver Processing Kurt Keutzer

  37. Implementation Analysis • Viterbi Decoder Code Created in Assembly • Linked to Processor Specific Memory Map • Simulated on Cycle-Accurate Simulator • Used Correct Memory Model for VC5402 Kurt Keutzer

  38. Implementation Results Kurt Keutzer

  39. Power Calculation • Compared with TI Figures: • TI uses 1/2 MACs, 1/2 NOPs For Power Figure • .25 Micron Estimate is .45 mA/MIPS • Fully Static Design can be Clocked at Any Rate • Viterbi Code Uses 1.08 Times More Current than TI Estimate • At 22 MIPS, 19.25 mW are Consumed in the Core Kurt Keutzer

  40. Area Estimate • TI Will Not Release Die Sizes • .25 Micron Chips Fit Inside 3.2 mm x 3.2 mm Area on a 144 pin BGA • Maximum Die Size is thus 10.24 mm2 Kurt Keutzer

  41. Development Cost • Engineering Time • Estimate - 3 days • Assumes Engineer Has Experience with Assembly Language and TI Tools • Tool Cost - $13262.45 • Includes Emulator, Simulator, Compiler, Assembler, Linker, Debugger • Cost of Chip - $8.52 Kurt Keutzer

  42. Conclusion • Optimized Instructions Make Algorithm Efficient • Static Design Allows Clock Rate to be Set As Needed to Reduce Power • Flexibility Exists to Perform Other Processing of Data • Very Little Development Time/Cost Kurt Keutzer

  43. ACS TIE Extension with State (ACS) Rt Rs 1:0 0:1 11 17 27 31 31 27 17 11 31 24:23 16:15 8:7 0 pm- pm- bm3 bm2 bm1 bm0 pm- pm- + + msb msb + - =1? =1? - + Control 0:1 11 16:17 27 31 instruction decision bit pm pm decision bit Rr Kurt Keutzer

  44. Tensilica Viterbi Implementation Niraj Shah Scott Weber 290A Final Presentation Kurt Keutzer

  45. Tensilica Flow .c .c .c TIE uArch xt-gcc gen Designer Tensilica Processor Generator gen xt-run .o Kurt Keutzer

  46. Xtensa Architecture • TIE Extensions: • single cycle • state free • no new exceptions • no stalls • typeless data • Rs, Rt, Rr are 32 bit regs • I is the instruction controlling the TIE unit • Xtensa Core is a 32 bit configurable RISC processor Xtensa Core Rs Rt I Rr TIE Kurt Keutzer

  47. Viterbi Architecture ADC I/0 Device Init RAM TraceBack ACS Measured Performance Here Kurt Keutzer

  48. TIE SetupBMreg (ACS) Rs Rt 31 8:7 0 31 8:7 0 I Q 0x7F + - + - - Control instruction bm0 bm1 bm2 bm3 0 7:8 15:16 23:24 31 Rr Kurt Keutzer

  49. ACS TIE Extension (ACS) Rs Rt 31 27 17 11 1:0 31 24:23 16:15 8:7 0 pm- pm- bm3 bm2 bm1 bm0 + msb =1? - + ACS03 || ACS12 || ACS30 || ACS21 instruction 0:1 11:12 31 0’s pm decision bit Rr Kurt Keutzer

  50. ACS TIE Extension with State (ACS) Rt Rs 1:0 0:1 11 17 27 31 31 27 17 11 31 24:23 16:15 8:7 0 pm- pm- bm3 bm2 bm1 bm0 pm- pm- + + msb msb + - =1? =1? - + Control 0:1 11 16:17 27 31 instruction decision bit pm pm decision bit Rr Kurt Keutzer

More Related