Lecture 4

Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas

Outline/objectives • Identify the most important DSP processor architecture features and how they relate to DSP applications • Understand the types of code appropriate for DSP implementation ACOE343 - Embedded Real-Time Processor Systems - Frederick University

What is a DSP? • A specialized microprocessor for real-time DSP applications • Digital filtering (FIR and IIR) • FFT • Convolution, Matrix Multiplication etc ACOE343 - Embedded Real-Time Processor Systems - Frederick University

Hardware used in DSP ACOE343 - Embedded Real-Time Processor Systems - Frederick University

Common DSP features • Harvard architecture • Dedicated single-cycle Multiply-Accumulate (MAC) instruction (hardware MAC units) • Single-Instruction Multiple Data (SIMD) Very Large Instruction Word (VLIW) architecture • Pipelining • Saturation arithmetic • Zero overhead looping • Hardware circular addressing • Cache • DMA ACOE343 - Embedded Real-Time Processor Systems - Frederick University

Harvard Architecture • Physically separate memories and paths for instruction and data ACOE343 - Embedded Real-Time Processor Systems - Frederick University

Single-Cycle MAC unit Can compute a sum of n-products in n cycles ACOE343 - Embedded Real-Time Processor Systems - Frederick University

Single Instruction - Multiple Data (SIMD) • A technique for data-level parallelism by employing a number of processing elements working in parallel ACOE343 - Embedded Real-Time Processor Systems - Frederick University

Very Long Instruction Word (VLIW) • A technique for instruction-level parallelism by executing instructions without dependencies (known at compile-time) in parallel • Example of a single VLIW instruction: F=a+b; c=e/g; d=x&y; w=z*h; ACOE343 - Embedded Real-Time Processor Systems - Frederick University

CISC vs. RISC vs. VLIW ACOE343 - Embedded Real-Time Processor Systems - Frederick University

Pipelining • DSPs commonly feature deep pipelines • TMS320C6x processors have 3 pipeline stages with a number of phases (cycles): • Fetch • Program Address Generate (PG) • Program Address Send (PS) • Program ready wait (PW) • Program receive (PR) • Decode • Dispatch (DP) • Decode (DC) • Execute • 6 to 10 phases ACOE343 - Embedded Real-Time Processor Systems - Frederick University

Saturation Arithmetic • fixed range for operations like addition and multiplication • normal overflow and underflow produce the maximum and minimum allowed value, respectively • Associativity and distributivity no longer apply • 1 signed byte saturation arithmetic examples: • 64 + 69 = 127 • -127 – 5 = -128 • (64 + 70) – 25 = 122 ≠ 64 + (70 -25) = 109 ACOE343 - Embedded Real-Time Processor Systems - Frederick University

Examples • Perform the following operations using one-byte saturation arithmetic • 0x77 + 0x99 = • 0x4*0x42= • 0x3*0x51= ACOE343 - Embedded Real-Time Processor Systems - Frederick University

Zero Overhead Looping • Hardware support for loops with a constant number of iterations using hardware loop counters and loop buffers • No branching • No loop overhead • No pipeline stalls or branch prediction • No need for loop unrolling ACOE343 - Embedded Real-Time Processor Systems - Frederick University

Hardware Circular Addressing • A data structure implementing a fixed length queue of fixed size objects where objects are added to the head of the queue while items are removed from the tail of the queue. • Requires at least 2 pointers (head and tail) • Extensively used in digital filtering y[n] = a0x[n]+a1x[n-1]+…+akx[n-k] ACOE343 - Embedded Real-Time Processor Systems - Frederick University

Direct Memory Access (DMA) • The feature that allows peripherals to access main memory without the intervention of the CPU • Typically, the CPU initiates DMA transfer, does other operations while the transfer is in progress, and receives an interrupt from the DMA controller once the operation is complete. • Can create cache coherency problems (the data in the cache may be different from the data in the external memory after DMA) • Requires a DMA controller ACOE343 - Embedded Real-Time Processor Systems - Frederick University

Cache memory • Separate instruction and data L1 caches (Harvard architecture) • Cache coherence protocols required, since most systems use DMA ACOE343 - Embedded Real-Time Processor Systems - Frederick University

DSP Harvard Architecture VLIW/SIMD (parallel execution units) No bit level operations Hardware MACs DSP applications Microcontroller Mostly von Neumann Architecture Single execution unit Flexible bit-level operations No hardware MACs Control applications DSP vs. Microcontroller ACOE343 - Embedded Real-Time Processor Systems - Frederick University

Examples • Estimate how long will the following code fragment take to execute on • A general purpose processor with 1 GHz operating frequency, five-stage pipelining and 5 cycles required for multiplication, 1 cycle for addition • A DSP running at 500 MHz, zero overhead looping and 6 independent ALUs and 2 independent single-cycle MAC units? for (i=0; i<8; i++) { a[i] = 2*i + 3; b[i] = 3*i + 5; } ACOE343 - Embedded Real-Time Processor Systems - Frederick University

Review Questions • Which of the following code fragments is appropriate for SIMD implementation? a[0]=b[0]+c[0]; a[0]=b[0]&c[0]; a[2]=b[2]+c[2]; a[0]=b[0]%c[0]; a[4]=b[4]+c[4]; a[0]=b[0]+c[0]; a[6]=b[6]+c[6]; a[0]=b[0]/c[0]; • Can the following instructions be merged into one VLIW instruction? If not in how many? • a=b+c; • d=c/e; • f=d&a; • g=b%c; ACOE343 - Embedded Real-Time Processor Systems - Frederick University

Review Questions • Which of the following is not a typical DSP feature? • Dedicated multiplier/MAC • Von Neumann memory architecture • Pipelining • Saturation arithmetic • Which implementation would you choose for lowest power consumption? • ASIC • FPGA • General-Purpose Processor • DSP ACOE343 - Embedded Real-Time Processor Systems - Frederick University

Examples • How many VLIW instructions does the following program fragment require if there two independent data paths (a,b), with 3 ALUs and 1 MAC available in each and 8 instructions/word? How many cycles will it take to execute if they are the first instructions in the program and all instructions require 1 cycle, assuming the pipelining architecture of slide 10 with 6 phases of execution? ADD a1,a2,a3 ;a3 = a1+a2 SUB b1,b3,b4 ;b4 = b1-b3 MUL a2,a3,a5 ;a5 = a2-a3 MUL b3,b4,b2 ;b2 = b3*b4 AND a7,a0,a1 ;a1 = a7 AND a0 MUL a3,a4,a5 ;a5 = a3*a4 OR a6,a3,a2 ;a2 = a6 OR a3 ACOE343 - Embedded Real-Time Processor Systems - Frederick University

References • DR. Chassaing, “DSP Applications using C and the TMS320C6x DSK”, Wiley, 2002 • Texas Instruments, TMS320C64x datasheets • Analog Devices, ADSP-21xx Processors ACOE343 - Embedded Real-Time Processor Systems - Frederick University

Lecture 4

Lecture 4

Presentation Transcript

Lecture 4

Lecture 4

Lecture 4

Lecture 4

Lecture 4

Lecture 4

Lecture 4

Lecture 4

Lecture 4

Lecture 4

Lecture 4

LECTURE # 4

Lecture 4

Lecture 4

LECTURE 4

LECTURE 4

Lecture 4

Lecture 4

Lecture 4

Lecture 4

LECTURE № 4