Architecture and Instruction Set of the C6x Processor

Architecture and Instruction Set of the C6x Processor Module 1

Reference • R. Chassaing, DSP applications using C and the TMS 320C6x DSK, Wiley, 2002

DSP • TMS320 Introduction • Architecture • Functional Unit • Fetch & Execute Packet • Pipelining • Registers • Addressing Modes

DSP • Digital Signal Processing : Application of mathematical operations to digitally represented signal • Signals represented digitally as sequence of samples. • Digital Signal Processor: Electronics System that process digital Signal.

DSP System

DSP tasks • Most DSP tasks Require • Repetitive numeric computation • Real time processing • High memory • System flexibility • DSP must perform these tasks efficiently while minimizing • Cost • Power • Memory use • Development time

TMS DSP IC • TMS 320 C6X • TMX – experimental device • TMP – prototype • TMS – Qualified device • 320- TI DSP family • C- CMOS with ROM • E- CMOS with EPROM • 5- Generation • X- version number

TMS320 Introduction • Texas Instruments introduced the ﬁrst generation TMS32010 digital signal processor in 1982, the TMS320C25 in 1986 , and the TMS320C50 in 1991. • These 16-bit processors are all ﬁxedpointprocessors and are code-compatible.

Von neumann VS Harvard • The fixed-point processors C1x, C2x, and C5x are based on a modified Harvard architecture with separate memory spaces for data and instructions that allow concurrent accesses. • Quantization error or round-off noise from an ADC is a concern with a fixed point processor.

The TMS320C30 ﬂoating-point processor was introduced in the late 1980s. • The TMS320C6201 (C62x), announced in 1997. • C62x is based on a very-long-instruction-word (VLIW) architecture, still using separate memory spaces for instructions and data as with the Harvard architecture. • The C62x is not code-compatible with the previous generation of ﬁxed-point processors.

TMS320C6x ARCHITECTURE • The TMS320C6711 is a ﬂoating-point processor based on the • VLIW architecture . • Internal memory includes a two-level cache architecture with 4kB of level 1 program cache (L1P), 4kB of level 1 data cache (L1D), and 64kB of RAM or level 2 cache for data/program allocation (L2). • It has a direct interface to both synchronous memories and asynchronous memories

On-chip peripherals include two multichannel buffered serial ports (McBSPs),two timers, a 16-bit host port interface (HPI), and a 32-bit external memory interface (EMIF). • It requires 3.3V for I/O and 1.8V for the core (internal). • Internal buses • 32-bit program address bus • 256-bit program data bus (eight 32-bit instructions), • two 32-bit data address buses, • two 64-bit data buses • two 64-bit store data buses. • With a 32-bit address bus, the total memory space is 2^32 • = 4GB, including four external memory spaces: CE0, CE1, CE2, and CE3.

3-Access level of Memory Map 1. L1 Memory -Cache-based Architecture -Program Cache & Data Cache -Size : PC(4Kbyte), DC(4Kbyte)‏ 2. L2 Memory - Size : 64Kbyte - Program & Data 3. L3 Memory External Memory

Internal Memory

Independent memory banks on the C6x allow for two memory accesses within one instruction cycle. • Two independent memory banks can be accessed using two independent buses. • Two loads or two stores instructions can be performed in parallel. • No conﬂict results if the data accessed are in different memory banks. • Separate buses for program, data, and direct memory access (DMA) allow the C6x to perform concurrent program fetches, data read and write, and DMA operations.

C6x has a byte-addressable memory space. • Internal memory is organized as separate program and data memory spaces, with two 32-bit internal ports (two 64-bit ports with the C64x) to access internal memory. • With a clock of 150MHz onboard the DSK, one can ideally achieve two multiplies and accumulates per cycle, for a total of 300 million multiplies and accumulates (MACs) per second.

With six of the eight functional units capable of handling ﬂoating-point operations, it is possible to perform 900 million ﬂoating-point operations per second (MFLOPS). • 1200 million instructions per second (MIPS)

FUNCTIONAL UNITS • The CPU consists of eight independent functional units divided into two data paths • Each path has a unit for • multiply operations (.M), • logical and arithmetic operations (.L), • branch, bit manipulation, and arithmetic operations (.S), • loading/storing and arithmetic operations (.D). • The .S and .L units are for arithmetic, logical, and branch instructions. • All data transfers make use of the .D units.

The arithmetic operations, such as subtract or add (SUB or ADD), can be performed by all the units except the .M units. • The eight functional units consist of four floating/fixed-point ALUs (two .L and two .S), two fixed-point ALUs (.D units), and two floating/fixed-point multipliers (.M units).

Each path includes a set of sixteen 32-bit registers, A0 through A15 and B0 through B15. • Two cross-paths (1x and 2x) allow functional units from one data path to access a 32-bit operand from the register file on the opposite side. • Each functional unit side can access data from the registers on the opposite side using a cross-path. • There are 32 general purpose registers, but some of them are reserved for specific addressing or are used for conditional instructions.

VelociTI™ • VLIW modification done by TI is called VelociTI • Reduces code size • Increases performance when instructions reside off-chip • C6X architecture is based on the high-performance advanced VelociTI very-long-instruction-word (VLIW) architecture developed by Texas Instruments (TI) • an excellent choice for multichannel and multifunction applications (Several instructions captured & processed simultaneously)‏

FETCH AND EXECUTE PACKETS • The architecture VELOCITI, introduced by TI, is derived from the VLIW architecture. • An execute packet (EP) consists of a group of instructions that can be executed in parallel within the same cycle time. • The number of EPs within a fetch packet (FP) can vary from one to eight • The VLIW architecture was modified to allow more than one EP to be included within an FP.

The least signiﬁcant bit of every 32-bit instruction is used to determine if the next or subsequent instruction belongs in the same EP (if 1) or is part of the next EP if 0).

EP1 contains the two parallel instructions A and B; EP2 contains the three parallel instructions C, D, and E; and EP3 contains the three parallel instructions F, G, and H. • Bit 0 (LSB) of each 32-bit instruction contains a “p” bit that signals whether it is in parallel with a subsequent instruction. • The “p” bit of instruction B is zero, denoting that it is not within the same EP as the subsequent instruction C. • Similarly, instruction E is not within the same EP as instruction F.

Pipelining • Pipelining is a key feature in a digital signal processor to get parallel instructions working properly. • There are three stages of pipelining: • program fetch, decode, and execute.

non-pipelined scalar architecture • - A processor that executes every instruction one after the other • - may use processor resources inefficiently, potentially leading to poor performance. • pipelining • - executing different sub-steps of sequential instructions simultaneously • superscalar architectures • - executing multiple instructions entirely simultaneously

Pipelining does not decrease the time for individual instruction execution. Instead, it increases instruction throughput. • The throughput of the instruction pipeline is determined by how often an instruction exits the pipeline • If the stages are perfectly balanced, then the time per instruction on the pipelined machine is equal to • Time per instruction on nonpipelined machine Number of pipe stages

Program Fetch • The program fetch stage is composed of four phases: • (a) PG: program address generate (in the CPU) to fetch an address • (b) PS: program address send (to memory) to send the address • (c) PW: program address ready wait (memory read) to wait for data • (d) PR: program fetch packet receive (at the CPU) to read opcode from memory

Decode Stage • The decode stage is composed of two phases: • (a) DP: to dispatch all the instructions within an FP to the appropriate functional units • (b) DC: instruction decode

Execute Stage • The execute stage is composed of from six phases (with fixed point) to 10 phases (with floating point), due to delays (latencies) associated with following instructions: • (a) Multiply instruction, which consists of two phases due to one delay • (b) Load instruction, which consists of five phases due to four delays • (c) Branch instruction, which consists of six phases due to five delays

execute Program fetch decode E1- E6 (E1-E10 for double precision)‏ PG PS PW PR DP DC • Pipeline phases Pipelining effects Clock cycles 1 2 3 4 5 6 7 8 9 10 11 12 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 PG PS PW PR DP DC E1 E2 E3 E4 PG PS PW PR DP DC E1 E2 E3 E4 PG PS PW PR DP DC E1 E2 E3 E4 PG PS PW PR DP DC E1 E2 E3 E4 PG PS PW PR DP DC E1 E2 E3 E4

Each row represents an FP • PG of first FP starts in cycle 1,PG of second FP starts in cycle 2 and so on…. • Each FP has 4 phases for fetch ,2 phases for decode and execution phases can take from 1 to 10 phases • At cycle 7, • instruction in the first FP are in the first execution phase E1, • instruction in the second FP is in decoding phase, • instruction in the third FP is in dispatching phase • and so on….. • All the instructions are proceeding through various phases • Therefore pipeline is FULL

Most instructions have 1 execute phase • Multiply (MPY) has 2 • Load (LDH/LDW) has 5 • Branch (B) has 6 phases • Additional execute phases are associated with floating point and double precision type instructions (upto 10 phases)‏ • eg: MPYDP has 9 delay slots and a total 10 phases • Functional unit latency: • The number of cycles that an instruction ties up a functional unit. • it is 1 for all instructions except double precision instructions • no other instructions can use the functional unit • it is different from delay slot • eg: MPYDP has 4 functional unit latency but 9 delay slots • delay slot: some instructions that are physically after the instruction are executed as if they were located before it. • Classic examples are branch and call instructions, which often execute the following instruction before the branch or call is performed.

Registers • The two register files each contain 1632-bit registers for a total of 32 general-purpose registers (A0~A15, B0~B15)‏ • Interaction with the CPU must be done through these registers • The four functional units on each side of the CPU can freely share the 16 registers belonging to that side. • two cross paths 1x and 2x connects all the registers on the other side • (which can access data from the register files on the opposite side.) • If register access is by functional units on the same side of the CPU, register file can service all the units in a single clock cycle

Registers A0, A1, B0, B1 are used as conditional registers. • Registers A4 through A7 and B4 through B7 are used for circular addressing. • Registers A0 through A9 and B0 through B9 (except B3) are temporary registers. • Any of the registers A10 through A15 and B10 through B15 used fsubroutine.

A 40-bit data value can be contained across a register pair. • The 32 least signiﬁcant bits (LSBs) are stored in the even register (e.g.,A2) and the remaining 8 bits are stored in the 8LSBs of the next-upper (odd) register (A3). • A similar scheme is used to hold a 64-bit double-precision value within a pair of registers (even and odd).

Addressing modes • Determines how one access memory • Addressing refers to means to specify location of operands for instructions - types of addressing are called addressing modes - operands may be input operands for the operation as well as results of the operation • Addressing modes supported by the TMS320C67x include register-indirect, indexed register-indirect, and modulo addressing (circular addressing). Immediate data is also supported. • The TMS320C67x does not support modulo addressing for 64-bit data.

Immediate • The operand is part of the instruction • Register • The operand is specified in a register • Direct • The address of the operand is part of the instruction (added to imply memory page)‏ • Indirect • The address of the operand is stored in a register ADD .L1 -13,A1,A6 (implied) ADD .L1 A7,A6,A7 not supported LDW .L1 *A5++[8],A1

Register-Indirect Addressing • Operand is located in memory address stored in a register • Special group of registers can be used to store addresses (address registers)‏ • Most important addressing mode in DSPs • Efficient from instruction set point of view • Few bits are needed to indicate address of operand • 32 registers(A0-A15,B0-B15) are used as pointers • Indirect addressing uses ‘*’ in conjunction with one of the 32 registers

1.*R – register R contains address of a memory location • where a data value is stored • 2. *R++ (d) - register R contains memory address • - after the memory address is used, R is • postincremented such that new address is R+1 if d=1 • - double minus (- -) update the address by d-1 • 3. *++R(d) - address is preincremented or offset by d • - current address is R+d or R-d • 4. * +R(d) - address is preincremented by d, such that the current address is R+d • - however R pre increments without modification • - unlike previous case, R is not updated or modified

Delay Line implemented with shifting of sample

Delay Line pointer manipulation using Circular Addressing

Architecture and Instruction Set of the C6x Processor

Architecture and Instruction Set of the C6x Processor

Presentation Transcript

The Instruction Set Architecture

Architecture and instruction set

Instruction Set Architecture

Instruction Set Architecture

Instruction Set architecture

Instruction Set Architecture

Architecture of the C6x Processor

Instruction Set Architecture

Instruction Set Architecture

Instruction Set Architecture of MIPS Processor Presentation B

Instruction Set Architecture

Instruction Set Architecture

Instruction Set Architecture

Instruction Set Architecture

Instruction Set Architecture

The Instruction Set Architecture

Instruction Set Architecture of MIPS Processor Presentation B

Instruction Set Architecture

The Instruction Set of processor 80x86