Unit 3 ARM ARCHITECTURE

Unit 3 ARM ARCHITECTURE Reference Material: ARM system developer’s guide A.N.Sloss Class notes: limayesir.wordpress.com Arjun Gour, Shyamkant Limaye ETC Department, SVPCET

RISC Principle ARM is a RISC

RISC Principle HISTORY OF RISC PHILOSOPHY • Initially, Dominant architecture in the PC market was the Intel IA-32, CISC design. • In the early 1980s designers started looking at simple ISAs with fewer instructions. The first RISC projects came from IBM (IBM 801), Stanford (MIPS), and UC-Berkeley (Berkeley RISC 1 and 2) in the late 70s and early 80s. They were all designed with a similar philosophy which has become known as RISC. • IBM801 later evolved into PowerPC architecture. • MIPS Computer Systems, was purchased buy Silicon Graphics, Inc. in 1992, and was spun off as MIPS Technologies, Inc. in 1998. Today, MIPS powers many consumer electronics and other devices. • Berkeley RISC1 ans RISK2 evolved into SPARC.

RISC Principle RISC uses complex compiler and simple processor CISC uses simple compiler and complex processor

RISC Principle • Introduction of microprogramming (CISC) • Microcode is a code that runs on a computer within the main computer that serves as the control unit of the main computer. Each assembly instruction invokes a microprogram which may take several clock cycles to execute. In each step, control signals are generated which control the data paths and ALU operation. • Facilitated cost-effective implementation of complex instructions by using microcode. • Microprogrammed control units use small fast memories to hold the microcode • Reduce memory access latency • Development of low-cost members of a processor family by simply changing the microcode. • Instructions can be customized to high-level language constructs depending on application.

RISC Principle RISC has Less instructions Less addressing modes Load store architecture More CPU registers

RISC Principle • Comparison of RISC and CISC programs

RISC Principle • Comparison of RISC and CISC program for: • Read data from memory • Add contents of a register to it • Write back the result to memory • Increment the memory pointer.

RISC Principle • Comparison of RISC and CISC program for: • Read data from memory • Add contents of a register to it • Write back the result to memory • Increment the memory pointer. CISC VAX RISC

RISC Principle • RISC Design Principles • Simple Operations • Each operation can execute in one cycle. • Advantage -No need for microcode and operations can be hardwired. • Design cache subsystem properly to capture instructions • Overall execution efficiency can be as good as a microcoded CISC machine. • Register-to-Register Operations • Only special load and store operations to access memory. • Simplifies instruction set design as it allows execution of instructions at a one-instruction-per-cycle rate. • Restricting operands to registers also simplifies the control unit.

RISC Principle • Large Register Set • Because RISC processors use register-to-register operations • Provide ample opportunities for the compiler to optimize their usage. • Minimize the overhead associated with procedure calls and returns. It makes use of LR instead of stack • Simple Addressing Modes • Provide very few addressing modes: often just one or two. • Fast computation of operands.

RISC Principle • Fixed-Length, Simple Instruction Format • RISC designs use fixed-length instructions. • Variable-length instructions can cause implementation and execution inefficiencies. • The boundaries of various fields in an instruction such as opcode and source operands are fixed. • Allows for efficient decoding and scheduling of instructions.

RISC Vs CISC summary • RISC • Lesser number of Instructions • Instruction pipelining • One instruction per clock cycle • ALU Operands are registers only, memory operations are load/store • CISC • Greater number of Instructions • Generally no instruction pipelining • Many clock cycles for one instruction • ALU Operands are can be either register or memory

RISC • Large number of registers are available • More Code for a task • Fixed length instructions • Less silicon • Hardwired control unit • Generally with Harvard memory Architecture • CISC • Limited number of registers are available • Less Code for Task • Variable length instructions • More silicon • Microprogrammed control • Can be Harvard or Von-Neumann Architecture

History of ARM • The ARM was developed in 1985 by Acorn Computer Group in the United Kingdom. • Acorn introduced the first RISC processor in 1987, targeting low-cost PCs. • In 1990, Acorn formed Advanced RISC Machines. • ARM, which initially stood for Acorn RISC Machine but later changed to Advanced RISC Machine, defines a 32-bit RISC architecture.

ARM RISC features • Small instruction set • Fixed instruction length (32 bits) • Load store architecture • Large number of registers (16) • Pipelined • One instruction every clock

ARM enhancements over RISC • In line barrel shifter allows advanced instructions. (MOVE + Shift, Arith/Logic +Shift) • Variable execution time. E.g. Load Multiple instruction takes several cycles. • Condition code in every instruction. Avoids branches. • More addressing modes, including auto increment.

Basic core architecture Components Register bank, 16X32 Barrel shifter Multiplier ALU A Bus, B Bus, ALU Bus Address register Incrementer Instruction decoder

ARM core components • 16 registers of 32 bits each + CPSR, SPSR, two ports read on A bus and B bus. Separate read port for PC (r15). • Barrel shifter performs shift/rotate operations on B Bus. • ALU has A bus and B bus as inputs and ALU bus as output. Performs arithmetic and logical operations. • Incrementer increments Address register without ALU. • Instruction decoder decodes ARM, Thumb and Java instructions and produces control signals. • Data registers hold data to and from memory.

Basic • Register Bank • All the registers shown are 32 bits in size. • 18 active registers visible in 6 states (There are 7 states): 17 in user state • 16 data registers • 2 processor status registers. • cpsr • current program status registers • Common in all states • spsr • saved program status registers • SPSR is not visible in user state. • Some registers are duplicated in different states.

Basic Complete ARM register set

Basic • Three registers assigned to a particular task or • special function: • r13 :Stack Pointer (sp) • Stores the head of the stack • in the current processor mode.

Basic • Three registers assigned to a particular task or • special function: • r13 :Stack Pointer (sp) • Stores the head of the stack • in the current processor mode. • r14 :Link Register (lr) • Core puts the return address • whenever it calls a subroutine.

Basic • Three registers are special • r13 :Stack Pointer (sp) • Stores the head of the stack • in the current processor mode. • r14 :Link Register (lr) • Core puts the return address • whenever it calls a subroutine. No stack. • r15 :Program Counter (pc) • Contains the address of the next instruction • to be fetched by the processor. As instr. are word aligned, bits 0,1 of PC are unused.

Basic Current Program Status Register (CPSR) New architecture additions: Bit 24 : J (Jazelle state) Bit 9 : E Endian ness Bit 8 : A Abort disable

FLAGS N: Negative ALU produced a negative result (MSB =1) Z: Zero ALU produced a zero result (All bits 0) C: Carry ALU generated a carry-out from MSB, A result of an arithmetic operation in the ALU or from the shifter. V: oVerflow Overflow into the sign bit. Q :Saturation Set when the result causes an overflow and/or saturation Flag set in enhanced DSP instructions.

Basic • I Interrupt request (IRQ)mask bit • F Fast interrupt request (FIQ) mask bit • control the masking of IRQ and FIQ. • T :Thumb • Thumb state the processor is executing purely Thumb 16-bit instructions.

More on Q Flag • The Q flag only exists in E (DSP) variants of ARM architecture v5 and above. It is used to detect saturation in special saturating arithmetic instructions ( QADD, QSUB, QDADD, and QDSUB), or overflow in certain multiply instructions ( SMLAxy and SMLAWy). • The Q flag is a sticky flag. Although these instructions can set the flag, they cannot clear it. You can execute a series of such instructions, and then test the flag to find out whether saturation or overflow occurred at any point in the series, without needing to check the flag after each instruction. • To clear the Q flag, use an MSR instruction.

Saturating arithmetic • In normal adder, if result is greater than max value that can be represented, then it folds over. E.g. for 4 bit unsigned arithmetic, max number is 1111 = 15. • 12+ 9 = 1100 + 1001 = 10101 repr. As 0101 • In 4 bit saturated adder, result will be 1111. • In 32 bit signed saturating instructions if overflow occurs: • the Q flag is set • if the full result would be less than –231, the result returned is –231 • if the full result would be greater than 231–1, the result returned is 231–1.

ARM core extensions • They are standard components placed next to the ARM core. They improve performance, manage resources, and provide extra functionality and are designed to provide flexibility in handling particular applications. • Each ARM family has different extensions available. • There are three hardware extensions • cache and tightly coupled memory • memory management • coprocessor interface.

Unified Cache • Unified cache for Von Neuman core

Harvard cache and TCM • Separate D and I caches for Harvard core. Low average access time but no guarantee. Access time depends on cache hit. • D and I Tightly coupled memory (TCM) for deterministic fast access.

MPU (Memory Protection Unit) • Nonprotected memory is fixed and provides very little flexibility. It is normally used for small, simple embedded systems that require no protection from rogue applications. • MPUs employ a simple system that uses a limited number of memory regions. These regions are controlled with a set of special coprocessor registers, and each region is defined with specific access permissions (RWX). • This type of memory management is used for systems that require memory protection but don’t have a complex memory map.

MMU (Memory Management Unit) • MMU uses a set of translation tables to provide fine-grained control over memory. These tables are stored in main memory and provide a virtual-to-physical address map as well as access permissions. • MMUs are designed for more sophisticated platform operating systems (Linux) that support multitasking.

Co processor • A coprocessor extends the processing features of a core by extending the instruction set or by providing configuration registers. • More than one coprocessor can be added to the ARM core via the coprocessor interface. • The coprocessor can be accessed through a group of dedicated ARM instructions • Consider, for example, coprocessor 15 to control the cache, TCMs, and memory management. • A coprocessor like VFP adds floating point instructions

Pipeline • Pipeline • Pipelining ensures that an instruction is taken in every clock cycle, though an instruction may take more clock cycles to complete. • ARM 7 has 3 stage pipe line • Fetch loads an instruction from memory. • Decode identifies the instruction to be executed. • Execute processes the instruction and writes the result back to a register.

Pipeline ARM7 Three-stage pipeline 3 consecutive instructions – ADD, SUB, CMP

ARM 9 and 10 pipelines • As the pipeline length increases, the amount of work done at each stage is reduced, which allows the processor to attain a higher operating frequency. This in turn increases the performance. The system latency also increases because it takes more cycles to fill the pipeline before the core can execute an instruction. • ARM 9 architecture splits execute into execute, memory and write. Total 5 stages. • ARM 10 splits decode into issue and decode. Total 6 stages.

Pipeline ARM9 five-stage pipeline ARM10 six-stage pipeline.

Pipeline

Basic • Processor Modes • Seven processor modes in total: • Six privileged modes • Abort • Fast interrupt request • Interrupt request • Supervisor • System • Undefined • One non privileged mode • User

Basic Processor Modes

Basic Abort mode Failed attempt to access memory. (Virtual memory) Fast interrupt request and Interrupt request modes Correspond to the two interrupt levels available on the ARM processor. Supervisor mode After reset and is generally the mode that an operating system kernel operates in. Undefined mode Encounters an instruction that is undefined or not supported by the implementation.

Basic System mode A special version of user mode that allows full read-write access to the program status registers. User mode Used for programs and applications.

Basic M4M3M2M1M0: Mode Bits

Basic • State and Instruction Sets • State of the core determines which instruction set is being executed. • There are three instruction sets: • ARM • Thumb • Jazelle Jazelle executes 8-bit instructions . Hybrid mix of software and hardware designed to speed up the execution of Java bytecodes.

Basic ARM instruction set is active when the processor is in ARM state. Thumb instruction set is active when the processor is in Thumb state.

Architecture Revisions • ARM Architecture Revisions • ARM instruction set started with version v1 and has latest revisions upto v8. Different families use different instruction sets. • ARM family numbering is different from instruction set numbering. Each ARM family has a different register organization. Classic ARM family has independent series of revisions staring from ARM1 to ARM11. ARM7 uses instruction set V5. • ARM cortex family has 3 variations of V6 & V7 architecture A, R, M

Nomenclature ARM instruction set revisions V1 First architecture,26 bit address V2 Added multiplier, on-chip cache V3 32 bit address, added undefined instr mode and abort mode V4 Added signed, unsigned half-word and signed byte load and store instructions, Thumb mode (ARM7 TDMI) V5 Added Signal processing, Jazelle V6 More instructions for graphics processing V7 Thumb2. Cortex families V7-A, V7-R, V7-M use V7 V8 64 bit registers, A64 instruction set compatible with V7-A

Nomenclature

Unit 3 ARM ARCHITECTURE