ARM Architecture

ARM Architecture Charles Bock

Arm Flavors • Cortex- A- Application (Fully Featured) • Android/IPhone • Windows RT Tablets • Cortex- R – Real time (RTOS) • Cars • Routers • Infrastructure • Cortex- M – Embedded (Minimal) • Automation • Appliances • ULP Devices • I will focus on Cortex-A15 • Most Featured • Most Complex • Most Interesting

Cortex A15 Overview • ARM processor architecture supports 32-bit ARM and 16-bit Thumb ISAs • Superscalar, variable-length, out-of-order pipeline. • Dynamic branch prediction with Branch Target Buffer (BTB) and Global History Buffer (GHB) • Two separate 32-entry fully-associative Level 1 (L1) Translation Look-aside Buffers • 4-way set-associative 512-entry Level 2 (L2) TLB in each processor • Fixed 32KB L1 instruction and data caches. • Shared L2 cache of 4MB • 40 Bit physical addressing (1TB)

Instruction Set • RISC (ARM – Advanced Risc Machine) • Fixed instruction width of 32 bits for easy decoding and pipelining, at the cost of decreased code density. • Additional Modes or States allow Additional Instruction sets • Thumb (16 bit) • Thumb 2 (16 and 32 bit) • Jazzelle (Byte Code) • Trade-Off: 32 Bit arm vs 16 bit Thumb

Thumb • Thumb is a 16-bit instruction set • Improved performance, more assumed operands. • Subset of the functionality of the ARM instruction set

Instruction Encoding Always ADD Op 1 Destination Wasted! Op 2 ADD Operands Destination

Jazzelle • JazelleDBX technology for direct java bytecode execution • Direct interpretation bytecode to machine code

General Layout • Fetch • Decode • Dispatch • Execute • Load/Store • WriteBack

Block Diagram 1 FP / SIMD Depth 18-24 Integer Depth 15 (Same as recent Intel Cores) Instructions broken down into Sub Operations here This is Genius Register Renaming SIMD

Block Diagram 2

Fetch • Up to 128 bits per fetch depending on alignment • ARM Set: 4 Instructions (32 bit) • Thumb Set: 8 Instructions (16 bit) • Only 3 can be dispatched per cycle. • Support for unaligned fetch address. • Branch prediction begins in parallel with fetch.

Branch prediction - Global History Buffer • Global History Buffer • 3 arrays: Taken array, Not taken array, and Selector

Branch prediction -microBTB • microBTB • Reduces bubble on taken branches • 64 entry fully associative for fast turn around prediction • Caches taken branches only • Overruled by main predictor if they disagree

Branch Prediction - Indirect • Indirect Predictor • 256 entry BTB indexed by XOR of target and address • Xor Allows for indexing of Multiple Target addresses per branch

Branch Prediction – Return Stack • Return Address Stack • 8-32 entries deep • indirect jumps (85%) are returns from functions • Push on call • Pop on Ret

Branch Prediction - Misc • Deeper Pipeline = Larger mispredict penalty • Static Predictor: Always Predicts True if Not Known

Decode / Out of order Issue • Instructions are Decoded into discrete sub operations • Multiple Issue Queues (8) • Instructions dispatched 3 per cycle to the appropriate issue queue • The instruction dispatch unit controls when the decoded instructions can be dispatched to the execution pipelines and when the returned results can be retired

Register Renaming • RRT (Register rename Table) • Maps from Used register to available register • Rename Loop • Queue which stores available registers for use • Registers removed when in use • Registers re-added when retired from use • 13 General Purpose Registers R0-R12 • R13 = Stack Pointer • R14 = Return Address (Function Calls) • R15 = Program Counter

Loop Buffer / Loop Cache • 32 Entries Long • Can contain up to two “forward” and one “backward” branch • Completely shuts down fetch and large parts of decode stages. • Why? Saves power, Saves time. • Smart!

Execution Lanes • Integer Lane • Single cycle integer operations • 2 ALUs, 2 shifters • FPU / SIMD (NEON) Lane • Asymetric, Varying Length 2-10 Cycles • Branch Lane • Any operation that targets the PC for writeback, usually 1 cycle • Mult / Div Lane • All Mult/Div operations, 4 cycles. • Load / Store Lane • Cache / Mem access 4 cycles. • Cache maintenance • 1 load and 1 store per cycle • Load cannot bypass store, store cannot bypass store

Load Store Pipeline • Issue queue 16 deep • Out of order but cannot bypass stores (safe) • Stores in order but only require address to issue • Pipeline • AGU Address generation Unit / TLB Lookup • Address and Tag Setup • Data / Tag Access • Data selection and forwarding

L1 Instruction / Data Caches • 32KB 2-way set-associative cache. • 64 Byte Block so 256 Blocks * 2 way Assoc. = 32KB • Physically-Indexed and Physically-Tagged (PIPT). • Strictly enforced write-through (Important for cache consistancy!)

L2 Shared Cache • 16 Way Set Assoc, 4MB • 4 tag banks to handle parallel requests • All Snooping is done at this level to keep caches consistent. • If a core is powered down its L1 cache can be restored from L2. • Any “Read Clean” Requests on the bus can be serviced by L2. • Supports Automatic Prefetching for Streaming Data Loads

Dual Layer TLB Structure • Layer One: • Two separate 32-entry fully associative L1 TLBs for data load and store pipelines. • Layer Two: • 4-way set-associative 512-entry L2 TLB in each processor • In General: • The TLB entries contain a global indicator or an Address Space Identifier (ASID) to permit context switches without TLB flushes. • The TLB entries contain a Virtual Machine Identifier (VMID) to permit virtual machine switches without TLB flushes. • Miss: • Trade off: add more hardware for faster page fault handling or let the os handle it in software? • CPU Includes full table walk machine incase of TLB Miss, no OS involvement required.

BIG Little • Combine A15 with A7. • Interconnect Below The L2 Shared Cache

References [1] Arm Information Center, infocenter.arm.com, 2012,http://infocenter.arm.com/help/topic/com.arm.doc.ddi0438g/DDI0438G_cortex_a15_r3p2_trm.pdf [2] BDTi, bdti.com, 2012,http://www.bdti.com/InsideDSP/2011/11/17/ARM [3] Arm, arm.com, 2012,http://www.arm.com/files/pdf/AT-Exploring_the_Design_of_the_Cortex-A15.pdf [4] Meet ARM’s Cortex A15, wired.com, 2012,http://www.wired.com/insights/2011/10/meet-arms-cortex-a15-the-future-of-the-ipad-and-possibly-the-macbook-air/ [5] ARM Cortex-A15 explained, extremetech.com, 2012,http://www.extremetech.com/computing/139393-arm-cortex-a15-explained-intels-atom-is-down-but-not-out [6] eecs373, web.eecs.umich.edu, 2012,http://web.eecs.umich.edu/~prabal/teaching/eecs373/readings/ARM_Architecture_Overview.pdf [7] ARM Cortex A Programming Guide, cs.utsa.edu, 2012,http://www.cs.utsa.edu/~whaley/teach/FHPO_F11/ARM/CortAProgGuide.pdf [8] Branch Prediction Review, cs.washington.edu, 2012,http://www.cs.washington.edu/education/courses/cse471/12sp/lectures/branchPredStudent.pdf [9] Cortex A 15, 7-cpu.com, 2012, http://www.7-cpu.com/cpu/Cortex-A15.html

ARM Architecture