CDA 5155 and 4150

Computer Architecture Week 2: 2 September 2014 CDA 5155 and 4150

Goals of the course Advanced coverage of computer architecture – general purpose processors, embedded processors,historically significant processors, design tools. Instruction set architecture Processor microarchitecture Systems architecture Memory systems I/O systems

Teaching Staff Professor Gary Tyson PhD: University of California – Davis Faculty jobs: California State University Sacramento: 1987 - 1990 University of California – Davis: 1993 - 1994 University of California – Riverside: 1995 - 1996 University of Michigan: 1997 - 2003 Florida State University: 2003 – present

Grading in 5155 (Fall’14) Programming assignments In-order pipeline simulation (10%) Out-of-order pipeline simulation (10%) Exams (2 @ 25% each) In class, 75 minutes Team Project (20%) 3 or 4 students per team Class Participation (10%)

Time Management 3 hours/week lecture This is probably the most important time 2 hours/week reading Hennessy/Patterson: Computer Architecture: A Quantitative Approach 3-5 hours/week exam prep 5+ hours/week Project (1/3 semester) Total: ~10-15 hours per week.

Tentative Course Timeline Week Date Topic Holidays Due Dates Notes 1 Aug 28 Performance, ISA, Pipelining 2 Sept 2 Pipelining, Branch Prediction 3 Sept 9 Superscalar, Exceptions 4 Sept 16 Compilers, VLIW 5 Sept 23 Dynamic Scheduling 6 Sept 30 Dynamic Scheduling 7 Oct 7 Advanced pipelines Advanced pipelines 8 Oct 14 9 Oct 21 Cache design 10 Oct 28 Cache design, VM 11 Nov 4 Multiprocessor, Multithreading 12 Nov 11 Embedded processors Exam 13 Nov 18 Embedded processors 14 Nov 25 Research Topics 15 Dec 2 Research Topics project Exam

Web Resources Course Web Page: http://www.cs.fsu.edu/~tyson/courses/CDA5155 • Wikipedia: http://en.wikipedia.org/wiki/Microprocessor • Wisconsin Architecture Page: http://arch-www.cs.wisc.edu/home

Levels of Abstraction • Problem/Idea (English?) • Algorithm (pseudo-code) • High-Level languages (C, Verilog) • Assembly instructions (OS calls) • Machine instructions (I/O interfaces) • Microarchitecture/organization (block diagrams) • Logic level: gates, flip-flops (schematic, HDL) • Circuit level: transistors, sizing (schematic, HDL) • Physical: VLSI layout, feature size, cabling, PC boards. What are the abstractions at each level?

Levels of Abstraction • Problem/Idea (English?) • Algorithm (pseudo-code) • High-Level languages (C, Verilog) • Assembly instructions (OS calls) • Machine instructions (I/O interfaces) • Microarchitecture/organization (block diagrams) • Logic level: gates, flip-flops (schematic, HDL) • Circuit level: transistors, sizing (schematic, HDL) • Physical: VLSI layout, feature size, cabling, PC boards. At what level do I perform a square root? Recursion?

Levels of Abstraction • Problem/Idea (English?) • Algorithm (pseudo-code) • High-Level languages (C, Verilog) • Assembly instructions (OS calls) • Machine instructions (I/O interfaces) • Microarchitecture/organization (block diagrams) • Logic level: gates, flip-flops (schematic, HDL) • Circuit level: transistors, sizing (schematic, HDL) • Physical: VLSI layout, feature size, cabling, PC boards. Who/what translates from one level to the next?

Role of Architecture • Responsible for hardware specification: • Instruction set design • Also responsible for structuring the overall implementation • Microarchitectural design. • Interacts with everyone • mainly compiler and logic level designers. • Cannot do a good job without knowledge of both sides

Design Issues: Performance • Get acceptable performance out of system. • Scientific: floating point throughput, memory&disk intensive, predictable • Commercial: string handling, disk (databases), predictable • Multimedia: specific data types (pixels), network? Predictable? • Embedded: what do you mean by performance? • Workstation: Maybe all of the above, maybe not

Calculating Performance • Execution time is often the best metric • Throughput (tasks/sec) vs. latency (sec/task) • Benchmarks: what are the tasks? • What I care about! • Representative programs (SPEC, Linpack) • Kernels: representative code fragments • Toy programs: useful for testing end-conditions • Synthetic programs: does nothing but with a representative instruction mix.

Design Issues: Cost • Processor • Die size, packaging, heat sink? Gold connectors? • Support: fan, connectors, motherboard specifications, etc. • Calculating processor cost: • Cost of device = (die + package + testing) / yield • Die cost = wafer cost / good die yield • Good die yield related to die size and defect density • Support costs: direct costs (components, labor), indirect costs ( sales, service, R&D) • Total costs amortized over number of systems sold(PC vs NASA)

Other design issues • Some applications care about other design issues. • NASA deep space mission • Reliability: software and hardware (radiation hardening) • AMD • Code compatibility • ARM • Power

A Quantitative Approach Hardware systems performance is generally easy to quantify Machine A is 10% faster than Machine B Of course Machine B’s advertising will show the opposite conclusion Many software systems tend to have much more subjective performance evaluations.

Measuring Performance Total Execution Time: A is 3 times faster than B for programs P1,P2 Issue: Emphasizes long running programs 1 n ΣTimei n i=1

Measuring Performance Weighted Execution Time: What if P1 is executed far more frequently? n ∑ Weighti X Timei i=1

Measuring Performance Normalized Execution Time: Compare machine performance to a reference machine and report a ratio. SPEC ratings measure relative performance to a reference machine.

Amdahl’s Law Rule of Thumb: Make the common case faster http://en.wikipedia.org/wiki/Amdahl's_law (Attack longest running part until it is no longer) repeat

Instruction Set Design Software Systems: named variables; complex semantics. Hardware systems: tight timing requirements; small storage structures; simple semantics Instruction set: the interface between very different software and hardware systems

Design decisions How much “state” is in the microarchitecture? Registers; Flags; IP/PC How is that state accessed/manipulated? Operand encoding What commands are supported? Opcode; opcode encoding

Design Challenges: or why is architecture still relevant? Clock frequency is increasing This changes the number of levels of gates that can be completed each cycle so old designs don’t work. It also tend to increase the ratio of time spent on wires (fixed speed of light) Power Faster chips are hotter; bigger chips are hotter

Design Challenges (cont) Design Complexity More complex designs to fix frequency/power issues leads to increased development/testing costs Failures (design or transient) can be difficult to understand (and fix) We seem far less willing to live with hardware errors (e.g. FDIV) than software errors which are often dealt with through upgrades – that we pay for!)

Techniques for Encoding Operands Explicit operands: Includes a field to specify which state data is referenced Example: register specifier Implicit operands: All state data can be inferred from the opcode Example: function return (CISC-style)

Accumulator Architectures with one implicit register Acts as source and/or destination One other source explicit Example: C = A + B Load A // (Acc)umulator = A Add B // Acc = Acc + B Store C // C = Acc Ref: “Instruction Level Distributed Processing: Adapting to Shifting Technology”

Stack Architectures with implicit “stack” Acts as source(s) and/or destination Push and Pop operations have 1 explicit operand Example: C = A + B Push A // Stack = {A} Push B // Stack = {A, B} Add // Stack = {A+B} Pop C // C = A+B ; Stack = {} Compact encoding; may require more instructions though

Registers Most general (and common) approach Small array of storage Explicit operands (register file index) Example: C = A + B Register-memory load/store Load R1, A Load R1, A Load R2, B Add R3, R1, B Add R3, R1, R2 Store R3, C Store R3, C

Memory Big array of storage More complex ways of indexing than registers Build addressing modes to support efficient translation of software abstractions Uses less space in instruction than 32-bit immediate field A[i]; use base (i) + displacement (A) (scaled?) a.ptr; use base (a) + displacement (ptr)

Addressing modes Register Add R4, R3 Immediate Add R4, #3 Base/Displacement Add R4, 100(R1) Register Indirect Add R4, (R1) Indexed Add R4, (R1+R2) Direct Add R4, (1001) Memory Indirect Add R4, @(R3) Autoincrement Add R4, (R2)+

Other Memory Issues What is the size of each element in memory? 0x000 0-255 Byte Half word Word 0x000 0 - 65535 0 - ~4B 0x000

Other Memory Issues Big-endian or Little-endian? Store 0x114488FF Points to most significant byte Points to least significant byte 0x000 11 0x000 FF 44 88 88 44 FF 11

Other Memory Issues Non-word loads? ldb R3, (000) 00 00 00 11 0x000 11 44 88 FF

Other Memory Issues Non-word loads? ldb R3, (003) FF FF FF FF 11 44 Sign extended 88 0x003 FF

Other Memory Issues Non-word loads? ldbu R3, (003) 00 00 00 FF 11 44 Zero filled 88 FF 0x003

Other Memory Issues Alignment? Word accesses only address ending in 00 Half-word accesses only ending in 0 Byte accesses any address 11 44 ldw R3, (002) is illegal! 88 0x002 Why is it important to be aligned? How can it be enforced? FF

Techniques for Encoding Operators Opcode is translated to control signals that direct data (MUX control) select operation for ALU Set read/write selects for register/memory/PC Tradeoff between how flexible the control is and how compact the opcode encoding. Microcode – direct control of signals (Improv) Opcode – compact representation of a set of control signals. You can make decode easier with careful opcode selection

Handling Control Flow Conditional branches (short range) Unconditional branches (jumps) Function calls Returns Traps (OS calls and exceptions) Predicates (conditional retirement)

Encoding branch targets PC-relative addressing Makes linking code easier Indirect addressing Jumps into shared libraries, virtual functions, case/switch statements Some unusual modes to simplify target address calculation (segment offset) or (trap number)

Condition codes Flags Implicit: flag(s) specified in opcode (bgt) Flag(s) set by earlier instructions (compare, add, etc.) Register Uses a register; requires explicit specifier Comparison operation Two registers with compare operation specified in opcode.

Higher Level Semantics: Functions Function call semantics Save PC + 1 instruction for return Manage parameters Allocate space on stack Jump to function Simple approach: Use a jump instruction + other instructions Complex approach: Build implicit operations into new “call” instruction

Role of the Compiler Compilers make the complexity of the ISA (from the programmers point of view) less relevant. Non-orthogonal ISAs are more challenging. State allocation (register allocation) is better left to compiler heuristics Complex Semantics lead to more global optimization – easier for a machine to do. People are good at optimizing 10 lines of code. Compilers are good at optimizing 10M lines.

LC processor Little Computer Fall 2011 For programming projects Instruction Set Design opcode regA regB destReg

LC processor R-type instructions opcode regA regB destReg 24- 22 21- 19 18 –16 15 – 3 2 - 0 add: destReg = regA + regB nand: destReg = regA & regB

LC processor I-type instructions opcode regA regB offsetField 24- 22 21- 19 18 –16 15 – 0 lw: regB = Memory[regA + offsetField] sw: Memory[regA +offsetField] = regB beq: if (regA= = regB) PC = PC + 1 + offsetField

LC processor O-type instructions opcode unused 24- 22 21 – 0 noop: do nothing halt: halt the simulation

LC assembly example lw 0 1 five load reg1 with 5 (uses symbolic address) lw 1 2 3 load reg2 with -1 (uses numeric address) start add 1 2 1 decrement reg1 beq 0 1 2 goto end of program when reg1==0 beq 0 0 start go back to the beginning of the loop noop done halt end of program five .fill 5 neg1 .fill -1 stAddr .fill start will contain the address of start (2)

LC machine code example (address 0): 8454151 (hex 0x810007) (address 1): 9043971 (hex 0x8a0003) (address 2): 655361 (hex 0xa0001) (address 3): 16842754 (hex 0x1010002) (address 4): 16842749 (hex 0x100fffd) (address 5): 29360128 (hex 0x1c00000) (address 6): 25165824 (hex 0x1800000) (address 7): 5 (hex 0x5) (address 8): -1 (hex 0xffffffff) (address 9): 2 (hex 0x2) Input for simulator: 8454151 9043971 655361 16842754 16842749 29360128 25165824 5 -1 2

CDA 5155 and 4150

CDA 5155 and 4150

Presentation Transcript

CDA and CDA Equivalencies

CDA 4

CDA 3100

CDA 4150 Lecture 4

CDA 5155

Why and where CDA

CDA 5155

Electronic Commerce DSE 4150

IHE Profiles and CDA

CDA 4150 Notes

CDA 5155

CDA 3100

CDA 3100

CDA 5155

CDA 3100

CDA, EFDA

CDA 5155

CDA 5155

CDA 3100

CDA and CDA Equivalencies

Avaya APSS 4150 study guide