560 likes | 832 Vues
Introduction to Embedded Systems . Rabie A. Ramadan rabieramadan@gmail.com http:// www.rabieramadan.org /classes/2014/embedded/ 2. Embedded microprocessor market. Categories of CPUs. RISC, DSP, and Multimedia processors. CPU mechanisms. Topics. Embedded processors account for
E N D
Introduction to Embedded Systems Rabie A. Ramadan rabieramadan@gmail.com http://www.rabieramadan.org/classes/2014/embedded/ 2
Embedded microprocessor market. Categories of CPUs. RISC, DSP, and Multimedia processors. CPU mechanisms. Topics
Embedded processors account for Over 97% of total processors sold Sales expected to increase by roughly 15% each year Demand for Embedded Processors
Performance Latency : the time required to execute an instruction from start to finish, Throughput : the rate at which instructions are finished Evaluating Processors
At the program level, computer architects also speak of average performance or peak performance. Often calculated assuming that instruction throughput proceeds at its maximum rate and all processor resources are fully utilized Evaluating Processors
Embedded system designers often talk about program performance in terms of worst-case (or sometimes best-case) performance: This is not simply a characteristic of the processor; it is determined for a particular program running on a given processor. Evaluating Processors
Cost The purchase price of the processor. In VLSI design, cost is often measured in terms of the silicon area required to implement a processor, which is closely related to chip cost. Evaluating Processors
Energy and power In modern processors, energy and power consumption must be measured for a particular program and data for accurate results. Evaluating Processors
Predictability Important characteristic for embedded systems When designing real-time systems, we want to be able to predict execution time. More difficult to measure. Evaluating Processors
Security An important characteristic of all processors, including embedded processors. Security is inherently unmeasurable because of the fact that we do not know of a successful attack on a system; this does not mean that such an attack cannot exist. Evaluating Processors
Von Neumann Architecture Basic Computer Architecture Memory instruction data Input unit Output unit ALU Processor CU Reg.
Bit level parallelism Within arithmetic logic circuits Instruction level parallelism Multiple instructions execute per clock cycle Memory system parallelism Overlap of memory operations with computation Operating system parallelism More than one processor Multiple jobs run in parallel Loop level Procedure level Levels of Parallelism
Bit Level Parallelism Within arithmetic logic circuits Levels of Parallelism
Instruction Level Parallelism (ILP) Multiple instructions execute per clock cycle Pipelining (instruction - data) Multiple Issue -Very long instruction word (VLIW) Levels of Parallelism
Memory System Parallelism Overlap of memory operations with computation Levels of Parallelism
Operating System Parallelism There are more than one processor Multiple jobs run in parallel Loop level Procedure level Levels of Parallelism
Single Instruction stream - Single Data stream (SISD) Single Instruction stream - Multiple Data stream (SIMD) Multiple Instruction stream - Single Data stream (MISD) Multiple Instruction stream - Multiple Data stream (MIMD) Flynn’s Taxonomy
Von Neumann Architecture Single Instruction stream - Single Data stream (SISD) Memory instruction data ALU CU Processor
Single Instruction stream - Single Data stream (SISD) Single Instruction stream - Multiple Data stream (SIMD) Multiple Instruction stream - Single Data stream (MISD) Multiple Instruction stream - Multiple Data stream (MIMD) Flynn’s Taxonomy
Instructions of the program are broadcast to more than one processor Each processor executes the same instruction synchronously, but using different data Used for applications that operate upon arrays of data Single Instruction stream - Multiple Data stream (SIMD) data PE data PE instruction CU Memory data PE data PE instruction
Single Instruction stream - Single Data stream (SISD) Single Instruction stream - Multiple Data stream (SIMD) Multiple Instruction stream - Single Data stream (MISD) Multiple Instruction stream - Multiple Data stream (MIMD) Flynn’s Taxonomy
Each processor has a separate program An instruction stream is generated for each program on each processor Each instruction operates upon different data Multiple Instruction stream - Multiple Data stream (MIMD)
Shared memory Distributed memory Multiple Instruction stream - Multiple Data stream (MIMD)
Distributed memory Each processor has its own local memory Message-passing is used to exchange data between processors Shared memory Single address space All processes have access to the pool of shared memory Shared vs Distributed Memory P P P P Bus Memory M M M M P P P P Network
Processors cannot directly access another processor’s memory Each node has a network interface (NI) for communication and synchronization Distributed Memory M M M M P P P P NI NI NI NI Network
Each processor executes different instructions asynchronously, using different data Distributed Memory instr data M CU PE data data data data data instr M CU PE Network data instr M CU PE data instr M CU PE
Each processor executes different instructions asynchronously, using different data Shared Memory data CU PE data CU PE Memory data CU PE data CU PE instruction
Uniform memory access (UMA) Each processor has uniform access to memory (symmetric multiprocessor - SMP) Non-uniform memory access (NUMA) Time for memory access depends on the location of data Local access is faster than non-local access Easier to scale than SMPs P P P P P P P P Bus Bus Memory Memory Shared Memory P P P P Bus Memory Network
Making the main memory of a cluster of computers look as if it is a single memory with a single address space Shared memory programming techniques can be used Distributed Shared Memory
Many general purpose processors GPU (Graphics Processor Unit) GPGPU (General Purpose GPU) Hybrid Multicore Systems Memory • The trend is: • Boardcomposed ofmultiple many core chipssharingmemory • Rack composedof multipleboards • A room full of these racks
RISC vs. CISC---Instruction set style. Instruction issue width. Static vs. dynamic scheduling for multiple-issue machines. Scalar vs. vector processing. Single-threaded vs. multithreading. A single CPU can fit into multiple categories. Other axes of comparison
Complex Instruction Set Computer “High level” Instruction Set Executes several “low level operations” Ex: load, arithmetic operation, memory store – VAX, Intel X86, IBM 360/370, etc. RISC vs. CISC
Features of CISC Small number of general purpose registers Instructions take multiple clocks to execute Few lines of code per operation
Reduced Instruction Set Computer RISC is a CPU design that recognizes only a limited number of instructions Simple instructions Instructions are executed quickly MIPS, DEC Alpha, SUN Sparc, IBM 801 RISC vs. CISC
“Reduced” instruction set Executes a series of simple instruction instead of a complex instruction Instructions are executed within one clock cycle Incorporates a large number of general registers for arithmetic operations to avoid storing variables on a stack in memory Pipelining = speed Features of RISC
Instruction issue width important aspect of processor performance. Processors that can issue more than one instruction per cycle generally execute programs faster. They do so at the cost of increased power consumption and higher cost. Single issue versus Multiple issue
Static scheduling instructions is determined when the program is written. Dynamic scheduling determines which instructions are issued at runtime. Superscalar is a common technique for dynamic instruction issue -Tomasulo static versus dynamic scheduling
Embedded processors may be customized for a category of applications. Customization may be narrow or broad. We may judge embedded processors using different metrics: Code size. Energy efficiency. Memory system performance. Predictability. Embedded vs. general-purpose processors
RISC processors often have simple, highly-pipelinable instructions Pipelines of embedded RISC processors have grown over time: ARM7 has 3-stage pipeline. ARM9 has 5-stage pipeline ARM11 has 8-stage pipeline. Embedded RISC processors
ARM: ARM7 has in-order execution, and no memory management or branch prediction; ARM9 ARM11 has out of order execution, memory management, and branch prediction, MIPS: MIPS32 4K has 5-stage pipeline; 4KE family has DSP extension; 4KS is designed for security. PowerPC: PowerPC 400 series includes several embedded processors; Motorola and IBM offer superscalar versions of the PowerPC RISC processor families
Embedded DSP Processors • Embedded DSP processors are optimized to perform DSP algorithms; speech coding, filtering, convolution, fast Fourier transforms, discrete cosine transforms
AT&T DSP-16 was the first DSP it had an onboard multiplier and provided a multiply–accumulate instruction. dest = src1*src2 + src3, a common operation in digital signal processing. Based on Harvard-architecture with separate data and instruction memories. Data accesses could rely on consistent bandwidth from the memory, which is particularly important for sampled-data systems. Embedded DSP Processors- example
Static: Use compiler to analyze program. Simpler CPU. Can’t depend on data values. Very Long Instruction Word (VLIW) Dynamic: Use hardware to identify opportunities. More complex CPU. Can make use of data values. Superscalar Parallelism extraction
Widespread use in embedded systems provide instruction-level parallelism with relatively low hardware overhead. The execution unit includes a pool of function units connected to a large register file. the execution unit reads a packet of instructions—each instruction in the packet can control one of the function units in the machine. Very Long Instruction Word (VLIW)
Large register file feeds multiple function units. Simple VLIW architecture E box Add r1,r2,r3; Sub r4,r5,r6; Ld r7,foo; St r8,baz; NOP Register file ALU ALU Load/store Load/store FU
Clustered VLIW architecture • Register file, function units divided into clusters. Cluster bus Execution Execution Register file Register file
Example 1 : Trimedia family of processors designed for use in video systems. Video algorithms often perform similar operations on several pixels at time. Very Long Instruction Word (VLIW)
Example 2 : Texas Instruments C6x VLIW DSP Very Long Instruction Word (VLIW)
Onboard program and a data RAM as well as standard devices and DMA. The processor core includes two clusters, each with the same configuration. Each register file holds 16 words. Each data path has eight function units: two load units, two store units, two data address units, and two register file cross paths. Very Long Instruction Word (VLIW)Example 2: Texas Instruments C6x VLIW DSP
more than one instruction per clock cycle. Unlike VLIW processors, they check for resource conflicts on-the-fly to determine which combinations of instructions can be issued at each step. Superscalar processors are not as common in the embedded world. Used to some extent in embedded processors. Embedded Pentium is two-issue in-order. Some PowerPCs are superscalar Superscalar Processors