760 likes | 918 Vues
Memory Oriented System-level Optimizations for Scripting Enabled Embedded Systems. Jiwon Hahn PhD Qualifying Exam University of California, Irvine March 2006. Motivation ▶ Embedded system development. Growing challenges Increasing end-user’s expectation More functionality
E N D
Memory Oriented System-level Optimizations for Scripting Enabled Embedded Systems Jiwon Hahn PhD Qualifying Exam University of California, Irvine March 2006
Motivation▶ Embedded system development • Growing challenges • Increasing end-user’s expectation • More functionality • Higher performance • Cheaper • Smaller • Very short time-to-market • Wide gap between available techniques and user satisfaction • Need new tools and methodology! motion sensing structural health monitoring preterm infant monitoring physiological sensing eco node
Strategies • Speed up the development! • Need better programming/debugging methodology and tool • Improve the current system’s bottleneck! • Memory unit is one of the most costly components, and affects system’s performance, power, and overall application range • Maximize the system’s capability! • Since embedded system is resource constrained, it helps to partition the system workload to the host
About My Research • Framework • Enhanced programming/debugging methodology • Host-assisting runtime environment • Optimization • Reducing data memory requirements and increasing memory utilization • Power and performance co-optimization
Outline • Scripting Framework • Memory-oriented Optimization • Implementation • Experimental Platforms • Summary & Research Plan
Outline • Scripting Framework • Scripting Engine Synthesis • Runtime Environment • Preliminary Results • Memory-oriented Optimization • Implementation • Experimental Platforms • Summary & Research Plan
Application temperature sensor sense temperature, send to the host every 5 min. Platform TecO particle 17 x 35 mm PIC18LF452 at 20 MHz 32KB program Flash 1.5KB RAM 32KB external EEPROM temperature sensor RF interface Etc. repeat Motivating Example▶ Building a small embedded system • Hardware • Solder RF module • Software (or Firmware) • no OS support! • no interactivity • no partial testing 1. Write the FW (C/assembly) 2. Compile 3. Connect board to the host 4. Enter the bootloading mode 5. Erase/Load/Verify Program 6. Restart the board 7. Run
Environment Setup Scripting repeat 1. Generate the FW (Scripting engine synthesis) 2. Compile 3. Connect board to the host 4. Enter the bootloading mode 5. Erase/Load/Verify Program 6. Restart the board 7. Run Motivation▶ Alternative approach: Scripting! 1. Write the script 2. Connect board to the host 3. Load & Run + Runtime Scripting Engine Synthesis
Our Framework: Rappit▶ Overview Receive packets Interpret the command Execute primitives (e.g., ADC read) Return the result >> readTemperature() 52 Framework to provide user an integrated scripting environment of the host and target systems
Rappit▶ Scripting engine synthesis System Description Architecture Application Communication // part of Scripting engine switch (opcode) { case 0x00: val = ADC_read(); case 0x01: RF_send(val); case 0x02: RF_packetize(val); … } Code Synthesis # example: pin mapping for an RF module mcu = MCU(ATmega169)# instantiate an atmega169 MCU import RF# load a transceiver module rf = RF(nRF2401)# instantiate nRF2401 rf.CS = mcu.PORTB[0]# connect the chip select pin rf.CE = mcu.PORTB[1]# connect the chip enable pin rf.DR1 = mcu.PORTB[2]# connect the data ready pin rf.CLK1 = mcu.PORTF[1]# connect the clock pin rf.DOUT1 = mcu.PORTF[2]# connect the data pin # example: packet format c_format = src(1),dst(1),msgID(1),opcode(1),arg(3),crc(1) r_format = src(1),dst(1),msgID(1),mtype(1),dtype(1),\ data(v), crc(1),eop(1) Component Library // part of primitives char ADC_read(void) { … } void RF_send(char pck) { … } Binary Executable Interactive Language Target F/W (Scripting Engine, Primitives,…) Host S/W (Parser, MsgGen, GUI, …) Compatible Message format Target System Host
Host Assistingmodules Rappit▶ Runtime environment Host Target System Parser Parser Optimizer Optimizer Msg Generator Pcktzer/ Dispatcher GUI Scripting Engine Pcktzer/ Depcktzer Pck Buffer Native Routines Admission Controller Component Library Packet Manager command response
Rappit▶ Host assistance • Script Parsing (Parser) • Memory Management (Optimizer) Host Parser, Msg. generator To target node “readTemp()” “0x4A0x01” • Easy to parse at node • Compact and efficient representation • User friendly Syntax Script Scheduler, Buffer Mapper To target node Optimized script Raw script • Minimal script size • Minimized memory usage • Minimized runtime overhead (Fixed schedule and buffer usage) • Written by user
Interactive port-setting >> PORTA[2] = 1 # toggle clock >> PORTA[2] = 0 >> PORTA[1] = 1 # set port A pin 1 >> PORTA[0] # read input pin 0 >> PORTA[2] = 1 >> PORTA[2] = 0 # toggle clock >> PORTA[0] # read input pin 1 System configuration >> mcu.sysclock = 1 MHz >> uart.baudrate = 9600 bps >> rf.power = -5 db >> rf.speed = 1 Mbps >> rf.config # query {’payload’: 1, ’power’: -5, ’speed’: 1000000, ’channel’:100, ’mode’: TX’} Periodic-task scheduling >> s = (every 50 ms: sample()) >> s.start() >> s.stop() Rappit▶ Scripting examples
Rappit▶ Experimental platform • AVR Butterfly Board • Atmel ATmega169 • 8-bit MCU @ 8MHz, 512B EEPROM, 1KB SRAM, 16KB program flash • Includes dataflash, speaker, sensors, joystick, LCD • USART serial link at 9600 baud AVR Butterfly w/ Wireless module AVR Butterfly
Rappit▶ Experimenting metrics and modality • Observation Metrics • Execution Modality
Code size reduction 61.8 – 66.3% reduction Scripting engine consists a thin layer Most reduction in application code size Performance overhead Batch mode scripting can be faster than native! Observed up to 25.7% speed-up Rappit▶ Preliminary results
Outline • Scripting Framework • Memory-oriented Optimization • Memory Optimization • Multi-metric Optimization • Implementation • Experimental Platforms • Summary & Research Plan
Problem Arise Choose primitives ADC_read, RF_send, RF_read, SD_write, SD_read, … Compile & Install Runtime Error! Why? exceeded 1KB RAM usage Problem Analysis Motivating Example▶ Installing Rappit primitives on Butterfly 512B static unsigned char sd_buffer[512]; static unsigned char rf_buffer[30]; static unsigned char ADC_buffer[30]; … 1KB char error_msg1 = “No SD Card detected!”; char error_msg2 = “Card Read Error!”; … SRAM • Solution • Sharing memory space • Mapping static data to dataflash 600B ? 1KB • Result • Increased board capability • Increased application range SRAM
Data Memory Minimization▶ Assumptions and Approach • Assumptions • Optimizing scripts • script size buffer size • Optimizing at runtime • Need low complexity algorithm • Approach • High-level optimization • Using scheduling and buffer mapping techniques • Priority on data memory minimization • Based on model of computation (MoC)
Models of Computation (MoC) • Synchronous Dataflow (SDF) [E. Lee ’87] • Extensively used as specification for block-diagram based programming environments for signal processing • Special case of dataflow • No notion of time • The number of tokens (=data) consumed and produced by each actor (=node) during each firing (=invocation) cycle is statically fixed. • Fractional Rate Dataflow (FRDF) [H. Oh, S. Ha ’02] • Extension of SDF that allows fractional flow of I/O samples of the original SDF
Why SDF? • Formal representation for optimization, simulation and analysis • System-level optimization • Application flow of various primitives • Static scheduling • Minimize runtime overhead for resource constrained embedded systems • Deadlock detection • Bounding the memory requirements • Good match for sensor applications • collect data, process, transmit
v1 v2 v3 … v|V| e1 e2 e3 … e|E| • -2 0 … 0 0 2 -1 … 0 0 0 3 … … 0 0 0 … -5 T = SDF▶ Notations • SDF graph G = (V, E, p, c) • V: {v1, v2, … v|V|} • E: {e1, e2, … e|E|} • src(e) : source node • snk(e): sink node • p(e) : produce rate • -c(e) : consume rate • T(e,v): topology matrix • p(e) if v = src(e), • -c(e) if v = snk(e) • 0 otherwise e1 e1e2 e3 … e|E| v3 v2 v|V| v1 1 2 2 1 3 … 5 src(e1) p(e1)c(e1) snk(e1)
SDF▶ Example • Surge Application • Actors: A, B, C • Buffers: x, y • Schedule: ABC • Rappit Script (4L): A C B x y ADC read RF pack RF send 1 1 1 1 every 2048: x = ADC.read() y = RF.pack(x) RF.send(y)
SDF▶ Example (cont’d) • Same code in Java (20L) [J. Koshy ’05]: SurgePacket sgPkt; char eList, eVector; byte sHandle; sgPkt = new SurgePacket(); evList = Select.setEventId( eList, Events.TIMEOUT | Events.RADIO RECV ); sHandle = Select.requestSelectHandle(); char val; Clock.startTimeout( 2048 ); while (true) { eVector = Select.select(sHandle, eList); if (Select.eventOccurred( eVector, Events.TIMEOUT )) { val = PhotoSensor.sense(); sgPkt.setReading( val ); Surge.sendPacket( sgPkt ); Clock.startTimeout( 2048 ); } else if (Select.eventOccurred( eVector, Events.RADIO RECV)) { handleRadioEvent( sgPkt ); // if base, forward to uart } }
Problem Statements • Find the best schedule and buffer mapping that minimizes the buffer size requirement • Goal-oriented • Previous work • Find the best schedule and buffer mapping that fits into, and maximizes the utilization of a given memory size • Constraint-driven • Novel • Practical
Buffer Mapping Problem▶ Spatial representation • Token-lifetime chart (t-chart) • row: token’s lifetime, produced placed consumed • column: fixed number of token changes caused by firing event local buffer x y time A B B C C
Buffer Mapping Problem▶ Spatial representation (cont’d) • Memory-usage profile (m-profile) • Metrics • Msize = 4, Mtotal = 20, Mused = 11, Mwasted = 9, Mutil = 55% • T = 5 memory time A B B C C
Memory Optimization Techniques • *Scheduling w/ Unshared Buffer • *Buffer Sharing • *I/O Buffer Merging 4a) **Fractionizing 4b) Rate Selection (new) • Pipelining (new) * Well established previous work ** Recently proposed
Memory Optimization Techniques▶ 1) Scheduling with unshared buffer x y • By efficient ordering of actors, buffer requirement is reduced! • Each edge is directly mapped to its dedicated buffer space A B C 2 1 1 1 Schedule 1: A B B C C Schedule 2: A B C B C x = A() repeat 2: y = B(x) repeat 2: C(y) x[0..1] = A() y[0] = B(x[0]) y[1] = B(x[1]) C(y[0]) C(y[1]) x = A() repeat 2: y = B(x) C(y) x[0..1] = A() y[0] = B(x[0]) C(y[0]) y[0] = B(x[1]) C(y[0]) Buffer requirement: |x| + |y| = 2 + 2 = 4 Buffer requirement: |a| + |b| = 2 + 1 = 3
Assuming the token is consumed before output is produced… Use the same space for the input/output tokens Reuse the available space! Data consumed… x[0] x[1] B(x[0]) B(x[0]) B(x[1]) x[0] Memory Optimization Techniques▶ Comparing 1), 2), 3) x y x = A() repeat 2: y = B(x) repeat 2: C(y) B A C 2 1 1 1 Schedule: A B B C C x[0..1] = A() y[0] = B(x[0]) y[1] = B(x[1]) C(y[0]) C(y[1]) x[0..1] = A() y[0] = B(x[0]) x[0] = B(x[1]) C(y[0]) C(x[0]) x[0..1] = A() x[0] = B(x[0]) x[1] = B(x[1]) C(x[0]) C(x[1]) 1) Unshared Buffer 2) Shared Buffer 3) Merged I/O Buffer Buffer requirement: |x| + |y| = 2 + 2 = 4 Buffer requirement: |x| + |y| = 2 + 1 = 3 Buffer requirement: |x| + |y| = 2 + 0 = 2
2) Shared Buffer 3) Merged I/O Buffer |x|+|y| : Mtotal : Mused : Mwasted : Mutil : Memory Optimization Techniques▶ Comparing 1), 2), 3) (cont’d) 1) Unshared Buffer 4 20 11 9 55% 3 15 11 4 73% 2 10 9 1 90% local buffer x t2 t4 y t1t3 time A B B C C
Memory Optimization Techniques▶ 4a) Fractionizing w x w x • Idea: • Don’t wait until A produces big chunk of data • Modify actor A to process only fractional amount of the original data at a time • Trade-off • Local effect • Possible time and energy overhead • e.g., resource’s access time, packet overhead • Global effect • Reduced bottleneck: shorter processing interval of A • Reduced buffer size: min|x|: 2 1 A B A’ B 1 3 1 1/3 1 1 Schedule: A 3(B) Schedule: 2(AB)
Memory Optimization Techniques▶ 4b) Rate Selection w x • Idea • Generalize fractionizing • Not only allow fractions but also multiples • Rate is defined as range, but fixed before schedule finalizes • Each actor is modeled with timing and power function with respect to the I/O range • Benefits • Combines the power of flexibility and static determinism • Increases buffer reduction opportunity • Challenge • Need an efficient way to handle considerably increased exploration space at runtime Schedule1: 2(A)B Schedule2: AB Schedule3: 2(A)3(B) A B (4,4) (2,6) (1,3)
Memory Optimization Techniques▶ 5) Pipelining • Idea • Allow multiple actor firing at once • Benefits • Reduced buffer requirement • Higher memory utilization • Increased throughput • Challenges • Need multiprocessors • Need to resolve resource conflict • Need to consider synchronization problem
Memory Optimization Techniques▶ Comparing 1), 4), 5) x y B A C 1) Unshared Buffer 1 2 1 1 1 x y A B C B C y x B A’ C 5) Pipelined 4) Fractionized / Rate Selected 1 1 1 1 1/2 x y C A B C A B C
global A B C A B C Memory Optimization Techniques▶ Summary 0: None (baseline) 1: Unshared Scheduling 2: Shared Buffer 3: Merged I/O 4: Fractionized 5: Pipelined
Multi-metric Optimization • Trade-offs • In actor point of view (local), processing large amount of data at once tends to reduce time and energy overhead • In SDF-flow point of view (global), processing small amount of data at once reduces buffer requirement • Goal • Find a pareto-optimal point that resides in a range of solution set that satisfies constraints Energy Data Memory Execution Time data-flow rate
Compile-time Host Run-time Target Applying it to Rappit▶ Quasi-static optimization Performed Tasks Rappit Flow Compile Kernel and primitives compiled and installed Load script SDF defined Optimization Actor-to-processor assignment, Actor ordering (scheduling), Buffer mapping Preprocess Load script code Static schedule loaded Deterministic execution w/o runtime overhead Execute
Outline • Scripting Framework • Memory-oriented Optimization • Implementation • Synthesis Tool • Simulator • Runtime Host-assisting Tool (GUI) • Experimental Platforms • Summary & Research Plan
Implementation▶ Scripting engine synthesis tool • System Template • GUI-based check-box approach • easily capture existing systems • model new systems for simulation and design space exploration • includes communication description • Component Library • binds according to template configuration • consists of MCU, on-chip devices, off-chip peripherals • each component has I/O pins and driver modules
Implementation▶ Tool integration Node 1 GUI Parser Dispatcher Node 2 Scheduler Node Manager Node 3 Memory Optimizer Node N
Outline • Scripting Framework • Memory-oriented Optimization • Implementation • Experimental Platforms • Summary & Research Plan
HW Platforms and Real-world Applications • Eco • ultra-compact sensor node • pre-term infant monitoring • dancing motion detection • Mini-FDPM • active laser sensing device • breast cancer detection • DuraNode • real-time data acquisition system • structural health monitoring • Butterfly • low-power, i/o rich development board • prototyping (SD-card, speaker, sensors, RF)
Outline • Scripting Framework • Memory-oriented Optimization • Implementation • Experimental Platforms • Summary & Research Plan
Summary • A novel scripting framework for embedded systems • Scripting engine synthesis • Host assisting runtime environment • Memory optimization techniques • Comparison of techniques • Integration and multi-objective problem • Tool Implementations • Rappit GUI, memory simulator