1 / 72

Memory Oriented System-level Optimizations for Scripting Enabled Embedded Systems

Memory Oriented System-level Optimizations for Scripting Enabled Embedded Systems. Jiwon Hahn PhD Qualifying Exam University of California, Irvine March 2006. Motivation ▶ Embedded system development. Growing challenges Increasing end-user’s expectation More functionality

Télécharger la présentation

Memory Oriented System-level Optimizations for Scripting Enabled Embedded Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Memory Oriented System-level Optimizations for Scripting Enabled Embedded Systems Jiwon Hahn PhD Qualifying Exam University of California, Irvine March 2006

  2. Motivation▶ Embedded system development • Growing challenges • Increasing end-user’s expectation • More functionality • Higher performance • Cheaper • Smaller • Very short time-to-market • Wide gap between available techniques and user satisfaction • Need new tools and methodology! motion sensing structural health monitoring preterm infant monitoring physiological sensing eco node

  3. Strategies • Speed up the development! • Need better programming/debugging methodology and tool • Improve the current system’s bottleneck! • Memory unit is one of the most costly components, and affects system’s performance, power, and overall application range • Maximize the system’s capability! • Since embedded system is resource constrained, it helps to partition the system workload to the host

  4. About My Research • Framework • Enhanced programming/debugging methodology • Host-assisting runtime environment • Optimization • Reducing data memory requirements and increasing memory utilization • Power and performance co-optimization

  5. Outline • Scripting Framework • Memory-oriented Optimization • Implementation • Experimental Platforms • Summary & Research Plan

  6. Outline • Scripting Framework • Scripting Engine Synthesis • Runtime Environment • Preliminary Results • Memory-oriented Optimization • Implementation • Experimental Platforms • Summary & Research Plan

  7. Application temperature sensor sense temperature, send to the host every 5 min. Platform TecO particle 17 x 35 mm PIC18LF452 at 20 MHz 32KB program Flash 1.5KB RAM 32KB external EEPROM temperature sensor RF interface Etc. repeat Motivating Example▶ Building a small embedded system • Hardware • Solder RF module • Software (or Firmware) • no OS support! • no interactivity • no partial testing 1. Write the FW (C/assembly) 2. Compile 3. Connect board to the host 4. Enter the bootloading mode 5. Erase/Load/Verify Program 6. Restart the board 7. Run

  8. Environment Setup Scripting repeat 1. Generate the FW (Scripting engine synthesis) 2. Compile 3. Connect board to the host 4. Enter the bootloading mode 5. Erase/Load/Verify Program 6. Restart the board 7. Run Motivation▶ Alternative approach: Scripting! 1. Write the script 2. Connect board to the host 3. Load & Run + Runtime Scripting Engine Synthesis

  9. Motivation▶ Scripting vs. Traditional Programming

  10. Related Work▶ Frameworks for runtime support

  11. Our Framework: Rappit▶ Overview Receive packets Interpret the command Execute primitives (e.g., ADC read) Return the result >> readTemperature() 52 Framework to provide user an integrated scripting environment of the host and target systems

  12. Rappit▶ Scripting engine synthesis System Description Architecture Application Communication // part of Scripting engine switch (opcode) { case 0x00: val = ADC_read(); case 0x01: RF_send(val); case 0x02: RF_packetize(val); … } Code Synthesis # example: pin mapping for an RF module mcu = MCU(ATmega169)# instantiate an atmega169 MCU import RF# load a transceiver module rf = RF(nRF2401)# instantiate nRF2401 rf.CS = mcu.PORTB[0]# connect the chip select pin rf.CE = mcu.PORTB[1]# connect the chip enable pin rf.DR1 = mcu.PORTB[2]# connect the data ready pin rf.CLK1 = mcu.PORTF[1]# connect the clock pin rf.DOUT1 = mcu.PORTF[2]# connect the data pin # example: packet format c_format = src(1),dst(1),msgID(1),opcode(1),arg(3),crc(1) r_format = src(1),dst(1),msgID(1),mtype(1),dtype(1),\ data(v), crc(1),eop(1) Component Library // part of primitives char ADC_read(void) { … } void RF_send(char pck) { … } Binary Executable Interactive Language Target F/W (Scripting Engine, Primitives,…) Host S/W (Parser, MsgGen, GUI, …) Compatible Message format Target System Host

  13. Host Assistingmodules Rappit▶ Runtime environment Host Target System Parser Parser Optimizer Optimizer Msg Generator Pcktzer/ Dispatcher GUI Scripting Engine Pcktzer/ Depcktzer Pck Buffer Native Routines Admission Controller Component Library Packet Manager command response

  14. Rappit▶ Host assistance • Script Parsing (Parser) • Memory Management (Optimizer) Host Parser, Msg. generator To target node “readTemp()” “0x4A0x01” • Easy to parse at node • Compact and efficient representation • User friendly Syntax Script Scheduler, Buffer Mapper To target node Optimized script Raw script • Minimal script size • Minimized memory usage • Minimized runtime overhead (Fixed schedule and buffer usage) • Written by user

  15. Interactive port-setting >> PORTA[2] = 1 # toggle clock >> PORTA[2] = 0 >> PORTA[1] = 1 # set port A pin 1 >> PORTA[0] # read input pin 0 >> PORTA[2] = 1 >> PORTA[2] = 0 # toggle clock >> PORTA[0] # read input pin 1 System configuration >> mcu.sysclock = 1 MHz >> uart.baudrate = 9600 bps >> rf.power = -5 db >> rf.speed = 1 Mbps >> rf.config # query {’payload’: 1, ’power’: -5, ’speed’: 1000000, ’channel’:100, ’mode’: TX’} Periodic-task scheduling >> s = (every 50 ms: sample()) >> s.start() >> s.stop() Rappit▶ Scripting examples

  16. Rappit▶ Experimental platform • AVR Butterfly Board • Atmel ATmega169 • 8-bit MCU @ 8MHz, 512B EEPROM, 1KB SRAM, 16KB program flash • Includes dataflash, speaker, sensors, joystick, LCD • USART serial link at 9600 baud AVR Butterfly w/ Wireless module AVR Butterfly

  17. Rappit▶ Experimenting metrics and modality • Observation Metrics • Execution Modality

  18. Code size reduction 61.8 – 66.3% reduction Scripting engine consists a thin layer Most reduction in application code size Performance overhead Batch mode scripting can be faster than native! Observed up to 25.7% speed-up Rappit▶ Preliminary results

  19. Outline • Scripting Framework • Memory-oriented Optimization • Memory Optimization • Multi-metric Optimization • Implementation • Experimental Platforms • Summary & Research Plan

  20. Problem Arise Choose primitives ADC_read, RF_send, RF_read, SD_write, SD_read, … Compile & Install Runtime Error! Why? exceeded 1KB RAM usage Problem Analysis Motivating Example▶ Installing Rappit primitives on Butterfly 512B static unsigned char sd_buffer[512]; static unsigned char rf_buffer[30]; static unsigned char ADC_buffer[30]; … 1KB char error_msg1 = “No SD Card detected!”; char error_msg2 = “Card Read Error!”; … SRAM • Solution • Sharing memory space • Mapping static data to dataflash 600B ? 1KB • Result • Increased board capability • Increased application range SRAM

  21. Data Memory Minimization▶ Assumptions and Approach • Assumptions • Optimizing scripts • script size  buffer size • Optimizing at runtime • Need low complexity algorithm • Approach • High-level optimization • Using scheduling and buffer mapping techniques • Priority on data memory minimization • Based on model of computation (MoC)

  22. Models of Computation (MoC) • Synchronous Dataflow (SDF) [E. Lee ’87] • Extensively used as specification for block-diagram based programming environments for signal processing • Special case of dataflow • No notion of time • The number of tokens (=data) consumed and produced by each actor (=node) during each firing (=invocation) cycle is statically fixed. • Fractional Rate Dataflow (FRDF) [H. Oh, S. Ha ’02] • Extension of SDF that allows fractional flow of I/O samples of the original SDF

  23. Why SDF? • Formal representation for optimization, simulation and analysis • System-level optimization • Application flow of various primitives • Static scheduling • Minimize runtime overhead for resource constrained embedded systems • Deadlock detection • Bounding the memory requirements • Good match for sensor applications • collect data, process, transmit

  24. v1 v2 v3 … v|V| e1 e2 e3 … e|E| • -2 0 … 0 0 2 -1 … 0 0 0 3 … … 0 0 0 … -5 T = SDF▶ Notations • SDF graph G = (V, E, p, c) • V: {v1, v2, … v|V|} • E: {e1, e2, … e|E|} • src(e) : source node • snk(e): sink node • p(e) : produce rate • -c(e) : consume rate • T(e,v): topology matrix • p(e) if v = src(e), • -c(e) if v = snk(e) • 0 otherwise e1 e1e2 e3 … e|E| v3 v2 v|V| v1 1 2 2 1 3 … 5 src(e1) p(e1)c(e1) snk(e1)

  25. SDF▶ Example • Surge Application • Actors: A, B, C • Buffers: x, y • Schedule: ABC • Rappit Script (4L): A C B x y ADC read RF pack RF send 1 1 1 1 every 2048: x = ADC.read() y = RF.pack(x) RF.send(y)

  26. SDF▶ Example (cont’d) • Same code in Java (20L) [J. Koshy ’05]: SurgePacket sgPkt; char eList, eVector; byte sHandle; sgPkt = new SurgePacket(); evList = Select.setEventId( eList, Events.TIMEOUT | Events.RADIO RECV ); sHandle = Select.requestSelectHandle(); char val; Clock.startTimeout( 2048 ); while (true) { eVector = Select.select(sHandle, eList); if (Select.eventOccurred( eVector, Events.TIMEOUT )) { val = PhotoSensor.sense(); sgPkt.setReading( val ); Surge.sendPacket( sgPkt ); Clock.startTimeout( 2048 ); } else if (Select.eventOccurred( eVector, Events.RADIO RECV)) { handleRadioEvent( sgPkt ); // if base, forward to uart } }

  27. Problem Statements • Find the best schedule and buffer mapping that minimizes the buffer size requirement • Goal-oriented • Previous work • Find the best schedule and buffer mapping that fits into, and maximizes the utilization of a given memory size • Constraint-driven • Novel • Practical

  28. Buffer Mapping Problem▶ Spatial representation • Token-lifetime chart (t-chart) • row: token’s lifetime, produced  placed  consumed • column: fixed number of token changes caused by firing event local buffer x y time A B B C C

  29. Buffer Mapping Problem▶ Spatial representation (cont’d) • Memory-usage profile (m-profile) • Metrics • Msize = 4, Mtotal = 20, Mused = 11, Mwasted = 9, Mutil = 55% • T = 5 memory time A B B C C

  30. Related Work▶ Data memory optimization based on MoC

  31. Memory Optimization Techniques • *Scheduling w/ Unshared Buffer • *Buffer Sharing • *I/O Buffer Merging 4a) **Fractionizing 4b) Rate Selection (new) • Pipelining (new) * Well established previous work ** Recently proposed

  32. Memory Optimization Techniques▶ 1) Scheduling with unshared buffer x y • By efficient ordering of actors, buffer requirement is reduced! • Each edge is directly mapped to its dedicated buffer space A B C 2 1 1 1 Schedule 1: A B B C C Schedule 2: A B C B C x = A() repeat 2: y = B(x) repeat 2: C(y) x[0..1] = A() y[0] = B(x[0]) y[1] = B(x[1]) C(y[0]) C(y[1]) x = A() repeat 2: y = B(x) C(y) x[0..1] = A() y[0] = B(x[0]) C(y[0]) y[0] = B(x[1]) C(y[0]) Buffer requirement: |x| + |y| = 2 + 2 = 4 Buffer requirement: |a| + |b| = 2 + 1 = 3

  33. Assuming the token is consumed before output is produced… Use the same space for the input/output tokens Reuse the available space! Data consumed… x[0] x[1] B(x[0]) B(x[0]) B(x[1]) x[0] Memory Optimization Techniques▶ Comparing 1), 2), 3) x y x = A() repeat 2: y = B(x) repeat 2: C(y) B A C 2 1 1 1 Schedule: A B B C C x[0..1] = A() y[0] = B(x[0]) y[1] = B(x[1]) C(y[0]) C(y[1]) x[0..1] = A() y[0] = B(x[0]) x[0] = B(x[1]) C(y[0]) C(x[0]) x[0..1] = A() x[0] = B(x[0]) x[1] = B(x[1]) C(x[0]) C(x[1]) 1) Unshared Buffer 2) Shared Buffer 3) Merged I/O Buffer Buffer requirement: |x| + |y| = 2 + 2 = 4 Buffer requirement: |x| + |y| = 2 + 1 = 3 Buffer requirement: |x| + |y| = 2 + 0 = 2

  34. 2) Shared Buffer 3) Merged I/O Buffer |x|+|y| : Mtotal : Mused : Mwasted : Mutil : Memory Optimization Techniques▶ Comparing 1), 2), 3) (cont’d) 1) Unshared Buffer 4 20 11 9 55% 3 15 11 4 73% 2 10 9 1 90% local buffer x t2  t4  y t1t3  time A B B C C

  35. Memory Optimization Techniques▶ 4a) Fractionizing w x w x • Idea: • Don’t wait until A produces big chunk of data • Modify actor A to process only fractional amount of the original data at a time • Trade-off • Local effect • Possible time and energy overhead • e.g., resource’s access time, packet overhead • Global effect • Reduced bottleneck: shorter processing interval of A • Reduced buffer size: min|x|: 2  1 A B A’ B 1 3 1 1/3 1 1 Schedule: A 3(B) Schedule: 2(AB)

  36. Memory Optimization Techniques▶ 4b) Rate Selection w x • Idea • Generalize fractionizing • Not only allow fractions but also multiples • Rate is defined as range, but fixed before schedule finalizes • Each actor is modeled with timing and power function with respect to the I/O range • Benefits • Combines the power of flexibility and static determinism • Increases buffer reduction opportunity • Challenge • Need an efficient way to handle considerably increased exploration space at runtime Schedule1: 2(A)B Schedule2: AB Schedule3: 2(A)3(B) A B (4,4) (2,6) (1,3)

  37. Memory Optimization Techniques▶ 5) Pipelining • Idea • Allow multiple actor firing at once • Benefits • Reduced buffer requirement • Higher memory utilization • Increased throughput • Challenges • Need multiprocessors • Need to resolve resource conflict • Need to consider synchronization problem

  38. Memory Optimization Techniques▶ Comparing 1), 4), 5) x y B A C 1) Unshared Buffer 1 2 1 1 1 x y A B C B C y x B A’ C 5) Pipelined 4) Fractionized / Rate Selected 1 1 1 1 1/2 x y C A B C A B C

  39. global A B C A B C Memory Optimization Techniques▶ Summary 0: None (baseline) 1: Unshared Scheduling 2: Shared Buffer 3: Merged I/O 4: Fractionized 5: Pipelined

  40. Multi-metric Optimization • Trade-offs • In actor point of view (local), processing large amount of data at once tends to reduce time and energy overhead • In SDF-flow point of view (global), processing small amount of data at once reduces buffer requirement • Goal • Find a pareto-optimal point that resides in a range of solution set that satisfies constraints Energy Data Memory Execution Time data-flow rate

  41. Compile-time Host Run-time Target Applying it to Rappit▶ Quasi-static optimization Performed Tasks Rappit Flow Compile Kernel and primitives compiled and installed Load script SDF defined Optimization Actor-to-processor assignment, Actor ordering (scheduling), Buffer mapping Preprocess Load script code Static schedule loaded Deterministic execution w/o runtime overhead Execute

  42. Outline • Scripting Framework • Memory-oriented Optimization • Implementation • Synthesis Tool • Simulator • Runtime Host-assisting Tool (GUI) • Experimental Platforms • Summary & Research Plan

  43. Implementation▶ Scripting engine synthesis tool • System Template • GUI-based check-box approach • easily capture existing systems • model new systems for simulation and design space exploration • includes communication description • Component Library • binds according to template configuration • consists of MCU, on-chip devices, off-chip peripherals • each component has I/O pins and driver modules

  44. Implementation▶ Memory simulator

  45. Implementation▶ Interactive runtime tool

  46. Implementation▶ Tool integration Node 1 GUI Parser Dispatcher Node 2 Scheduler Node Manager Node 3 Memory Optimizer Node N

  47. Outline • Scripting Framework • Memory-oriented Optimization • Implementation • Experimental Platforms • Summary & Research Plan

  48. HW Platforms and Real-world Applications • Eco • ultra-compact sensor node • pre-term infant monitoring • dancing motion detection • Mini-FDPM • active laser sensing device • breast cancer detection • DuraNode • real-time data acquisition system • structural health monitoring • Butterfly • low-power, i/o rich development board • prototyping (SD-card, speaker, sensors, RF)

  49. Outline • Scripting Framework • Memory-oriented Optimization • Implementation • Experimental Platforms • Summary & Research Plan

  50. Summary • A novel scripting framework for embedded systems • Scripting engine synthesis • Host assisting runtime environment • Memory optimization techniques • Comparison of techniques • Integration and multi-objective problem • Tool Implementations • Rappit GUI, memory simulator

More Related