220 likes | 356 Vues
Demand and Penalty-Based Resource Allocation for Reconfigurable Systems with Runtime Partitioning John Ardini. Motivation. For a given HW architecture, including reconfigurable components Optimize performance in consideration of long reconfiguration times and current demands for processing
E N D
Demand and Penalty-Based Resource Allocation for Reconfigurable Systems with Runtime PartitioningJohn Ardini Ardini
Motivation • For a given HW architecture, including reconfigurable components • Optimize performance in consideration of long reconfiguration times and current demands for processing • Application in systems with unknown runtime processing demands • Cognitive systems • Multisensor systems • Systems with unknown data lengths • Take advantage of ability to express hardware implementations in high-level language (C) common to processor and programmable devices Ardini
Related Work • Li, Compton, Hauck [00], based on Young[94] • “credit” for RC unit is proportional to size of the unit • Penalty Algorithm for defragmentation • Scoring approach here, but “credit” is proportional to amount of “acceleration” achieved with decision threshold based on size • Vuletić, Pozzi, Ienne [04] • HW/SW abstraction layer proposed for RC transparent programming model Ardini
Goals • Examine possible RTL generators allowing one set of source code for an algorithm • Binds to processor or programmable device (FPGA) • Minimal changes (I/O only) required to source • However, scheduling approach is not dependent on the capability or C to RTL generators • Show easy creation of processor and FPGA implementations of logic • Assume task scheduling is unknown at build time and is based on service requests • Allow each task to support SW only and hardware accelerated versions • Define simple logic to make “best” use of hardware resources, assign ownership dynamically • Show benefit of RC via DMA in algorithms that can be bound to HW or SW • Define API for application threads • Demonstrate concept in real hardware Ardini
Experimental Environment Worker thd registration Mgr thread Service request • Worker thread, coproc DMA model setup in Windows using VC++ multithreaded app • Coprocessor is FPGA on PCI AlphaData card • Implemented algorithm execution with/without coproc • Used DMA to help hide overhead of reconfiguration: SW only threads can execute during configuration • Service requests initiated by adjustable timers to exercise RC logic • Event logging for analysis dataset dataset DMA config savings savings Worker thread 2 Worker thread 1 coproc Ardini
Hardware Environment Local bus to PCI bridge, PC FPGA • Alpha-Data VirtexII Pro card on PCI bus • Simple bus wrapper gets coprocessor IP onto Alpha-Data local bus • PC chosen for easy development and focus on unique logic wrapper IP Ardini
RTL Generator • ImpulseC chose for this study • ANSI C - like • Simple modifications to algorithm to compile for processor • Data I/O path • Word types as simple #defines • High level of abstraction • Small learning curve • Give up low-level control of registers/signals • Some control over max gate delay using #pragma • Desktop simulation for fast algorithm debug Ardini
Software • Manager and application in VC++ • Easily implemented in C as well • For demo, windows “worker thread” model used, but other static thread + messaging methods could be used as well Ardini
Test Algorithms • Two tasks implemented • FIR • FFT • HW implementation flow • Code in C • ImpulseC RTL generator • Synplify • Xilinx implementation tools • SW flow • Change I/O in HW algorithm to use shared memory buffer Ardini
IP Development Outline • Write Task coprocessor for HW using ImpulseC • Modify I/O for processor implementation • Quantify savings in clock cycles for HW accelerated version • Wrap both implementations into “worker thread” that will use one of the implementations based on coprocessor ownership • Need to check coprocessor ownership on thread start • Worker thread registration not considered here • Could be defined on power up or • Dynamically registered Ardini
Worker Thread Control Block • One instantiated per worker thread • Contains information about the coprocessor bit stream • Points to the HW resource it currently owns • Would be used in multiple coprocessor systems for faster manager logic • Contains base address of its coprocessor • Maintained by the manager and is used as a semaphore for coprocessor use Ardini
RC Thread Control Block • Control block for HW resource • Holds information about the resource, e.g. the ID of the resource • Member function to kick off bit stream load process via DMA • Target thread can continue to run SW only until configuration is complete • Member function to gain coprocessor access on behalf of a worker thread based on ownership and state (is it done loading the bit stream?) Ardini
Coprocessor Ownership • All service requests pass through the thread manager • Manager uses “Scoring” logic • Upon completion, worker threads report “savings” that were achieved, or, could have been achieved using a coprocessor • Manager increments score for that thread • Highest scoring threads receive a coprocessor • Reassignment not done until a threshold is passed • Set based on relative time penalty of performing a reconfiguration, e.g. do reconfig when score delta exceeds 10x the reconfiguration time. Ardini
Scoring logic • Need to bound scores • Bound should be greater than RC threshold • 2x RC threshold used in these tests • Need to maintain “relative” performance of competing tasks, i.e. can’t have most scores saturating • Therefore, when updating scores at thread completion, subtract the current lowest score off of all registered threads Ardini
Scoring Details • Simple subtraction of lowest score is not enough • One inactive thread would allow “integrator windup” on the remaining threads • Slow response when the inactive thread comes back online • Saturation logic would prevent the selection of coprocessor owners, i.e. they would all “collect” at the top of the score list • Prevents initial accumulation of scores • Therefore, subtract score x from each task where • x is the lowest nonzero score for all tasks other than the top scoring m threads where m is the number of available coprocessors Ardini
Coproc Assignment • Get highest scoring non-owner in top m tasks • Compare score to lowest ranking owner • If diff is greater than threshold, RC • If current owner is using the resource skip RC • If RC is still the right decision after current owner finishes, RC will happen at that time • More logic could be used to continue comparing against current coproc owners Ranked task scores Top m tasks eligible for coprocessor ownership *t1 t2 Δ > thresh? t3 Lower ranking tasks will run in SW *t4 t5 * = current owner Ardini
Reconfiguration Thread • Created by manager • Kicks off DMA process • Waits for done event • Sends reconfiguration complete message back to manager • Manager can then give access the Worker thread owner Ardini
Test Configuration • Single HW resource available • Two competing threads, FFT, FIR processing • Fixed HW block sizes • Fixed data set sizes = fixed savings • Adjust for mismatch in microprocessor vs. FPGA clock rates • Service request rates for each thread adjustable to exercise RC logic Ardini
Results score saturation RC Threshold hysteresis Thread 2 owns RC event Thread 1 owns No owner Service request rates for two threads vary with time Ardini
Reconfiguration Detail RC DMA period Ardini
RC DMA with Higher Demand Rate RC DMA period Ardini
Conclusions • Coprocessor ownership given based on best sustained use of the resource • Provides hysteresis to prevent frequent reconfigurations • Low-overhead logic RC decision logic • Hardware and software implementations allow DMA to hide reconfiguration overhead • IP description in C allows it to be created once, compiled for microprocessor and FPGA targets Ardini