Maximizing GPU Performance for Real-Time Systems: Platform-Based Design Principles
410 likes | 512 Vues
Explore the use of GPUs and PicoChip in real-time systems, covering parallel resource control, shading techniques, deadline guarantees, and fixed-priority scheduling. Understand why GPUs are preferred for dynamic resource allocation and how to achieve timeliness in embedded systems. Learn about task scheduling models and analysis for optimized performance.
Maximizing GPU Performance for Real-Time Systems: Platform-Based Design Principles
E N D
Presentation Transcript
Platform-based Design 5KK70MPSoC Controlling the Parallel Resources
Contents • GPUs • PicoChip • Real-Time Scheduling basics • Resource Management
GPU basics • Synthetic objects are represented with a bunch of triangles (3d) in a language/library like OpenGL or DirectX plus texture • Triangles are represented with 3 vertices • A vertex is represented with 4 coordinates with floating-point precision • Objects are transformed between coordinate representations • Transformations are matrix-vector multiplications
GeForce 8800 GPU 330 Gflops, 128 processors with 4-way SIMD
GPU: Why more general-purpose programmable? • All transformations are shading • Shading is all matrix-vector multiplications • Computational load varies heavily between different sorts of shading • Programmable shaders allow dynamic resource allocation between shaders Result: • Modern GPUs are serious competitor for general-purpose processors!
Real-time systems (Reinder Bril) • Correct result at the right time: timeliness • Many products contain embedded computers, e.g. cars, planes, medical and consumer electronics equipment, industrial control. • In such systems, it’s important to deliver correct functionality on time. • Example: inflation of an air bag
Cable modem DVB Tuner IEEE 1394 interface RF Tuner CVBS interface YC interface VGA DVD CDx front end Example: Multimedia Consumer Terminals (by courtesy of Maria Gabrani)
up-scaled Example: High quality video & real time TV companies invest heavily in video enhancement,e.g. temporal up-scaling Input stream: 24 Hz (movie) original Rendered stream: 60 Hz (TV screen)
up-scaled displayed Example: High quality video & real time TV companies invest heavily in video enhancement,e.g. temporal up-scaling Input stream: 24 Hz (movie) original • Deadline miss leads to “wrong” picture. • Deadline misses tend to come in bursts (heavy load). • Valuable work may be lost.
Real-time systems • Guaranteeing timeliness requirements: • real-time tasks with timing constraints • scheduling of tasks • Fixed-priority scheduling (FPS) is the de-facto standard for scheduling in real-time systems. • FPS: supported by • commercially available RTOS; • analytic and synthetic methods.
Recap of FPS • Fixed Priority Pre-emptive Scheduling (FPPS) • A basic scheduling model • Analysis • Example • Optimality of RMS and DMS
FPPS: A basic scheduling model • Single processor • Set of n independent, periodic tasks 1, …, n • Tasks are assigned fixed priorities, and can be pre-empted instantaneously. • Scheduling: At any moment in time, the processor is used to execute the highest priority task that has work pending.
FPPS: A basic scheduling model • Task characteristics: • period T, • (worst-case) computation time C, • (relative) deadline D, • Assumptions: • non-idling; • context switching and scheduling overhead is ignored; • execution of releases in order of arrival; • deadlines are hard, and D T; • 1 has highest and n has lowest priority. • No data-dependencies between tasks
FPPS: Analysis • Schedulable iff:WRi Di for 1 i n • Necessary condition: • Sufficient condition for RMS:ULL(n) = n (21/n – 1), i.e. ri >rj iff Ti < Tj;Di = Ti.
FPPS: Analysis • Otherwise, • i.e. U 1 and not RMS, or • n(21/n – 1) < U < 1 and RMS • exact condition: • Critical instant: simultaneous release of i with all higher priority tasks • WRi is the smallest positive solution of
FPPS: Example • Task set Γ consisting of 3 tasks: • Notes: • RM priority assignment and Di = Ti(RMS); • U1 + U2 + U3 = 0.97 1, hence Γcould be schedulable; • Utilization bound: U(n) LL(n) = n (21/n – 1): • U1+U2 = 0.88 > LL(2) 0.83, • therefore U(3) > LL(3), hence another test required.
1 2 3 4 5 6 1 2 3 time 0 10 20 30 40 50 60 WR1 = 3 WR2 = 17 WR3 = 56 FPPS: Example • Time line Task 1 Task 2 Task 3
FPPS: Optimality of RMS and DMS • Priority assignment policies: • Rate Monotonic (RM): ri >rj iff Ti < Tj • Deadline Monotonic (DM): ri >rj iff Di < Dj • Under arbitrary phasing: • RMS is optimal among FPS when Di = Ti; • DMS is optimal among FPS when DiTi, • where optimal means: if an FPS algorithm can schedule the task set, so can RMS/DMS.
Task Non-Preemptive Systems (Akash Kumar) • State-space needed is smaller • Lower implementation cost • Less overhead at run-time • Cache pollution, memory size
Why FPS doesn’t work for “future” high-performance platforms • Heavy-duty DSPs: Preemption not supported • If it was: Context switching is significant • Data-dependencies not taken into account • Multi-processor
Related Research – Feasibility Analysis Preemptive [Liu, Layland, 1973] B A D [Jeffay, 1991] Non-Preemptive C Homogeneous MPSoC [Baruah, 2006] P1 P2 P3 P4 P5 P6 [ , 2020??] Heterogeneous MPSoC
50 49 50 49 49 50 50 49 A A B B 50 49 49 50 Unpredictability – Variation in Execution Time P1 P2 P3
Problem No good techniques exist to analyze and schedule applications on non-preemptive heterogeneous systems Resource Manager proposed to schedule applications such that they meet their performance requirements on non-preemptive heterogeneous systems
B2 A2 D2 C2 Task Job Our Assumptions • Heterogeneous MPSoC • Applications modeled as SDF • Non-preemptive system – tasks can not be stopped • Jobs can be suspended • Lot of dynamism in the system • Jobs arriving and leaving at run-time • Variation in execution time • Very simple arbiter at cores
Application QoS Manager Application level few sec Reconfigure to meet above quality milliseconds Resource Manager B A Local Processor Arbiter Task level micro sec Core Resource Manager
Resource Manager Local Arbiter P1 P2 P3 Architecture Description • Computation resources available are described • Each processor can have different arbiter • In this model First Come First Serve mechanism is used • Resource manager can configure/control the local arbiters: to regulate the progress of application if needed
Resource Manager • Responsible for two main things • Admission control • Incoming application specifies throughput requirement • Execution-time and mapping of each actor • Repetition vector used to compute expected utilization • RM checks if enough resources present • Allocates resources to applications if admitted
Video Conf Play MP3 Typing Sms P1 Admission Control Resource Reqmt Exceeded! P2 P3
Resource Manager • Admission control • Budget enforcement • When running, each application signals RM when it completes an iteration • RM keeps track of each application’s progress • Operation modes • ‘Polling’ mode • ‘Interrupt’ mode • Suspends application if needed
Performance goes down! Better than required! Budget Enforcement (Polling) New job enters! Resource Manager job suspended! job resumed!
Experiments • A high-level simulation model developed • POOSL – a parallel simulation language used • A protocol for communication defined • System verified with a number of application SDF models • Case study done with H263 and JPEG application models • Impact of varying ‘polling’ interval studied