Architectural Support for Fine-Grained Parallelism on Multi-core Architectures

Architectural Support for Fine-Grained Parallelismon Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher J. Hughes, Corporate Technology Group, Intel Corporation Anthony Nguyen, Corporate Technology Group, Intel Corporation By Duygu AKMAN

Keywords • Multi-core Architectures • RMS applications • Large-Grain Parallelism / Fine-Grain Parallelism • Architectural Support

Multi-core Architectures • MCA • Higher performance than uniprocessor systems • Reduce communication latency and increase bandwidth between cores. • Applications need thread level parallelism to benefit.

Thread Level Parallelism • One common approach is partitioning a program into parallel tasks, and letting a software to schedule the tasks to different threads. • Useful, only if tasks are large enough so that software overhead is negligible. (e.g. Scientific applications) • RMS (Recognition, Mining and Synthesis) applications mostly have small tasks.

Reasons for Fine-Grain Parallelism • Also, MCAs can even be found at homes, which also tells us that fine-grained parallelism is necessary. • Some applications need to get good performance on different platforms with varying number of cores -> fine-grain tasks • In multiprogramming, the number of cores assigned to an application can change during execution. Need to maximize available parallelism -> fine-grain parallelism

Example (8 core MCA)

Example (8 core MCA) • Two cases: • Application partitioned into 8 equal-sized tasks • Application partitioned into 32 equal-sized tasks • In a parallel section, when a core finishes its tasks, it waits for other cores -> waste of resources

Example (8 core MCA) • With 4 and 8 cores assigned to the application, all cores are fully utilized. • With 6 cores in first case, more waste resources(same performance with 4 cores) than the second case. Second case is finer-grained.

Problem is even worse with larger number of cores • With only 64 tasks, no performance improvement between 32 cores and 63 cores! • Need more tasks -> fine-grain parallelism

Contribution • Propose a hardware technique to accelerate dynamic task scheduling on MCAs. • Hardware queues that cache tasks and implement scheduling policies • Task prefetchers on each core to hide the latency of accessing the queues.

Workloads • Parallelized and analyzed RMS applications from areas including • Simulation for computer games • Financial analytics • Image processing • Computer vision, etc. • Some modules of these applications have large grained parallelism -> insensitive to task queuing overhead • But significant number of modules have to be parallelized at a fine granularity to achieve better performance

Architectural Support For Fine-Grained Parallelism • Overhead when queuing of tasks are handled by software • If tasks are small, this overhead can be a significant fraction of total execution time. • Contribution is adding hardware to MCAs for accelerating task queues. • Provides very fast access to the storage for tasks • Performs fast task scheduling

Proposed Hardware • An MCA chip where the cores are connected to a cache hierarchy by an on-die network. • Two separate hardware components • Local Task Unit (LTU) per core • Single Global Task Unit (GTU)

Proposed Hardware

Global Task Unit • GTU holds the logic of the implementation of the scheduling algorithm • GTU holds enqueued tasks in hardware queues. There is a hardware queue for each core • Since the queues are physically close to each other, scheduling is faster. • GTU is physically centralized and connection between the GTU and the cores is done via the same on-die interconnect as the caches.

Global Task Unit • Disadvantage of GTU is that as the number of cores increase, average communication latency between a core and GTU also increases. • This latency is hidden with the use of prefetchers (LTUs).

Local Task Unit • Each core has a small piece of hardware to communicate with the GTU. • If cores wait to contact the GTU until the thread running on them finishes its current task, thread will have to stall for the GTU access latency. • LTU also has a task prefetcher and a small buffer to hide the latency of accessing the GTU.

Local Task Unit • On a dequeue, if there is a task in LTU’s buffer, task is returned to the thread, and a prefetch for the next available task is sent to the GTU. • On an enqueue, task is placed in LTU’s buffer. Since proposed hardware uses a LIFOordering of tasks for a given thread, if the buffer isalready full, the oldest task in the buffer is sent to theGTU.

Experimental Evaluation • Benchmarks are from the RMS application domain • RMS = Recognition, Mining and Synthesis • Wide range of different areas • All benchmarks are parallelized

These benchmarks are straightforward to parallelize, each parallel loop simply specifies a range of indices and the granularity of tasks

Task-level parallelism is more general thanloop-level parallelism where each parallel section startswith a set of initial tasks and any task may enqueue othertasks.

Benchmarks • In some of these benchmarks, task size is small, so task queue overhead must be small to effectively exploit the available parallelism. • In some, parallelism is limited. • In some, task sizes are highly variable, therefore a very efficient task management is needed for good load balancing.

Results • Results show the performance benefit of the proposed hardware for loop-level and task-level benchmarks, when running with 64 cores. • The hardware proposal is compared with • thebest optimized software implementations • an idealizedimplementation (Ideal) in which tasks bypass the LTUsand are sent directly to/from GTU with zero interconnectlatency and GTU processes these tasksinstantly without any latency.

Results

Results • The graphs represent the speedup over the one-threadexecution using the Ideal implementation. • For eachbenchmark,multiple bars. Each bar correspondsto a different data set shown in Benchmark Tables

Results • For the loop-level benchmarks, the proposedhardware executes 88% faster on average than theoptimized software implementation and only 3% slower than Ideal. • For the task-level benchmarks, on average theproposed hardware is 98% faster compared to the bestsoftware version and is within 2.7% of Ideal.

Conclusion • In order to benefit from the growing compute resources of MCAs, applications must expose their thread-level parallelism to hardware. • Previous work has proposed software implementations ofdynamic task schedulers. But applications with large tasks, such as RMSs, achieve poor parallel speedups using software dynamictask scheduling. This is because the overhead of thescheduler are large for these applications.

Conclusion • To enable good parallel scaling even for applications withvery small tasks,a hardware scheme is proposed. • It consists ofrelatively simple hardware and is tolerant to growingon-die latencies; therefore, it is a good solution forscalable MCAs. • When the proposed hardware, the optimizedsoftware task schedulers and an idealized hardwaretask scheduler is compared, we see that,for the RMS benchmarks, hardware gives large performance benefits over thesoftware schedulers, and it comes very close tothe idealized hardware scheduler.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures

Presentation Transcript

Multi-core architectures

Optimizing for Intel multi-/many-core architectures

Multi-core architectures

Data Marshaling for Multi-Core Architectures

Enhancing Fine-Grained Parallelism Part II

Enhancing Fine-Grained Parallelism

Enhancing Fine-Grained Parallelism

Multi-core architectures

Operating System Support for Fine-Grain Parallelism on Multicore Architectures

Multi-core architectures

Exploring Multi-Grained Parallelism in Compute-Intensive DEVS Simulations

Packet Scheduling for Deep Packet Inspection on Multi-Core Architectures

Multi-Core Architectures

parXXL : A Fine Grained Development Environment on Coarse Grained Architectures

Multi-core Architectures

Enhancing Fine-Grained Parallelism

Fine-Grained Soils:

Enhancing Fine-Grained Parallelism Part II

Fine Grained Auditing