Energy And Power Characterization of Parallel Programs Running on the Intel Xeon Phi

Energy And Power Characterization of Parallel Programs Running on the Intel Xeon Phi Joal Wood, ZiliangZong, Qijun Gu, RongGe Email: {jw1772, ziliang, qijun}@txstate.edu, rong.ge@marquette.edu

The Xeon Phi coprocessor Equipped with 60 x86-based cores, each capable of running 4 threads simultaneously. Designed for high computation density. Used in both Tianhe-2 and Stampede supercomputers.

Overview of our work • We profile the power and energy of multiple algorithms with contrasting workloads. • Concentrating on the performance and energy impact of increasing the number of threads, running code in native versus offloaded mode, and co-running selected algorithms on the Xeon Phi. • We describe how to correctly profile the instantaneous power of the Xeon Phi using the built-in power sensors.

Xeon Phi Power Data • Power data is collected using the MICAccessAPI - a C/C++ library that allows users to monitor and configure several metrics (including power) of the coprocessor. • The power results that we present are measured and recorded by issuing the MicGetPowerUsage() call to the MICAccessAPI during execution of each experiment.

Selected algorithms • Barnes-Hut simulation – O(nlogn) n-body approximation. • Shellsort – comparison based exchange/insertion sort. • SSSP – Single Source Shortest Path (Dijkstra’s algorithm) graph searching. • Fibonacci – calculates 45 Fibonacci numbers.

Power tracing • Graphing the instantaneous power of these algorithms allows us to confirm much of what can be inferred about the performance and energy from the implementation. • It can help us identify features of different applications that aren’t otherwise obvious, and facilitate new findings.

Barnes-hut • Designed to solve the n-body simulation problem by approximating the forces acting on each body. • Uses an octree data structure to achieve a time complexity of O(nlogn). • Memory access and control flow patterns are irregular, since different parts of the octree must be traversed to compute forces from each body. • Balanced workload, as each thread will perform the same amount of force calculation per iteration.

Shellsort • Comparison based in-place sorting algorithm. • Starts by sorting elements far from each other, reducing the gap between them. • Workload gradually reduces because fewer swaps occur as the data set becomes relatively sorted.

SSSP • Returns the distance between 2 chosen nodes of the input graph. • Amount of parallelism changes throughout execution. • Unbalanced workload, as each thread is given a different number of neighbor nodes to compute the distance.

Fibonacci • Calculates 45 Fibonacci sequence numbers. • Each sequence position is assigned to a thread, which calculates the corresponding number. • Highly unbalanced workload, as threads assigned to larger Fibonacci numbers (position 45 and 46) require much more work. • Changing the OMP_WAIT_POLICY environment variable seemed to have no influence on the power trace of Fibonacci.

Correctly Plotting The instantaneous power data Incorrect Power Trace – X axis incrementing by sample number Correct Power trace – X axis as timestamp

Native vs. Offloaded execution • The Xeon Phi offers native and offloaded execution modes. During native execution, the program runs entirely on the coprocessor. • Building a native application is a fast way to get existing software running with minimal code changes. • Offloaded mode is a heterogeneous programming model where developers can designate specific code sections to run on the Xeon Phi. • For our experiments, we offload the entire execution onto the Xeon Phi.

Offloaded SSSP The energy consumption is slightly higher for offloaded mode compared to native mode across each number of threads. This is because the performance is consistently slightly worse than native execution. However, the performance and energy deficit grows smaller as more threads are used. Intuitively, offloading to the Xeon Phi with a high number of threads (120, 240) implies energy savings assuming the host CPU is utilized

Offloaded SHELLSORT Offloaded shellsort reveals a much higher performance and energy deficit compared to its native version. Native shellsort consistently performs 3-4X faster than offloaded version. These results show great benefit in terms of performance and energy when running codes in native mode. Based on these results, generally speaking, codes that do not perform extensive I/0 operations and require a modest memory footprint should be executed in native mode.

Co-Running programs • The Xeon Phi contains 60 physical cores and is capable of high computation density. We explore the viability of co-running complementary workloads on the Xeon Phi. • We chose the Fibonacci calculation code as the ideal co-runner, as it performs best with lower thread counts. • Mostly interested in revealing if co-running these codes will incur significant performance and energy losses.

barnes-hut & Fibonacci co-run These codes are able to co-run well because Barnes- Hut is a very balanced workload and benefits from using more Threads. Fibonacci actually declines in performance as a large number of threads are used. This allows us to give as many threads as possible to execute Barnes-Hut while leaving a small thread pool to execute Fibonacci.

SSSP & Fibonacci co-run Fibonacci is an example of a workload that will co-run well when paired with other programs with a high degree of parallelism. SSSP is still a good candidate to co-run with Fibonacci. It yields similar results to that of co-running Barnes-Hut. Assuming memory contention is low, each of these co-running programs will return with little performance cost.

Conclusions • The power trace generated from the built-in power sensors of Xeon Phi can accurately capture the run-time program behavior. • Running code in native mode yields better performance and consumes less energy compared to offload mode. • Co-running programs with complementary workloads has potential to conserve energy with negligible performance degradation.

Future work • We need to investigate the heterogeneous power and energy implications of offloading work to the Xeon Phi from the host CPU. (Currently, we exclusively look at data from the Xeon Phi) • Compare the performance and energy of these algorithms with corresponding CPU and GPU implementations.

Acknowledgement • The work reported in this paper is supported by the U.S. National Science Foundation under Grants CNS-1305359, CNS-1305382, CNS-1212535, and a grant from the Texas State University Research Enhancement Program.

Energy And Power Characterization of Parallel Programs Running on the Intel Xeon Phi

Energy And Power Characterization of Parallel Programs Running on the Intel Xeon Phi

Presentation Transcript

Intel Parallel Advisor Workflow

MARC Program Status and Essentials to Programming the Intel ® Xeon ® Phi ™ Coprocessor (based on Intel ® Many Integra

XeON PHI

Performance of Parallel Programs

Power and Performance Characterization of Computational Kernels on the GPU

Running Parallel Jobs

Scale from Intel® Xeon® Processor to Intel® Xeon Phi™ Coprocessors

Building and Running Parallel Simulations

An evaluation of the Intel Xeon E5 Processor Series

Parallel Programs

Running Programs on CSP

Accelerating MFIX-DEM code on the Intel Xeon Phi

Resources Available for Running Parallel Programs

Computing Labs CL5 / CL6 Multi-/Many-Core Programming with Intel Xeon Phi Coprocessors

Intel Xeon Server Processor | Star Micro Inc.

Intel Xeon Phi Training - Introduction