1 / 20

Extending PAPI to Multiple Measurement Domains

Extending PAPI to Multiple Measurement Domains. Jack Dongarra, Kevin London, Shirley Moore, Philip Mucci, Daniel Terpstra, and Haihang You University of Tennessee and Oak Ridge National Laboratory. Motivation. Increasing cpu speeds and densities places greater importance on:

gautier
Télécharger la présentation

Extending PAPI to Multiple Measurement Domains

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Extending PAPI toMultiple Measurement Domains Jack Dongarra, Kevin London, Shirley Moore, Philip Mucci, Daniel Terpstra, and Haihang You University of Tennessee and Oak Ridge National Laboratory

  2. Motivation • Increasing cpu speeds and densities places greater importance on: • Thermal health and management • Power consumption • Higher processor counts make communications metrics more critical: • Bandwidth • Latency • Dropped packets • Bytes transferred • Industry standard interfaces don’t exist to measure these metrics. • Hybrid machines require simultaneous access to multiple processor counter substrates.

  3. PAPI High Level PAPI Low Level Portable Layer • Hardware Independent Layer PAPI Machine DependentSubstrate Machine Specific Layer KernelExtension Operating System Hardware Performance Counters PAPI 3.0 Design

  4. PAPI High Level PAPI High Level PAPI Low Level PAPI Low Level Portable Layer Portable Layer • Hardware Independent Layer • Hardware Independent Layer PAPI CPU DependentSubstrate PAPI Machine DependentSubstrate PAPI Network DependentSubstrate Machine Specific Layer Machine Specific Layer KernelExtension KernelExtension KernelExtension Operating System Operating System Operating System Hardware Performance Counters Hardware Performance Counters Off-Processor Hardware Counters PAPI 4.0 Multiple Substrate Design

  5. Multiple Measurements • HPCC HPL benchmark on Opteron with 3 performance metrics: • FLOPS, Temperature, Network Sends/Receives • Temperature is from an on-chip thermal diode

  6. Multiple Measurements • HPCC HPL benchmark on Opteron with 3 performance metrics: • FLOPS, Temperature, Network Sends/Receives • Temperature is from an on-chip thermal diode

  7. For More Information • http://icl.cs.utk.edu/papi/ • Software and documentation • Reference materials • Papers and presentations • Third-party tools • Mailing lists • Team members: • Jack Dongarra, Kevin London, Shirley Moore, Philip Mucci, Daniel Terpstra, Haihang You

  8. Correlating Temperature and PAPI Events • Can Multi-Substrate PAPI be used to correlate temp with PAPI presets? • Measure temperature & all 42 PAPI presets on Opteron cluster across HPCC suite. • Statistically examine results for correlations using cluster analysis and principal component analysis.

  9. Dendrogram of temperature and PAPI events • Cluster analysis shows 8 PAPI preset events with similar behavior to the temperature. • Half are L2 cache related. • Also: • Resource stalls • Hardware interrupts • TLB misses • Total cycles ACPI_TEMP PAPI_TLB_TL PAPI_TOT_CYC PAPI_HW_INT PAPI_RES_STL PAPI_L2_TCM PAPI_L2_STM PAPI_L2_DCM PAPI_L2_DCR

  10. Normalized Graph of Clustered Events

  11. Principal Component Analysis • Simplifies a dataset by transforming to a new coordinate system. • The principal component contains the greatest variance. • In this example, the first two components contain the bulk of the temperature variance.

  12. First Principal Component • Inversely • Proportional: • PAPI_TLB_TL • PAPI_L2_STM • PAPI_RES_STL • Inversely • Proportional: • PAPI_TLB_DM • PAPI_L2_STM • PAPI_FPU_IDL • Proportional: • PAPI_L1_TCA • PAPI_L1_TCH • PAPI_L1_ICR • PAPI_L1_ICA • PAPI_L1_DCH • PAPI_FML_INS • PAPI_L1_DCA • PAPI_FAD_INS • PAPI_FP_OPS • PAPI_L1_ICH • Proportional: • ACPI_THERM • PAPI_TOT_INS • PAPI_FP_INS

  13. Proportional PAPI_L1_ICH PAPI_L1_ICR PAPI_L1_DCH PAPI_L1_DCA PAPI_TOT_INS PAPI_VEC_INS PAPI_FML_INS PAPI_FP_INS PAPI_FAD_INS PAPI_FPU_IDL PAPI_TLB_DM PAPI_TLB_TL PAPI_HW_INT PAPI_RES_STL PAPI_L2_TCM PAPI_L2_DCM PAPI_L1_TCM PAPI_L1_DCM Inversely Proportional First vs. Second Principal Component

  14. Temperature Correlation • Multi-Substrate PAPI made it easy to collect data needed to analyze and reduce the number of performance metrics required • Found approximately 10 events that are either directly or inversely proportional • Redundancy suggests using as few as 4-5 events to estimate temperature • Potential for automated search for relevant performance metrics on new hardware

  15. PAPI 4.0 Status • Multi-substrate development complete • Some CPU platforms not yet ported • Substrates available for • ACPI (Advanced Configuration and Power Interface) • Myrinet MX • Substrates under development for • Infiniband • GigE • Friendly User release available now for CVS checkout • Release target: Q3, 2006 Acknowledgement: This work was supported by the U.S. Department of Energy Los Alamos Computer Science Institute under subcontract R7A827-79200 through Rice University.

  16. PAPI 4.0 • Multi-substrate work complete • Substrates available for • ACPI (Advanced Configuration and Power Interface ) • Myrinet MX • Substrates under development for • Infiniband • GigE • Friendly User release available now for CVS checkout • PAPI 4.0 Beta release expected Q3, 2006

  17. Support Slide:Setting up the counters • Test is run on 1.4 GHz AMD Opteron • Supports 42 PAPI preset events • 4 hardware counters • HPCC calls function to setup PAPI events and uses a timer • Run on 1 processor, interested in temperature of 1 processor • Multiway multiplexing • Need 11 eventsets to monitor all events • Each eventset gets a 20 ms timeslice • Randomized order of eventsets • After 5 iterations log results • Resulted in 1631 logged results of 43 different performance metrics (42 PAPI presets & 1 temperature)

More Related