1 / 11

CS295: Modern Systems Lab 1 Review

This review explores the performance of two different baseline SIMD implementations: one using naïve transpose and fused multiply-add operations. The results are analyzed based on machine specifications and performance measurements.

matthewe
Télécharger la présentation

CS295: Modern Systems Lab 1 Review

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS295: Modern SystemsLab 1 Review Sang-Woo Jun Spring, 2019

  2. Baseline SIMD Implementation • Naïve transpose + Fused Multiply-Add FMA Non-SIMD Add A BT C … … … × =

  3. Baseline SIMD Implementation • Naïve transpose + Fused Multiply-Add

  4. Baseline SIMD Implementation • Naïve transpose + Fused Multiply-Add

  5. Baseline Tiled SIMD Implementation • Naïve transpose + Temporary C of size 8N to delay non-SIMD addition • Fixed tile size (64 elements) • Not optimizing for cache size, or the existence of L2+ cache CT BT A Non-SIMD Add FMA … … … × =

  6. Baseline Tiled SIMD Implementation

  7. Multithreaded • Naïve, single-threaded transpose + Temporary C of size 8N to delay non-SIMD addition • Round-robin row block thread assignment For 2 threads: BT A C … … Thread 0 … … Thread 1 × = … … Thread 0 …

  8. Machine Specs • Machine 1 • Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz • 6 Cores, 12 Threads • 32 GB DRAM • 4 DIMMs DDR4 • Machine 2 • Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz • 2 Cores, 4 Threads • 8 GB DRAM • 1 DIMM DDR4

  9. Results on Machine 1Performance Normalized Against Naïve

  10. Results on Machine 2Performance Normalized Against Naïve What happened here?

  11. Two Different Ways to Do Blocking BT Option 1 A C … … … … × = … … Option 2 N*N/Threads… Doesn’t fit in cache! N/Threads × =

More Related