1 / 24

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures. Michela Becchi and Patrick Crowley Applied Research Lab Washington University in St. Louis. Context. Chip Multiprocessor (CMP): several processors on the same chip Support high degree of Thread Level Parallelism

Télécharger la présentation

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures MichelaBecchi and Patrick Crowley Applied Research Lab Washington University in St. Louis

  2. Context • Chip Multiprocessor (CMP): several processors on the same chip • Support high degree of Thread Level Parallelism • Overcome limitations of wide-issues superscalar uni-processor systems • Applications w/ limited Instruction Level Parallelism • Complexity, area occupancy and manufacturing costs • Heterogeneous CMP • Coexistence of differing cores and caches

  3. Motivations • Area-complexity tradeoff in CMP • Many “simple” processors/caches • High thread level parallelism (TLP) • Few “sophisticated” processors/caches • High instruction level parallelism (ILP) • Multi-programmed computing environment • Computing needs vary • Across threads • Over time

  4. Problem • Can Heterogeneous CMP be better? • Under which conditions/workload? • How to exploit hardware heterogeneity?

  5. Goal • Heterogeneous CMP more flexible than homogenous CMP • Thread diversity • Applications w/ “multiphase” behavior • Varying degree of thread level parallelism • Dynamic core assignment and thread migration

  6. Approach • Simulation • Mix of event based and trace based simulation • Analysis of heterogeneous set of benchmarks on two different processors (Alpha 21164 and Alpha 21264) • Simulation of CMP configurations: • Homogeneous vs. heterogeneous • Static vs. two dynamic assignment policies

  7. Hardware setup • Same Instruction Set Architecture • Mono-threaded • Unified L2 cache (4MB/4-way/128B blocks) • Main memory – L2: 2GB/s bus • 2.1 GHz clock

  8. Workload definition • 11 programs from SPEC2000, ref input set • INT: gzip, gcc, crafty, parser, bzip2 • FP: wupwise, swim, mgrid, galgel, equake, lucas • # of running threads: from 1 to 40 • Data points: average across 100 simulations on random workload selections

  9. Benchmarks behavior: EV6 vs. EV5 • M5 uni-processor simulations • 2.5 B instructions executed • Relative statistics on windows of 1M clock cycles • Results: • IPC (instruction per clock cycle): EV5 from 0.35 to 1.15, EV6 from 0.45 to 1.8 • Branch predictor accuracy: no remarkable variation • L1 cache misses: varying impact across programs and not directly correlated to IPC

  10. gzip gzip gzip gzip gzip crafty gzip gzip gzip gzip gzip lucas EV6 EV6 EV5 EV5 EV5 EV5 EV5 EV5 x5 IPC ratios EV6 vs. EV5

  11. CMP systems • Core configurations (100mm2 area) • homogeneous: 4EV6, 20EV5 • heterogeneous”: 1EV6&15EV5, 2EV6&10EV5, 3EV6&5EV5 • Assignment policies: • Random and pseudo best static • Round-Robin • IPC driven • Thread migration modeled as a context switch • Upper bound to # of clock cycles required to transfer architectural state and refill caches • Comparison Metric: speedup in respect to one EV6

  12. Homogeneous Configurations Low thread parallelism High thread parallelism

  13. Static assignment • Tasks statically assigned to cores • 2 flavors: • Random • Pseudo-best • Runtime characteristics of tasks known in advance • Term of comparison • Heuristic: • Sort tasks basing on IPC on two cores • Assign to EV6 twice as many threads as EV5 • Drawbacks: • Idle EV6s remain idle (unless unassigned threads) • Slow threads on EV5 penalize overall performance

  14. Quality of static assignments Random: homogeneous better than heterogeneous

  15. Quality of static assignments Random: homogeneous better than heterogeneous Best: a priori knowledge of benchmark characteristics

  16. Round Robin assignment • Dynamic assignment policy • Periodic rotation of threads on cores: • swap_period • # EV6s < # EV5s => several cycles for complete rotation • Pros: • EV6s never idle • More load balancing • Cons: • Runtime behavior of threads ignored

  17. Round Robin vs. static assignment • RR better than static over all degrees of TLP • RR w/ 10 EV5s ~ homogeneous w/ 20 EV5s for high degree TLP

  18. IPC Driven assignment • Dynamic assignment policy • Goal: assign to EV6s jobs having a greater speedup on them • EV6/EV5 IPC ratio as control metric • Three causes of migration • Learning (forced migration) • EV6 core becoming idle • Variation in IPC ratios (IPC-driven migration)

  19. Dynamic assignments IPC-driven assignment better for high TLP Limited performance increasemay not justify complicated schemes

  20. # threads ≤ #cores Effect of load balancing independent of dyn. policy # threads ≤ # EV6s NO reassignment # threads ~ # cores NO load balancing # threads > # EV6s Load balancing Components of Speedup

  21. Conclusions • Analysis • Multi-programmed computing environment (from SPEC2000) • Two homogeneous and three heterogeneous CMP configurations (two cores) • Two static and two dynamic assignment policies • Dynamic assignment policy on heterogeneous CMP configuration • accommodates broad range of degrees of thread parallelism • outperforms static assignment of 20% to 40% on average (80% in extreme cases) • a simple Round Robin policy can suffice, especially in case of limited degree of thread level parallelism

  22. Questions Thanks • Dr. Patrick Crowley • Applied Research Lab and Storage Based Supercomputing Group at Washington University in St. Louis • Anonymous Reviewers • YOU ALL!

  23. Forced migrations • different programs have different phase durations • phases changes observed on different cores at the same time Variation of IPC as triggering factor • Initially • According to “program phases”

  24. Homogeneous vs. Heterogeneous - random static

More Related