1 / 39

EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining. University of Michigan November 9, 2011. Announcements + Reading Material. 2 nd paper review due today Should have submitted to andrew.eecs.umich.edu:/y/submit Next Monday – Midterm exam in class Today’s class reading

rollin
Télécharger la présentation

EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EECS 583 – Class 17Research Topic 1Decoupled Software Pipelining University of Michigan November 9, 2011

  2. Announcements + Reading Material • 2nd paper review due today • Should have submitted to andrew.eecs.umich.edu:/y/submit • Next Monday – Midterm exam in class • Today’s class reading • “Automatic Thread Extraction with Decoupled Software Pipelining,” G. Ottoni, R. Rangan, A. Stoler, and D. I. August, Proceedings of the 38th IEEE/ACM International Symposium on Microarchitecture, Nov. 2005. • Next class reading (Wednes Nov 16) • “Spice: Speculative Parallel Iteration Chunk Execution,” E. Raman, N. Vachharajani, R. Rangan, and D. I. August, Proc 2008 Intl. Symposium on Code Generation and Optimization, April 2008.

  3. Midterm Exam • When: Monday, Nov 14, 2011, 10:40-12:30 • Where • 1005 EECS • Uniquenames starting with A-H go here • 3150 Dow (our classroom) • Uniquenames starting with I-Z go here • What to expect • Open book/notes, no laptops • Apply techniques we discussed in class on examples • Reason about solving compiler problems – why things are done • A couple of thinking problems • No LLVM code • Reasonably long but you should finish • Last 2 years exams are posted on the course website • Note – Past exams may not accurately predict future exams!!

  4. Midterm Exam • Office hours between now and Monday if you have questions • Daya: Thurs and Fri 3-5pm • Scott: Wednes 4:30-5:30, Fri 4:30-5:30 • Studying • Yes, you should study even though its open notes • Lots of material that you have likely forgotten • Refresh your memories • No memorization required, but you need to be familiar with the material to finish the exam • Go through lecture notes, especially the examples! • If you are confused on a topic, go through the reading • If still confused, come talk to me or Daya • Go through the practice exams as the final step

  5. Exam Topics • Control flow analysis • Control flow graphs, Dom/pdom, Loop detection • Trace selection, superblocks • Predicated execution • Control dependence analysis, if-conversion, hyperblocks • Can ignore control height reduction • Dataflow analysis • Liveness, reaching defs, DU/UD chains, available defs/exprs • Static single assignment • Optimizations • Classical: Dead code elim, constant/copy prop, CSE, LICM, induction variable strength reduction • ILP optimizations - unrolling, renaming, tree height reduction, induction/accumulator expansion • Speculative optimization – like HW 1

  6. Exam Topics - Continued • Acyclic scheduling • Dependence graphs, Estart/Lstart/Slack, list scheduling • Code motion across branches, speculation, exceptions • Software pipelining • DSA form, ResMII, RecMII, modulo scheduling • Make sure you can modulo schedule a loop! • Execution control with LC, ESC • Register allocation • Live ranges, graph coloring • Research topics • Can ignore these

  7. Last Class • Scientific codes – Successful parallelization • KAP, SUIF, Parascope, gcc w/ Graphite • Affine array dependence analysis • DOALL parallelization • C programs • Not dominated by array accesses – classic parallelization fails • Speculative parallelization – Hydra, Stampede, Speculative multithreading • Profiling to identify statistical DOALL loops • But not all loops DOALL, outer loops typically not!! • This class – Parallelizing loops with dependences

  8. What About Non-Scientific Codes??? 1 5 3 2 0 4 2 4 5 1 0 3 Core 1 Core 2 C:1 C:2 X:1 X:2 C:4 C:3 X:4 X:3 C:5 C:6 X:5 X:6 Scientific Codes (FORTRAN-like) General-purpose Codes (legacy C/C++) for(i=1; i<=N; i++) // C a[i] = a[i] + 1; // X while(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // X Core 1 Core 2 Independent Multithreading (IMT) Example: DOALL parallelization Cyclic Multithreading (CMT) Example: DOACROSS [Cytron, ICPP 86] LD:1 X:1 LD:2 X:2 LD:3 LD:4 X:3 X:4 LD:5 X:5 LD:6

  9. Alternative Parallelization Approaches 0 1 2 4 5 3 4 3 2 1 0 5 Core 1 Core 2 Core 1 Core 2 LD:1 LD:1 LD:2 X:1 X:1 LD:2 X:2 X:2 LD:3 LD:3 X:3 LD:4 LD:4 X:3 X:4 X:4 LD:5 LD:5 LD:6 X:5 X:5 LD:6 while(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // X Pipelined Multithreading (PMT) Example: DSWP [PACT 2004] Cyclic Multithreading (CMT)

  10. Comparison: IMT, PMT, CMT 0 4 0 1 3 4 5 5 2 1 2 1 0 3 5 4 3 2 CMT PMT Core 1 Core 2 LD:1 Core 1 Core 2 Core 1 Core 2 C:1 C:2 LD:1 LD:2 X:1 X:1 X:2 X:1 LD:2 X:2 LD:3 X:2 C:4 C:3 LD:3 X:3 LD:4 X:4 LD:4 X:3 X:3 X:4 LD:5 X:4 C:5 C:6 LD:5 LD:6 X:5 X:5 X:6 X:5 LD:6 IMT

  11. Comparison: IMT, PMT, CMT 4 3 2 0 5 3 5 4 1 0 1 2 3 4 2 5 1 0 IMT CMT PMT Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 C:1 C:2 LD:1 LD:1 X:1 X:2 LD:2 X:1 X:1 LD:2 X:2 X:2 C:4 C:3 LD:3 LD:3 X:4 X:3 LD:4 X:3 LD:4 X:3 X:4 X:4 C:5 C:6 LD:5 LD:5 1 iter/cycle 1 iter/cycle X:5 X:6 LD:6 X:5 X:5 LD:6 1 iter/cycle lat(comm) = 1:

  12. Comparison: IMT, PMT, CMT 0 0 1 3 4 5 5 4 2 2 1 0 5 4 3 3 2 1 IMT CMT PMT Core 1 Core 2 Core 1 Core 2 LD:1 LD:1 Core 1 Core 2 C:1 C:2 X:1 LD:2 X:1 X:2 LD:2 X:1 LD:3 C:4 C:3 X:2 X:2 LD:4 X:4 X:3 X:3 LD:3 LD:5 C:5 C:6 X:4 X:3 LD:6 X:5 X:6 1 iter/cycle 1 iter/cycle lat(comm) = 1: 1 iter/cycle 1 iter/cycle 1 iter/cycle lat(comm) = 2: 0.5 iter/cycle

  13. Comparison: IMT, PMT, CMT 5 4 3 2 1 0 1 5 0 3 4 5 4 3 2 1 0 2 C:1 C:2 X:1 X:2 C:4 C:3 X:4 X:3 C:5 C:6 X:5 X:6 Thread-local Recurrences  Fast Execution IMT PMT CMT Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 LD:1 LD:1 LD:2 X:1 X:1 LD:2 LD:3 X:2 X:2 LD:4 X:3 LD:5 LD:3 X:4 LD:6 X:3 Cross-thread Dependences  Wide Applicability

  14. Our Objective: Automatic Extraction of Pipeline Parallelism using DSWP 197.parser Find English Sentences Parse Sentences (95%) Emit Results Decoupled Software Pipelining PS-DSWP (Spec DOALL Middle Stage)

  15. Decoupled Software Pipelining

  16. [MICRO 2005] Decoupled Software Pipelining (DSWP) register control intra-iteration loop-carried communication queue A: while(node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; Dependence Graph Thread 1 Thread 2 A D DAGSCC D A B B A D B C C C Inter-thread communication latency is a one-time cost

  17. register control memory L1: intra-iteration loop-carried Aux: Implementing DSWP DFG

  18. register control memory intra-iteration loop-carried Optimization: Node SplittingTo Eliminate Cross Thread Control L1 L2

  19. register control memory intra-iteration loop-carried Optimization: Node Splitting To Reduce Communication L1 L2

  20. register control memory intra-iteration loop-carried Constraint: Strongly Connected Components Consider: Solution: DAGSCC Eliminates pipelined/decoupled property

  21. 2 Extensions to the Basic Transformation • Speculation • Break statistically unlikely dependences • Form better-balanced pipelines • Parallel Stages • Execute multiple copies of certain “large” stages • Stages that contain inner loops perfect candidates

  22. Why Speculation? register control intra-iteration loop-carried communication queue A: while(node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; A D Dependence Graph DAGSCC D B B A C C

  23. Why Speculation? register control intra-iteration loop-carried communication queue A: while(cost < T && node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; A D Dependence Graph DAGSCC D B A B C D B A C C Predictable Dependences

  24. Why Speculation? register control intra-iteration loop-carried communication queue A: while(cost < T && node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; D Dependence Graph DAGSCC D B B A C C A Predictable Dependences

  25. Execution Paradigm register control intra-iteration loop-carried communication queue D DAGSCC Misspeculation detected B C Misspeculation Recovery Rerun Iteration 4 A

  26. Understanding PMT Performance 4 3 5 0 2 1 1 2 3 4 5 0 Core 1 Core 2 Core 1 Core 2 A:1 B:1 A:1 A:2 B:1 A:2 B:2 C:1 B:2 A:3 Rate tiis at least as large as the longest dependence recurrence. NP-hard to find longest recurrence. Large loops make problem difficult in practice. Idle Time B:3 A:4 A:3 B:3 C:3 B:4 A:5 A:6 B:5 Slowest thread: 1 cycle/iter 2 cycle/iter Iteration Rate: 1 iter/cycle 0.5 iter/cycle

  27. Selecting Dependences To Speculate register control intra-iteration loop-carried communication queue A: while(cost < T && node) B: ncost = doit(node); C: cost += ncost; D: node = node->next; D Dependence Graph DAGSCC Thread 1 D B Thread 2 B A C Thread 3 C A Thread 4

  28. Detecting Misspeculation A1: while(consume(4)) D: node = node->next produce({0,1},node); Thread 1 A2: while(consume(5)) B: ncost = doit(node); produce(2,ncost); D2: node = consume(0); D DAGSCC Thread 2 B A3: while(consume(6)) B3: ncost = consume(2); C: cost += ncost; produce(3,cost); Thread 3 C A: while(cost < T && node) B4: cost = consume(3); C4: node = consume(1); produce({4,5,6},cost < T && node); A Thread 4

  29. Detecting Misspeculation A1: while(TRUE) D: node = node->next produce({0,1},node); Thread 1 D DAGSCC A2: while(TRUE) B: ncost = doit(node); produce(2,ncost); D2: node = consume(0); Thread 2 B A3: while(TRUE) B3: ncost = consume(2); C: cost += ncost; produce(3,cost); Thread 3 C A: while(cost < T && node) B4: cost = consume(3); C4: node = consume(1); produce({4,5,6},cost < T && node); A Thread 4

  30. Detecting Misspeculation A1: while(TRUE) D: node = node->next produce({0,1},node); Thread 1 D DAGSCC A2: while(TRUE) B: ncost = doit(node); produce(2,ncost); D2: node = consume(0); Thread 2 B A3: while(TRUE) B3: ncost = consume(2); C: cost += ncost; produce(3,cost); Thread 3 C A: while(cost < T && node) B4: cost = consume(3); C4: node = consume(1); if(!(cost < T && node)) FLAG_MISSPEC(); A Thread 4

  31. Breaking False Memory Dependences register control intra-iteration loop-carried communication queue Oldest Version Committed by Recovery Thread Dependence Graph D B A Memory Version 3 C false memory

  32. Adding Parallel Stages to DSWP 5 3 2 4 6 Core 1 Core 2 Core 3 0 LD:1 1 LD:2 while(ptr = ptr->next) // LD ptr->val = ptr->val + 1; // X LD:3 X:1 LD:4 LD = 1 cycle X = 2 cycles X:2 LD:5 X:3 Comm. Latency = 2 cycles LD:6 X:4 Throughput DSWP: 1/2 iteration/cycle DOACROSS: 1/2 iteration/cycle PS-DSWP: 1 iteration/cycle LD:7 X:5 7 LD:8

  33. Thread Partitioning p = list; sum = 0; A: while (p != NULL) { B: id = p->id; E: q = p->inner_list; C: if (!visited[id]) { D: visited[id] = true; F: while (foo(q))‏ G: q = q->next; H: if (q != NULL)‏ I: sum += p->value; } J: p = p->next; } 10 10 10 10 5 5 50 50 5 Reduction 3

  34. Thread Partitioning: DAGSCC

  35. Thread Partitioning 20 • Merging Invariants • No cycles • No loop-carried dependence inside a doall node 10 15 5 100 5 3

  36. Thread Partitioning 20 Treated as sequential 10 15 113

  37. Thread Partitioning 45 113 Modified MTCG[Ottoni, MICRO’05] to generate code from partition

  38. Discussion Point 1 – Speculation • How do you decide what dependences to speculate? • Look solely at profile data? • What about code structure? • How do you manage speculation in a pipeline? • Traditional definition of a transaction is broken • Transaction execution spread out across multiple cores

  39. Discussion Point 2 – Pipeline Structure • When is a pipeline a good/bad choice for parallelization? • Is pipelining good or bad for cache performance? • Is DOALL better/worse for cache? • Can a pipeline be adjusted when the number of available cores increases/decreases?

More Related