1 / 40

Presenter

Presenter. MaxAcademy Lecture Series – V1.0, September 2011. Stream Scheduling. Overview. Latencies in stream computing Scheduling algorithms Stream offsets. Latencies in Stream Computing. Consider a simple arithmetic pipeline Each operation has a latency

sanne
Télécharger la présentation

Presenter

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling

  2. Overview • Latencies in stream computing • Scheduling algorithms • Stream offsets

  3. Latencies in Stream Computing • Consider a simple arithmetic pipeline • Each operation has a latency • Number of cycles from input to output • May be zero • Throughput is still 1 value per cycle, L values can be in-flight in the pipeline (A + B) + C

  4. Input A Input B InputC Output + + Basic hardware implementation

  5. Input A Input B InputC Output 1 2 3 + + Data propagates through the circuit in “lock step”

  6. Input A Input B InputC Output 1 2 + 3 +

  7. Input A Input B InputC Output + 1 2 + X 3 Data arrives at wrong time due to pipeline latency

  8. Input A Input B InputC Output + + Insert buffering to correct

  9. Input A Input B InputC Output 1 2 3 + + Now with buffering

  10. Input A Input B InputC Output + 1 2 3 +

  11. Input A Input B InputC Output + 3 3 +

  12. Input A Input B InputC Output + + 3 3

  13. Input A Input B InputC Output + + 6

  14. Input A Input B InputC Output + + Success! 6

  15. Stream Scheduling Algorithms • A stream scheduling algorithm transforms an abstract dataflow graph into one that produces the correct results given the latencies of the operations • Can be automatically applied on a large dataflow graph (many thousands of nodes) • Can try to optimize for various metrics • Latency from inputs to outputs • Amount of buffering inserted  generally most interesting • Area (resource sharing)

  16. ASAP As Soon As Possible

  17. Input Input A Input A Input Input B InputC 0 0 0 Build up circuit incrementally Keeping track of latencies

  18. Input Input A Input A Input Input B InputC 0 0 0 + 1

  19. Input Input A Input A Input Input B InputC 0 0 0 + 1 + Input latencies are mismatched

  20. Input Input A Input A Input Input B InputC 0 0 0 + 1 1 + 2 Insert buffering

  21. Input Input A Input A Input Input B InputC Output 0 0 0 + 1 1 + 2

  22. ALAP As Late As Possible

  23. Output Start at output 0

  24. Output Latencies are negative relative to end of circuit + -1 -1 0

  25. InputC Output -2 + -2 + -1 -1 0

  26. Input Input A Input A Input Input B InputC Output -2 + -2 + -1 -1 0

  27. Input Input A Input A Input Input B InputC Output Buffering is saved -2 + -2 + -1 -1 0

  28. Input Input A Input A Input Input B InputC Output 2 Output 1 Sometimes this is suboptimal + + What if we add an extra output?

  29. Input Input A Input A Input Input B InputC Output 2 Output 1 Unnecessary buffering is added -2 + -2 + -1 -1 Neither ASAP nor ALAP can schedule this design optimally 0 0

  30. Optimal Scheduling • ASAP and ALAP both fix either inputs or outputs in place • More complex scheduling algorithms may be able to develop a more optimal schedule e.g. using ILP

  31. Buffering data on-chip • Consider: • We can see that we might need some explicit buffering to hold more than one data element on-chip • We could do this explicitly, with buffering elements a[i] = a[i] + (a[i - 1] + b[i - 1]) a = a + (buffer(a, 1) + buffer(b, 1))

  32. Input A Input B Output Buffer(1) Buffer(1) + + The buffer has zero latency in the schedule

  33. Input A Input B Output 0 0 Buffer(1) Buffer(1) 0 0 + 1 + 1 2 This will schedule thus Buffering = 3

  34. Buffers and Latency • Accessing previous values with buffers is looking backwards in the stream • This is equivalent to having a wire with negative latency • Can not be implemented directly, but can affect the schedule

  35. Input A Input B Output 0 0 Offset(-1) Offset(-1) -1 -1 -1 + -1 + 0 1 Offset wires can have negative latency

  36. Input A Input B Output 0 0 Offset(-1) Offset(-1) -1 -1 -1 + -1 + 0 1 This is scheduled Buffering = 0

  37. Stream Offsets • A stream offset is just a wire with a positive or negative latency • Negative latencies look backwards in the stream • Positive latencies look forwards in the stream • The entire dataflow graph will re-schedule to make sure the right data value is present when needed • Buffering could be placed anywhere, or pushed into inputs or outputs  more optimal than manual instantiation

  38. Input A Output 0 a[i] = a + a[i + 1] Offset(1) + a = a + stream.offset(a, +1)

  39. Input A Output 0 Offset(1) 1 1 + 2 Scheduling produces a circuit with 1 buffer

  40. Exercises For the questions below, assume that the latency of an addition operation is 10 cycles, and a multiply takes 5 cycles, while inputs/outputs take 0 cycles. • Write pseudo-code algorithms for ASAP and ALAP scheduling of a dataflow graph • Consider a MaxCompiler kernel with inputs a1, a2, a3, a4 and an output c. Draw the dataflow graph and draw the buffering introduced by ASAP scheduling to: • c = ( (a1 + a2) + a3) + a4 • c = (a1 + a2) + (a3 + a4) • Consider a MaxCompiler kernel with inputs a1, a2, a3, a4 and an output c. Draw the dataflow graph and write out the inequalities that must be satisfied to schedule: • c = ((a1 * a2) + (a3 * a4)) + a1 • c = stream.offset(a1, -10)*a2 + stream.offset(a1, -5)*a3 + stream.offset(a1, +15)*a4 How many values of stream a1 will be buffered on-chip for (b)?

More Related