Understanding VLIW and Software-Driven ILP Techniques
This presentation, delivered by Anshul Kumar from CSE IITD, explores the fundamentals of Very Long Instruction Word (VLIW) architectures and software-driven Instruction-Level Parallelism (ILP). It covers topics such as pipeline scheduling, loop unrolling, branch prediction, and techniques for detecting and enhancing loop-level parallelism. The session also discusses practical examples and strategies for multi-issue processors, focusing on improving performance through advanced scheduling techniques and removing false dependencies.
Understanding VLIW and Software-Driven ILP Techniques
E N D
Presentation Transcript
CS718 : VLIW - Software Driven ILP Introduction 23rd Mar, 2006 Anshul Kumar, CSE IITD
Outline • Pipeline scheduling and loop unrolling • Branch prediction with static scheduling • Basic VLIW approach • Detecting and enhancing loop level parallelism • Software pipelining • Global scheduling • Hardware support • Real examples Anshul Kumar, CSE IITD
Approaches for multi-issue processors Anshul Kumar, CSE IITD
Pipeline scheduling example for (i=1000; i>0; i--) x[i] = x[i] + s; Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop Anshul Kumar, CSE IITD
Latency due to data hazards Assume no structural hazards Anshul Kumar, CSE IITD
Straight forward scheduling Loop: L.D F0, 0(R1) 1 stall 2 ADD.D F4, F0, F2 3 stall 4 stall 5 S.D F4, 0(R1) 6 DADDUI R1, R1, #-8 7 stall 8 BNE R1, R2, Loop 9 stall 10 Anshul Kumar, CSE IITD
A better schedule Loop: L.D F0, 0(R1) 1 DADDUI R1, R1, #-8 2 ADD.D F4, F0, F2 3 stall 4 BNE R1, R2, Loop 5 S.D F4, 0(R1) 6 Anshul Kumar, CSE IITD
A better schedule Loop: L.D F0, 0(R1) 1 DADDUI R1, R1, #-8 2 ADD.D F4, F0, F2 3 stall 4 BNE R1, R2, Loop 5 S.D F4, 8(R1) 6 Anshul Kumar, CSE IITD
Loop unrolling Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) 6 L.D F0, -8(R1) ADD.D F4, F0, F2 S.D F4, -8(R1) 12 L.D F0, -16(R1) ADD.D F4, F0, F2 S.D F4, -16(R1) 18 L.D F0, -24(R1) ADD.D F4, F0, F2 S.D F4, -24(R1) 24 DADDUI R1, R1, #-32 BNE R1, R2, Loop 28 28/4=7 Anshul Kumar, CSE IITD
Removing false dependences Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) 6 L.D F6, -8(R1) ADD.D F8, F6, F2 S.D F8, -8(R1) 12 L.D F10, -16(R1) ADD.D F12, F10, F2 S.D F12, -16(R1) 18 L.D F14, -24(R1) ADD.D F16, F14, F2 S.D F16, -24(R1) 24 DADDUI R1, R1, #-32 BNE R1, R2, Loop 28 28/4=7 Anshul Kumar, CSE IITD
Re-scheduling Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10, -16(R1) L.D F14, -24(R1) 4 ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 8 S.D F4, 0(R1) S.D F8, -8(R1) 10 DADDUI R1, R1, #-32 S.D F12, -16(R1) 12 BNE R1, R2, Loop S.D F16, -24(R1) 14 14/4=3.5 Anshul Kumar, CSE IITD
Decisions and transformations • Can S.D move after DADDUI and BNE? • Adjust S.D offset. • Are loop iterations independent? • Do register renaming. • Remove extra loop termination tests, adjust the code. • Analyze addresses. Can loads/stores be reordered? • Schedule the code, preserving dependences. Anshul Kumar, CSE IITD
Dependences in unrolled loop Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8; drop BNE L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8; drop BNE L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8; drop BNE L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop Anshul Kumar, CSE IITD
Remove extra DADDUI Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1); drop DADDUI and BNE L.D F0, -8(R1) ADD.D F4, F0, F2 S.D F4, -8(R1); drop DADDUI and BNE L.D F0, -16(R1) ADD.D F4, F0, F2 S.D F4, -16(R1); drop DADDUI and BNE L.D F0, -24(R1) ADD.D F4, F0, F2 S.D F4, -24(R1) DADDUI R1, R1, #-32 BNE R1, R2, Loop offsets in loads/stores adjusted Anshul Kumar, CSE IITD
False dependences Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1); drop DADDUI and BNE L.D F0, -8(R1) ADD.D F4, F0, F2 S.D F4, -8(R1); drop DADDUI and BNE L.D F0, -16(R1) ADD.D F4, F0, F2 S.D F4, -16(R1); drop DADDUI and BNE L.D F0, -24(R1) ADD.D F4, F0, F2 S.D F4, -24(R1) DADDUI R1, R1, #-32 BNE R1, R2, Loop Anshul Kumar, CSE IITD
Removing false dependences Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1); drop DADDUI and BNE L.D F6, -8(R1) ADD.D F8, F6, F2 S.D F8, -8(R1); drop DADDUI and BNE L.D F10, -16(R1) ADD.D F12, F10, F2 S.D F12, -16(R1); drop DADDUI and BNE L.D F14, -24(R1) ADD.D F16, F14, F2 S.D F16, -24(R1) DADDUI R1, R1, #-32 BNE R1, R2, Loop Anshul Kumar, CSE IITD
True dependences Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1); drop DADDUI and BNE L.D F6, -8(R1) ADD.D F8, F6, F2 S.D F8, -8(R1); drop DADDUI and BNE L.D F10, -16(R1) ADD.D F12, F10, F2 S.D F12, -16(R1); drop DADDUI and BNE L.D F14, -24(R1) ADD.D F16, F14, F2 S.D F16, -24(R1) DADDUI R1, R1, #-32 BNE R1, R2, Loop Anshul Kumar, CSE IITD