Multithreading and Parallelism: Maximizing Processor Usage and Resource Efficiency
E N D
Presentation Transcript
Process vs Thread • Thread : Instruction sequence • Own registers/stack • Share memorywith otherthreads in process (program)
Threaded Code • Demo…
Multithreading • Multithreading • Alternate or combine threads to maximize use of processor • Hardware required • Multiple register sets • Track "owner" of pipeline instructions
Resource Usage • Code running in a superscalar pipeline • Can't always fill all 4 issue slots • Have bubbles from memory access, page faults, etc… Issue Slots
Threading Examples • Assumptions: • Three threads of work • In order execution • Must obay stalls (i.e. A3 is 3+ cycles after A2)
Threading Examples • Assumptions: • Three threads of work • In order execution • Must obay stalls (i.e. A3 is 3+ cycles after A2) • Two wide pipeline (two instructions per cycle)
Multithreading • Corse Grained Multitasking • Threads run until stall • Cache miss, page fault • Other long event • On stall, drain pipeline, and start next thread
Multithreading • Corse Grained Multitasking • Threads run until stall • Cache miss, page fault • Other long event • On stall, drain pipeline, and start next thread
Corse Example • Course Multi threading • Avoids waiting for long periods • Wastes tme on context switches • 16/30 possible units of work
Multithreading • Course Grained • Assumption1 cycle to retire after stall Threads to run Single Pipeline Time
Multithreading • Course Grained • Assumption1 cycle to retire after stall Threads to run Dual Pipeline Time
Multithreading • Corse Grained Multitasking • Avoids wasting time on large stalls • Context switches waste time • Ex: Does work in 16/30 possible slots
Multithreading • Fine Grained Multitasking • Every cycle, switch threads
Multithreading • Fine Grained • Switch eachcycle to nextready thread Threads to run Single Pipeline Time
Multithreading • Fine Grained • Switch eachcycle to nextready thread Threads to run Dual Pipeline Time A6 can’t run until 4 after A5Gets skipped at time 10
Multithreading • Fine Grained Multithreading • More responsive for each thread • Significant hardware required • Multiple register sets • Track "owner" of pipeline instructions • Ex: Finishes in 15 steps24 out of 30 possible units of work
Latency vs Throughput • Multithreading favors throughput over latency • Longer to do any one task • Shorter overall to do all
Multithreading • SMT : Simultaneous Multithreading • AKA Hyperthreading • Can issue instructions from multiple threads in one cycle
SMT • SMT : Simultaneous Multithreading • AKA Hyperthreading • Execution units can each work on different threads
Multithreading SMT • Switch like fine grained • Do work from multiplethreads if needed to fillpipelines Threads to run B4 not ready, but C3 is Time
Multithreading SMT • Switch like fine grained • Do work from multiplethreads if needed to fillpipelines Threads to run C5 not ready but A5 is Time
Multithreading SMT • Switch like fine grained • Do work from multiplethreads if needed to fillpipelines Threads to run B4, C5, A6 all waiting Time
Multithreading • Simultaneous Multi Threading • Better potential to use all hardware execution units • Depends on complimentary work loads • More book keeping required
SMT Challenges • Resources must be duplicated or split • Split too thin hurts performance… • Duplicate everything and you aren't maximizing use of hardware…
Intel vs AMD • Variations on SMT
Intel vs AMD • AMD Zen architecture
Development • Single Core
Development • Single Core with Multithreading • 2002 Pentium 4 / Xeon
Development • Multi Processor • Multiple processors coexisting in system • PC space in ~1995
Development • Multi Core • Multiple CPU's on one chip • PC space in ~2005
Development • Modern Complexity • Many cores • Private / Shared cache levels
Development • Massively Parallel Systems
UMA • Uniform Memory Access • Every processor sees every memory using same addresses • Same access time for any CPU to any memory word
NUMA • Non Uniform Memory Access • Single memory address space visible to all CPUs • Some memory local • Fast • Some memory remote • Accessed in same way, but slower
Sunway Architecture • One chip : 256 cores~1.5Ghz • Computer :40,000+ chips
Multiprocessing & Memory • Memory demo…
Memory Access • Race conditions : unpredictable effects of sharing memory • May add 10, 1 or 11 to x
Memory Access • Syncronization – using locks to prevent others from accessing memory
Memory Access • Syncronization issues: • No longer parallel • Deadlock
Cache Coherence • Cache Coherence : Trying to make sure cached memory stays synched
Cache Coherence • Cache Coherence : • Need ability to snoop on activity and/or broadcast changes
Cache Coherence • Cache Coherence : • Need ability to snoop on activity and/or broadcast changes A broacasts write on X, B knows it no longer has valid value
Cache Coherence • Cache Coherence : • Need ability to snoop on activity and/or broadcast changes A snoops on B asking for X, provides New value and updates memory