220 likes | 358 Vues
This paper explores the integration of Hardware Transactional Memory (HTM) within GPU architectures to address challenges such as concurrent thread management and transaction conflicts. The authors, Wilson Fung, Inderpeet Singh, Andrew Brownsword, and Tor Aamodt from the University of British Columbia, present the KILO-TM approach, offering innovative solutions for high-performance applications like N-Body simulations and others. By enabling fine-grained locking and efficient conflict detection, KILO-TM enhances GPU performance and mitigates common bottlenecks associated with multiple concurrent threads.
E N D
Hardware Transactional Memory for GPU Architectures Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia In Proc. 2011 ACM/IEEE Int’l Symp. Microarchitecture (MICRO-44)
Performance E.g. N-Body with 5M bodies CUDA SDK: O(n2) – 1640 s (barrier)Barnes Hut: O(nLogn) – 5.2 s (locks) Functionality Time Fine-Grained Locking Transactional Memory Time Time Motivation • Lifetime of GPU Application Development ? Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures
Are TM and GPUs Incompatible? GPUs different from Multi-Core CPUs • 1000s Concurrent Scalar Threads • Challenges (from TM perspective) Our Solution: KILO TM • Hardware TM for GPUs 3 Hardware TM for GPU Architectures Hardware TM for GPU Architectures
Aborted Committed T0 T0 T0 T1 T1 T1 T2 T2 T2 T3 T3 T3 Hardware TM for GPUs Challenge #1: SIMD Hardware • On GPUs, scalar threads in a warp/wavefront execute in lockstep A Warp with 4 Scalar Threads ... TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit ... Branch Divergence! 4 Hardware TM for GPU Architectures Hardware TM for GPU Architectures
KILO TM – Solution to Challenge #1: SIMD Hardware Transaction Abort Like a Loop Extend SIMT Stack Abort ... TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit ... Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt 5 5 Hardware TM for GPU Architectures Hardware TM for GPU Architectures
GPU Core (SM) CPU Core 10s of Registers 32k Registers Register File Register File @ TX Entry @ TX Abort Checkpoint Register File Warp Warp Warp Warp Warp Warp Warp Warp Checkpoint? Hardware TM for GPUs Challenge #2: Transaction Rollback 2MB Total On-Chip Storage 6 Hardware TM for GPU Architectures Hardware TM for GPU Architectures
Overwritten Abort KILO TM – Solution toChallenge #2: Transaction Rollback • SW Register Checkpoint • Most TX: Registers overwritten at first use • TX in Barnes Hut: Checkpoint 2 registers TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit 7 Hardware TM for GPU Architectures Hardware TM for GPU Architectures
Hardware TM for GPUs Challenge #3: Conflict Detection Existing HTMs use Cache Coherence Protocol • Not Available on GPUs • No Private Data Cache per Thread Signatures? • 1024-bit / Thread • 3.8MB / 30k Threads Hardware TM for GPU Architectures Hardware TM for GPU Architectures 8
GPU Core (SM) L1 Data Cache Warp Warp Warp Warp Warp Warp Warp Fermi’s L1 Data Cache (48kB) = 384 X 128B Lines 1024-1536 Threads Hardware TM for GPUs Challenge #4: Write Buffer Problem: 384 lines / 1536 threads < 1 line per thread! 9 Hardware TM for GPU Architectures Hardware TM for GPU Architectures
Read-Log Read-Log Write-Log Write-Log TX2 atomic {A=B+2} Private Memory KILO TM: Value-Based Conflict Detection • Self-Validation + Abort: • Only detects existence of conflict (not identity) Global Memory A=1 A=1 TX1 atomic {B=A+1} Private Memory B=2 B=0 A=1 TxBegin LD r1,[A] ADD r1,r1,1 ST r1,[B] TxCommit B=2 B=2 TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit B=0 A=2 10 Hardware TM for GPU Architectures Hardware TM for GPU Architectures
Tx1 then Tx2: A=4,B=2 OR Read-Log Read-Log Tx2 then Tx1: Write-Log Write-Log A=2,B=3 TX2 atomic {A=B+2} Private Memory Parallel Validation? Data Race!?! Global Memory A=1 A=1 TX1 atomic {B=A+1} Private Memory B=0 B=0 A=1 B=2 B=2 B=0 A=2 A=2 11 Hardware TM for GPU Architectures Hardware TM for GPU Architectures
Commit Unit Global Memory V + C Serialize Validation? TX1 TX2 Time • Benefit #1: No Data Race • Benefit #2: No Live Lock • Drawback:Serializes Non-ConflictingTransactions (“collateral damage”) V + C Stall V = Validation C = Commit 12 Hardware TM for GPU Architectures Hardware TM for GPU Architectures
Commit Unit Global Memory TX1 TX2 V+C V+C V+C Solution: Speculative Validation Key Idea: Split Conflict Detection into two parts • Recently Committed TX in Parallel • Concurrently Committing TX in Commit Order • Approximate V = Validation C = Commit TX3 Time RS Stall RS RS Conflict Rare Good Commit Parallelism 13 Hardware TM for GPU Architectures Hardware TM for GPU Architectures
KILO TM Implementation • Minimal Modification to Existing GPU Arch. SIMT Stacks Commit Unit TX Log Unit 14 Hardware TM for GPU Architectures Hardware TM for GPU Architectures
Evaluation Methodology • GPGPU-Sim 3.0 (BSD license) • Detailed: IPC Correlation of 0.93 vs GT 200 • KILO TM (Timing-Driven Memory Accesses) • GPU TM Applications • Hash Table (HT-H, HT-L) • Bank Account (ATM) • Cloth Physics (CL) • Barnes Hut (BH) • CudaCuts (CC) • Data Mining (AP) 15 Hardware TM for GPU Architectures Hardware TM for GPU Architectures
Performance (vs. Serializing TX) Higher is Better Serializing TX ≈ Coarse-Grained Locks 16 Hardware TM for GPU Architectures Hardware TM for GPU Architectures
3 Ideal TM e m KILO TM i T FG Lock . 2 c e x E d e 1 z i l a m r o N 0 HT-H HT-L ATM CL BH CC AP Performance (Exec. Time) Lower is Better • Captures 59% of FG Lock Performance 17 Hardware TM for GPU Architectures Hardware TM for GPU Architectures
Implementation Complexity • Logs in Private Memory @ L1 Data Cache • Commit Unit • 5kB Last Writer History Unit • 19kB Transaction Status • 32kB Read-Set and Write-Set Buffer • CACTI 5.3 @ 40nm • 0.40mm2 x 6 Memory Partition • 0.5% of 520mm2 18 Hardware TM for GPU Architectures Hardware TM for GPU Architectures
Summary • KILO TM: Hardware TM for GPUs • 1000s of Concurrent Scalar TXs • Handles Scalar TX Abort • No cache coherence protocol dependency • Word-level conflict detection • Unbounded Transaction • 59% Fine-Grained Locking Performance • 128X Faster than Serializing TX Execution • 0.5% Area Overhead Question? 19 Hardware TM for GPU Architectures Hardware TM for GPU Architectures
Backup Slides 20 Hardware TM for GPU Architectures Hardware TM for GPU Architectures
top A B C Next Next Next Null t A Next B top A C Next Next Null top top B B C C Next Next Next Next Null Null top C Next Null ABA Problem? • Classic Example: Linked List Based Stack • Thread 0 – pop(): while (true) { t = top; Next = t->Next; // thread 2: pop A, pop B, push A if (atomicCAS(&top, t, next) == t) break; // succeeds! } Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt 21 Hardware TM for GPU Architectures Hardware TM for GPU Architectures
top A B C Next Next Next Null ABA Problem? • atomicCAS protects only a single word • Only part of the data structure • Value-based conflict detection protects all relevant parts of the data structure while (true) { t = top; Next = t->Next; if (atomicCAS(&top, t, next) == t) break; // succeeds! } Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt 22 Hardware TM for GPU Architectures Hardware TM for GPU Architectures