320 likes | 446 Vues
This presentation explores concurrent data structures designed for architectures with limited shared memory support, focusing on many-core systems like Intel's Single-chip Cloud Computer (SCC). It discusses the challenges and strategies for implementing efficient synchronization through both coarse and fine-grained locking techniques and non-blocking mechanisms. Different queue implementations, including a traditional 2-lock queue and a message-passing-based queue, are examined in detail, including their performance, scalability, and progress guarantees. The significance of hardware advancements in multi-core systems to overcome sharing limitations is also highlighted.
E N D
Distributed Computing and SystemsChalmers University of TechnologyGothenburg, Sweden Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas
Concurrent Data Structures • Parallel/Concurrent programming: • Share data among threads/processes, sharing a uniform address space (shared memory) • Inter-process/thread communication and synchronization • Both a tool and a goal Yiannis Nikolakopoulos ioaniko@chalmers.se
Concurrent Data Structures:Implementations • Coarse grained locking • Easy but slow... • Fine grained locking • Fast/scalable but: error-prone, deadlocks • Non-blocking • Atomic hardware primitives (e.g. TAS, CAS) • Good progress guarantees (lock/wait-freedom) • Scalable Yiannis Nikolakopoulos ioaniko@chalmers.se
What’s happening in hardware? • Multi-cores many-cores • “Cache coherency wall” [Kumar et al 2011] • Shared address space will not scale • Universal atomic primitives (CAS, LL/SC) harder to implement • Shared memory message passing Shared Local Cache Cache IACore Yiannis Nikolakopoulos ioaniko@chalmers.se
Networks on chip (NoC) • Short distance between cores • Message passing model support • Shared memory support • Eliminatedcache coherency • Limited support for synchronization primitives Shared Local Cache Cache IACore Can we have Data Structures: Fast Scalable Good progress guarantees Yiannis Nikolakopoulos ioaniko@chalmers.se
Outline • Concurrent Data Structures • Many-core architectures • Intel’s SCC • Concurrent FIFO Queues • Evaluation • Conclusion Yiannis Nikolakopoulos ioaniko@chalmers.se
Single-chip Cloud Computer (SCC) • Experimental processor by Intel • 48 independent x86 cores arranged on 24 tiles • NoC connects all tiles • TestAndSet registerper core Yiannis Nikolakopoulos ioaniko@chalmers.se
SCC: Architecture Overview Message Passing Buffer (MPB) 16Kb Memory Controllers: to private & shared main memory Yiannis Nikolakopoulos ioaniko@chalmers.se
Programming Challenges in SCC • Message Passing but… • MPB small for large data transfers • Data Replication is difficult • No universal atomic primitives (CAS); no wait-free implementations [Herlihy91] Yiannis Nikolakopoulos ioaniko@chalmers.se
Outline • Concurrent Data Structures • Many-core architectures • Intel’s SCC • Concurrent FIFO Queues • Evaluation • Conclusion Yiannis Nikolakopoulos ioaniko@chalmers.se
Concurrent FIFO Queues • Main idea: • Data are stored in shared off-chip memory • Message passing for communication/coordination • 2 design methodologies: • Lock-based synchronization (2-lock Queue) • Message passing-based synchronization (MP-Queue, MP-Acks) Yiannis Nikolakopoulos ioaniko@chalmers.se
2-lock Queue • Array based, in shared off-chip memory (SHM) • Head/Tail pointers in MPBs • 1 lock for each pointer [Michael&Scott96] • TAS based locks on 2 cores Yiannis Nikolakopoulos ioaniko@chalmers.se
2-lock Queue:“Traditional” Enqueue Algorithm • Acquire lock • Read & UpdateTail pointer (MPB) • Add data (SHM) • Release lock Yiannis Nikolakopoulos ioaniko@chalmers.se
2-lock Queue:Optimized Enqueue Algorithm • Acquire lock • Read & UpdateTail pointer (MPB) • Release lock • Add data to node SHM • Set memory flag to dirty Why?No Cache Coherency! Yiannis Nikolakopoulos ioaniko@chalmers.se
2-lock Queue:Dequeue Algorithm • Acquire lock • Read & UpdateHead pointer • Releaselock • Check flag • Read node data What about progress? Yiannis Nikolakopoulos ioaniko@chalmers.se
2-lock Queue:Implementation Locks? On which tile(s)? Head/TailPointers (MPB) Data nodes Yiannis Nikolakopoulos ioaniko@chalmers.se
Message Passing-based Queue • Data nodes in SHM • Access coordinated by a Server node who keeps Head/Tail pointers • Enqueuers/Dequeuers request access through dedicated slots in MPB • Successfully enqueued data are flagged with dirty bit Yiannis Nikolakopoulos ioaniko@chalmers.se
MP-Queue DEQ ENQ TAIL HEAD ADDDATA SPIN What if this fails and is never flagged? “Pairwise blocking”only 1 dequeue blocks Yiannis Nikolakopoulos ioaniko@chalmers.se
Adding Acknowledgements • No more flags! Enqueue sends ACK when done • Server maintains in SHM a private queue of pointers • On ACK: Server adds data location to its private queue • On Dequeue:Server returns only ACKed locations Yiannis Nikolakopoulos ioaniko@chalmers.se
MP-Acks DEQ ENQ TAIL HEAD ACK No blocking between enqueues/dequeues Yiannis Nikolakopoulos ioaniko@chalmers.se
Outline • Concurrent Data Structures • Many-core architectures • Intel’s SCC • Concurrent FIFO Queues • Evaluation • Conclusion Yiannis Nikolakopoulos ioaniko@chalmers.se
Evaluation Benchmark: • Each core performs Enq/Deq at random • High/Low contention • Perfomance? Scalability? • Is it the same for all cores? Yiannis Nikolakopoulos ioaniko@chalmers.se
Measures • Throughput:Data structure operations completed per time unit. [Cederman et al 2013] Average operations per core Operations by core i Yiannis Nikolakopoulosioaniko@chalmers.se
Throughput – High Contention Yiannis Nikolakopoulos ioaniko@chalmers.se
Fairness – High Contention Yiannis Nikolakopoulos ioaniko@chalmers.se
Throughput VS Lock Location Yiannis Nikolakopoulos ioaniko@chalmers.se
Throughput VS Lock Location Yiannis Nikolakopoulos ioaniko@chalmers.se
Conclusion • Lock based queue • High throughput • Less fair • Sensitive to lock locations, NoC performance • MP based queues • Lower throughput • Fairer • Better liveness properties • Promising scalability Yiannis Nikolakopoulos ioaniko@chalmers.se
Thank you! ivanw@chalmers.se ioaniko@chalmers.se Yiannis Nikolakopoulos ioaniko@chalmers.se
Backup slides Yiannis Nikolakopoulos ioaniko@chalmers.se
Experimental Setup • 533MHz cores, 800MHz mesh, 800MHz DDR3 • Randomized Enq/Deq operations • High/Low contention • One thread per core • 600ms per execution • Averaged over 12 runs Yiannis Nikolakopoulos ioaniko@chalmers.se
Concurrent FIFO Queues • Typical 2-lock queue [Michael&Scott96] Yiannis Nikolakopoulos ioaniko@chalmers.se