1 / 38

Kernel-Kernel Communication in a Shared-memory Multiprocessor Eliseu Chaves, et. al. May 1993

Kernel-Kernel Communication in a Shared-memory Multiprocessor Eliseu Chaves, et. al. May 1993 Presented by Tina Swenson May 27, 2010. Agenda. Introduction Remote Invocation Remote Memory Access RI/RA Combinations Case Study Conclusion. Introduction. Introduction.

oihane
Télécharger la présentation

Kernel-Kernel Communication in a Shared-memory Multiprocessor Eliseu Chaves, et. al. May 1993

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Kernel-Kernel Communication in a Shared-memory Multiprocessor Eliseu Chaves, et. al. May 1993 Presented by Tina Swenson May 27, 2010

  2. Agenda • Introduction • Remote Invocation • Remote Memory Access • RI/RA Combinations • Case Study • Conclusion

  3. Introduction

  4. Introduction • There’s more than one way to handle large shared memory systems • Remote Memory • we’ve studied this a lot! • Remote Invocation • message passing • Trade-offs are discussed • Theories tested with a case study

  5. Motivation • UMA design won’t scale • NUMA was seen as the future • It is implemented in commercial CPU’s • NUMA allows programmers to choose shared memory or remote invocation • The authors discuss the trade-offs

  6. Kernel-kernel Communication • Each processor has: • Full range of kernel services • Reasonable performance • Access to all memory on the machine • Locality – key to RI success • Previous kernel experience shows that most memory access tend to be local to the “node” “...most memory accesses will be local even when using remote memory accesses for interkernel communication, and that the total amount of time spent waiting for replies from other processors when using remote invocation will be small...”

  7. NUMA • NUMA without cache-coherence • 3 methods of kernel-kernel communication • Remote Memory Access • Operation executes on node i, accessing node j’s memory as needed. • Remote Invocation • Node i processor sends a message to node j processor asking j to perform i’s operations. • Bulk data transfer • Kernel moves data from node to node.

  8. Remote Invocation

  9. Remote Invocation (RI) • Instead of moving data around the architecture, move the operations to the data! • Message Passing

  10. Interrupt-Level RI (ILRI) • Fast • For operations that can be safely executed in an interrupt handler • Limitations: • Non-blocking (thus no locks) operations only • interrupt handles lack process context • Deadlock Prevention • severely limits when we can use ILRI

  11. Process-Level RI (PLRI) • Slower • Requires context switch and possible synchronization with other running processes • Used for longer operations • Avoid deadlocks by blocking

  12. Remote Memory Access

  13. Memory Considerations • If remote memory access is used how is it affected by memory consistency models (not in this paper)? • Strong consistency models will incur contention • Weak consistency models widen the cost gap between normal instructions and synchronization instructions • And require use of memory barriers From Professor Walpole’s slides.

  14. RI/RA Combinations

  15. Mixing RI/RA • ILRI, PLRI and shared memory are compatible, as long as guidelines are followed. “It is easy to use different mechanisms for unrelated data structures.”

  16. Using RA with PLRI • Remote Access and Process-level Remote Invocation can be used on the same data structure if: • synchronization methods are compatible

  17. Using RA with ILRI • Remote Access and Interrupt-level Remote Invocation can be used on the same data structure if: • A Hybrid lock is used • interrupt masking AND spin locks

  18. Using RA with ILRI – Hybrid Lock

  19. Using PLRI and ILRI • PLRI & ILRI on the same data structure if: • Avoid deadlock • Always be able to perform incoming invocations while waiting for outgoing invocation. • Example: Cannot make PLRI with ILRIs blocked in order to access data that is shared by normal and interrupt-level code (from Professor Walpole’s slides)

  20. The Costs • Latency • Impact on local operations • Contention and Throughput • Complement or clash conceptually with the kernel’s organization

  21. Latency • What’s the latency between performing RA and RI? • If (R-1)n < C • then implement using RA • If operations require a lot of time • then implement using RI

  22. Impact on Local Operations • Implicit Synchronization: • PLRI is used for all remote accesses, then it could allow the data structure • This solution depends on the no pre-emption • Explicit Synchronization: • Bus-based nodes

  23. Contention and Throughput • Operations are serialized at some point! • RI: Serialize on processor executing those operations • Even if there is no data in common • RA: Serialize at the memory • If access competes for same lock

  24. Complement or Clash • Types of kernels • procedure-based • no distinction between user & kernel space • user program enters kernel via traps • fits RA • message-based • each major kernel resource is its own kernel process • ops require communication among these kernel processes • fits RI

  25. Complement or Clash

  26. Case Study

  27. Psyche on Butterfly Plus • Procedure-based OS • Uses share memory as primary kernel communication mechanism • Authors built in message-based ops • RI – reorganized code; grouped together accesses allowing a single RI call. • non-CC-NUMA • 1 CPU/node • C = 12:1 (remote -to-local access time)

  28. Psyche on Butterfly Plus • High degree of node locality • RI implemented optimistically • Spin locks used • Test-and-test-and-set used to minimize latency in absence of contention. Otherwise, some atomic instruction is used • This can be decided on the fly

  29. Results

  30. Results

  31. Results

  32. Results

  33. Conclusion

  34. Factors Affecting the choice of RI/RA • Cost of the RI mechanism • Cost of atomic operations for synchronization • Ratio of remote to local memory access time • For cache-coherent machines: • cache line size • false sharing • caching effects reducing total cost of kernel ops.

  35. Using PLRI, ILRI, and RA • PLRI • Use it once the cost of PLRI surpasses ILRI • Must consider latency, throughput, appeal of eliminating explicitly synch • IRLI • Node locality is hugely important • Use it for low-latency ops when you can’t do RA • Use it when the remote node is idle. • Authors used IRLI for console IO, kernel debugging and TLB Shootdown.

  36. Observations • On Butterfly Plus: • ILRI was fast • Explicit sync is costly • Remote references much more expensive than local references. • Except for short operations, RI had lower latency. RI might have lower throughput.

  37. Conclusions? • Careful design is required for OSs to scale on modern hardware! • Which means you better understand the effects of your underlying hardware. • Keep communication to a minimum no matter what solution is used. • Where has mixing of RI/RA gone? • Monday’s paper, for one. • What else? • ccNUMA is in wide-spread use • How is RI/RA affected?

  38. Thank You

More Related