Enhancing Data Sharing Across Heterogeneous Compute Platforms**
The paper addresses the challenges and goals of data sharing between heterogeneous devices within compute platforms. It delves into the complexities that programmers face, such as memory access, data copies, and the roles of MMU and IOMMU in this context. The discussion highlights the importance of developing hardware/software interfaces that simplify programming and abstract device-specific details. It also examines the implications of I/O page faults and proposes advancements in I/O management to unify memory spaces, reduce overheads, and improve performance for both applications and I/O devices.
Enhancing Data Sharing Across Heterogeneous Compute Platforms**
E N D
Presentation Transcript
Heterogeneous Compute Platforms:Data management Dan Tsafrir May 2013, ICRI-CI Retreat Data Sharing
Data sharing – the problem • Sharing data between heterogeneous devices • Oftentimes cumbersome & device-specific • In OS, apps, or both • Programmers need to address questions like • Can the device work directly on app memory? Or must it have its own copy of the data? • Can the device deal with app virtual addresses?Or must the memory be mapped in some other way? • Should the memory be pinnedbefore passing it to the device? Or can the device withstand I/O page faultsThereby allowingmemory overcommitment? Data Sharing
Data sharing – goal • Big goal • Data sharing between heterogeneous PEs should "just work” • HW/SW interfaces should allow to keep app programmers mostly ignorant of details • Need to develop interfaces & runtime layer that • Abstract away details of each device, • Present to apps a simplified, efficient programming model • Concrete goal • Focusing on MMU and IOMMU Data Sharing
Unifying MMU and IOMMU spaces Ilya Lesokhin Muli Ben-Yehuda Assaf Schuster Dan Tsafrir Data Sharing
IOMMU in a nutshell • IOMMU vs. MMU • IOMMU serves I/O devices that perform DMAs • Like MMU serves processes that access virtual memory • But • No I/O page faults (IOPFs) • If memory isn’t there => crash Data Sharing
No IOPFs – consequences • IOMMU management crippled compared to MMU • Virtual-memory must be pre-allocated & pinned to physical-memory • Can’t do memory overcommitment • Consider a set of uncooperative VMs with assigned NICs (SR-IOV) • Must pin their entire memory images! • Kernel’s MMU & IOMMU management subsystems • Developed separately & used differently • Causes numerous headaches and performance penalties • E.g., can’t use apps virtual memory space to do I/O • Thus, to be able to unify (and get rid of above drawbacks) • Must have IOPFs Data Sharing
IOPFs support – current state of affairs • Recently defined industry spec for supporting IOPFs: • In “PRI” (Page Request Interface) • Part of the PCI-SIG ATS (Address Translation Services) specification • Bleeding edge I/O devices do (experimentally) support IOPFs • We are working on such experimental NICs Data Sharing
Research • Status • Have a working environment • Handling send-IOPFs (currently NIC drops receive-IOPFs) • Measured IOPF handling (breakdown to HW and SW components) • Next steps • Attempt to reduce overhead • Develop a strategy to handle receive-IOPFs (10 Gb/sec => 1.25 MB/ms) • Characterizing IOPFs • How often? Performance penalty? Dropped packets? • Show I/O memory space overcommitment is possible & advantageous • Longer term • Unify process & I/O address spaces • Processes use their VA buffers, I/O subsystem works directly on them • Does the PRI spec make sense? Optimal? Could be improved? How? Data Sharing
Rethink the IOMMU Moshe Malka Nadav Amit Dan Tsafrir Data Sharing
IOMMU architected similarly to MMU |------------------------------------------- virtual address ------------------------------------| • Has IOTLB • Upon IOTLB miss, => HW walks the table CR3 Data Sharing
Does this make sense? • We submit that it does not… • Specifically, it seems that • Since NICs work with rings, IOTLBaccesses are completely predictable(more important than TLB becausepage-tables are un-cached) • Since NICs map each DMA descriptorjust before using it, and un-maps itjust after, no needfor a page-tablehierarchy • Performance can begreatly improved ifredesigning the IOMMUto take advantage of the above Data Sharing
Research • Status • Working hard towards proving all claims from previous slide • Environment: KVM/QEMU setup (10Gb/s NICs) logs all IOMMU accesses • Future • Not just NICs (have reason to believe other I/O devices too) • Reducing overheads for virtualization (vIOMMU) • What would be the impact of unifying I/O and process spaces? (previous project) Data Sharing