Our paper on doing RDMA from GPUs

Our paper “Toward GPU-centric Networking on Commodity Hardware” has been accepted to appear at EdgeSys 2024 in Athens. The paper summarizes our efforts to implement a RDMA stack directly on a CUDA GPU, and it is a joint work with Mariano Scazzariello, Gerald Q. Maguire Jr., and Dejan Kostić.

This paper fits in the grand-schema of GPU-accelerated systems, tackling the problem of expensive, wasted, CPU cycles being used to just move data in and out of GPUs, often polling for completion. We believe this a key enabler to support the new era of GPU-accelerated computing, without limiting to AI applications. The use of standard RDMA semantics, and minimal modifications of the stack (e.g., without porting the whole code to the CUDA architecture), would allow an easier and more flexible adoption by current applications, without the need of proprietary and single-vendor stacks (e.g. NCCL). Although developed on NVIDIA hardware (and Mellanox NICs), I believe the future is heterogeneous and multi-vendor, hence the need for a more standard and easy to deploy solution (i.e. without the need to get that magic combination of hardware components).

The paper is available on ACM Digital Library with Open Access, from DiVA or directly here. The article is released under the Creative Commons BY-NC-SA license, and Open Access in the hope that this would help the scientific progress.

The code used in the paper is available under GNU GPL v3 license on GitHub.

More details about the implementation and the reasons for this work are described in my Licentiate thesis. But if you are interested in knowing more (and to know what is not written there), get in touch! My institutional email is mylastname@kth.se.