RDMA(Remote Direct Memory Access) is a technique for computers to access memory on a remote machine without interrupting the processing of its own CPU(s).
- OpenFabrics Alliance
A non-profit organization that promotes RDMA technologies for server and storage connectivity.
It aims to develop open-source software that supports the three major RDMA fabric technologies: IB(InfiniBand), RoCE(RDMA over Converged Ethernet) and iWARP(Internet Wide Area RDMA Protocol).
The software includes two packages, one that runs on Linux and FreeBSD and one that runs on Microsoft Windows.
- OFED(OpenFabrics Enterprise Distribution)
It was released by the OpenFabrics Alliance. The OFED stack includes software drivers, core kernel-code, middleware, and user-level interfaces. It offers a range of standard protocols.
A new generation network protocol which supports RDMA natively from the beginning.
- RoCE(RDMA over Converged Ethernet)
It allows performing RDMA over Ethernet network. Its lower network headers are Ethernet headers and its upper network headers (including the data) are InfiniBand headers. This allows using RDMA over standard Ethernet infrastructure (switches).
- iWARP(Internet Wide Area RDMA Protocol)
It allows performing RDMA over TCP which means RDMA can be used over standard Ethernet infrastructure.
However, some features that exist in IB and RoCE may not be supported in iWARP, as well as loosing some of the RDMA performance advantages.
- Direct user-level access to HW
Zero-copy: Applications can perform data transfer without the network software stack involvement and data is being send /received directly to the buffers without being copied between the network layers.
<img src=http://7xjh3j.com1.z0.glb.clouddn.com/2016-06-27-rdma-1.jpg title=”Zero Copy”/>
Kernel bypass in the fast path: Applications can perform data transfer directly from userspace without the need to perform context switches.
No CPU involvement: Applications can access remote memory without consuming any CPU in the remote machine. The caches in the remote CPU(s) won’t be filled with the accessed memory content.
- Asynchronous Communication
Computation and communication overlap.
- Hardware managed transport
Software deals with buffers, not packets.
<img src=http://7xjh3j.com1.z0.glb.clouddn.com/2016-06-27-rdma-2.jpg title=”Full Protocol Offloading”>
A pervasive, low-latency, high-bandwidth interconnect which requires low processing overhead and is ideal to carry multiple traffic types (clustering, communications, storage, management, etc) over a single connection.
IB mainly consists of 5 layers.
|1st||Software Transport Verbs and Upper Layer Protocols||Interface between applications and hardwares
Define methodology for management functions
|2nd||Transport||Delivers packets to the appropriate QP node
|3rd||Network||Route packets between different partitions/subnets|
|4th||Data Link(Symbols and framing)||Route packets on the same partition/subnet|
<img src=http://7xjh3j.com1.z0.glb.clouddn.com/2016-06-27-rdma-4.jpg title=”Layers of IB Network”/>
In an IB net, there are devices:
HCA(Host Channel Adapter): Like a Ethernet NIC. Connects the InfiniBand Cable to the PCI Express bus. It is the end node of the InfiniBand network, executing transport-level functions as well as supporting the InfiniBand verbs interface.
TCA(Target Channel Adapter): Similar to HCA,
<img src=http://7xjh3j.com1.z0.glb.clouddn.com/2016-06-27-rdma-3.jpg title=”Infiniband Architecture”/>
lsmod to show the loaded kernel modules. Check for ‘rdma‘ and ‘ib‘.
ib_uverbs and low-level driver of the hardware shoule be checked.
USERSPACE VERBS ACCESS
The ib_uverbs module, built by enabling CONFIG_INFINIBAND_USER_VERBS, enables direct userspace access to IB hardware via “verbs”.
ibv_devices to show the available RDMA devices in the local machine.
ibv_devinfo to check for more information about the IB device.
Make sure at least one port is in PORT_ACTIVE state, which means that the port is available for working.
ibv_*_pingpong to test the connection between machines. There have to be a server process and a client process.
rping can be used to test the connection too.
To be continued at Simple usage of RDMA ibv