RDMA(Remote Direct Memory Access) is a technique for computers to access memory on a remote machine without interrupting the processing of its own CPU(s).
A non-profit organization that promotes RDMA technologies for server and storage connectivity.
It aims to develop open-source software that supports the three major RDMA fabric technologies: IB(InfiniBand), RoCE(RDMA over Converged Ethernet) and iWARP(Internet Wide Area RDMA Protocol).
The software includes two packages, one that runs on Linux and FreeBSD and one that runs on Microsoft Windows.
OFED(OpenFabrics Enterprise Distribution)
It was released by the OpenFabrics Alliance. The OFED stack includes software drivers, core kernel-code, middleware, and user-level interfaces. It offers a range of standard protocols.
IB(InfiniBand)
A new generation network protocol which supports RDMA natively from the beginning.
RoCE(RDMA over Converged Ethernet)
It allows performing RDMA over Ethernet network. Its lower network headers are Ethernet headers and its upper network headers (including the data) are InfiniBand headers. This allows using RDMA over standard Ethernet infrastructure (switches).
iWARP(Internet Wide Area RDMA Protocol)
It allows performing RDMA over TCP which means RDMA can be used over standard Ethernet infrastructure.
However, some features that exist in IB and RoCE may not be supported in iWARP, as well as loosing some of the RDMA performance advantages.
Advantages
Direct user-level access to HW
Zero-copy: Applications can perform data transfer without the network software stack involvement and data is being send /received directly to the buffers without being copied between the network layers.
Kernel bypass in the fast path: Applications can perform data transfer directly from userspace without the need to perform context switches.
No CPU involvement: Applications can access remote memory without consuming any CPU in the remote machine. The caches in the remote CPU(s) won’t be filled with the accessed memory content.
Asynchronous Communication
Computation and communication overlap.
Hardware managed transport
Software deals with buffers, not packets.
Infiniband
A pervasive, low-latency, high-bandwidth interconnect which requires low processing overhead and is ideal to carry multiple traffic types (clustering, communications, storage, management, etc) over a single connection.
IB mainly consists of 5 layers.
|No|Layers|Functions| |-| |1st|Software Transport Verbs and Upper Layer Protocols|Interface between applications and hardwares Define methodology for management functions| |2nd|Transport|Delivers packets to the appropriate QP node Message Assembly/De-assembly Access Right| |3rd|Network|Route packets between different partitions/subnets| |4th|Data Link(Symbols and framing)|Route packets on the same partition/subnet| |5th|Physical|Signal levels/Media/Connections|
In an IB net, there are devices:
HCA(Host Channel Adapter): Like a Ethernet NIC. Connects the InfiniBand Cable to the PCI Express bus. It is the end node of the InfiniBand network, executing transport-level functions as well as supporting the InfiniBand verbs interface.
TCA(Target Channel Adapter): Similar to HCA,
Test
Use lsmod to show the loaded kernel modules. Check for ‘rdma‘ and ‘ib‘.
jcf@node1:~> rping -s -v -C 20 server ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr server ping data: rdma-ping-1: BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrs server ping data: rdma-ping-2: CDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrst server ping data: rdma-ping-3: DEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstu server ping data: rdma-ping-4: EFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuv server ping data: rdma-ping-5: FGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvw server ping data: rdma-ping-6: GHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwx server ping data: rdma-ping-7: HIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy server ping data: rdma-ping-8: IJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz server ping data: rdma-ping-9: JKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzA server ping data: rdma-ping-10: KLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzA server ping data: rdma-ping-11: LMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzAB server ping data: rdma-ping-12: MNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABC server ping data: rdma-ping-13: NOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCD server ping data: rdma-ping-14: OPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDE server ping data: rdma-ping-15: PQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEF server ping data: rdma-ping-16: QRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEFG server ping data: rdma-ping-17: RSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEFGH server ping data: rdma-ping-18: STUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEFGHI server ping data: rdma-ping-19: TUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEFGHIJ server DISCONNECT EVENT... wait for RDMA_READ_ADV state 10