0%

RDMA Warming Up

RDMA(Remote Direct Memory Access) is a technique for computers to access memory on a remote machine without interrupting the processing of its own CPU(s).

Refer to:


Basic

  • OpenFabrics Alliance

A non-profit organization that promotes RDMA technologies for server and storage connectivity.

It aims to develop open-source software that supports the three major RDMA fabric technologies: IB(InfiniBand), RoCE(RDMA over Converged Ethernet) and iWARP(Internet Wide Area RDMA Protocol).

The software includes two packages, one that runs on Linux and FreeBSD and one that runs on Microsoft Windows.

  • OFED(OpenFabrics Enterprise Distribution)

It was released by the OpenFabrics Alliance. The OFED stack includes software drivers, core kernel-code, middleware, and user-level interfaces. It offers a range of standard protocols.

  • IB(InfiniBand)

A new generation network protocol which supports RDMA natively from the beginning.

  • RoCE(RDMA over Converged Ethernet)

It allows performing RDMA over Ethernet network. Its lower network headers are Ethernet headers and its upper network headers (including the data) are InfiniBand headers. This allows using RDMA over standard Ethernet infrastructure (switches).

  • iWARP(Internet Wide Area RDMA Protocol)

It allows performing RDMA over TCP which means RDMA can be used over standard Ethernet infrastructure.

However, some features that exist in IB and RoCE may not be supported in iWARP, as well as loosing some of the RDMA performance advantages.

Advantages

  • Direct user-level access to HW

Zero-copy: Applications can perform data transfer without the network software stack involvement and data is being send /received directly to the buffers without being copied between the network layers.

Kernel bypass in the fast path: Applications can perform data transfer directly from userspace without the need to perform context switches.

No CPU involvement: Applications can access remote memory without consuming any CPU in the remote machine. The caches in the remote CPU(s) won’t be filled with the accessed memory content.

  • Asynchronous Communication

Computation and communication overlap.

  • Hardware managed transport

Software deals with buffers, not packets.

Infiniband

A pervasive, low-latency, high-bandwidth interconnect which requires low processing overhead and is ideal to carry multiple traffic types (clustering, communications, storage, management, etc) over a single connection.

IB mainly consists of 5 layers.

|No|Layers|Functions|
|-|
|1st|Software Transport Verbs and Upper Layer Protocols|Interface between applications and hardwares
Define methodology for management functions|
|2nd|Transport|Delivers packets to the appropriate QP node
Message Assembly/De-assembly
Access Right|
|3rd|Network|Route packets between different partitions/subnets|
|4th|Data Link(Symbols and framing)|Route packets on the same partition/subnet|
|5th|Physical|Signal levels/Media/Connections|

In an IB net, there are devices:

HCA(Host Channel Adapter): Like a Ethernet NIC. Connects the InfiniBand Cable to the PCI Express bus. It is the end node of the InfiniBand network, executing transport-level functions as well as supporting the InfiniBand verbs interface.

TCA(Target Channel Adapter): Similar to HCA,

Test

Use lsmod to show the loaded kernel modules. Check for ‘rdma‘ and ‘ib‘.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
jcf@node1:~> lsmod | grep ib
ib_ucm 18507 0
ib_ipoib 140788 0
ib_cm 47822 3 ib_ucm,rdma_cm,ib_ipoib
ib_uverbs 75527 2 rdma_ucm,ib_ucm
ib_umad 22476 6
mlx5_ib 188056 0
mlx5_core 397618 1 mlx5_ib
mlx4_ib 209139 0
ib_sa 33470 5 rdma_ucm,rdma_cm,ib_ipoib,ib_cm,mlx4_ib
ib_mad 56507 4 ib_cm,ib_umad,mlx4_ib,ib_sa
ib_core 138343 12 rdma_ucm,ib_ucm,rdma_cm,iw_cm,ib_ipoib,ib_cm,ib_uverbs,ib_umad,mlx5_ib,mlx4_ib,ib_sa,ib_mad
ib_addr 18889 3 rdma_ucm,rdma_cm,ib_core
mlx4_core 370626 2 mlx4_en,mlx4_ib
mlx_compat 30364 17 rdma_ucm,ib_ucm,rdma_cm,iw_cm,ib_ipoib,ib_cm,ib_uverbs,ib_umad,mlx5_ib,mlx5_core,mlx4_en,mlx4_ib,ib_sa,ib_mad,ib_core,ib_addr,mlx4_core
libcrc32c 12644 0
ipv6_lib 344914 187 ib_ipoib,ib_core,ib_addr,ipv6
libsas 88001 1 isci
scsi_transport_sas 40887 2 isci,libsas
libahci 35044 1 ahci
libata 230626 5 libsas,ata_generic,ata_piix,ahci,libahci
scsi_mod 235785 13 usb_storage,sr_mod,sg,sd_mod,isci,libsas,scsi_transport_sas,scsi_dh_emc,scsi_dh_alua,scsi_dh_hp_sw,scsi_dh_rdac,scsi_dh,libata

ib_uverbs and low-level driver of the hardware shoule be checked.

USERSPACE VERBS ACCESS

The ib_uverbs module, built by enabling CONFIG_INFINIBAND_USER_VERBS, enables direct userspace access to IB hardware via “verbs”.

See Kernel Documentation - Infiniband

1
2
3
4
5
6
7
8
9
10
11
jcf@node1:~> lsmod | grep rdma
rdma_ucm 22630 0
rdma_cm 54692 1 rdma_ucm
iw_cm 36675 1 rdma_cm
configfs 35817 2 rdma_cm
ib_cm 47822 3 ib_ucm,rdma_cm,ib_ipoib
ib_uverbs 75527 2 rdma_ucm,ib_ucm
ib_sa 33470 5 rdma_ucm,rdma_cm,ib_ipoib,ib_cm,mlx4_ib
ib_core 138343 12 rdma_ucm,ib_ucm,rdma_cm,iw_cm,ib_ipoib,ib_cm,ib_uverbs,ib_umad,mlx5_ib,mlx4_ib,ib_sa,ib_mad
ib_addr 18889 3 rdma_ucm,rdma_cm,ib_core
mlx_compat 30364 17 rdma_ucm,ib_ucm,rdma_cm,iw_cm,ib_ipoib,ib_cm,ib_uverbs,ib_umad,mlx5_ib,mlx5_core,mlx4_en,mlx4_ib,ib_sa,ib_mad,ib_core,ib_addr,mlx4_core

Then use ibv_devices to show the available RDMA devices in the local machine.

1
2
3
4
jcf@node1:~> ibv_devices
device node GUID
------ ----------------
mlx4_0 10d2c91000006bd0

Use ibv_devinfo to check for more information about the IB device.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
jcf@node1:~> ibv_devinfo -d mlx4_0
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.9.1000
node_guid: 10d2:c910:0000:6bd0
sys_image_guid: 10d2:c910:0000:6bd3
vendor_id: 0x02c9
vendor_part_id: 26428
hw_ver: 0xB0
board_id: MT_0D90110009
phys_port_cnt: 1
Device ports:
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 1
port_lmc: 0x00
link_layer: InfiniBand

Make sure at least one port is in PORT_ACTIVE state, which means that the port is available for working.

Next, use ibv_*_pingpong to test the connection between machines. There have to be a server process and a client process.

Server:

1
2
3
4
5
jcf@node1:~> ibv_rc_pingpong -g 0 -d mlx4_0 -i 1
local address: LID 0x0001, QPN 0x000241, PSN 0x20e9d5, GID fe80::10d2:c910:0:6bd1
remote address: LID 0x0001, QPN 0x000242, PSN 0x3bef44, GID fe80::10d2:c910:0:6bd1
8192000 bytes in 0.01 seconds = 8998.49 Mbit/sec
1000 iters in 0.01 seconds = 7.28 usec/iter

& Client:

1
2
3
4
5
jcf@node1:~> ibv_rc_pingpong -g 0 -d mlx4_0 -i 1 localhost
local address: LID 0x0001, QPN 0x000242, PSN 0x3bef44, GID fe80::10d2:c910:0:6bd1
remote address: LID 0x0001, QPN 0x000241, PSN 0x20e9d5, GID fe80::10d2:c910:0:6bd1
8192000 bytes in 0.01 seconds = 9322.33 Mbit/sec
1000 iters in 0.01 seconds = 7.03 usec/iter

Two nodes,

Server:

1
2
3
4
5
jcf@node1:~> ibv_rc_pingpong -g 0 -d mlx4_0 -i 1
local address: LID 0x0001, QPN 0x000240, PSN 0x490d6c, GID fe80::10d2:c910:0:6bd1
remote address: LID 0x0009, QPN 0x00006a, PSN 0xf5086d, GID fe80::10d2:c910:0:6c61
8192000 bytes in 0.01 seconds = 7793.55 Mbit/sec
1000 iters in 0.01 seconds = 8.41 usec/iter

& Client:

1
2
3
4
5
jcf@node4:~> ibv_rc_pingpong -g 0 -d mlx4_0 -i 1 node1
local address: LID 0x0009, QPN 0x00006a, PSN 0xf5086d, GID fe80::10d2:c910:0:6c61
remote address: LID 0x0001, QPN 0x000240, PSN 0x490d6c, GID fe80::10d2:c910:0:6bd1
8192000 bytes in 0.01 seconds = 7809.34 Mbit/sec
1000 iters in 0.01 seconds = 8.39 usec/iter

Besides ibv_*_pingpong, rping can be used to test the connection too.

Server:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
jcf@node1:~> rping -s -v -C 20
server ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
server ping data: rdma-ping-1: BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrs
server ping data: rdma-ping-2: CDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrst
server ping data: rdma-ping-3: DEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstu
server ping data: rdma-ping-4: EFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuv
server ping data: rdma-ping-5: FGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvw
server ping data: rdma-ping-6: GHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwx
server ping data: rdma-ping-7: HIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
server ping data: rdma-ping-8: IJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz
server ping data: rdma-ping-9: JKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzA
server ping data: rdma-ping-10: KLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzA
server ping data: rdma-ping-11: LMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzAB
server ping data: rdma-ping-12: MNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABC
server ping data: rdma-ping-13: NOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCD
server ping data: rdma-ping-14: OPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDE
server ping data: rdma-ping-15: PQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEF
server ping data: rdma-ping-16: QRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEFG
server ping data: rdma-ping-17: RSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEFGH
server ping data: rdma-ping-18: STUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEFGHI
server ping data: rdma-ping-19: TUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEFGHIJ
server DISCONNECT EVENT...
wait for RDMA_READ_ADV state 10

& Client:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
jcf@node4:~> rping -c -v -a 10.0.0.1 -C 20
ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
ping data: rdma-ping-1: BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrs
ping data: rdma-ping-2: CDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrst
ping data: rdma-ping-3: DEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstu
ping data: rdma-ping-4: EFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuv
ping data: rdma-ping-5: FGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvw
ping data: rdma-ping-6: GHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwx
ping data: rdma-ping-7: HIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
ping data: rdma-ping-8: IJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz
ping data: rdma-ping-9: JKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzA
ping data: rdma-ping-10: KLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzA
ping data: rdma-ping-11: LMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzAB
ping data: rdma-ping-12: MNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABC
ping data: rdma-ping-13: NOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCD
ping data: rdma-ping-14: OPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDE
ping data: rdma-ping-15: PQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEF
ping data: rdma-ping-16: QRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEFG
ping data: rdma-ping-17: RSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEFGH
ping data: rdma-ping-18: STUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEFGHI
ping data: rdma-ping-19: TUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEFGHIJ
client DISCONNECT EVENT...

To be continued at Simple usage of RDMA ibv