A800 NCCL Multi Host Communication

GPU Environment Preparation

1. Drivers and CUDA

GPU cloud hosts created using industry images come pre-installed with GPU drivers and CUDA.
For manual installation, refer to: https://developer.nvidia.com/cuda-toolkit-archive

2. Load nvidia_peermem Kernel Module

CUDA 11.4 and newer versions include this kernel module by default.
For versions below 11.4, manually install the nv_peer_mem module and configure it to load on startup:


     sudo modprobe nvidia_peermem

3. Install nvidia-fabricmanager


   # Ubuntu
   # NVIDIA driver version
   version=535.54.03  
   main_version=$(echo $version | awk -F '.' '{print $1}')
   sudo apt update
   sudo apt -y install nvidia-fabricmanager-${main_version}=${version}-*
   sudo systemctl start nvidia-fabricmanager
   sudo systemctl status nvidia-fabricmanager
   sudo systemctl enable nvidia-fabricmanager

A800 Virtual Network Interface Configuration

After creating an A800 cloud host, it comes with a default eth0 network interface, typically used for traffic forwarding management. For multi-machine GPU communication scenarios, 4 virtual network interfaces need to be created on each cloud host for data traffic forwarding. Below is the best practice for network interface configuration:

1. Create Subnets

In the VPC console:
- Create a new VPC with: 192.168.0.0/16.
- The initial subnet-eth1：192.168.1.0/24.
Under the same VPC, add subnets at its tab bar:
- subnet-eth2: 192.168.2.0/24
- subnet-eth3: 192.168.3.0/24
- subnet-eth4: 192.168.4.0/24

2. Create and Attach Virtual NICs

For each A800 host:
- Switch to virtual NIC tab bar and create 4 virtual NICs from the subnets:
  - VNIC1: subnet-eth1
  - VNIC2: subnet-eth2
  - VNIC3: subnet-eth3
  - VNIC4: subnet-eth4
Attach VNICs to hosts in the same order (e.g., VNIC1 is from subnet-eth1 for all hosts).

3. Initialize Virtual NICs

The virtual NIC configuration is complete. The network configuration must be initialized within the cloud host upon each boot. It is recommended to add this initialization to the startup scripts.


# ubuntu
mtu=4200
devs="eth1 eth2 eth3 eth4"
for dev in $devs
do
    sudo ip link set $dev mtu $mtu
    sudo ip link set $dev up 
    sudo dhclient $dev
done
devs="mlx5_0 mlx5_1 mlx5_2 mlx5_3"
for dev in $devs
do
    sudo cma_roce_tos -d $dev -t 184
    sudo cma_roce_mode -d $dev -m 2 
    sudo echo 184 > /sys/class/infiniband/$dev/tc/1/traffic_class
done
sudo pkill dhclient
sudo modprobe rdma_cm

NCCL Multi-Host Communication Verification

1. Download and Compile nccl-test


   git clone https://github.com/NVIDIA/nccl-tests.git
   cd nccl-tests
   make MPI=1 MPI_HOME=/usr/mpi/gcc/openmpi-4.1.5a1 CUDA_HOME=/usr/local/cuda -j

2. Multi-Host NCCL Test Command


   mpirun --allow-run-as-root --oversubscribe -np {number_of_GPUs} --bind-to numa -H {internal_IPs} -mca plm_rsh_args "-p 22 -q -o StrictHostKeyChecking=no" -mca coll_hcoll_enable 0 \
   -mca pml ob1 -mca btl_tcp_if_include eth0 -mca btl ^openib -mca btl_openib_if_include mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1 -mca btl_openib_cpc_include rdmacm -mca btl_openib_rroce_enable 1 -x NCCL_IB_DISABLE=0 \
   -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_TC=184 -x NCCL_IB_TIMEOUT=23 -x NCCL_IB_RETRY_CNT=7 -x NCCL_IB_PCI_RELAXED_ORDERING=1 \
   -x NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3 -x CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 -x NCCL_TOPO_FILE={path_to_nccl_topo_file} -x NCCL_NET_GDR_LEVEL=1 \
   -x CUDA_DEVICE_ORDER=PCI_BUS_ID -x NCCL_ALGO=Ring -x LD_LIBRARY_PATH -x PATH {path_to_all_reduce_perf} -b 8 -e 8G -f 2 -g 1
 
 
   # Parameters:
   # {number_of_GPUs}: The number of cloud hosts* 8 (8 GPUs for each host)
   # {internal_IPs}: eth0 IPs of all hosts, comma-separated (e.g., 192.168.1.2,192.168.1.3)
   # {path_to_nccl_topo_file}: Absolute path to NCCL topology file:nccl_topo.xml
   # {path_to_all_reduce_perf}: Absolute path to all_reduce_perf file, e.g., $HOME/nccl-tests/build/all_reduce_perf