Skip to Content
GPU Operation InstructionsA800 NCCL Multi Host Communication

A800 NCCL Multi Host Communication

GPU Environment Preparation

1. Drivers and CUDA

2. Load nvidia_peermem Kernel Module

  • CUDA 11.4 and newer versions include this kernel module by default.
  • For versions below 11.4, manually install the nv_peer_mem module and configure it to load on startup:
sudo modprobe nvidia_peermem

3. Install nvidia-fabricmanager

# Ubuntu # NVIDIA driver version version=535.54.03 main_version=$(echo $version | awk -F '.' '{print $1}') sudo apt update sudo apt -y install nvidia-fabricmanager-${main_version}=${version}-* sudo systemctl start nvidia-fabricmanager sudo systemctl status nvidia-fabricmanager sudo systemctl enable nvidia-fabricmanager

A800 Virtual Network Interface Configuration

After creating an A800 cloud host, it comes with a default eth0 network interface, typically used for traffic forwarding management. For multi-machine GPU communication scenarios, 4 virtual network interfaces need to be created on each cloud host for data traffic forwarding. Below is the best practice for network interface configuration:

1. Create Subnets

  • In the VPC console:
    • Create a new VPC with: 192.168.0.0/16.
    • The initial subnet-eth1192.168.1.0/24.
  • Under the same VPC, add subnets at its tab bar:
    • subnet-eth2: 192.168.2.0/24

    • subnet-eth3: 192.168.3.0/24

    • subnet-eth4: 192.168.4.0/24

2. Create and Attach Virtual NICs

  • For each A800 host:
    • Switch to virtual NIC tab bar and create 4 virtual NICs from the subnets:
      • VNIC1: subnet-eth1
      • VNIC2: subnet-eth2
      • VNIC3: subnet-eth3
      • VNIC4: subnet-eth4
  • Attach VNICs to hosts in the same order (e.g., VNIC1 is from subnet-eth1 for all hosts).

3. Initialize Virtual NICs

The virtual NIC configuration is complete. The network configuration must be initialized within the cloud host upon each boot. It is recommended to add this initialization to the startup scripts.

# ubuntu mtu=4200 devs="eth1 eth2 eth3 eth4" for dev in $devs do sudo ip link set $dev mtu $mtu sudo ip link set $dev up sudo dhclient $dev done devs="mlx5_0 mlx5_1 mlx5_2 mlx5_3" for dev in $devs do sudo cma_roce_tos -d $dev -t 184 sudo cma_roce_mode -d $dev -m 2 sudo echo 184 > /sys/class/infiniband/$dev/tc/1/traffic_class done sudo pkill dhclient sudo modprobe rdma_cm

NCCL Multi-Host Communication Verification

1. Download and Compile nccl-test

git clone https://github.com/NVIDIA/nccl-tests.git cd nccl-tests make MPI=1 MPI_HOME=/usr/mpi/gcc/openmpi-4.1.5a1 CUDA_HOME=/usr/local/cuda -j

2. Multi-Host NCCL Test Command

mpirun --allow-run-as-root --oversubscribe -np {number_of_GPUs} --bind-to numa -H {internal_IPs} -mca plm_rsh_args "-p 22 -q -o StrictHostKeyChecking=no" -mca coll_hcoll_enable 0 \ -mca pml ob1 -mca btl_tcp_if_include eth0 -mca btl ^openib -mca btl_openib_if_include mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1 -mca btl_openib_cpc_include rdmacm -mca btl_openib_rroce_enable 1 -x NCCL_IB_DISABLE=0 \ -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_TC=184 -x NCCL_IB_TIMEOUT=23 -x NCCL_IB_RETRY_CNT=7 -x NCCL_IB_PCI_RELAXED_ORDERING=1 \ -x NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3 -x CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 -x NCCL_TOPO_FILE={path_to_nccl_topo_file} -x NCCL_NET_GDR_LEVEL=1 \ -x CUDA_DEVICE_ORDER=PCI_BUS_ID -x NCCL_ALGO=Ring -x LD_LIBRARY_PATH -x PATH {path_to_all_reduce_perf} -b 8 -e 8G -f 2 -g 1 # Parameters: # {number_of_GPUs}: The number of cloud hosts* 8 (8 GPUs for each host) # {internal_IPs}: eth0 IPs of all hosts, comma-separated (e.g., 192.168.1.2,192.168.1.3) # {path_to_nccl_topo_file}: Absolute path to NCCL topology file:nccl_topo.xml # {path_to_all_reduce_perf}: Absolute path to all_reduce_perf file, e.g., $HOME/nccl-tests/build/all_reduce_perf