NCCL Tests 是 NVIDIA 提供的测试工具,用于验证和评估 NCCL 在多 GPU、多节点环境下的通信性能与稳定性。
NVIDIA【NCCL Tests】#
NCCL Tests 编译依赖 CUDA Toolkit、NCCL、Open MPI;请提前安装依赖!
基础环境:
- Ubuntu: 22.04.5
- 内核: 5.15.0-119-generic
- CUDA Toolkit
- NCCL
- Open MPI
一、基础环境#
1.1 配置源#
- 备份原有 sources.list(若已存在则不重复覆盖)
[ -f /etc/apt/sources.list ] && cp -n /etc/apt/sources.list /etc/apt/sources.list.bak
- 写入阿里云 Ubuntu 22.04 (jammy) 镜像源
cat <<'EOF' > /etc/apt/sources.list
deb https://mirrors.aliyun.com/ubuntu/ jammy main restricted universe multiverse
deb-src https://mirrors.aliyun.com/ubuntu/ jammy main restricted universe multiverse
deb https://mirrors.aliyun.com/ubuntu/ jammy-security main restricted universe multiverse
deb-src https://mirrors.aliyun.com/ubuntu/ jammy-security main restricted universe multiverse
deb https://mirrors.aliyun.com/ubuntu/ jammy-updates main restricted universe multiverse
deb-src https://mirrors.aliyun.com/ubuntu/ jammy-updates main restricted universe multiverse
# deb https://mirrors.aliyun.com/ubuntu/ jammy-proposed main restricted universe multiverse
# deb-src https://mirrors.aliyun.com/ubuntu/ jammy-proposed main restricted universe multiverse
deb https://mirrors.aliyun.com/ubuntu/ jammy-backports main restricted universe multiverse
deb-src https://mirrors.aliyun.com/ubuntu/ jammy-backports main restricted universe multiverse
EOF
- 更新源
sudo apt update
1.2 内核包#
apt install linux-image-5.15.0-119-generic linux-headers-5.15.0-119-generic linux-tools-5.15.0-119-generic linux-cloud-tools-5.15.0-119-generic
二、环境依赖#
CUDA Toolkit#
参考文章: CUDA Toolkit 安装部署
NCCL#
- 安装脚本 nccl-install.sh
#!/bin/bash
set -e
# -------------------------
# Configuration
# -------------------------
# NCCL version (can be overridden by first script argument)
NCCL_VER="${1:-2.26.2-1+cuda12.8}"
CUDA_KEYRING_PKG="cuda-keyring_1.1-1_all.deb"
CUDA_KEYRING_URL="https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/${CUDA_KEYRING_PKG}"
# -------------------------
# Download CUDA keyring
# -------------------------
echo "Downloading CUDA keyring..."
if [ ! -f "$CUDA_KEYRING_PKG" ]; then
wget -c "$CUDA_KEYRING_URL"
fi
# -------------------------
# Install CUDA keyring
# -------------------------
echo "Installing CUDA keyring..."
sudo dpkg -i "$CUDA_KEYRING_PKG"
# -------------------------
# Update apt repository
# -------------------------
echo "Updating package list..."
sudo apt-get update
# -------------------------
# Install NCCL
# -------------------------
echo "Installing NCCL version $NCCL_VER..."
sudo apt-get install -y --no-install-recommends \
libnccl2="$NCCL_VER" \
libnccl-dev="$NCCL_VER"
echo "NCCL $NCCL_VER installation completed successfully."
- 运行脚本
bash nccl-install.sh
Open MPI#
Open MPI 是开源高性能消息传递库,广泛用于 HPC 和分布式并行计算。
编译安装
- GitHub 下载编译
- 如需其他版本,请修改脚本 OMPI_TAG
- 安装脚本 openmpi-install.sh
#!/bin/bash
set -e
# OpenMPI version and install path
OMPI_TAG="v4.1.8"
PREFIX="/usr/local/openmpi"
PROFILE_FILE="/etc/profile.d/openmpi.sh"
# Required build dependencies (minimal)
apt-get update
apt-get install -y
--no-install-recommends \
autoconf automake libtool \
gcc g++ make \
libhwloc-dev libevent-dev \
flex bison git
# Clone OpenMPI source code
if [ ! -d "ompi" ]; then
git clone --branch "${OMPI_TAG}" --depth 1 https://github.com/open-mpi/ompi.git
fi
cd ompi
# Generate configure script
./autogen.pl
# Configure
./configure --prefix="${PREFIX}"
# Build and install
make -j"$(nproc)"
make install
# Create system-wide environment configuration
cat >"${PROFILE_FILE}" <<EOF
# OpenMPI environment
export PATH=${PREFIX}/bin:\$PATH
export LD_LIBRARY_PATH=${PREFIX}/lib:\$LD_LIBRARY_PATH
EOF
chmod 644 "${PROFILE_FILE}"
# Load environment for current shell
source "${PROFILE_FILE}"
# Verify installation
mpirun --version
echo "OpenMPI ${OMPI_TAG} installed successfully at ${PREFIX}"
- 运行脚本
bash openmpi-install.sh
三、NCCL Tests#
- 安装脚本 nccl-tests-install.sh
#!/bin/bash
set -e
# -------------------------
# Configuration
# -------------------------
NCCT_REPO="https://github.com/NVIDIA/nccl-tests.git"
NCCT_DIR="/usr/local/src/nccl-tests"
NCCT_TAG="${1:-v2.17.6}"
MPI_HOME="/usr/local/openmpi"
CUDA_HOME="/usr/local/cuda"
NCCL_HOME="/usr"
# -------------------------
# Dependency check
# -------------------------
echo "Checking dependencies..."
# -------------------------
# Check OpenMPI
# -------------------------
if [ ! -d "$MPI_HOME" ] || [ ! -x "$MPI_HOME/bin/mpirun" ]; then
echo "Error: OpenMPI not found in $MPI_HOME"
exit 1
else
echo "=================================================="
echo "OpenMPI found at $MPI_HOME"
"$MPI_HOME/bin/mpirun" --version | head -n 1
echo "=================================================="
fi
# -------------------------
# Check CUDA
# -------------------------
if [ ! -d "$CUDA_HOME" ] || [ ! -x "$CUDA_HOME/bin/nvcc" ]; then
echo "Error: CUDA not found in $CUDA_HOME"
exit 1
else
echo "=================================================="
echo "CUDA found at $CUDA_HOME"
"$CUDA_HOME/bin/nvcc" --version | grep "release"
echo "=================================================="
fi
# -------------------------
# Check NCCL
# -------------------------
if [ ! -d "$NCCL_HOME/include" ] || [ ! -f "$NCCL_HOME/include/nccl.h" ]; then
echo "Error: NCCL not found in $NCCL_HOME"
exit 1
else
echo "=================================================="
echo "NCCL found at $NCCL_HOME"
if command -v dpkg >/dev/null 2>&1; then
dpkg-query -W -f='${Package} ${Version}\n' libnccl2 libnccl-dev 2>/dev/null || echo "NCCL version info not found"
else
echo "NCCL header exists but cannot query version without dpkg"
fi
echo "=================================================="
fi
echo "All dependencies found. Proceeding to clone and compile."
# -------------------------
# Clone repository with tag
# -------------------------
if [ ! -d "$NCCT_DIR" ]; then
git clone --branch "$NCCT_TAG" --depth 1 "$NCCT_REPO" "$NCCT_DIR"
else
echo "Repository already exists at $NCCT_DIR, fetching tags..."
cd "$NCCT_DIR"
git fetch --all --tags
git checkout "$NCCT_TAG"
fi
cd "$NCCT_DIR"
# -------------------------
# Compile nccl-tests
# -------------------------
echo "Compiling nccl-tests..."
make clean
make MPI=1 MPI_HOME="$MPI_HOME" CUDA_HOME="$CUDA_HOME" NCCL_HOME="$NCCL_HOME"
echo "Compilation of nccl-tests ($NCCT_TAG) completed successfully."
echo "You can run tests using: $NCCT_DIR/build/all_reduce_perf"
- 运行脚本
bash nccl-tests-install.sh
结语#
NCCL Tests 通过 AllReduce、Broadcast、AllGather 等典型通信算子,帮助用户直观评估 GPU 集群在 NVLink、PCIe、InfiniBand 或 RoCE 网络下的带宽与延迟表现。它是部署 NCCL、排查通信瓶颈、验证 HPC 与 AI 训练环境性能的重要参考工具。
参考:


