跳过正文
NVIDIA【NCCL Tests】
  1. 运维日记/

NVIDIA【NCCL Tests】

·1011 字·3 分钟·
目录
nvidia - 这篇文章属于一个选集。
§ 6: 本文

NCCL Tests 是 NVIDIA 提供的测试工具,用于验证和评估 NCCL 在多 GPU、多节点环境下的通信性能与稳定性。

NVIDIA【NCCL Tests】
#

NCCL Tests 编译依赖 CUDA ToolkitNCCLOpen MPI;请提前安装依赖!

基础环境:

  • Ubuntu: 22.04.5
  • 内核: 5.15.0-119-generic
  • CUDA Toolkit
  • NCCL
  • Open MPI

一、基础环境
#

1.1 配置源
#

  1. 备份原有 sources.list(若已存在则不重复覆盖)
[ -f /etc/apt/sources.list ] && cp -n /etc/apt/sources.list /etc/apt/sources.list.bak
  1. 写入阿里云 Ubuntu 22.04 (jammy) 镜像源
cat <<'EOF' > /etc/apt/sources.list
deb https://mirrors.aliyun.com/ubuntu/ jammy main restricted universe multiverse
deb-src https://mirrors.aliyun.com/ubuntu/ jammy main restricted universe multiverse

deb https://mirrors.aliyun.com/ubuntu/ jammy-security main restricted universe multiverse
deb-src https://mirrors.aliyun.com/ubuntu/ jammy-security main restricted universe multiverse

deb https://mirrors.aliyun.com/ubuntu/ jammy-updates main restricted universe multiverse
deb-src https://mirrors.aliyun.com/ubuntu/ jammy-updates main restricted universe multiverse

# deb https://mirrors.aliyun.com/ubuntu/ jammy-proposed main restricted universe multiverse
# deb-src https://mirrors.aliyun.com/ubuntu/ jammy-proposed main restricted universe multiverse

deb https://mirrors.aliyun.com/ubuntu/ jammy-backports main restricted universe multiverse
deb-src https://mirrors.aliyun.com/ubuntu/ jammy-backports main restricted universe multiverse
EOF
  1. 更新源
sudo apt update

1.2 内核包
#

apt install linux-image-5.15.0-119-generic linux-headers-5.15.0-119-generic linux-tools-5.15.0-119-generic linux-cloud-tools-5.15.0-119-generic

二、环境依赖
#

CUDA Toolkit
#

参考文章: CUDA Toolkit 安装部署

NCCL
#

  1. 安装脚本 nccl-install.sh
#!/bin/bash
set -e

# -------------------------
# Configuration
# -------------------------
# NCCL version (can be overridden by first script argument)
NCCL_VER="${1:-2.26.2-1+cuda12.8}"

CUDA_KEYRING_PKG="cuda-keyring_1.1-1_all.deb"
CUDA_KEYRING_URL="https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/${CUDA_KEYRING_PKG}"

# -------------------------
# Download CUDA keyring
# -------------------------
echo "Downloading CUDA keyring..."
if [ ! -f "$CUDA_KEYRING_PKG" ]; then
    wget -c "$CUDA_KEYRING_URL"
fi

# -------------------------
# Install CUDA keyring
# -------------------------
echo "Installing CUDA keyring..."
sudo dpkg -i "$CUDA_KEYRING_PKG"

# -------------------------
# Update apt repository
# -------------------------
echo "Updating package list..."
sudo apt-get update

# -------------------------
# Install NCCL
# -------------------------
echo "Installing NCCL version $NCCL_VER..."
sudo apt-get install -y --no-install-recommends \
    libnccl2="$NCCL_VER" \
    libnccl-dev="$NCCL_VER"

echo "NCCL $NCCL_VER installation completed successfully."
  1. 运行脚本
bash nccl-install.sh

Open MPI
#

Open MPI 是开源高性能消息传递库,广泛用于 HPC 和分布式并行计算。

编译安装

  • GitHub 下载编译
  • 如需其他版本,请修改脚本 OMPI_TAG
  1. 安装脚本 openmpi-install.sh
#!/bin/bash
set -e

# OpenMPI version and install path
OMPI_TAG="v4.1.8"
PREFIX="/usr/local/openmpi"
PROFILE_FILE="/etc/profile.d/openmpi.sh"

# Required build dependencies (minimal)
apt-get update
apt-get install -y
  --no-install-recommends \
  autoconf automake libtool \
  gcc g++ make \
  libhwloc-dev libevent-dev \
  flex bison git

# Clone OpenMPI source code
if [ ! -d "ompi" ]; then
  git clone --branch "${OMPI_TAG}" --depth 1 https://github.com/open-mpi/ompi.git
fi

cd ompi

# Generate configure script
./autogen.pl

# Configure
./configure --prefix="${PREFIX}"

# Build and install
make -j"$(nproc)"
make install

# Create system-wide environment configuration
cat >"${PROFILE_FILE}" <<EOF
# OpenMPI environment

export PATH=${PREFIX}/bin:\$PATH
export LD_LIBRARY_PATH=${PREFIX}/lib:\$LD_LIBRARY_PATH
EOF

chmod 644 "${PROFILE_FILE}"

# Load environment for current shell
source "${PROFILE_FILE}"

# Verify installation
mpirun --version

echo "OpenMPI ${OMPI_TAG} installed successfully at ${PREFIX}"
  1. 运行脚本
bash openmpi-install.sh

三、NCCL Tests
#

  1. 安装脚本 nccl-tests-install.sh
#!/bin/bash
set -e

# -------------------------
# Configuration
# -------------------------
NCCT_REPO="https://github.com/NVIDIA/nccl-tests.git"
NCCT_DIR="/usr/local/src/nccl-tests"
NCCT_TAG="${1:-v2.17.6}"
MPI_HOME="/usr/local/openmpi"
CUDA_HOME="/usr/local/cuda"
NCCL_HOME="/usr"

# -------------------------
# Dependency check
# -------------------------
echo "Checking dependencies..."

# -------------------------
# Check OpenMPI
# -------------------------
if [ ! -d "$MPI_HOME" ] || [ ! -x "$MPI_HOME/bin/mpirun" ]; then
    echo "Error: OpenMPI not found in $MPI_HOME"
    exit 1
else
    echo "=================================================="
    echo "OpenMPI found at $MPI_HOME"
    "$MPI_HOME/bin/mpirun" --version | head -n 1
    echo "=================================================="
fi

# -------------------------
# Check CUDA
# -------------------------
if [ ! -d "$CUDA_HOME" ] || [ ! -x "$CUDA_HOME/bin/nvcc" ]; then
    echo "Error: CUDA not found in $CUDA_HOME"
    exit 1
else
    echo "=================================================="
    echo "CUDA found at $CUDA_HOME"
    "$CUDA_HOME/bin/nvcc" --version | grep "release"
    echo "=================================================="
fi

# -------------------------
# Check NCCL
# -------------------------
if [ ! -d "$NCCL_HOME/include" ] || [ ! -f "$NCCL_HOME/include/nccl.h" ]; then
    echo "Error: NCCL not found in $NCCL_HOME"
    exit 1
else
    echo "=================================================="
    echo "NCCL found at $NCCL_HOME"
    if command -v dpkg >/dev/null 2>&1; then
        dpkg-query -W -f='${Package} ${Version}\n' libnccl2 libnccl-dev 2>/dev/null || echo "NCCL version info not found"
    else
        echo "NCCL header exists but cannot query version without dpkg"
    fi
    echo "=================================================="
fi

echo "All dependencies found. Proceeding to clone and compile."

# -------------------------
# Clone repository with tag
# -------------------------
if [ ! -d "$NCCT_DIR" ]; then
    git clone --branch "$NCCT_TAG" --depth 1 "$NCCT_REPO" "$NCCT_DIR"
else
    echo "Repository already exists at $NCCT_DIR, fetching tags..."
    cd "$NCCT_DIR"
    git fetch --all --tags
    git checkout "$NCCT_TAG"
fi

cd "$NCCT_DIR"

# -------------------------
# Compile nccl-tests
# -------------------------
echo "Compiling nccl-tests..."
make clean
make MPI=1 MPI_HOME="$MPI_HOME" CUDA_HOME="$CUDA_HOME" NCCL_HOME="$NCCL_HOME"

echo "Compilation of nccl-tests ($NCCT_TAG) completed successfully."
echo "You can run tests using: $NCCT_DIR/build/all_reduce_perf"
  1. 运行脚本
bash nccl-tests-install.sh

结语
#

NCCL Tests 通过 AllReduce、Broadcast、AllGather 等典型通信算子,帮助用户直观评估 GPU 集群在 NVLink、PCIe、InfiniBand 或 RoCE 网络下的带宽与延迟表现。它是部署 NCCL、排查通信瓶颈、验证 HPC 与 AI 训练环境性能的重要参考工具。

参考:

nvidia - 这篇文章属于一个选集。
§ 6: 本文

相关文章


微信赞赏
微信赞赏
关注公众号
关注公众号
支付宝赞赏
支付宝赞赏