NVIDIA【HPC-X】
#

基础环境：

Ubuntu 22.04；内核5.15.0-119-generic
NVIDIA GPU
Mellanox

一、基础环境
#

1.1 配置源
#

备份原有 sources.list（若已存在则不重复覆盖）

[ -f /etc/apt/sources.list ] && cp -n /etc/apt/sources.list /etc/apt/sources.list.bak

写入阿里云 Ubuntu 22.04 (jammy) 镜像源

cat <<'EOF' > /etc/apt/sources.list
deb https://mirrors.aliyun.com/ubuntu/ jammy main restricted universe multiverse
deb-src https://mirrors.aliyun.com/ubuntu/ jammy main restricted universe multiverse

deb https://mirrors.aliyun.com/ubuntu/ jammy-security main restricted universe multiverse
deb-src https://mirrors.aliyun.com/ubuntu/ jammy-security main restricted universe multiverse

deb https://mirrors.aliyun.com/ubuntu/ jammy-updates main restricted universe multiverse
deb-src https://mirrors.aliyun.com/ubuntu/ jammy-updates main restricted universe multiverse

# deb https://mirrors.aliyun.com/ubuntu/ jammy-proposed main restricted universe multiverse
# deb-src https://mirrors.aliyun.com/ubuntu/ jammy-proposed main restricted universe multiverse

deb https://mirrors.aliyun.com/ubuntu/ jammy-backports main restricted universe multiverse
deb-src https://mirrors.aliyun.com/ubuntu/ jammy-backports main restricted universe multiverse
EOF

更新源

sudo apt update

二、HPC-X
#

下载（根据需求下载指定系统、版本的驱动包）NVIDIA HPC-X: NVIDIA HPC-X
点击下载，选择版本信息

cuda12.x

安装

请修改脚本中安装包信息，并运行脚本！

#!/bin/bash
set -e

# HPC-X version and package information
HPCX_VER="v2.18.1"
HPCX_FILE="hpcx-${HPCX_VER}-gcc-mlnx_ofed-ubuntu22.04-cuda12-x86_64.tbz"
HPCX_URL="https://content.mellanox.com/hpc/hpc-x/${HPCX_VER}/${HPCX_FILE}"

# Installation paths
INSTALL_DIR="/usr/local/src"
SRC_DIR="${INSTALL_DIR}/hpcx-${HPCX_VER}-gcc-mlnx_ofed-ubuntu22.04-cuda12-x86_64"
HPCX_DIR="${INSTALL_DIR}/hpcx"
PROFILE_FILE="/etc/profile.d/hpcx.sh"

# Check required commands
command -v wget >/dev/null || { echo "wget not found"; exit 1; }
command -v tar  >/dev/null || { echo "tar not found"; exit 1; }

# Download the package if it does not already exist
if [ ! -f "${HPCX_FILE}" ]; then
  wget -c "${HPCX_URL}"
fi

# Extract the package
tar -xvf "${HPCX_FILE}" -C "${INSTALL_DIR}"

# Rename the extracted directory to a unified path
if [ -d "${SRC_DIR}" ] && [ ! -d "${HPCX_DIR}" ]; then
  mv "${SRC_DIR}" "${HPCX_DIR}"
fi

# Create system-wide environment configuration
cat >"${PROFILE_FILE}" <<'EOF'
# NVIDIA HPC-X system-wide environment settings

export HPCX_HOME=/usr/local/src/hpcx

if [ -f "$HPCX_HOME/hpcx-init.sh" ]; then
  source "$HPCX_HOME/hpcx-init.sh"
  hpcx_load
fi
EOF

# Set proper permissions
chmod 644 "${PROFILE_FILE}"

# Source the environment immediately for the current session
if [ -f "${HPCX_DIR}/hpcx-init.sh" ]; then
  source "${HPCX_DIR}/hpcx-init.sh"
  hpcx_load
fi

echo "HPC-X ${HPCX_VER} installation completed successfully."
echo "Environment variables are now active in the current shell."
echo "For new sessions, they will be loaded automatically."

root@ubuntu:/usr/local/src# tree -L 1 hpcx/
hpcx/
├── archive
├── clusterkit
├── hcoll
├── hpcx-debug-init-ompi.sh
├── hpcx-debug-init.sh -> hpcx-debug-init-ompi.sh
├── hpcx-init-ompi.sh
├── hpcx-init.sh -> hpcx-init-ompi.sh
├── hpcx-mt-init-ompi.sh
├── hpcx-mt-init.sh -> hpcx-mt-init-ompi.sh
├── hpcx-prof-init-ompi.sh
├── hpcx-prof-init.sh -> hpcx-prof-init-ompi.sh
├── hpcx-stack-init-ompi.sh
├── hpcx-stack-init.sh -> hpcx-stack-init-ompi.sh
├── modulefiles
├── nccl_rdma_sharp_plugin
├── ompi
├── OSS_Licenses.pdf
├── OSS_Notices.pdf
├── README.txt
├── sharp
├── sources
├── ucc
├── ucx
├── utils
└── VERSION

12 directories, 14 files

三、验证
#

结语
#

NVIDIA HPC-X 是一套综合性软件包，包含消息传递接口（MPI）、对称分层内存（SHMEM）和分区全局地址空间（PGAS）通信库，以及多种加速组件。该功能完备、经过测试并已打包的工具集，使 MPI 和 SHMEM/PGAS 编程模型能够实现高性能、高可扩展性和高效率，并确保通信库针对 NVIDIA Quantum InfiniBand 网络解决方案进行了充分优化。

参考：