跳过正文
NVIDIA【Mellanox OFED】
  1. 运维日记/

NVIDIA【Mellanox OFED】

·673 字·2 分钟·
目录
nvidia - 这篇文章属于一个选集。
§ 1: 本文

MLNX OFED 是 NVIDIA 提供的高性能网络驱动与通信栈,支持 InfiniBand 与 RoCE,广泛用于 HPC、AI 训练和数据中心低时延高速通信环境。

Mellanox OFED
#

基础环境:

  • Ubuntu: 22.04.5
  • 内核: 5.15.0-119-generic
  • Mellanox: MLNX_OFED_LINUX-23.10-1.1.9.0-ubuntu22.04-x86_64

一、基础环境
#

1.1 配置源
#

  1. 备份原有 sources.list(若已存在则不重复覆盖)
[ -f /etc/apt/sources.list ] && cp -n /etc/apt/sources.list /etc/apt/sources.list.bak
  1. 写入阿里云 Ubuntu 22.04 (jammy) 镜像源
cat <<'EOF' > /etc/apt/sources.list
deb https://mirrors.aliyun.com/ubuntu/ jammy main restricted universe multiverse
deb-src https://mirrors.aliyun.com/ubuntu/ jammy main restricted universe multiverse

deb https://mirrors.aliyun.com/ubuntu/ jammy-security main restricted universe multiverse
deb-src https://mirrors.aliyun.com/ubuntu/ jammy-security main restricted universe multiverse

deb https://mirrors.aliyun.com/ubuntu/ jammy-updates main restricted universe multiverse
deb-src https://mirrors.aliyun.com/ubuntu/ jammy-updates main restricted universe multiverse

# deb https://mirrors.aliyun.com/ubuntu/ jammy-proposed main restricted universe multiverse
# deb-src https://mirrors.aliyun.com/ubuntu/ jammy-proposed main restricted universe multiverse

deb https://mirrors.aliyun.com/ubuntu/ jammy-backports main restricted universe multiverse
deb-src https://mirrors.aliyun.com/ubuntu/ jammy-backports main restricted universe multiverse
EOF
  1. 更新源
sudo apt update

1.2 内核包
#

注意: 内核相关的包一定要保持与当前系统内核版本号一致。

apt install linux-image-5.15.0-119-generic linux-headers-5.15.0-119-generic linux-tools-5.15.0-119-generic linux-cloud-tools-5.15.0-119-generic

二、Mellanox NIC
#

  1. 下载(根据需求下载指定系统、版本的驱动包)

MLNX_OFED: MLNX_OFED Download Center

  1. 安装
# 解压安装
tar xf MLNX_OFED_LINUX-23.10-1.1.9.0-ubuntu22.04-x86_64.tgz && cd MLNX_OFED_LINUX-23.10-1.1.9.0-ubuntu22.04-x86_64
# 交互安装;选择 Y 继续安装
./mlnxofedinstall --all
# 强制安装
# ./mlnxofedinstall --all --force
  1. 输出信息
root@ubuntu:~/MLNX_OFED_LINUX-23.10-1.1.9.0-ubuntu22.04-x86_64# ./mlnxofedinstall --all --force
Logs dir: /tmp/MLNX_OFED_LINUX.1730.logs
General log file: /tmp/MLNX_OFED_LINUX.1730.logs/general.log

Below is the list of MLNX_OFED_LINUX packages that you have chosen
(some may have been added by the installer due to package dependencies):

ofed-scripts
mlnx-tools
mlnx-ofed-kernel-utils
mlnx-ofed-kernel-dkms
iser-dkms
isert-dkms
srp-dkms
rdma-core
libibverbs1
ibverbs-utils
ibverbs-providers
libibverbs-dev
libibverbs1-dbg
libibumad3
libibumad-dev
ibacm
librdmacm1
rdmacm-utils
librdmacm-dev
mstflint
ibdump
libibmad5
libibmad-dev
libopensm
opensm
opensm-doc
libopensm-devel
libibnetdisc5
infiniband-diags
mft
kernel-mft-dkms
perftest
ibutils2
ibsim
ibsim-doc
ucx
sharp
hcoll
knem-dkms
knem
openmpi
mpitests
dpcp
srptools
mlnx-ethtool
mlnx-iproute2
rshim
ibarr

This program will install the MLNX_OFED_LINUX package on your machine.
Note that all other Mellanox, OEM, OFED, RDMA or Distribution IB packages will be removed.
Those packages are removed due to conflicts with MLNX_OFED_LINUX, do not reinstall them.

Checking SW Requirements...
One or more required packages for installing MLNX_OFED_LINUX are missing.
Attempting to install the following missing packages:
swig libgfortran5 gcc automake libnl-3-dev libltdl-dev pkg-config flex graphviz libnl-route-3-dev tk m4 libnl-route-3-200 autoconf make bison debhelper dkms quilt autotools-dev libc6-dev libfuse2 gfortran chrpath
Removing old packages...
Installing new packages
Installing ofed-scripts-23.10.OFED.23.10.1.1.9...
Installing mlnx-tools-23.10.0...
Installing mlnx-ofed-kernel-utils-23.10.OFED.23.10.1.1.9.1...
Installing mlnx-ofed-kernel-dkms-23.10.OFED.23.10.1.1.9.1...
Installing iser-dkms-23.10.OFED.23.10.1.1.9.1...
Installing isert-dkms-23.10.OFED.23.10.1.1.9.1...
Installing srp-dkms-23.10.OFED.23.10.1.1.9.1...
Installing rdma-core-2307mlnx47...
Installing libibverbs1-2307mlnx47...
Installing ibverbs-utils-2307mlnx47...
Installing ibverbs-providers-2307mlnx47...
Installing libibverbs-dev-2307mlnx47...
Installing libibverbs1-dbg-2307mlnx47...
Installing libibumad3-2307mlnx47...
Installing libibumad-dev-2307mlnx47...
Installing ibacm-2307mlnx47...
Installing librdmacm1-2307mlnx47...
Installing rdmacm-utils-2307mlnx47...
Installing librdmacm-dev-2307mlnx47...
Installing mstflint-4.16.1...
Installing ibdump-6.0.0...
Installing libibmad5-2307mlnx47...
Installing libibmad-dev-2307mlnx47...
Installing libopensm-5.17.0.MLNX20231105.d437ae0a...
Installing opensm-5.17.0.MLNX20231105.d437ae0a...
Installing opensm-doc-5.17.0.MLNX20231105.d437ae0a...
Installing libopensm-devel-5.17.0.MLNX20231105.d437ae0a...
Installing libibnetdisc5-2307mlnx47...
Installing infiniband-diags-2307mlnx47...
Installing mft-4.26.1...
Installing kernel-mft-dkms-4.26.1.3...
Installing perftest-23.10.0...
Installing ibutils2-2.1.1...
Installing ibsim-0.12...
Installing ibsim-doc-0.12...
Installing ucx-1.16.0...
Installing sharp-3.5.1.MLNX20231116.7fcef5af...
Installing hcoll-4.8.3223...
Installing knem-dkms-1.1.4.90mlnx3...
Installing knem-1.1.4.90mlnx3...
Installing openmpi-4.1.7a1...
Installing mpitests-3.2.21...
Installing dpcp-1.1.43...
Installing srptools-2307mlnx47...
Installing mlnx-ethtool-6.4...
Installing mlnx-iproute2-6.4.0...
Installing rshim-2.0.17...
Installing ibarr-0.1.3...
Selecting previously unselected package mlnx-fw-updater.
(Reading database ... 90967 files and directories currently installed.)
Preparing to unpack .../mlnx-fw-updater_23.10-1.1.9.0_amd64.deb ...
Unpacking mlnx-fw-updater (23.10-1.1.9.0) ...
Setting up mlnx-fw-updater (23.10-1.1.9.0) ...

Added 'RUN_FW_UPDATER_ONBOOT=no to /etc/infiniband/openib.conf

Initializing...
Attempting to perform Firmware update...
No devices found!

Installation passed successfully
To load the new driver, run:
/etc/init.d/openibd restart
  1. 启动服务
systemctl enable openibd.service
systemctl start openibd.service
  1. 安装服务
  • perftest: /usr/bin
  • openmpi: /usr/mpi/gcc/openmpi-4.1.7a1

结语
#

参考:

nvidia - 这篇文章属于一个选集。
§ 1: 本文

相关文章


微信赞赏
微信赞赏
关注公众号
关注公众号
支付宝赞赏
支付宝赞赏