R750 DSS: NVIDIA Mellanox BlueField-2 DPU Card DPN-GRNMC PCIe link training failure
Riepilogo: PowerEdge R750 Datacenter Scalable Solutions (DSS) may experience failures when running older Data Center-on-a-Chip Architecture (DOCA) versions with the NVIDIA Mellanox BlueField-2 Data Processing Unit (DPU) Card. ...
Questo articolo si applica a
Questo articolo non si applica a
Questo articolo non è legato a un prodotto specifico.
Non tutte le versioni del prodotto sono identificate in questo articolo.
Sintomi
The NVIDIA Mellanox MT42822 BlueField-2 100G DPU channel card, DPN# GRNMC, is a DSS-qualified DPU adapter that might be equipped with some DSS-configured PowerEdge servers according to some DSS/RCI user-specific requirements.
This adapter is qualified and supported by NVIDIA DOCA 1.5.1 or later versions by the Dell DSS/RCI engineering team.
If the DOCA image is changed to an earlier version than 1.5.1 on this specific adapter, multiple failure symptoms might be observed on the server.
For example:
1. PCIe link training failure event UEFI0067 is logged in the iDRAC/LifeCycle log:
2. The Host Operating System (OS) fails to initialize the DPU adapter.
3. PCIe bus fatal error events are logged in the iDRAC/Lifecycle log, pointing to the slot where the DPU adapter is installed.

This adapter is qualified and supported by NVIDIA DOCA 1.5.1 or later versions by the Dell DSS/RCI engineering team.
If the DOCA image is changed to an earlier version than 1.5.1 on this specific adapter, multiple failure symptoms might be observed on the server.
For example:
1. PCIe link training failure event UEFI0067 is logged in the iDRAC/LifeCycle log:
2. The Host Operating System (OS) fails to initialize the DPU adapter.
[ 133.575847] kernel: mlx5_core 0000:ca:00.1: firmware version: 24.35.2000 [ 133.576304] kernel: mlx5_core 0000:ca:00.1: 252.048 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x16 link) [ 153.576974] kernel: mlx5_core 0000:ca:00.1: wait_fw_init:195:(pid 821): Waiting for FW initialization, timeout abort in 100s [ 173.584974] kernel: mlx5_core 0000:ca:00.1: wait_fw_init:195:(pid 821): Waiting for FW initialization, timeout abort in 79s [ 193.592974] kernel: mlx5_core 0000:ca:00.1: wait_fw_init:195:(pid 821): Waiting for FW initialization, timeout abort in 59s [ 213.600975] kernel: mlx5_core 0000:ca:00.1: wait_fw_init:195:(pid 821): Waiting for FW initialization, timeout abort in 39s [ 233.608975] kernel: mlx5_core 0000:ca:00.1: wait_fw_init:195:(pid 821): Waiting for FW initialization, timeout abort in 19s [ 253.584980] kernel: mlx5_core 0000:ca:00.1: mlx5_function_setup:960:(pid 821): Firmware over 120000 MS in pre-initializing state, aborting [ 253.586029] kernel: mlx5_core 0000:ca:00.1: init_one:1366:(pid 821): mlx5_load_one failed with error code -16 [ 253.587272] kernel: mlx5_core: probe of 0000:ca:00.1 failed with error -16
3. PCIe bus fatal error events are logged in the iDRAC/Lifecycle log, pointing to the slot where the DPU adapter is installed.

Causa
DSS/RCI engineering qualified two models of the Mellanox BlueField-2 DPU channel adapter.
Starting from DOCA 1.5.1 LTS release, both models, DPN#CH5RM and DPN#GRNMC are supported.
NVIDIA Mellanox recommends DOCA package (LTS) 1.5.7 or newer.
- 32G NVIDIA Mellanox BlueField2 DPU card (DPN#CH5RM, Model# MBF2H516A-CEEOT)
- 128G NVIDIA Mellanox BlueField2 DPU card (DPN#GRNMC, Model# MBF2H516C-CECOT)
Starting from DOCA 1.5.1 LTS release, both models, DPN#CH5RM and DPN#GRNMC are supported.
NVIDIA Mellanox recommends DOCA package (LTS) 1.5.7 or newer.
Risoluzione
If the Dell PowerEdge server experiences the mentioned failure symptoms with the DSS-qualified NVIDIA Mellanox BlueField2 DPU adapter (DPN#GRNMC), ensure that the DOCA 1.5.1 LTS or a later version is correctly installed.
If the DOCA image is refreshed to an older unsupported version on this 128G DPU adapter, use the following procedure to recover the DPU:
If the DOCA image is refreshed to an older unsupported version on this 128G DPU adapter, use the following procedure to recover the DPU:
Install DOCA Host Drivers found on https://developer.nvidia.com/networking/doca
Example: For Ubuntu 20.04 Host OS
- wget https://content.mellanox.com/DOCA/DOCA_v2.7.0/host/doca-host_2.7.0-204000-24.04-ubuntu2004_amd64.deb
-
dpkg -i doca-host_2.7.0-204000-24.04-ubuntu2004_amd64.deb apt-get update apt install doca-all
Download and install the latest BF2 DOCA package.
- wget https://content.mellanox.com/BlueField/BFBs/Ubuntu22.04/bf-bundle-2.7.0-33_24.04_ubuntu-22.04_prod.bfb
-
bfb-install --bfb bf-bundle-2.7.0-33_24.04_ubuntu-22.04_prod.bfb --rshim rshim0
Once the DOCA install on BF2 is complete, reset the BF2.
-
echo "SW_RESET 1" > /dev/rshim0/mis
Prodotti interessati
Datacenter Scalable Solutions, Mellanox Family of AdaptersProprietà dell'articolo
Numero articolo: 000228342
Tipo di articolo: Solution
Ultima modifica: 03 ott 2024
Versione: 2
Trova risposta alle tue domande dagli altri utenti Dell
Support Services
Verifica che il dispositivo sia coperto dai Servizi di supporto.