Mellanox ConnectX-6 Dx: High Corrected Bits on PAM4 NICs
Summary: Dell Mellanox ConnectX-6 Dx 100 Gb Network Interface Card (NIC) is reporting high rx_corrected_bits_phy due to PAM4 data transmission technique, which is normal and expected.
Symptoms
There are no issues being experienced, but upon reviewing the statistics in the environment, all hosts were found to be reporting bit errors.
- High number of
rx_corrected_bits_phy - High number of
rx_err_lane_0_phy - High number of
rx_err_lane_1_phy - High number of
rx_err_lane_2_phy - High number of
rx_err_lane_3_phy
user@hostname:~$ sudo ethtool -S enp139s0f1np1 | grep -E "correc|rx_err"
rx_corrected_bits_phy: 153303800
rx_err_lane_0_phy: 74171021
rx_err_lane_1_phy: 79132779
rx_err_lane_2_phy: 0
rx_err_lane_3_phy: 0
user@hostname:~$ sudo ethtool -S lan0
rx_corrected_bits_phy: 191025837
rx_err_lane_0_phy: 759699
rx_err_lane_1_phy: 190266147
Cause
The issue is related to the PAM4 data transmission technique used by the Mellanox ConnectX-6 Dx NIC.
- The PAM4 technique uses four levels (00, 01, 10, 11) to represent data, which can transmit twice the data in the same bandwidth as the previously used technology, non-return-to-zero (NRZ).
- However, PAM4 is more complex, susceptible to noise and errors, and requires better error correction.
- The use of PAM4 electrical modulated signals requires mandatory running of the RS544 FEC technique to detect and correct errors in the data transmission.
- The IEEE standards require all links involving 50G/100G PAM4 to achieve a pre-FEC Bit Error Rate (BER) of 2.4E-04 or better.
- With RS544 FEC enabled and running, a link is expected to achieve a BER of 1E-12 or better.
Error Correction Mechanism
The RS544 FEC technique introduces 16 bins for error counting. In this system, bin-0 to counts received packet with zero error, bin-1 counts received packet with 1-bit error, and so on.
Bin0 5540265380 11 0:00:04 ago Bin1 4420085 11 0:00:04 ago Bin2 578175 11 0:00:04 ago Bin3 11808 11 0:00:04 ago Bin4 1071 11 0:00:04 ago Bin5 63 11 0:00:04 ago Bin6 6 6 0:00:04 ago Bin7 3 2 0:01:02 ago Bin8 1 1 0:00:04 ago Bin9 0 0 never Bin10 0 0 never Bin11 0 0 never Bin12 0 0 never Bin13 0 0 never Bin14 0 0 never Bin15 0 0 never Bin16+ 0 0 never
BER Requirements
The effective physical BER shows how well the FEC is working to correct errors and ensure reliable data transmission.
The link is expected to achieve a BER of 1E-12 or better with RS544 FEC enabled and running.
Resolution
The rx_corrected_bits_phy observed are normal and expected on a link that uses the PAM4 data transmission technique. The FEC that is being used on the link corrects the errored bits resulting in a reliable link.
Verification Steps
To verify that the issue has been successfully resolved, follow these steps:
- Check the
rx_corrected_bits_phycounter value using the commandsudo ethtool -S enp139s0f1np1 | grep -E "correc|rx_err"orsudo ethtool -S lan0. - Verify that the counter value is within the expected range for a reliable link.
- Check the Bin count display using the command
Symbol Errors Per Codeword Codewords Changes Last Changeto ensure that the bin count does not reach beyond bin-8.
Tools and Resources
The following tools and resources can aid in resolving the issue:
ethtoolcommand-line utilitysudocommand for running commands with elevated privileges
Precautions and Warnings
rx_corrected_bits_phy counter value is within the expected range for a reliable link to avoid potential issues.