Unsolved
1 Rookie
•
6 Posts
0
4822
R640 and Broadcom - Poor performance on SMB Direct (RDMA) configuration
Hello all.
So, I've got a couple new Dell R640 servers with Broadcom NetXtreme E-series Adapter (BCM57416) for a new Hyper-V Cluster and I've been trying really hard to get SMB Direct working between them so I can leverage the best possible performance during LiveMigration.
The thing is: all tests go green across the board, both Test-RDMA and Validate-DCB, as per screenshots below:
But when trying to copy something between the servers I get REALLY slow throughput:
The result for Get-SmbMultichannelConnection shows that RDMA is being used:
As soon as I disable RDMA on the interface, transfer speed goes back to normal
I've tried bypassing the switch and linking both interfaces directly, but to no avail. The results were the same.
So I guess my question is: has anyone gotten SMB Direct working successfully over Broadcom interfaces? I cannot for the life of me find anything wrong with my setup, but it just won't work.
Best regards,
Giovani
DELL-Daniel My
Moderator
Moderator
•
6.2K Posts
0
September 20th, 2019 11:00
Hello
Here is an article on setting up RDMA. It may shed light on any configuration issues.
https://www.dell.com/support/article/how16693/
Thanks
giomoda
1 Rookie
1 Rookie
•
6 Posts
0
September 23rd, 2019 08:00
Hey Daniel.
I have followed that guide and so many others. As I pointed out, every test I run says RDMA is working, and in fact it is, but not to the desired speed. But I have some more info.
On these servers I have a daughter card and a PCIe card, both with BCM57416 chips. When testing RDMA on the daughter card I get really slow speed, but for the PCIe card I have no issues, which confirms that the configuration is sound. Here are the results:
Transfer Speed with Daughter Card Disabled - you can notice that it just skyrockets from 4MB/s to 860MB/s after disabling it.
Transfer Speed Daughter Card Disabled
Transfer Speed with Daughter Card Enabled
Transfer Speed Daughter Card Enabled
Physical NICs, one PCIe and one Daughter Card, same make and model
Physical Interfaces
Virtual Switch Mapping, LiveMigration-1325 mapped to SET1 - PCIe, and LiveMigration-1326 mapped to SET2 - Daughter Card.
Virtual Interface to Physical Interface Mapping
RDMA test results for LiveMIgration-1326, bound to Daughter Card adapter. You can notice very low throughput
RDMA Test Daughter Crad
RDMA test results for LiveMIgration-1325, bound to PCIe adapter. You can notice a very higher throughput.
RDMA Test PCIe Card
The configuration is exactly the same for both servers and interfaces, so I can suspect there's a hardware issue or maybe some hidden setting I'm overlooking?
Regards,
Giovani
DELL-Daniel My
Moderator
Moderator
•
6.2K Posts
0
September 24th, 2019 10:00
I'm not seeing any obvious issues. I am also not finding any known issues that are exactly the same as what you are experiencing. There are known issues with virtual functions that may be related. Those issues should be resolved with the latest driver and firmware, so I suggest making sure you are using the latest driver and firmware.
giomoda
1 Rookie
1 Rookie
•
6 Posts
0
September 25th, 2019 13:00
Hey Daniel. Thanks for replying.
Well, there were firmware and drivers updates released last week that I have just applied to all servers, but still no change. I'm baffled by this, I cannot seem to find a way to make it work.
It's worth saying that I have another couple of R640 with BCM57412 (SFP+) with the exact same behavior.
Regards,
Giovani
DELL-Daniel My
Moderator
Moderator
•
6.2K Posts
0
September 27th, 2019 09:00
An enterprise engineer emailed me and said they had experienced this issue and found a work-around. They were able to set RoCE to v1 and traffic was passed normally.
If that is the case then it sounds like a bug. The configuration would need to be verified as valid and then it would need to be reproduced to escalate it for a bug fix. I do not plan on trying to reproduce this issue, it would be very time consuming for me to build out a test environment for this.
giomoda
1 Rookie
1 Rookie
•
6 Posts
0
September 27th, 2019 13:00
Hey Daniel.
I guess we hit a bug then. With RoCEv1 everything works. Let me know if you need some evidence from me to get this up the chain. I'll be happy to provide anything you need.
Regards,
Giovani
Kristofer Layman
2 Posts
0
March 4th, 2020 15:00
Any tips on getting PFC working with my Broadcom 57412 daughter card? I have the same issue with RoCEv1 vs v2, but I cannot figure out why my PFC is false. I am using the latest 21.60 firmware and driver.
*Edit, I found changing the DCBX Mode setting to CEE (Only) allows successful test-rdma.ps1 IsRoCE $True using v1 without the PFC error that I was receiving when DCBX Mode was set to IEEE (Only) and speed is normal. However, Get-NetAdapterRdma still shows PFC false.
Thanks!
kungfujediss
4 Posts
0
October 6th, 2020 11:00
Anyone make any progress on this issue?
I have the Broadcom 57416 and have much worse RDMA performance than non-RDMA. Switching to ROCEv1 did not fix it for me. My PFC also shows false, despite being configured.
DELL-Chris H
Moderator
Moderator
•
8.8K Posts
0
October 6th, 2020 13:00
Kungfujediss,
I would suggest calling in for support as there are multiple things from TSRs, configuration issues, etc that could be causing it.
CDiskan
1 Message
0
May 17th, 2022 14:00
For anyone who comes across this thread down the road:
We ran into this issue, and spent a significant amount of time troubleshooting with both Dell and Microsoft. The root cause is that the Broadcom 57 series NICs do NOT properly support RDMA or RDMA ROCE. There is no fix. The cards will let you turn the feature on, but they cannot properly handle the RDMA traffic and will result in throughput measured in single digit Mbps.
To make matters more complicated for troubleshooting, the Microsoft Test-RDMA PowerShell cmdlets and Validate-DCB cmdlets will not show ANY errors for these cards, because RDMA traffic does pass correctly at low speed. However, the usable speed is so low that they are completely unusable for production traffic.
This impacts all of the following cards (And likely others from Broadcom)