So, I've got a couple new Dell R640 servers with Broadcom NetXtreme E-series Adapter (BCM57416) for a new Hyper-V Cluster and I've been trying really hard to get SMB Direct working between them so I can leverage the best possible performance during LiveMigration.
The thing is: all tests go green across the board, both Test-RDMA and Validate-DCB, as per screenshots below:
But when trying to copy something between the servers I get REALLY slow throughput:
The result for Get-SmbMultichannelConnection shows that RDMA is being used:
As soon as I disable RDMA on the interface, transfer speed goes back to normal
I've tried bypassing the switch and linking both interfaces directly, but to no avail. The results were the same.
So I guess my question is: has anyone gotten SMB Direct working successfully over Broadcom interfaces? I cannot for the life of me find anything wrong with my setup, but it just won't work.
I have followed that guide and so many others. As I pointed out, every test I run says RDMA is working, and in fact it is, but not to the desired speed. But I have some more info.
On these servers I have a daughter card and a PCIe card, both with BCM57416 chips. When testing RDMA on the daughter card I get really slow speed, but for the PCIe card I have no issues, which confirms that the configuration is sound. Here are the results:
Transfer Speed with Daughter Card Disabled - you can notice that it just skyrockets from 4MB/s to 860MB/s after disabling it.
Transfer Speed with Daughter Card Enabled
Physical NICs, one PCIe and one Daughter Card, same make and model
Virtual Switch Mapping, LiveMigration-1325 mapped to SET1 - PCIe, and LiveMigration-1326 mapped to SET2 - Daughter Card.
RDMA test results for LiveMIgration-1326, bound to Daughter Card adapter. You can notice very low throughput
RDMA test results for LiveMIgration-1325, bound to PCIe adapter. You can notice a very higher throughput.
The configuration is exactly the same for both servers and interfaces, so I can suspect there's a hardware issue or maybe some hidden setting I'm overlooking?
I'm not seeing any obvious issues. I am also not finding any known issues that are exactly the same as what you are experiencing. There are known issues with virtual functions that may be related. Those issues should be resolved with the latest driver and firmware, so I suggest making sure you are using the latest driver and firmware.
Dell EMC, Enterprise Engineer
Hey Daniel. Thanks for replying.
Well, there were firmware and drivers updates released last week that I have just applied to all servers, but still no change. I'm baffled by this, I cannot seem to find a way to make it work.
It's worth saying that I have another couple of R640 with BCM57412 (SFP+) with the exact same behavior.
An enterprise engineer emailed me and said they had experienced this issue and found a work-around. They were able to set RoCE to v1 and traffic was passed normally.
If that is the case then it sounds like a bug. The configuration would need to be verified as valid and then it would need to be reproduced to escalate it for a bug fix. I do not plan on trying to reproduce this issue, it would be very time consuming for me to build out a test environment for this.
Dell EMC, Enterprise Engineer
I guess we hit a bug then. With RoCEv1 everything works. Let me know if you need some evidence from me to get this up the chain. I'll be happy to provide anything you need.
Any tips on getting PFC working with my Broadcom 57412 daughter card? I have the same issue with RoCEv1 vs v2, but I cannot figure out why my PFC is false. I am using the latest 21.60 firmware and driver.
*Edit, I found changing the DCBX Mode setting to CEE (Only) allows successful test-rdma.ps1 IsRoCE $True using v1 without the PFC error that I was receiving when DCBX Mode was set to IEEE (Only) and speed is normal. However, Get-NetAdapterRdma still shows PFC false.
Anyone make any progress on this issue?
I have the Broadcom 57416 and have much worse RDMA performance than non-RDMA. Switching to ROCEv1 did not fix it for me. My PFC also shows false, despite being configured.
I would suggest calling in for support as there are multiple things from TSRs, configuration issues, etc that could be causing it.