We have a huge performance drop after some updates to our server, a R7525. Because it's a production server, it's very difficult to us to experiment which of the updates is making the system slow.
The CPU's seems slower and pcie4 Nvme drives have gone from 7500 MB/s to 1200 MB/s.
Anyone has experienced something similar? Any suggestion?
A couple of questions first please, what model are the NVMe drives? Were they sold with the server or added after the server was purchased? What about the firmware versions. The screen you are sharing are your current versions? Please, can you check iDRAC, BIOS, PERC Controller and drives current firmware versions? On the screenshoot I am missing the NVME drives versions (you can check all this info in the iDRAC).
As far as I can understand on your message I understand the performance problems appearead after the Windows updates. Is that so? I would suggest checking if your system has an OEM windows license (if the server was sold with the same version of Windows installed). If so, you could consider opening a ticket with support and asking for a OS Engenieer escalation. This escalation is only possible if the server has an OS OEM license.
If you suspect it can be a hardware error, there two things you can do: first, check the hardware logs or run a hardware diagnostic from the Lifecycle Controller. By doing so, you will be able to see if the NVMe drives are having any error. Also you can check the iDRAC log for errors.
There are some BIOS configurations you can set up but first I would like to be sure about the previous information.
Thanks Diego for your detailed answer.
Without a doubt, it is also a problem related to the processors, or their cores, queues, etc ... We have seen that all virtual machines (hyper-v), whatever their version of OS, now also use more processor.
We have detected the performance loss on Intel Optane P4800X, Samsung PM9A3, Samsung PM1735, these are not from Dell. However, the SSD raid 0 arrays on the PERC H755 controller are also slower.
I have done several tests, as the processors are ROME EPYC 7302 the first was to uninstall the drivers "AMD SP3 MILAN Series Chipset drive" https://www.dell.com/support/home/es-es/drivers/driversdetails?driverid=rxgwv&oscode=ws19l&productco...
to install the old ones: "AMD SP3 series ROME chipset driver" https://www.dell.com/support/home/es-es/drivers/driversdetails?driverid=k343k&oscode=ws19l&productco...
There were no differences. I am not sure which ones I should install, I believe that the SP3 Milan are the newer and also compatible with the ROME, but it does not specify it anywhere. It would be nice to know which ones are the most suitable.
I also downgraded the bios from 2.3.6 to 2.2.5, there was no difference either.
Finally I did a diagnostic test in the lifecycle and everything appears correct.
I think the next option will be to reinstall Windows 2019 with the latest ISO or Windows 2022 and see the performances again. The OS are our SPLA licences. Maybe something is wrong in the host OS.
Actual versions are:
What tools are you using for the Benchmark tests? Can you share some screens of the results? Do you have more servers with the same hardware configuration? Like, for example, is this affecting some hosts but not some others? That would allow us to compare and check configurations.
There are a few configurations you can check in the BIOS. Below you will see this values ad they should be to avoid performance issues. You can change these in the idrac gui under the configuration> bios settings tab or from a KVM or something and boot into F2 BIOS.
memory interleaving disabled
numa nodes per socket 4
L3 cache as a numa domain disabled
auto discovery bifurcation auto
dynamic link width management (DWLM) unforced
Please check those values and perform the test again. Then let me know if that changes anything.
If none of this makes a meanfull change in terms of performance I suggest you to open a ticket with phone support to get the involvement of a SST, with that they could recreate the issue on the lab and may be able to offer you some advice.
We finally found a way to solve the problem, in the end it is something a bit incomprehensible:
1) Shut down the server
2) Disconnect the cables from the two power supplies
3) Press the power button for 10 seconds
4) Connect the cables of the two power supplies
5) Wait a minute
6) Start the server by one press of the power button
Then everything worked perfectly at a very high performance.