A couple of additional questions please: is your VMware Dell OEM supported? I mean, was the OS license bought with the server or later? I want to check if the OS is covered by the warranty because, if so the case could be escalated to the OS Support team.
I am pretty sure, the Kingston DC500R are not supported. As far a I know those are not Enterprise certified drives. Can you please remove them from one of the systems and check if the problem still appears?
What model are the SD cards? Were they shipped with the server? Or you bought them?
Please, can you confirm you follow this steps for the OS installation: Installing ESXi on flash media: https://dell.to/2Pj52vg
Can you please, post a firmware inventory screenshoot to check all firmwares? (You can check this from the IDRAC).
A couple of additional questions please: is your VMware Dell OEM supported? I mean, was the OS license bought with the server or later? I want to check if the OS is covered by the warranty because, if so the case could be escalated to the OS Support team.
I downloaded the latest DellEMC image from VMware: VMware-VMvisor-Installer-7.0.0.update02-17630552.x86_64-DellEMC_Customized-A00.iso
I am using the free license.
I am pretty sure, the Kingston DC500R are not supported. As far a I know those are not Enterprise certified drives. Can you please remove them from one of the systems and check if the problem still appears?
The DC500R are part of Kingston's enterprise data centre series:
We already have 10 x T620 servers using Kingston enterprise drives and we haven't had any issues with them in the past. I did check with sales when we purchased these drives and they said they were supported.
I am not at work today but I can go in tomorrow and swap out the drives in one of the problem servers with a couple of the original Dell drives to test.
What model are the SD cards? Were they shipped with the server? Or you bought them?
They were supplied by Dell and pre-installed on the servers.
Please, can you confirm you follow this steps for the OS installation: Installing ESXi on flash media: https://dell.to/2Pj52vg
I installed ESXi using these steps. The only difference is that I used the Virtual CD/DVD/ISO device from the iDRAC from a mounted ISO on my PC instead of removable media. The PC was connected to the server on the same wired LAN.
Can you please, post a firmware inventory screenshoot to check all firmwares? (You can check this from the IDRAC)
Here is the firmware inventory list from one of the servers having the issue:
OS COLLECTOR, v6.0, A00
6.0
Internal Dual SD Module
1.13
Backplane 1
2.52
Dell EMC iDRAC Service Module Embedded Package v3.5.1, A00
3.5.1
Power Supply.Slot.1
00.26.35
Power Supply.Slot.2
00.26.35
PERC H740P Adapter
51.13.2-3714
BIOS
2.10.0
Dell OS Driver Pack, 20.08.09, A00
20.08.09
Integrated Dell Remote Access Controller
4.40.00.00
Dell 64 Bit uEFI Diagnostics, version 4301, 4301A50, 4301.51
With the OS license not having been purchased from us, your support coverage wouldn't extend to the software side of things. That having been said, I can still grab an R740 and install our version of ESXi on it to try to confirm if there is an issue with that image or not. Looking through your post, I think our processes and configuration will be pretty much the same, excusing the Kingston drives. I'm willing to bet that the system I pull will have differing makes and models of storage.
We can also look at hardware with you, but as you had indicated, I wouldn't jump to the conclusion that this is a hardware issue quite yet. I'll bring my test system up to all the latest on firmware, that may save you some time, as well. If it seems to make a difference on my side, I'll certainly let you know.
The files and folders that you're losing access to - are those mounted on the SD storage right along with the rest of ESXi, or are they on your hard drives? With the fast path and HBA, I'd expecting the issue to be communication with external storage, like a SAN. If this is all internal storage, the best play might be to export the PERC log and see if it gives any further indication of what is happening. Exporting the PERC log can be done through the OS, but would also require a reboot to complete the install of OpenManage. You can also acquire this log through the iDRAC using a SupportAssist collection without any additional software installation. If there is indeed a communication problem with those Kingston drives, I'd expect it to show up there.
If you'd like to send me the service tags of your systems, I'd be happy to check your support coverage for you. At the very least, this can help identify some options to assist you.
Interesting @HONGLMN. As these are new machines, I have gone straight to 7.0 U2. What version were you running previously? I may be able to test the previous version you were running on my hardware to see if the problem goes away.
I've looked at the Dell releases and the last one released before 7.0 U2 was released on 15 Jan and relates to 7.0 1c (VMware build 17325551).
I'll get one of my servers installed with that version and report any issues I get.
As a matter of interest, are you losing access to any of the VMs on the ESXi hosts when the problem occurs or just access to ESXi and the SD card? I'm not running vCenter so I'm accessing ESXi locally on each host.
The files and folders that you're losing access to - are those mounted on the SD storage right along with the rest of ESXi, or are they on your hard drives?
The problem is just with access to files on the SD card itself. I can still access files on the hard drives. My test VM sits on a datastore on the hard drive (RAID array) and when the problem arises the VMs continue on as normal. The problem happened a couple of times today and I managed to remount the SD card (vmhba32) without affecting the VM in any way. The PERC, which is seen as vmhba2 by ESXi always remains mounted and has not had a problem.
The .locker folder is mounted on the the first datastore and the logs are still being updated while access to the SD is lost.
On one of the problem machines, I re-installed ESXi on the first array completely bypassing the SD card all together and this machine has been running for 4 days now without any issues.
I have not set up Support Assist on any of the servers yet but I will do this tomorrow on the problem machines to see if that highlights anything.
I will PM the service tags to you along with details of which units have had issues.
I will also follow Diego's advice and put some Dell disks in one of the problem machines to take the Kingston disks out of the equation.
It was the Dell-specific 7.01d distribution. I don't have the specific build number - I'm not in the office today.
Also, my ESXi servers are connected to vCenter, and there was (still is?) a bug using vCenter to upgrade them from 7.01d to 7.0 U2. So, I after they were borked from that bug, I upgraded them by booting them to a CD with the Dell-specific 7.0 U2 image and upgrading them manually to recover them.
But, these are production servers, and I am having to reboot them to recover them every few days. I installed Skyline Health Diagnostics and collected the log files. They show almost identical errors to what you originally posted.
Okay, I must have used 7.0 1c and then upgraded within vCenter to 7.0 1d before going to 7.0 U2.
In addition to the
Bootbank cannot be found at path '/bootbank'
messages, I start getting warnings about hostd performance issues and "possible storage bottleneck" which, according to what I can tell from Skyline Health Diagnostics, might be related to "storage issues including HBA."
So, the real world impact is that VMs start to appear unresponsive. (I don't think they're really "dead" though, they appear unresponsive because any I/O they do is really, really, really slow.) The ESXi server's CPUs are nearly 100% idle during this time.
In order to recover, I have to "Force restart VMs" from the ESXi server console and reboot the server. And that takes 30 minutes to happen due to I/O issues too.
I have installed one of the problem servers with 7.0U2 on Dell disks as per @DiegoLopez's advice.
@HONGLMN I have set up another of the problem servers on 7.0U1 (the stock Dell 7.0 1c version) to see if the problem still occurs.
I now have a total of 7 servers configured with VMs and will keep checking to see if the SD card locks up in any of them. The server that I re-installed on the RAID array (bypassing the SD card) is still running fine without any issues.
I will get SupportAssist set up on the servers tomorrow and will post as soon as any of the servers hit a problem.
On our brand new PowerEdge R440 we have the same issue. Running 7.0 U1 Dell Version: A01, Build# 17325551 for 2 months without problems, on 2021/03/11 upgraded to 7.0 U2 Dell Version: A00, Build# 17630552. Host became unresponsive 2 times in 2 weeks.
After analyzing VMware Skyline Health Diagnostics for vSphere:
2021-04-01T20:07:40.294Z cpu0:2129784)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "mpx.vmhba32:C0:T0:L0" state in doubt; requested fast path state update... 2021-04-01T20:07:40.294Z cpu0:2129784)ScsiDeviceIO: 4315: Cmd(0x45b90c7ee1c0) 0x28, cmdId.initiator=0x4304f04dd140 CmdSN 0x1 from world 2130054 to dev "mpx.vmhba32:C0:T0:L0" failed H:0x5 D:0x0 P:0x0 Cancelled from path layer. Cmd count Active:1
followed by
2021-04-01T19:07:41.056Z cpu16:2129742)ALERT: Bootbank cannot be found at path '/bootbank' 2021-04-01T20:07:40.349Z cpu20:2130064)ALERT: Bootbank cannot be found at path '/bootbank'
We can still log on to esxi and vcenter but we are unable to issue commands, only solutions is to reboot the host.
Yesterday we did a rollback to 7.0 U1 Dell Version: A01, Build# 17325551, hopefully this will solve this issue.
It is not recommended to attempt upgrades to ESXi 7.0 with the A00 or A01 version customized ISO image. The A00 / A01 Dell EMC customized ISO is suitable for new installations of ESXi 7.0.
Are any of you seeing these issues on fresh installs of 7.0?
@NLO_elisa It looks like you are seeing same issues and like @HONGLMN, only when you upgraded from 7.0 U1 to 7.0 U2. As these are new servers, I have only ever tried 7.0 U2 and have encountered the problem on 5 of the 7 servers I am testing.
Of the 7 I currently have on test:
* 5 of them are running on the SD card with 7.0 U2 * 1 of them is running on the SD card with 7.0 U1 * 1 is running on the local SSD drives (no SD in use) with 7.0 U2.
As of yet I have only had issues running on the SD card with 7.0 U2.
I have installed SupportAssist on all 7 servers and am just waiting for one to fail so that I can send a support collection to Dell for analysis. Nothing has failed yet since yesterday afternoon.
@DELL-Chris H I have only been using fresh installs of 7.0 U2 and 7.0 U1. No downgrades or upgrades.
We are experiencing the same issue.... We have a sev 1 / sev 2 vmware case open on this issue and haven't got this much info yet. I will be looping in our Dell Tech team on Monday after seeing all this as well. We have 111 R540's that upgraded from 7.0 U1c to 7.0 U2 last week where Almost immediately we started experiencing ~10 different new hosts disconnecting from vcenter per day. This is ultimately due to the hostd process deterioration and the /bootbank going inaccessible. Meanwhile We have 40 other R540's that use mirrored SAS drives for the esxi os install drive instead of the the mirrored SD cards which have not had any issue what so ever. There is definitely something funky with 7.0 U2 and the vmhba32 USB driver with the mirrored SD card modules on that specic config. Seeing this really makes me think we need to revert back to U1c
DiegoLopez
4 Operator
•
2.7K Posts
0
March 30th, 2021 07:00
Hello @cbs_technical,
A couple of additional questions please: is your VMware Dell OEM supported? I mean, was the OS license bought with the server or later? I want to check if the OS is covered by the warranty because, if so the case could be escalated to the OS Support team.
I am pretty sure, the Kingston DC500R are not supported. As far a I know those are not Enterprise certified drives. Can you please remove them from one of the systems and check if the problem still appears?
What model are the SD cards? Were they shipped with the server? Or you bought them?
Please, can you confirm you follow this steps for the OS installation: Installing ESXi on flash media: https://dell.to/2Pj52vg
Can you please, post a firmware inventory screenshoot to check all firmwares? (You can check this from the IDRAC).
Thank you in advanced.
Regards.
cbs_technical
1 Rookie
•
13 Posts
0
March 30th, 2021 08:00
Hi Diego,
Thanks for the response:
A couple of additional questions please: is your VMware Dell OEM supported? I mean, was the OS license bought with the server or later? I want to check if the OS is covered by the warranty because, if so the case could be escalated to the OS Support team.
I downloaded the latest DellEMC image from VMware:
VMware-VMvisor-Installer-7.0.0.update02-17630552.x86_64-DellEMC_Customized-A00.iso
I am using the free license.
I am pretty sure, the Kingston DC500R are not supported. As far a I know those are not Enterprise certified drives. Can you please remove them from one of the systems and check if the problem still appears?
The DC500R are part of Kingston's enterprise data centre series:
https://www.kingston.com/unitedkingdom/en/ssd/dc500-data-center-solid-state-drive
We already have 10 x T620 servers using Kingston enterprise drives and we haven't had any issues with them in the past. I did check with sales when we purchased these drives and they said they were supported.
I am not at work today but I can go in tomorrow and swap out the drives in one of the problem servers with a couple of the original Dell drives to test.
What model are the SD cards? Were they shipped with the server? Or you bought them?
They were supplied by Dell and pre-installed on the servers.
Please, can you confirm you follow this steps for the OS installation: Installing ESXi on flash media: https://dell.to/2Pj52vg
I installed ESXi using these steps. The only difference is that I used the Virtual CD/DVD/ISO device from the iDRAC from a mounted ISO on my PC instead of removable media. The PC was connected to the server on the same wired LAN.
Can you please, post a firmware inventory screenshoot to check all firmwares? (You can check this from the IDRAC)
Here is the firmware inventory list from one of the servers having the issue:
regards,
Aidan
Dell-DylanJ
4 Operator
•
2.9K Posts
0
March 30th, 2021 09:00
Hello,
With the OS license not having been purchased from us, your support coverage wouldn't extend to the software side of things. That having been said, I can still grab an R740 and install our version of ESXi on it to try to confirm if there is an issue with that image or not. Looking through your post, I think our processes and configuration will be pretty much the same, excusing the Kingston drives. I'm willing to bet that the system I pull will have differing makes and models of storage.
We can also look at hardware with you, but as you had indicated, I wouldn't jump to the conclusion that this is a hardware issue quite yet. I'll bring my test system up to all the latest on firmware, that may save you some time, as well. If it seems to make a difference on my side, I'll certainly let you know.
The files and folders that you're losing access to - are those mounted on the SD storage right along with the rest of ESXi, or are they on your hard drives? With the fast path and HBA, I'd expecting the issue to be communication with external storage, like a SAN. If this is all internal storage, the best play might be to export the PERC log and see if it gives any further indication of what is happening. Exporting the PERC log can be done through the OS, but would also require a reboot to complete the install of OpenManage. You can also acquire this log through the iDRAC using a SupportAssist collection without any additional software installation. If there is indeed a communication problem with those Kingston drives, I'd expect it to show up there.
If you'd like to send me the service tags of your systems, I'd be happy to check your support coverage for you. At the very least, this can help identify some options to assist you.
HONGLMN
6 Posts
0
March 30th, 2021 11:00
Hello.
I am having a very, very similiar - if not identical - experience with our Dell R740xd servers.
They are pre-installed with from-factory SD cards and SSD drives for the local storage.
I am seeing the same messages as the original poster. I started having problems after the 7.0 U2 update.
cbs_technical
1 Rookie
•
13 Posts
0
March 30th, 2021 12:00
Interesting @HONGLMN. As these are new machines, I have gone straight to 7.0 U2. What version were you running previously? I may be able to test the previous version you were running on my hardware to see if the problem goes away.
regards,
Aidan
cbs_technical
1 Rookie
•
13 Posts
0
March 30th, 2021 12:00
Hi @HONGLMN
I've looked at the Dell releases and the last one released before 7.0 U2 was released on 15 Jan and relates to 7.0 1c (VMware build 17325551).
I'll get one of my servers installed with that version and report any issues I get.
As a matter of interest, are you losing access to any of the VMs on the ESXi hosts when the problem occurs or just access to ESXi and the SD card? I'm not running vCenter so I'm accessing ESXi locally on each host.
regards,
Aidan
cbs_technical
1 Rookie
•
13 Posts
0
March 30th, 2021 12:00
Hi Dylan,
Many thanks for your help.
The files and folders that you're losing access to - are those mounted on the SD storage right along with the rest of ESXi, or are they on your hard drives?
The problem is just with access to files on the SD card itself. I can still access files on the hard drives. My test VM sits on a datastore on the hard drive (RAID array) and when the problem arises the VMs continue on as normal. The problem happened a couple of times today and I managed to remount the SD card (vmhba32) without affecting the VM in any way. The PERC, which is seen as vmhba2 by ESXi always remains mounted and has not had a problem.
The .locker folder is mounted on the the first datastore and the logs are still being updated while access to the SD is lost.
On one of the problem machines, I re-installed ESXi on the first array completely bypassing the SD card all together and this machine has been running for 4 days now without any issues.
I have not set up Support Assist on any of the servers yet but I will do this tomorrow on the problem machines to see if that highlights anything.
I will PM the service tags to you along with details of which units have had issues.
I will also follow Diego's advice and put some Dell disks in one of the problem machines to take the Kingston disks out of the equation.
regards,
Aidan
HONGLMN
6 Posts
0
March 30th, 2021 12:00
It was the Dell-specific 7.01d distribution. I don't have the specific build number - I'm not in the office today.
Also, my ESXi servers are connected to vCenter, and there was (still is?) a bug using vCenter to upgrade them from 7.01d to 7.0 U2. So, I after they were borked from that bug, I upgraded them by booting them to a CD with the Dell-specific 7.0 U2 image and upgrading them manually to recover them.
But, these are production servers, and I am having to reboot them to recover them every few days. I installed Skyline Health Diagnostics and collected the log files. They show almost identical errors to what you originally posted.
HONGLMN
6 Posts
0
March 31st, 2021 08:00
Okay, I must have used 7.0 1c and then upgraded within vCenter to 7.0 1d before going to 7.0 U2.
In addition to the
messages, I start getting warnings about hostd performance issues and "possible storage bottleneck" which, according to what I can tell from Skyline Health Diagnostics, might be related to "storage issues including HBA."
So, the real world impact is that VMs start to appear unresponsive. (I don't think they're really "dead" though, they appear unresponsive because any I/O they do is really, really, really slow.) The ESXi server's CPUs are nearly 100% idle during this time.
In order to recover, I have to "Force restart VMs" from the ESXi server console and reboot the server. And that takes 30 minutes to happen due to I/O issues too.
cbs_technical
1 Rookie
•
13 Posts
0
March 31st, 2021 14:00
Hi @Dell-DylanJ,
I have installed one of the problem servers with 7.0U2 on Dell disks as per @DiegoLopez's advice.
@HONGLMN I have set up another of the problem servers on 7.0U1 (the stock Dell 7.0 1c version) to see if the problem still occurs.
I now have a total of 7 servers configured with VMs and will keep checking to see if the SD card locks up in any of them. The server that I re-installed on the RAID array (bypassing the SD card) is still running fine without any issues.
I will get SupportAssist set up on the servers tomorrow and will post as soon as any of the servers hit a problem.
regards,
Aidan
NLO_elisa
6 Posts
0
April 2nd, 2021 04:00
@cbs_technical
Any news?
On our brand new PowerEdge R440 we have the same issue.
Running 7.0 U1 Dell Version: A01, Build# 17325551 for 2 months without problems, on 2021/03/11 upgraded to 7.0 U2 Dell Version: A00, Build# 17630552.
Host became unresponsive 2 times in 2 weeks.
After analyzing VMware Skyline Health Diagnostics for vSphere:
2021-04-01T20:07:40.294Z cpu0:2129784)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "mpx.vmhba32:C0:T0:L0" state in doubt; requested fast path state update...
2021-04-01T20:07:40.294Z cpu0:2129784)ScsiDeviceIO: 4315: Cmd(0x45b90c7ee1c0) 0x28, cmdId.initiator=0x4304f04dd140 CmdSN 0x1 from world 2130054 to dev "mpx.vmhba32:C0:T0:L0" failed H:0x5 D:0x0 P:0x0 Cancelled from path layer. Cmd count Active:1
followed by
2021-04-01T19:07:41.056Z cpu16:2129742)ALERT: Bootbank cannot be found at path '/bootbank'
2021-04-01T20:07:40.349Z cpu20:2130064)ALERT: Bootbank cannot be found at path '/bootbank'
We can still log on to esxi and vcenter but we are unable to issue commands, only solutions is to reboot the host.
Yesterday we did a rollback to 7.0 U1 Dell Version: A01, Build# 17325551, hopefully this will solve this issue.
DELL-Chris H
Moderator
•
9.7K Posts
0
April 2nd, 2021 05:00
It is not recommended to attempt upgrades to ESXi 7.0 with the A00 or A01 version customized ISO image. The A00 / A01 Dell EMC customized ISO is suitable for new installations of ESXi 7.0.
Are any of you seeing these issues on fresh installs of 7.0?
cbs_technical
1 Rookie
•
13 Posts
0
April 2nd, 2021 13:00
@NLO_elisa It looks like you are seeing same issues and like @HONGLMN, only when you upgraded from 7.0 U1 to 7.0 U2. As these are new servers, I have only ever tried 7.0 U2 and have encountered the problem on 5 of the 7 servers I am testing.
Of the 7 I currently have on test:
* 5 of them are running on the SD card with 7.0 U2
* 1 of them is running on the SD card with 7.0 U1
* 1 is running on the local SSD drives (no SD in use) with 7.0 U2.
As of yet I have only had issues running on the SD card with 7.0 U2.
I have installed SupportAssist on all 7 servers and am just waiting for one to fail so that I can send a support collection to Dell for analysis. Nothing has failed yet since yesterday afternoon.
@DELL-Chris H I have only been using fresh installs of 7.0 U2 and 7.0 U1. No downgrades or upgrades.
regards,
Aidan
mattjudson
8 Posts
0
April 4th, 2021 18:00
We are experiencing the same issue.... We have a sev 1 / sev 2 vmware case open on this issue and haven't got this much info yet. I will be looping in our Dell Tech team on Monday after seeing all this as well. We have 111 R540's that upgraded from 7.0 U1c to 7.0 U2 last week where Almost immediately we started experiencing ~10 different new hosts disconnecting from vcenter per day. This is ultimately due to the hostd process deterioration and the /bootbank going inaccessible. Meanwhile We have 40 other R540's that use mirrored SAS drives for the esxi os install drive instead of the the mirrored SD cards which have not had any issue what so ever. There is definitely something funky with 7.0 U2 and the vmhba32 USB driver with the mirrored SD card modules on that specic config. Seeing this really makes me think we need to revert back to U1c
DELL-Joey C
Moderator
•
4.1K Posts
0
April 4th, 2021 23:00
Hi @mattjudson and @cbs_technical,
Thanks for updating us on the issue. It may be an issue with the driver communication with the OS and IDSDM.
When the issue occurs, does iDRAC able to detect the SD cards? Something I found from the VMWare KB: https://dell.to/3rPh6Sd, does it relate?