Unsolved
This post is more than 5 years old
6 Posts
0
90678
February 29th, 2016 07:00
PowerEdge R720 VM's cannot ping VM host
I have 3 R720 servers in my production cluster in a VMWare 5.5 environment (vm6, vm7 and vm8). All 3 servers are identical. Same hardware and firmware versions, BIOS, etc. The switches are the same (Dell N3024 stacked). I have the onboard BMC5720 with 4 ports for my LAN, 2 ports going to one switch and 2 going to another. This is my issue. Every VM I create or migrate to vm8, cannot ping the host. When I do a ping, the first packet sent will get a reply, but the rest will timeout. When I open a console for the VM, if I do not use it for minute, the screen goes blank and I lose the connection. I spent several hours on the phone with VMWare last week and they verified all the software settings, checked the logs, checked drivers, etc. and they are currently at a loss for the cause. They verified the tg3 vmnic version since there are versions that cause packet loss. I upgraded to the lastest version back in September since I was experiencing high availability issues which the upgrades resolved. But since I am having issues now, I cannot use the host for any servers.
Now here is another interesting fact that is making this harder to resolve. If I ping vm8 from my workstation or any other physical server, the first packet times out and I get a reply on everyone after that. If I start another ping immediately, no time outs. If I wait 60 seconds or longer and start another ping, the first one times out again.
Here's some info on the servers. These are the same for all 3 servers and only 1 has been having issues and the issues started on the day it was installed.
BIOS 2.2.2
BCM5720 Firmware version 7.8.16
BCM5719 Slot 5 Firmware version 7.8.16
The 4 ports on BCM5720 go to the LAN
First 2 ports on BCM5719 go to DMZ and the remaining 2 go to my Compellent SAN.
I have tested every port individually on the 5720 to the LAN and still had issues.
Has anyone ever seen an issue like this? I have run out of things to try. Also, this happened when we used CISCO switches. I have been working on the issue for months and need to get it resolved since we are moving our Citrix farm to VMWare.



DELL-Josh Cr
Moderator
•
9.5K Posts
0
February 29th, 2016 10:00
Hi,
The 4 ports that go to the lan, 2 to each switch, are the switches on the same subnet? Do they connect to each other? Did you replace the cables when you went from Cisco to Dell? Was this all working when you used Citrix?
rsarran1
6 Posts
0
February 29th, 2016 11:00
This server has always given us an issue. The old Citrix is on an HP Blade server that we are getting rid of. The new Citrix servers were created in VMWare and are currently on the other 2 hosts. I only have one VM on vm8 for testing purposes. Everything is on the same subnet and we have the 4 Dell switches stacked. I have not replace the cables.
DELL-Josh Cr
Moderator
•
9.5K Posts
0
February 29th, 2016 12:00
It could be a cable issue, or a NIC issue, with a lot of the other hardware and software being changed it probably is not one of those devices.
UnstoppableDrew_9818db
10 Posts
0
March 8th, 2016 07:00
One thing you might want to try is take the network redundancy out of the equation. Try it with everything only connected to switchA, then only connected to switchB. If it works on one side but not the other, you can narrow your focus to things like cables or switch ports in that stack. If both singles work, but not when redundant, it's more likely a switching/routing issue where not all paths between sets of adapters are valid.
rsarran1
6 Posts
0
March 14th, 2016 07:00
Since there are 4 vnics on this server and we have 3 switches, I started moving the connections around. We have 2 vnics on switchC and 2 on switchB. I moved the 2 from switchC to switchB (all 4 on switchB). Then I took the original 2 on switchB and moved them to switchA. Then I took the 2 on switchB and moved them to switchC. Then the 2 on switchA to switchC. This was just about every possible combination. I still had issues. I was even checking the system when only 2 vnics were online and still had issues. By testing when only 2 vnics were tested that ruled out a cable issue unless I have 2 or more bad cables. I also tried different ports on the switches. It doesn't appear to be switch or cable problem. When I checked the settings in vSphere on a 3 hosts, every settings is exactly the same. I used the iDRAC and logged into each server and every setting is the same (BIOS, firmware, drivers, etc.)
Now as I was typing this, I tried to log into iDRAC on the server in question. I entered my username/password and it spun on the verifying credentials screen for several minutes. I used iDRAC to check a different server and got right in. I retried the first server again and this time, now I get a "is not available". And in my vSphere, that host has an isolation error reaching the server.
Since I cannot get into iDRAC and having issues even getting into the server, could it be a hardware issue with the server???? And how do I even test the server????
DELL-Josh Cr
Moderator
•
9.5K Posts
0
March 14th, 2016 08:00
This is happening across multiple servers, right? It is unlikely that all of them have the same NIC issue. You could try swapping to a different NIC and see if that works better.
rsarran1
6 Posts
0
March 15th, 2016 09:00
No, this is isolated to one R720 server. I removed the Broadcom 5719 nic adapter and only used the 5720 onboard, with the same results. Put the 5720 back it and only used that nic (nothing in the 5719 nic ports) with the same results. Yesterday, I tried to using iDRAC to get into the server and it hung on the Verifying Credentials screen after entering the username/password. I closed the window and tried to go back in and I got a browser message stating that the URL was unavailable. I tried iDRAC on my other 2 R720's and got right in. About 10 minutes later, I tried to go back in with iDRAC and got in. I am thinking that there is a hardware issue, but I am having a hard time diagnosing it. I verified every setting between my 3 R720's and everything is the same. (Same BIOS, firmware, and drivers). Even vmWare support has been checking my servers and they are totally puzzled. They determined that it cannot be a cable or physical switch issue, since a virtual machine is unable to ping the host that is residing on. When a VM pings it host, it only uses its vSwitch. It doesn't go out to the physical switch. There is an issue with the tg3 driver (Dell and vmWare recognize this). Back in October, I upgraded to the tg3 driver that does not have the packet loss issue. Also, the work around is to disable the NetQueue, which I did last night. This also did not help. The server is under warranty until July 2017.
DELL-Josh Cr
Moderator
•
9.5K Posts
0
March 15th, 2016 10:00
Can you boot to our Live image and see if it has issues pinging in a different OS? http://www.dell.com/support/contents/us/en/19/article/Product-Support/Self-support-Knowledgebase/enterprise-resource-center/Enterprise-Tools/support-live-image
Can you get a DSET report?