duncj

11 Posts

3770

June 2nd, 2010 22:00

vSphere 4 U1 - NMP generating pathing errors

Hi guys,

We're seeing some NMP errors on our vSphere 4 hosts and I was wondering if anyone else had seen similar errors. Every few days we see this group of three errors reported in the ESX logs:

May 30 02:50:00 nzchchlmpsesx5 vmkernel: 1:15:34:16.470 cpu6:4255)NMP: nmp_CompleteCommandForPath: Command 0x28 (0x41000201c880) to NMP device "naa.60060480000290102936533030314433" failed on physical path "vmhba1:C0:T0:L12" H:0x7 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

May 30 02:50:00 nzchchlmpsesx5 vmkernel: 1:15:34:16.470 cpu6:4255)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe: NMP device "naa.60060480000290102936533030314433" state in doubt; requested fast path state update...

May 30 02:50:00 nzchchlmpsesx5 vmkernel: 1:15:34:16.470 cpu6:4255)ScsiDeviceIO: 747: Command 0x28 to device "naa.60060480000290102936533030314433" failed H:0x7 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

From my (limited) understanding vSphere is attempting a read down a path which is failing with an HBA internal error (H:0x7). I've got calls open with VMware, HP and EMC and haven't had a lot of success getting a resolution of this issue.

The environment consists of:

OS: vSphere 4 Update 1 plus patches (using local boot)

Host: BL460 G1 with LP-1105 HBAs on firmware 2.82A4

C7000 Chassis with Brocade 4/24 SAN switch (AE372A) on FOS 6.1.1b

Main Fabric is Brocade 4900 switches running FOS 6.2.0c

Symmetrix DMX4 storage array on 5773 microcode

We've got this set up in two separate DataCenters and are seeing the issue on hosts in both environments, which to my mind reduces the chance of this strictly being a hardware fault. Also we're currently migrating from ESX 3.5 to vSphere 4 so we have a mixed environment of both types of hosts at the moment.

Any suggestions would be most welcome.

Cheers,

Duncan

Responses(6)

duncj

11 Posts

0

June 2nd, 2010 22:00

Forgot to mention that we're using Round Robin pathing policy, as recommended by the EMC solutions guide for vSphere.

codyhosterman

286 Posts

1

June 3rd, 2010 09:00

Duncan,

This has been a situation reported by many customers and there is a VMware Communities thread that might have some helpful discussion for you.

http://communities.vmware.com/message/1321985#1321985

VMware has released a couple of patches regarding similar issues to this such as:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1019492

and

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1016291

One of these patches might be of assistance, I would contact the person(s) handling your VMware support request and inquire about these patches and see if they are right for you.

Also, I have heard installing PowerPath/VE will also resolve this issue as it replaces most of the native VMware NMP stack.

Hope this helps.

Thanks!

Cody

Cody Hosterman

Symmetrix Partner Engineering
Integrated Customer Operations
EMC Corporation

duncj

11 Posts

0

June 8th, 2010 16:00

Hi Cody,

Thanks for your reply. I've checked and we've got those patches already applied to the vSphere hosts.

Interestingly, as part of trying to resolve this issue, I'm currently having a discussion with VMware support as to whether we should be using Round Robin as the pathing policy for the vSphere hosts connected to the Symmetrix. The VMware compatibility guide currently only lists Powerpath/VE and FIXED as the supported path policies and I'm getting a bit of push back from VMware support that our hosts should be configured that way. Of course the EMC guide for Symmetrix and VMware (I forget the official name) recommends Round Robin as the correct pathing policy - which is what we've got on our vSphere hosts at the moment. I assume that Round Robin is definitely the preferred pathing policy from EMC's point of view?

Since we're currently migrating from ESX 3.5 to vSphere 4 U1 in our environment, we've got a number of ESX 3.5 hosts accessing the shared LUNs using FIXED pathing, whilst the vSphere hosts are using Round Robin pathing to the same LUNs. I'm wondering if this would possibly be contributing to the errors we're seeing?

Cheers,

Duncan

codyhosterman

286 Posts

0

June 8th, 2010 17:00

Duncan,

I was one of the authors of the Symmetrix/VMware guide and using Round Robin is definitely our recommendation for ESX 4.0 hosts connected to the Symmetrix. Since the Symmetrix is an Active/Active array there should be no issue with using Round Robin as the pathing policy. The only reason we do recommend Round Robin for ESX 3.5 also is because VMware only gives it experimental support and therefore we will not support it either.

It is possible that the ESX 3.5 hosts are causing the issue, but I would be suprised if it was the case. If so I would think the error would pop up more often. You said it only happens every so often correct? Is there some type of correlation between heavy IO and these errors? A VCB backup or something to that effect? It's possible that certain paths are getting hammered at certain times and NMP is just mindlessly throwing IO at them even though they really can't take any more IO. PP/VE is smart enough to know that a path is overloaded and will not send anything more down that path until it calms down. I would be curious to see if installing PP/VE on the ESX 4.0 servers would clear up those errors. My money would be on yes.

Thanks!

Cody

Cody Hosterman

Symmetrix Partner Engineering
Integrated Customer Operations
EMC Corporation

duncj

11 Posts

0

June 15th, 2010 05:00

Hi,

I've done some more investigation of the logs and found that we're seeing the NMP error under two circumstances:

1) It pops up on our vSphere hosts when there is a corresponding SCSI Reservation error on one or more the ESX 3.5 hosts

2) One of our vSphere hosts is seeing the NMP error much more often than the other vSphere hosts and without the corresponding SCSI reservation issue on the ESX 3.5 hosts and always on the same HBA accessing the same datastore. Other hosts accessing that datastore don't have an issue and the fibre zoning and Storage Array configs look fine.

We're going to rebalance the virtuals across the datastores to address occurrence 1 - we've currently got a couple of datastores with a large number of virtuals and other datastores that are basically empty. I think this has been on someone's to do list and they never quite got there :-)

I'm treating occurrence 2 as a faulty HBA and we'll replace it shortly.

Cheers,

Duncan

codyhosterman

286 Posts

0

June 16th, 2010 08:00

Ah interesting, but that would make sense. vSphere has been improved in many ways over VI3 when it comes to dealing with SCSI reservations so it makes sense that these errors are occuring on the 3.5 hosts only, but those errors could propagate to the 4.0 hosts in a couple ways. Which from what you are saying sounds like the case.

Regards,

Cody

Cody Hosterman
Symmetrix Partner Engineering
Integrated Customer Operations
EMC Corporation

View All

No Events found!