Start a Conversation

Unsolved

This post is more than 5 years old

A

5 Practitioner

 • 

274.2K Posts

7952

July 2nd, 2012 12:00

Datastore rescan issue

IHAC that is experincing slow rescan issue. Takes 3-4 minutes on UCS with vSphere 5, PPVE, and VMAX20K (5875). This is compared to VNX, this take 5 seconds on NS-480 with vSphere 5 without PPVE. Anyone familiar with this issue?

92 Posts

July 2nd, 2012 12:00

Adding to that, can you please provide more info? Cluster size, number os LUNs, etc...

This may not be an issue

5 Practitioner

 • 

274.2K Posts

July 2nd, 2012 12:00

They have about 12 *1TB R2 volumes presented all the time during Disaster.

There are 9 ESXI clusters attached, followed by 63 Devices that are mapped and masked to these clusters(8 Gatekeepers, 521TB , 12TB, 1*3TB).

--

John Shelest • Advisory Technical Consultant • VMware-Cisco-EMC Technology Solutions

1 Attachment

286 Posts

July 2nd, 2012 12:00

Do they have a lot of write-disabled devices present to the cluster? For example SRDF R2 volumes?

286 Posts

July 2nd, 2012 13:00

That's probably the reason then. During a rescan, ESX queries all the discovered devices and finds VMFS volumes on the write disabled devices. The SCSI open() succeeds because the device is ready but when it tries to update the VMFS metadata it fails and ESX retries a few 1000 times before giving up. The update is required for multiple things the most important of all being the registration of the kernel for heart beating. That is a write operation. Of course, the kernel has no idea that the SCSI device is RO until it does the first write. When that write happens, the Symmetrix returns 07/27/00 which indicates a RO medium. This was a problem with earlier versions of ESX and in ESX 4.0 U3 VMware introduced a fix (see below) that enabled it to recognize the device as purposefully WD from that sense code. Therefore ESX does not keeps retrying the metadata update operation several 1000 times before giving up on the LUN. Basically, the initiator should back off and not keep retrying the operation. Without this a customer who has several of these WD LUNs, as is commonly the case, it can take a long time for the rescan to finish.

A workaround is to set the WD disabled devices to Not_Ready on the Symmetrix. This will cause the LUN to fail the SCSI open() and ESX will never even once try to update the metadata. If they do this and the rescan problem goes away this would indicate that this is indeed their issue.

I've been seeing this more and more and i am beginning to suspect this fix did not make it into new versions of ESX/ESXi. I would suggest opening up a support case with VMware to see if this is the case.

The fix from the ESX 4.0 U3 release notes:

  • Rescan operations take a long time or times out with read-only VMFS volume

    Rescan or add-storage operations that you run from the vSphere Client might take a long time to complete or fail due a timeout, and messages similar to the following are written to /var/log/vmkernel:

    Jul 15 07:09:30 [vmkernel_name]: 29:18:55:59.297 ScsiDeviceToken: 293: Sync IO 0x2a to device "naa.60060480000190101672533030334542" failed: I/O error H:0x0 D:0x2 P:0x0 Valid sense data: 0x7 0x27 0x0.
    Jul 15 07:09:30 [vmkernel name]: 29:18:55:59.298 cpu29:4356)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x4100b20eb140) to NMP device "naa.60060480000190101672533030334542" failed on physical path "vmhba1:C0:T0:L100" H:0x0 D:0x2 P:0x0 Valid sense data: 0x7 0x27 0x0.
    Jul 15 07:09:30 [vmkernel_name]: 29:18:55:59.298 cpu29:4356)ScsiDeviceIO: 747: Command 0x2a to device "naa.60060480000190101672533030334542" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x7 0x27 0x0.

    VMFS continues trying to mount the volume even if the LUN is read-only.

    This issue is resolved in this release. From this release, VMFS does not attempt to mount the volume when it receives the read-only status.

5 Practitioner

 • 

274.2K Posts

July 2nd, 2012 15:00

Thank you for the direction and the suggested workaround. We'll try both VMWAre support and the workaround.

--

John Shelest • Advisory Technical Consultant • VMware-Cisco-EMC Technology Solutions

1 Attachment

286 Posts

July 2nd, 2012 15:00

Sure thing! Let me know how it turns out.

286 Posts

September 25th, 2012 12:00

By the way the rescan issue with WD devices that contain unresolved VMFS volumes has been fixed in ESXi 4.1 U3 and ESXi 5.1. I confirmed this personally in my lab.

26 Posts

September 27th, 2012 00:00

Hi,

as an add-on to Codys statement, there's a primus soultion available.

How to avoid long rescan times with Symmetrix RDF in a VMware SRM environment with PowerPath/VE 5.4 SP2

Regards,

Ralf

5 Practitioner

 • 

274.2K Posts

September 27th, 2012 03:00

Does it matter if the customer is ESX5? This primes is for ESX4.1.

--

John Shelest • Advisory Technical Consultant • VMware-Cisco-EMC Technology Solutions

1 Attachment

26 Posts

September 27th, 2012 03:00

Hi John,

ESXi 5 acts similar, both versions are affected.

It should be fixed with ESXi 5.1, but right now I'm unable to verify this.

Regards,

Ralf

September 28th, 2012 17:00

Thanks Cody for the info and the update.

286 Posts

September 28th, 2012 17:00

Unfortunately no. For the rescan issue the elongation is actually in the "scan for new storage devices" operation and not in the VMFS scan portion. The problem description I wrote earlier is technically for the issue that still persists with the "Add Storage Wizard" where it takes forever to load devices when these unresolved WD luns are present. The rescan issue is slightly different and has been fixed. Both are symptoms of the same problem but they occur for slightly different reasons in slightly different ways. The rescan elongation doesnt have to do with metadata updates but just improper handling of WD SCSI Sense Codes. The rescan was actually affected less as the presence of one of these devices would "only" add a few seconds each to the rescan time. The add storage wizard device load time is actually delayed about 25-30 seconds per device and this issue is still being worked on. The Not Ready option is still the best option for working around the add storage issue.

September 28th, 2012 17:00

Cody,

Just curious, I'm assuming if you were only rescanning to discover newly presented LUNs (to then add a VMFS volume to) and therefore, you unchecked "Scan for new VMFS Volumes" and only leave "Scan for New Storage Devices" checked, you wouldn't experience this behavior?  Maybe I am just restating the obvious.

Of course, if the intention is to discover new VMFS volumes since the last scan then this isn't an option, but just curious.

286 Posts

September 28th, 2012 18:00

Sure thing!

1 Rookie

 • 

20.4K Posts

September 29th, 2012 06:00

do R2 have to be presented to DR cluster ?  I am not familiar with SRM , for other open systems i typically keep R2 in a separate device group and add them to the masking view when needed.

No Events found!

Top