vmware errors after Flare 30 509 patch upgrade

Question

On Monday EMC updated my CX4-480 from an earlier rev of Flare 30 (I think 508) to the 4.30.0.5.509 build to fix a known potential data loss issue. Since that time, our vsephere server keeps reporting "Non-VI workload detected on the datastore" on the busier LUNs. The generic information I can find online as to the source of that error are references to it occurring when somethign other than vsphere itself is performing activities on the LUN, such as having older vmware servers connected to the same lun, backups running against the lun independent of vsphere, storage migration on the lun occurring, etc. We don't have any of that, just the CX4 and ten managed vsphere hosts all connected to it over iscsi.

From what I've read, the source of that alert is when you have a LUN with storage i/o control enabled, as ours are, if there is some latency detected in the I/O and it backs everything off but the problem doesn't get better, it makes the assumption that something else is performing I/O on the same LUN. That of course worries me because, since I don't have any of those other things, the only thing I can think of as causing that error to occur is if the CX4 itself with the new code is now having momentary pauses in I/O or slowdowns that it did not have before since this error has not occurred before.

tkjoffs · Answer

Are the LUNs affected in a Storage Pool or RG?   Also, are you using FAST, FAST Cache, or QoS?

colohost · Answer

They're in raid groups using raid 5.  Our entire CX4 is the same drives; no tiering, power management, fast cache, qos, thin luns or anything like that, all on all the time thick luns.

tkjoffs · Answer

There are only a few ways to get that type of error that I know of; which lead me to think you may have:

More than one ESX LUN in a Raid Group - Not always bad. If you do this though make sure the I/O is okay and that the Cache is properly enabled.
A transitioning LUN; is it only one LUN affected? If so, you could try migrating that LUN to a new RG and back.
Are you using PowerPathVE or native pathing on the ESX? If you are using VE, check the paths are all correct.

FYI - I am running .509 on two arrays both with multiple ESX clusters attached. All clean, no errors using ESX 4.0.0 What version ESX are you using?

colohost · Answer

We're using 146 gig 15krpm drives in 9+1 raid 5 sets for each raid group, that becomes a single EMC LUN, and then each one of those is added to our single storage group to present to vsphere as a unique LUN. So it's a 1:1:1 mapping of raid group to emc lun to vsphere lun. We only have four luns in use so far, and the two that are producing the alert are the only ones that have significant usage, but they haven't changed prior to 509.

I did notice both of the luns are native to storage processor B, not that that should matter. We do use PowerPath and it's showing all four paths (4 x 10gig iscsi) are up to each of them. We're using ESXi 4.1 enterprise plus. All the ESXi hosts were updated last week to newer code but that was at least 7 days ago and the alerts didn't start until Monday late night after the 509 code went on.

tkjoffs · Answer

Can you check the commit status of the FLARE on SPB? Ted

colohost · Answer

Can you tell me where to look for that? If I go into SPB's properties I see:

FLARE-Operating-Environment 04.30.000.5.509 Active

on the software tab. I know the support guy that did the upgrade had to reboot both SP's before they would take the new code, and then I think he may have done it again after each was upgraded as well because one was reporting the other as isolated, but after that everything was looking good.

tkjoffs · Answer

If the code was not commited you would see a commit button under the area where active was listed.&#xa0; For me during the FLARE upgrade I performed, I ran into an issue where the upgrade did not properly commit the code and I have to reboot the SP again.&#xa0; I don't think that is your issue.&#xa0; As far as I can tell you have an optimal design for the SAN side of this and there does not seem to be anything strange.&#xa0; I would suggest you go to chat support and see if they can get anything from SP collects.&#xa0; Sorry I could not help more.&#xa0; I guess the other option is to reboot the SPB again...

christopher_ime · Answer

Hostasaurus, During your research did you stumble on Duncan Eppings post? http://www.yellow-bricks.com/2011/01/20/enable-storage-io-control-on-all-datastores/ If not, as is often the case, the 'Comments' have responses from others that share their experience with the error; you'll possibly find a similarity in your environment.  For instance, Scott Lowe will describe having array based replication (such as MirrorView) contributing to it.  Another poster identified Veeam backup accessing the same LUNs and then Duncan acknolwedging that as a possibility.  Then, simply a reinforcement of this, there is a link to Scott Drummonds blog which further reminds us that replication and backup solutions are, of course, workloads.  As noted below, both VMware and EMC are looking 'We all recognize that it is kind of silly that an event is raised for an extraordinarily common condition like the presence of SIOC and replication. EMC and VMware are looking into improving this and a fix will come at some future (unknown) date.' Then, more than once the following VMware KB article is referenced but I'm thinking you've already gone through the checklist: 'External I/O workload detected on shared datastore running Storage I/O Control (SIOC) for congestion management' http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1020651 tkjoffs, Storage I/O Control was introduced in 4.1 so you are correct that you wouldn't be seeing the error running 4.0.

colohost · Answer

Ah, yes traditional luns in this case, just wanted to clarify that they weren't thin or on storage pools.

GearoidG · Answer

Just a quick query for clarification

I note here you say thick luns, Just to clarify are these on Storagepools or in a RG

We in EMC tend say traditional luns for normal RG luns, as a thick lun is still a pool lun

Gearoid

colohost · Answer

Nope, no add-on services on the CX4 doing anything in the background; just straight raidgroup's of all the same drives, no tiering or fast cache, one lun per group, one vmware lun per emc lun, and then vmware is the only thing talking to the CX4 using mostly thin-provisioned vmdk's. Backups are currently done via client agent in each guest since we're not yet upgraded to a backup software that can do file-level restores in the guests if using the vsphere agent.

I haven't rebooted the SPB yet to see if it makes any difference but will probably do that this weekend since the LUNs raising the alerts are all homed to that SP.

kelleg · Answer

Are yuu using anything like Snapshots or Mirrorview or RecoverPoint or any other utilities that automate functions on the LUNs? Glen

kelleg · Answer

What I would recommend is to open a case with VMware - see if what the error message saying is what we feel is occurring - some other process or a latency issue. You should probably also start Analyzer (set the Archive Interval to 120 seconds, Enable Periodic Archiving and Stop After for a couple of days) in case you want to open a case with EMC - they'll need spcollects and NAR/NAZ files in order to see if there are any latency issues with the LUNs.

glen

CLARiiON

vmware errors after Flare 30 509 patch upgrade

Was this post helpful?