Start a Conversation

Unsolved

This post is more than 5 years old

3617

May 30th, 2016 09:00

EQL PS6500 High IOWAIT spikes during page moves (re-balance)

We have a five member PS6500 pool using 6248 switches.

The PS6500's are at firmware 7.1.9

The symptoms experienced are high iowait spikes for VM's at times when there are page moves going on whereas datastores loses connecitivty for a second or two.  Note that the active page moves happening (I have pm mon in a window constantly monitoring) are not for the datastore(s) affected at that time.  There is some correlation in the logs in SAN HQ when there are connection reconnects or volume moves noted but not 100% of the time.

So for example when the current pm plan is done everything is extremely stable. As soon as a new plan starts the issues rear their ugly head again.  

We do have a case open with a great pro support engineer but it seems like my choices are to break the pool to three and two members which is not something I want to do for a number of reasons, or ask Dell to simply stop the re-balancing period as it is causing far more harm than good.

Anyone else experience the same issue and if so how did you solve it?

56 Posts

May 30th, 2016 19:00

I have a case open right now for similar issues.  It's been open for over a year.  In brief, what happens in my environment is we experience random ~20 second hangs/pauses on all volumes on our PS6510/6500 arrays.  Normally this only happens once or so a day (at random times), but when there is a page movement job running, the frequency increases to more like 1-2 times an hour.  We have done countless iterations of diags, san hq archives, switch logs, vm support bundles, etc and have made minimal progress in determining the issue.  One thing I have noticed is that the issue only happens on the PS6510/6500 series arrays; we have other models (PS6210) in the same group and they do not exhibit the issue. 

I'm curious, how frequently do the datastores lose connectivity when a page mover job is running? 

I wonder if you're seeing the same thing I am.

8 Posts

May 30th, 2016 20:00

Glad to hear we are not alone with this issue though you have my sympathy to your plight as well.

I keep an eql support shell pm mon window open on the group lead as well as three putty sessions to three mission critical VM's that run Linux that have SAR displaying every 5 seconds and in some cases I have that logged to a putty output file. I went this route when I was trying to track down the cause of all the user complaints because otherwise a straight SAR shows the results in 10 minute increments and that always showed as fine but is extremely misleading as it turns out when there are these short high iowait spikes as you know.

So I monitor things very closely and can correlate the high iowait spikes on those vm's to the VMware host event log showing that volume disconnecting and reconnecting over a second or two. 

As for frequency, there is a blizzard of them when a page move plan starts and a whole lot of moves are initiated and as there are typically a lot of small moves to start and complete that can be painful for quite a few minutes. As the current plan settles and the start and complete of each volume move is spread out that slows down to on average 4 or 5 volumes having this occur in an hour (sometimes the same one more than once) and, as we have about 29 VMware datastores (and 71 Windows volumes for four windows 2008 R2 servers) that seems to jump around to different volumes though the same ones on each of the five VMware hosts at the same time. 

Sometimes those VM's only show an iowait spike of only 40, 50, 60, 70 etc percentage instead.  Never seems to happen when the volume in question itself is having slices moved for some reason, at least that I can see.  Also is not server load related either. Sometimes it correlates with a logout/reconnect or volume relocate in SAN HQ but most times it does not. 

I am not sure exactly when the problem started but it is possible it did when we upgraded to 7.1.4 of firmware about 14 months ago, which turned out to have accumulated into a nightmare this past December to Feb with controller failures, corrupt VM;s and high latency issues as well.  We have made a lot of updates to software, drivers, firmware, configurations and sent a ton of those same logs as well.  In our case we are a lot better than we were thanks to our pro support engineer with only these high iowait spikes happening when a page move plan is in operation. 

That being said I find this very frustrating and I am toying with asking Dell to stop all page move plans on our pool to stop the user pain.  For example we had the last page move plan complete on Sunday afternoon and then the high iowait spikes all but disappeared for 20 hours until the next plan kicked in and the pain and sorrow is back once again.  What I also find frustrating is that the new plan had over 90 tasks to start and likely 15 TB's of moves planned before it adds even more as it goes, and we can't possibly be in a position to need that as when there isn't any rebalance going on our network utilization is less than 1% (and when rebalance is happening it is at 5 to 7%).  I almost wonder if there are old plans it is completing as it says we have been out of balance for 397,309 hours before we updated to 7.1.9 and now 405,421 hours, both of which are obviously some default or otherwise nonsensical. 

I keep hoping to see a message saying the pool is balanced to see if that helps and slows down the frequency of the plans running at least but even though plans finish it has yet to say it is balanced.  Also frustrating.

Sorry for the novel but wanted to give you more of our story in case there are other similarities as well.

8 Posts

May 31st, 2016 09:00

Hi Don,

You are right, breaking the members into two pools is what is suggested at this point by our support engineer and I have been resisting that much to his frustration.

Looks like this is my only choice to try and address this and I have notified Sam (our engineer) accordingly. 

Makes me very nervous to not be able to do rolling updates but the penny has dropped and we will split the pools.

Scott

8 Posts

May 31st, 2016 13:00

Hi Bealdrid,

I am not sure if my response made it to you from last night as I don't see it from some reason.

Here is an extremely shortened version.

When no page move plans are going then the issue is rare.  When the plan is executing it can be up to 5 or 6 times an hour, usually with different datastores though sometimes with duplicates.  It is funny that the datastores that lose connection are not the ones with the high iowaits at that particular time.

As we have exhausted all other avenues pretty well such as firmware, software and other updates and changes (except for the 6248 switches themselves) we are going to break up our 5 member pool to a three and two member pool and hope that going to the optimal pool size will help.

I was toying with asking Dell to just stop all load balancing but that went over like a lead balloon :-)

Scott

8 Posts

May 31st, 2016 14:00

And I need to clarify an errant sentence - it should read: "It is funny that the datastores that lose connection are not the ones being moved at that particular time".

8 Posts

May 31st, 2016 14:00

They are at 3.3.8.2

8 Posts

May 31st, 2016 14:00

And the 6248's are stacked so unfortunately means an outage to update which is a barrier for us.

56 Posts

May 31st, 2016 17:00

Scott,


Thanks for that info.  First let me say there's a good chance your issue is not related to mine, and we are fortunate to have Don here to provide insight for us.  However, I've been desperately looking for someone else in the EQL community that is seeing similar behavior to what we're seeing.  Would you be open to me asking you a few questions offline?  Is there some way I can contact you?


Thanks,
Bryan

8 Posts

May 31st, 2016 17:00

Hi Bryan,

Sure, I am at <ADMIN NOTE: Email id removed per privacy policy>

Scott

No Events found!

Top