Unsolved

This post is more than 5 years old

1 Rookie

 • 

56 Posts

37523

April 12th, 2015 18:00

PS6510E latency spikes

Equipment:
1 PS group:  2 x PS6510E, 1 x PS6210S, 4 x PS6210E all at v7.1.4
PowerEdge M610,M910 blades w/ 10G intel NIC
m8024k blade switches, 8024F san switches





Software:
ESXi 5.1 U3
Windows 2008R2
Red Hat Enterprise 6.6





For a while now, been seeing some latency spikes (mainly write) on our PS6510E arrays.  Basically what happens is everything is going well, then all of a sudden we get a latency spike on just one 6510E member.  (The 6510Es are each in their own pool) Most of the time its just write latency, and read latency remains low.  This is observed from SANHQ, but our VMs that have datastores on these arrays exhibit a high 1minute load average that coincides with the spike.  ESXi hosts log messages similar to "lost access to volume... due to connectivity issues" followed a couple seconds later by "Successfully restored access to volume..  following connectivity issues".  This is logged for nearly all the datastores on the particular member (and only that member).  It recovers almost immediately, but the interruption is enough to cause a brief load spike on the VMs.  The initiators do not seem to be logging out and back in during this time.


Steps taken:  I have done my best to configure everything according to best practice.  (Disable delayed ack, LRO, using MEM 1.2) and verified this is all set correctly.  Jumbo frames on, not using default vlan, etc. 

I have not noticed this on our SSD 6210, or any of the other 6210s but those are also not being used heavily for vmware right now.  Also I have offloaded a lot of our "heavier" IOPS volumes to our newer 6210 arrays, to reduce load on the 6510s, but this has not seemed to help all that much.  You can see below the array doesn't seem to be that busy.  I have the line hovered over the spike:


Thanks for ideas/suggestions.

1 Rookie

 • 

117 Posts

April 13th, 2015 09:00

Hi Don thanks.  RAID 6 with 3TB 7.2K for the 6510s.  I will consider opening the case, but one thing I've thought about doing is moving our entire VMware datastore environment over to the 6210E members, which have way better controllers as you know, and leaving the 6510Es for backups, archive, mainly read workloads.  Do you think it's a possibility that the older controllers with only 2G cache, combined with slow 7.2k disks, and RAID6 on top of that is contributing to this?

1 Rookie

 • 

56 Posts

April 13th, 2015 19:00

Thanks.  I'll look at getting the 7.1.5 upgrade done on our next maintenance.

I found a cronjob that does a heavy sequential write backup every 30 minutes, and my load spikes sometimes correspond to this.  I also found out this volume has the "discard" mount option set; I don't know if that can affect performance.  

A side question- can you clarify the way EqualLogic defines write latency on its arrays?  Does it include latency to the host or network in any way?  Or is it purely the time it takes the SAN to process the request?

1 Rookie

 • 

117 Posts

April 15th, 2015 14:00

Yes they were not replicated; I turned discard off anyway to see if it helps but not expecting it to.  I opened a support case this afternoon and submitted the array diags, san HQ archive, and ESXi diags.  I hope they are able to find something.

No Events found!

Top