Start a Conversation

Solved!

Go to Solution

5959

June 1st, 2018 06:00

Unity 400F code 4.3.1.1525703027 and performance issues

Hello all  UNITY flash users.

I wanted to check with the user base if anyone else has seen performance issues after going to code 4.3.1.1525703027?

A Dell technical advisory came out that applied to the drives in my Unity 400F, so I upgraded to 4.3.1.1525703027.

The previous code I was running did not have a performance issue code, 4.3.0.1522077968.


Latency immediately jumped on all esxhost datastores after the code update. I do have a SR open 11153806.

on the left of the graph when it was flat was before code update. On the right with the spikes is after the code update.

If you do not have any monitoring tools, the below is also indicated something changed.

No Esxhost disk heart beat timeouts before code update. After code update, Esxhost host disk heart beats now regularly timeout, which  was not happening before the code update.


I was wondering if anyone else is running 4.3.1.1525703027 ? 

Has anyone else see performance issues after upgrading to  4.3.1.1525703027?

I have a single traditional pool configured , which does include the vault drives.

Compression is turned on.

The below graph is from a monitoring product we use.

The key is it really shows the change from before the code update and after the code update finished at around 8:00 am.



Unity8104_VMware_Latency (002).png


I also want to know if anyone is seeing periodic drive queue length spikes?

To the best of my knowledge the drive queue length spike was not caused by a change in work load and it lasted 3-5 min.

I was wondering if an internal process on the unity will cause the below?

The graph is of the first 10 drives in DPE.

unity8104_drive_queue_lenght_spike.png

All, thank you,

June 26th, 2018 07:00

All,

I wanted to provide an update.

The issue went away when  the production workload moved off the Unity but were still using the server, switches, and different storage.  The above was key in reducing the number of components that could be introduction the problem.

With the above information we focused on the connections between the 5596UP switch and the Unity.

One at a time we disabled switch ports and waited to see if the issue continued or stopped.

The above took some time, figures we did not find anything until we got to the last port on SPA.

When disabled, the issue stopped.   Fiber patch cables were cleaned and re-seated and the port was turned back on.

We continue to monitor, without the production load the issue has become more intermittent, I will know more by end of the week.   So far the cleaning and re-seating of the fiber looks like it might have fixed the issue.  I will update again after we have run a week without issue.

4.5K Posts

June 4th, 2018 08:00

When you say that latency on all the hosts jumped, do you mean that the hosts reported latency (host applications impacted) or do you mean that you saw in the Performance monitoring tool evidence of latency? What version of ESXi are the hosts running? The Heartbeat issue is in ESXi 6.x but I think it may have been fixed in the latest release. It would help to disable that just to be sure.

Compression will add overhead to the array and when the array is running Compression (the check that runs for four hours every 12 hours) you can experience some additional background overhead. The compression will evacuate slices in the background, but will slow down when host IO is present. When host IO is not present the process will speed up, so you may see higher utilization on the SP and drives when this is running with no increase in host IO.

The rules for enabling compression on LUNs is based on the Table 2 in the Unity Best Practices Guide and page 9. If you have too many LUNs with compression enabled, you should look at which LUNs have the least compression ratio and probably disable compression on those LUNs. The way compression works is if there is less than 25% compression, then the data is not compressed, but there is SP resources used to determine this. With low ratio you may be wasting SP resources checking data that can not be compressed more that 25%. Disabling compression will reduce the overhead for checking new Writes, but until any compressed data is overwritten, the SP will still need to decompression the data on Reads.

I'd look at the LUNs with compression enabled and determine if you can disable on some of the LUNs.

https://support.emc.com/docu69891_Dell_EMC_Unity:_Best_Practices_Guide.pdf?language=en_US

June 4th, 2018 11:00

Hi Glen,

To answer a few of your questions.

Third party monitoring tool reported increase in Datastore I/O latency right after upgrade.

Performance monitoring tool evidence of latency?  CloudIQ is reporting Block Latency high  Anomaly HIGH

A migration from ESXi 5.5 to 6.5 is in progress,  It is my understanding the KB has been applied to address Heartbeat issue.

SP performance is good based on CloudIQ most the time below 25%. 

The Dell/EMC Sales team sized the array with compression turned on for everything.

I will revisit the Best Practices guide, but cpu does not seem to be an issue when the latency hits.

 


4.5K Posts

June 6th, 2018 08:00

On CLARiiON and VNX, when you setup a host using two FC HBAs, EMC recommends zoning four paths:

HBA1 <-> SPA-0

HBA2 <-> SPB-1

HBA1 <-> SPA-1

HBA2 <-> SPB-0

The recommendation was based on limiting the number of Array SP ports per HBA to two - one from SPA and one from SPB. Using more SP ports was not normally recommended as it was possible you could overload the host HBA.

If the array has a total of four SP ports per SP for eight total ports, we also recommended using four paths (two per SP) for each host and dividing the total of eight between the host. If you have, for example 32 hosts, then zone 16 to one group of four SP ports and the other 16 to the other group of four SP ports. For example:

Group 1

SPA port 0

SPA port 1

SPB port 0

SPB port 1

Group 2

SPA port 2

SPA port 3

SPB port 2

SPB port 3

Also, on Unity and VNX we recommend setting the IO Operation Limits from 1000 to 1 for better load balancing.

From page 17 in the Best Practice Guide:

Dell EMC recommends configuring ESXi with Round Robin Path Selection Plug-in with IOPS limit of 1. See VMware Knowledge Base article 2069356 -- https://kb.vmware.com/kb/2069356

glen

June 6th, 2018 09:00

Hi Glen,

Thank you for the input.  The Unity 400F been up and running without issue for over a year before upgrading to 4.3.1 code.  Only after the code update did we start seeing issue.  The install of the unity was done with EMC PS.  The Round Robin Path Selection with limit of 1 is set.  Each hba is zoned to four ports. I can eliminate zoning as an issue because we did not have any heart beat issue or latency issues while running the previous codes.  The Unity 400F is at our DR site and it just is not that busy.  I was hoping to hear back from some end users that are running Unity OE code version 4.3.1.

I am not sure how many end users are running the 4.3.1 code, since it was just released.

Are you running Unity OE code version 4.3.1?

June 7th, 2018 08:00

Currently EMC Support has me looking for a Slow drain issue on the SAN.   I am not finding a Slow drain issue, but I have asked Cisco support to look for one.

Just wanted to give an update.

4.5K Posts

June 7th, 2018 12:00

Thanks Tim, I'm monitoring your case.

glen

June 13th, 2018 09:00

Update.

Continuing to work with DELL/EMC support.  To help eliminate the switches as a possible source of the problem, production impacted work loads are being moved to another storage device.   If the performance issues continues while using different storage that would indicate we have some switch issues.  If the issues go away for the work loads we take off the Unity then we will focus more heavily on troubleshooting the Unity. 

4.5K Posts

June 26th, 2018 09:00

I have a catch phase for patch panels (gained over the years of working with fiber optics), but as this is family friendly forums I'll refrain. 

glen

July 9th, 2018 10:00

All, It has been over a week with out issues on the Unity 400F.

I can not explain why the issue popped up after the code update, but at this point after over a week of monitoring we have not had issues since cleaning and re-seating one of the fiber runs to SPA took place.

The EMC support case was closed with the below. 

Below is a summary of the key points of the service request for your records:

Reported Symptom(s): Large latency being reported after unity code upgrade to 4.3.1.1525703027.  And getting VMFS heartbeat timeout.

Conclusion(s): Customer cleaned and re-seated fiber on one of the connections to SPA to resolve issue.

8.6K Posts

July 10th, 2018 02:00

thanks for the feedback

yes issues due to bad / not clean fibre cables are not uncommon and difficult to find

No Events found!

Top