Start a Conversation

This post is more than 5 years old

Solved!

Go to Solution

4596

May 4th, 2017 08:00

Any issues with XtremIO upgrades from 4.0.15-20 to 4.0.15-24?

I am asking because I want to know if anyone else has experienced disruptive upgrades specifically from 4.0.15-20  to 4.0.15-24?

I am trying to figure out if I should be asking the community before proceeding with upgrades. 

I have a support case open and I am sure DELL/EMC will have answers soon on what cause the issue, but I would like to avoid issues when it comes to affecting production.

June 27th, 2017 07:00

The XtremIO upgrade went well last week, All AIX servers involved stayed up.  Successful NDU! The fix is below. Customer Upgrade Preparation Guide - XtremIO  EMC added the below line. * Is using AIX, please review KB491002 prior to NDU. KB491002 Referenced the IBM fix. * Apply the IBM Authorized Program Analysis Report (APAR) mentioned in IBM IV84862 - Improve Handling of Aborted Commands on the host side. Note the fix has been rolled into a service pack.  My AIX systems that made it through the Successful NDU were running the below AIX OS level. # oslevel -s 7100-04-03-1642

727 Posts

May 8th, 2017 08:00

4.0.15-20 to 4.0.15-24 is an NDU (non disruptive upgrade) process. Dell EMC Support will run pre-upgrade checks in your environment to make sure we are aware of any potential issues before the NDU process starts.

May 8th, 2017 09:00

Hi,

We have got this upgrade done (same versions) as there was an advisory recommending upgrade. We didn't faced any issues. Maybe EMC can suggest more on this.

May 8th, 2017 09:00

Thank you, the above is great news, glad to hear your upgrade went smoothly.

May 8th, 2017 09:00


Avi, thank you for your comment, and yes my earlier updates were NDU, but the 4.0.15-20 to 4.0.15-24 was NOT a NDU.

EMC did all the pre-checking and everything passed, but a production database crashed during the update.

I will update this post when I get the RCA results, because many things can cause issues during an update.

In my case the previous update was done less than 90 days earlier without issues.

I believe my hosts were all configured correctly, so that is why I was asking the community if anyone else has experienced issue with the specified version jump from 4.0.15-20 to 4.0.15-24.

I was very surprised the upgrade had issues because the previous upgrade had no issues at all.

64 Posts

May 12th, 2017 00:00

4.0.15-20 to 4.0.15-24 is about as simple an upgrade as they come.  There are no "firmware" changes in this version, so there's no need to reboot any of the storage controllers - just a quick blip as we reload the new XIOS code, and that's it.

This is one additional step that the person carrying out the upgrade will do due to the fact that there wasn't an reboot, but that's completely transparent.

There's certainly no expectation of any problems for this (or any other) upgrades. Most of the times we see issues during upgrades it's down to things like multipathing or timeouts not being set correctly on the host, but I'm sure support will be working with you to try and work out exactly what went wrong in this case and get it fixed.  We're actually working on a set of scripts that will validate the host-sided configuration before an upgraded (or at any other time) to help avoid such issues - the one for VMware is in final testing, and (physical) Windows and Linux will follow shortly.

5 Practitioner

 • 

274.2K Posts

May 12th, 2017 08:00

From a supportability standpoint, NDU from 4.0.15-20 -> 4.0.15-24 is incredibly stable. This is a minor release and does not include a Kernel update. This is good because it means that no reboots are required.

The only issues we've encountered thus far originated from "noisy" SAN environments. If you are concerned about the possibility of hitting an issue resulting a disruption of service, I would start by vetting your fabric infrastructure.

May 12th, 2017 09:00

Hi Mdeitrick  and Scotthoward,  thank you for your input. 

EMC Support suggested I open a support case with the switch vender.  Which I have done and that investigation is taking place.  No root cause identified yet, but we are still digging. 

May 17th, 2017 07:00

Just an update, the switch vender did not find any FC switch issues shortly before, during, and after the upgrade.


5 Practitioner

 • 

274.2K Posts

May 17th, 2017 12:00

Did the Fabric analysis mention anything about credit starvation or slow draining devices? Are they any ISLs present and were all switches review (not just those with links to the XtremIO)? I had one recently where a reset of unrelated switch ports on a switch that was connected via ISL resulted in excessive noise. If you would like a second pair eyes to review the case, please provide the service request number and I'll have a look.

May 17th, 2017 12:00

mdeitrick,

EMC support just requested I have the vender check for flapping on the port. No flapping was detected.

Regarding SR #

Service Request Number

07017682

Former Service Request Number

85818360

May 24th, 2017 06:00

I am still waiting for official RCA.  Preliminary info that was provided.

Referenced two possible reasons for the Outage.

The AIX fix which public knowledge, I just did not know about it.

Make sure your  AIX systems have IBM IV87492: IMPROVE HANDLING OF ABORTED COMMANDS APPLIES TO AIX 7100-04 - United States  installed.

The other possibility that EMC referenced, was an XtremIO bug, but since the preliminary info has EMC confidential all over it, I will let someone from EMC disclose EMC bug issue and details.


5 Practitioner

 • 

274.2K Posts

May 29th, 2017 15:00

The bug you speak of is related to a reset of the Platform Manager (PM) due to a bug in the qlogic FC firmware. To understand the problem better you need to understand a little about how the NDU process works. During the NDU, the PM enables the HBA FW Heartbeat mechanism (typically used to monitor the progress of controller reboots). In "noisy" SAN environments the HBA FW Heartbeat may unexpectedly timeout, causing a reset of the controllers FC drivers. If this occurs on more than one controller at the same time, this could result in the System Manager (SYM) temporarily closing I/O gates. It takes a lot to hit these conditions and the environment is only at risk if the HBA FW Heartbeat remains active during moments where the SAN experiences excessive "noise" - this could be related to many RSCN events, CRC errors, etc. I'll review the service request and consult with the owner in the hope of driving a quick a response.

June 20th, 2017 07:00

EMC Final RCA indicates the issue was the missing AIX patch.  I have a XtremIO upgrade scheduled this week.  The upgrade scheduled this week is a bigger jump in code  4.0.2-80 to 4.0.15-24.  I have the AIX patch applied to the AIX servers.

5 Practitioner

 • 

274.2K Posts

June 26th, 2017 14:00

Thanks for the follow up. Good luck with your future upgrade. I hope the experience is positive and goes a little smoother. Feel free to reach back out to us if you have any further questions or concerns. Thanks again!

No Events found!

Top