Start a Conversation

Unsolved

This post is more than 5 years old

D

88006

October 9th, 2012 07:00

Firmware Update (v6.0.1) Error Messages

Hi,

While upgrading the firmware from v5.2.5/v5.2.2 to v6.0.1, I've received numerous error messages from the group.  They were:

  • Connection login timed out
  • Connection failed because target offline
  • Free space in pool default is low. Performance on thin provisioned volumes, if any, might be temporarily decreased.
  • The reported size of the volumes currently belonging to pool default exceeds the pool capacity.
  • Volume state transition is in progress
  • Initiator disconnected from target during login.
  • The maximum in-use space limit for the volumes currently belonging to pool default exceeds the pool capacity.

Some background information: We have two PS6000 (been in production for about 2 years) and a PS6100 (just purchased and installed).  The PS6100 was going through RAID verification during the firmware upgrade and the group automatically started moving data to the new array after setup (did not know about the delay-data-move command option during RAID initialization). 

Also, the PS6000 arrays are:

  • RAID 10 - 3.66 TB
  • RAID 50 - 5.23 TB

And the PS6100 array is: RAID 50 - 48.3 TB

The process in which I took to upgrade the firmware was the PS6100 first, then the RAID 50 PS6000, and finally the RAID 10 PS6000.  Also, SAN HQ would disconnect me from the group.  Group Manager GUI was also extremely slow to show me information about the group, member, just about everything.

While I have completed quite a few firmware upgrades in the past, this was the first time the Equallogic group sent this many error messages.  I thought during array restart, the array would still be fine since it upgrades the secondary controller first, fails over to the secondary controller, and updates the ex-primary controller which should not have the arrays "screaming" at me. 

Can someone please tell me what I'm doing wrong to keep the Equallogic group from yelling at me?

Thank you

7 Technologist

 • 

729 Posts

October 9th, 2012 08:00

Q1: Connection login timed out

Login goes through many stages before successful completion. If the login stays in any state for more than 15 seconds, this timeout occurs.

Recommended Action

None. The initiator will usually try to log in again and, usually, will be successful.

Q2: Connection failed because target offline

- The connection failed because the target (volume) is offline. First, make sure that this is true. If so, set the volume online and try to connect again. If the problem is still there, contact support

Q3: Free space in pool default is low. Performance on thin provisioned volumes, if any, might be temporarily decreased.

- The “default” pool (you may have been renamed it, but it is the original pool that is created by default, when the group is first setup), is low on space.

Q4: The reported size of the volumes currently belonging to pool default exceeds the pool capacity.

- The total size of the volumes in the specified pool exceeds the pool capacity. In any pool, the total of the volume sizes cannot exceed the available space. There is not enough space in the pool to support the potential growth of all the thin-provisioned volumes to their total reported size.

Q5: Volume state transition is in progress

- 7.3.18: Volume state transition in progress

Management request to shut down when a new login comes in

Recommended Action

Retry login.

Q6: Initiator disconnected from target during login.

- The login process failed because the initiator broke the TCP connection to the array before logging in to a volume. A connection closure was received from the initiator before completion of the login process

- You should open a support case so we can determine what happened, might be a network issue.

Q7: The maximum in-use space limit for the volumes currently belonging to pool default exceeds the pool capacity.

- You can do any of the following:

• Reduce the size of one or more thin-provisioned volumes until this warning no longer occurs

• Add more space to the pool (for example, by adding another member to the group).

• Move thin-provisioned volumes to a pool with more capacity.

• Reduce the maximum in-use space warning limit for one or more thin-provisioned volumes in the pool.

-joe

294 Posts

October 9th, 2012 08:00

Thank you Joe.

I thought since the controllers are failing over (one controller is always online), then the pool space would never decrease and the connections would never drop.  To me, it seems like an entire member is offline while upgrading which is why the pool space decreased and the connections were dropping.  

As for the volume stating that it's offline, the volumes were not offline.  I had to verify this by going to the server using that volume since the Group Manager wasn't loading (which is another issue).  

5 Practitioner

 • 

274.2K Posts

October 9th, 2012 11:00

No.  Data is striped across the members.  More on the 6100 because its larger.  What's changed is now you have a multimember pool.   So both members are required to provide the larger pool of space.  To get around that, you would need to move a member to a new pool.   Then move some volumes to the old pool, to the new one.   Then it will work like it did in the past.   The benefit of multimember pools is that all I/O is handled by the members.  So they work cooperatively on every I/O request.  If you install the Windows HIT kit that will also provide an enhanced MPIO algorithm that will help you better leverage I/O from both members compared to the standard Windows MPIO code.

re: slowness.  See how it goes once the build has completed.  The GUI now needs information from two members, not just one.

re: Windows.  In the OS considerations guide it covers how to extend the disk timeout value to 60 seconds.  That's a registry change so that will require a reboot to become effective.

294 Posts

October 9th, 2012 11:00

Thank you, Don.

This is a Windows environment.  

So since one member (PS6100) is substantially bigger than the other two members (PS6000), I have to tweak the snapshot reserve space (since that's the only change I've done other than adding the new member) so that it'll reserve less space and fit within the two members.  Is this correct?

Is this upgrade (v6.0.1) a little different than past upgrades?  I just find it odd that this upgrade was much more messy than past upgrades.  Also, the information for the Group Manager GUI was extremely slow (I thought the entire group went down)...

5 Practitioner

 • 

274.2K Posts

October 9th, 2012 11:00

When you restarted one of the members in the pool,  that member no longer provided that space to the pool.  That's what the "free space in pool" messages are referring to.  Also, since volume data is striped across the members, when one is restarting, all the volumes they have in common will go offline while the member is failing over.

Yes, the upgrade process updates the secondary, then restarts it.  However, the actual failover time before that passive becomes active, and services I/O & login requests varies.  Based on load and model.

Is this a VMware ESX environment?   ESXi v5.x has a very short login timeout value of 5 seconds, by default.  That should be adjusted to 60 seconds.   That will typically prevent the "timeout during login" error messages.

Here's a KB article from Vmware explaining how to change that.

kb.vmware.com/.../search.do

Regards,

294 Posts

October 9th, 2012 12:00

Thank you, Don.  

I have the iSCSI Initiator and Operating System Considerations document open.  Just be to certain that I'm looking at the same thing you're talking about, I will go into the registry and:

Increase the value of the TimeOutValue parameter (HKEY_LOCAL_MACHINE/SYSTEM/CurrentControlSet/Services/Disk/TimeOutValue) to at least 60 seconds (the default is 10 seconds). Make sure to set the type to DWORD and enter the decimal value (60).  And reboot the server (and VMs).

Is that the only setting?

5 Practitioner

 • 

274.2K Posts

October 9th, 2012 12:00

Yes, that's the important one.  It has to be done on the server AND inside any VMs on your Hyper-V servers.  So the host and VMs require a restart.

294 Posts

October 9th, 2012 12:00

Thank you, Don.

We have SQL Server on the PS6000XV (RAID 10) and PS6000E/PS6100E (RAID 50) which we failover to the backup site before updating the Equallogic firmware.  We also have VMs on on the PS6000/6100 as well as file shares and Windows clustering volumes.  

Is there a process where I can upgrade the firmware without taking volumes offline?  

5 Practitioner

 • 

274.2K Posts

October 9th, 2012 12:00

If you set the disk timeout values on the servers and the VM's they should "ride" out the restart process.  It needs to be done not only for upgrades, but in case of a HW problem that causes the CM to failover.  

Otherwise only other option is have another array in the group that's empty.   Add it to the pool and move out the members one at a time to a temporary pool.  Upgrade and restart that member, then move it back into the pool.  Move out the next one and repeat.    That prevents any downtime but requires another array and takes a lot longer.

5 Practitioner

 • 

274.2K Posts

October 9th, 2012 13:00

Yes.

294 Posts

October 9th, 2012 13:00

Oh!  What if I remove thin provisioning on those RAID 50 volumes?  It should be left online, correct?

5 Practitioner

 • 

274.2K Posts

October 9th, 2012 13:00

Won't make any difference.   Thin Provisioning is when pages are allocated.

Think of it like two people holding a board level.  If one is gone, the board falls.   The two members are dependent on each other.  Since part of the data is one the other member.  Whether all the pages for that volume are pre-allocated or not.

Are all the switch ports connecting the servers and arrays set for spanning tree portfast (or pvst)?   If they are not, then the switch delays bringing up the ports.  We've seen that cause many upgrade/failover problems.

I would suggest you open a case on the failover problem.   Diags from both members will allow support to better understand what happened.  They can see exactly how long each phase of the process took.

294 Posts

October 9th, 2012 13:00

Thank you, Don.

I'm actually going through our servers and all of the ones I've looked at so far has the value:

Name: TimeOutValue

Type: REG_DWORD

Data: 0x0000003c (60)

Does this mean that it's already configured to have a 60 second time out value?

294 Posts

October 9th, 2012 13:00

Ahh, ok.  Thank you for the clarifications and help.  

How can I check the switch ports?

294 Posts

October 9th, 2012 13:00

So I guess the I upgraded the firmware, it didn't exactly "ride" out the restart process...  Ughh.

No Events found!

Top