timevers

7 Posts

31388

October 21st, 2015 09:00

Random Replication Errors

I have several PS6000 / PS6100 running FW 5.2.11 and 6.0.x configured for async replication. I have the following issues:

Roughly every 30 minutes my backup EQL reports the following error in its main log:

ERROR 21.10.15 17:06:33 node1 Partner cluster3-slow2: iSCSI: login timed out. Make sure the partner IP address is correct and reachable.

At that time no replication job is scheduled. All replication jobs run fine. This only happens to one of my many replication partners. I.e. backup never complains about cluster3-slow or cluster3-slow4 which it is also replicating.

Nothing else is logged, I can ping all interfaces from every EQL in question both internally (from the respective partner) and externally.

Now I'm out of ides what might be happening here...

Responses(8)

A

Anonymous

5 Practitioner

•

274.2K Posts

0

October 21st, 2015 13:00

Hello,

I believe the partners do periodically check availability of the partners. First thing that pops up is running mixed F/W for replication is supported but not recommended. New features in the newer version won't be available until both partners are at same revision. Also 5.2.x/6.x are older revisions of firmware. There have been many improvements since then.

I would suggest you consider upgrading to a current revision, and if the problem remains open a support case with Dell.

Regards,

Don

timevers

7 Posts

0

October 21st, 2015 13:00

Hi,

upgarding FW is my last resort since we have a zero-downtime target and have to migrate all volumes prior to upgrades. Also the error happens both between EQL of the same and of different versions, so this can be ruled out as main cause.

Regards

Tim

A

Anonymous

5 Practitioner

•

274.2K Posts

0

October 22nd, 2015 11:00

Hi Tim,

My comment about firmware was also about the age of the firmware you are running. There have been many fixes since then.

Re: Downtime. Understood. Have you set your servers disk timeout values to our recommendations? being able to deal with a restart for firmware upgrades also makes sure you are prepared for a controller failure/failover event.

Are you running VMware? If so, it's critical for all EQL arrays to be at 6.0.7 or greater. Versions prior to that are vulnerable to VMFS metadata corruption.

Regards,

Don

timevers

7 Posts

0

October 22nd, 2015 13:00

Hi Don,

yeah I know about the age of the firmware, but I believe an error of the sort I am seeing right now would have shown up if it's just for a bug in the firmware. Besides that I see that old versions get a long time on their last iteration before they are retired. Seems that no major bugs where found in this period.

In short: I don't believe in updates "just in case" on utility devices like storage arrays.

And while I recognize the need for support providers to keep their clients on the narrowest possible base I have almost never experienced a problem this obvious beeing fixed by just upgrading software.

That said: I have just upgraded my backup EQL to the latest 7.1 and the error is still there - only on cluster3-slow2.

And just a short note on your points regarding timeout settings: Back in the good old days of firmware 4.2 a controller restart took around 20 seconds on a PS6000 till the initiators were online again. Today a restart needs at least 60 seconds and while my initiators survive the timeout (I'm using Xen/Debian btw) a 60-100 second "hang" does not match our zero-downtime target. I am aware that we have to live with that in case of a failover. But that's a different story...

Thanks for your time I will report back any findings for documentation.

Regards

Tim

A

Anonymous

5 Practitioner

•

274.2K Posts

0

October 22nd, 2015 13:00

Hi Tim,

Thanks for the update. Are you going to open a support case?

Off-Topic:

Interesting your experience with FW upgrades is very different from my 10 years at Dell/EQL.

In the older firmware, the entire array (both CMs) had to be rebooted.

Today Dell uses CM failover. Newer FW (7.x/8.x) work even better in that regards. An ongoing project with each new firmware generation has been to reduce the failover / restart process time. Streamlining the processes, etc.

However, when you have many active iSCSI connections this does increase the failover time trying to catch up to all the incoming IO requests. To insure all writes are handled and the cache sync'd on the controllers.

Especially after 5.x since changes for Persistent SCSI-3 reservations was introduced for MS clustering requirements.

I grabbed a couple random diag reports to show what I mean.

This was a PS6000 running 7.x. firmware.

***********************************

Sun Mar 29 08:00:10 2015

***********************************

Disk : 6.7

Cache : 0.24

IOM+LV : 2.36

MgtExc : 10.15

Total : 19.45

This one was a PS6100 running 8.0.x firmware

Disk : 1.39

Cache : 1.17

IOM+LV : 2.6

MgtExc : 3.47

Total : 8.63

Another 6100 running 7.1.x.

***********************************

Sun Mar 22 08:29:43 2015

***********************************

Disk : 1.23

Cache : 1.23

IOM+LV : 2.9

MgtExc : 11.38

Total : 16.74

Newer H/W does help a little of course.

I'm curious to know what version of XenServer are you running? Have you enabled MPIO with EQL?

Thanks.

Don

timevers

7 Posts

0

November 4th, 2015 07:00

Hi,

I'm back with an update: I brought the outgoing member to 6.0.11 and a new PS6000 unit - the error persists. So I'm checking for issues outside the Equallogics now. I've gathered traffic dumps via a mirror port but cannot see anything suspicious atm.

BTW: Upgrade from 5.2.11 to 6.0.11 needs to be done through the CLI which restarts the whole array without doing a failover. That leads to around 5 minutes downtime. Good I did not try this on a live array...

Regards,

Tim

timevers

7 Posts

0

November 4th, 2015 08:00

Hi Don,

I think in the 6.0.7 Release Notes my issue is documented:

Under certain conditions, when updating an existing array to firmware v6.0.7 using the Group Manager GUI, you might be prompted to complete the update process by issuing a restart using the CLI instead of the GUI. This is due to RAID check enhancements that were added to firmware v6.0.7. Once you are running v6.0.7 or later, updates to firmware will not require a restart using the CLI.

The corresponding message during the Update is:

Array firmware update from version V5.2.9 to V6.0.11 failed. Reason: The array cannot be restarted using the GUI because of a RAID issue. Use the CLI 'restart' command to continue.

Regards,

Tim

A

Anonymous

5 Practitioner

•

274.2K Posts

0

November 4th, 2015 08:00

Hello Tim.

The only time I've seen the requirement for CLI update and entire reboot of the array is in a single controller environment.

From F/W upgrade doc.

Use the CLI to Update Firmware in a Single Controller Environment

The Firmware Upgrade Wizard displays the Single controller updates are not allowed from the GUI error in the Group Manager GUI when the array module being upgraded has only a single control module. In a single controller environment, you must perform the upgrade using the CLI.

Don

View All

No Events found!

EqualLogic

Random Replication Errors