sql-lover

7 Posts

51255

August 16th, 2013 09:00

Backup jobs (FULL or DIFFs) disconnect LUNs and bringing SQL Cluster down

Here are the typical Os errors, before SQL fails over to the passive node:

-Connection to the target was lost. The initiator will attempt to retry the connection.
-The initiator could not send an iSCSI PDU. Error status is given in the dump data.
-\Device\MPIODisk25 is currently in a degraded state. One or more paths have failed, though the process is now complete.

Now, I am not a SAN expert, I am a MS-SQL DBA. I am just frustrated by the whole situation.

Has anyone seen this issue before?

I can't confirm exact patch, but our SAN expert applied a firmware upgrade (suggested by DELL) and since then, things are now even worse. I can't run backups anymore. Before, the LUNs disconnect sporadically at least and disk resource came online on their own. Not anymore. He also removed duplicated MPIO entries or something like that, also suggested by DELL.

This is a two node SQL2012 SP1 Cluster running on Windows2008r2 SP1 Os.

Any suggestion is highly appreciated. I may bring that to our SAN expert. I am not ruling out any Windows or HP Proliant driver problem though but everything seems to point to the SAN.

Responses(11)

A

Anonymous

5 Practitioner

•

274.2K Posts

0

August 16th, 2013 10:00

Disconnects on iSCSI are often network/switch related. Backups tend to be sustained reads. When that happens the incoming buffers on the server NICs get quickly exhausted. So the cards will try to issue a flowcontrol PAUSE frame. If the switch is not configured for that, the switch continues to send data, the server just starts dropping packets as the buffer is overflowed. Causing massive retransmits and other timeouts. Which lead to dropped volumes.

What kind of switches are you using?

Regards,

sql-lover

7 Posts

0

August 16th, 2013 10:00

Don,

Thanks for reply.

That's exactly what I mentioned to the SAN guy, but he says everything is fine. But honestly, I have not checked the logs.

I can ask the guy and check what type of switches we have.

I know for sure (I even told him in advance) that the SAN had poor performance issues. Brand new was giving me 40MB/sec when it should give, with 1Gb adapter I think we have, around 90 or 80. Long story short, the problem finally exploded a month ago and he is upgrading to more and faster disks. He is also upgrading the iSCSI output to 10Gb I believe.

But my main problem is this error when running backups. The LUNs failed. I have not run backups in more than a week and that's scary!

A

Anonymous

5 Practitioner

•

274.2K Posts

0

August 16th, 2013 11:00

Yeah, 40MB/sec is only normal on a simple, single threaded process like a copy/paste in Explorer.

Especially on reads the CML should do better than that. Especially if MPIO is enabled.

Depending on the switch, you can verify whether or not flowcontrol is enabled.

I would also check the properties of the NICs. Make sure that flowcontrol is even enabled on those ports. Even if the switch is OK, the NIC my be defaulted to NO. I would also try setting it to TX or TX only vs Auto.

A

Anonymous

5 Practitioner

•

274.2K Posts

0

August 16th, 2013 11:00

Also what kind of NICs are you using? If Broadcom, are you using iSCSI offload? Those are very sensitive to Broadcom firmware version and the driver installed on the host. I.e. without very recent firmware/driver, you can't use Jumbo Frames with those adapters. (When in iSCSI offload, Jumbo is supported in NIC only mode)

sql-lover

7 Posts

0

August 16th, 2013 12:00

Don,

Appreciate your replies! Even if does not fix the problem, is a good start.

It looks like the switch is a Cisco 3750X. The guy or guys in charge of our network infrastructure are going to check this, as per my request.

Yep! The NICs speeds are set to auto! :-( ... Gosh! it should be a fixed value! I remember I fixed a Cluster issue back in 2004 because this. Not sure if has something to do with current issue though.

A

Anonymous

5 Practitioner

•

274.2K Posts

0

August 16th, 2013 13:00

The 3750X is a good mid level switch. There are a couple of known issues causing retranmits fixed in the very latest IOS firmware. It's greater than 15.2, I don't remember exact version. The ISCSI ports need to be auto negotitate, not fixed speed or duplex. Spanning tree PVST or Portfast. Make sure the iSCSI ports are NOT on the default VLAN, (typically VLAN1) and the system MTU size is 9000. (Requires reboot if not) Flowcontrol should be set to flowcontrol receive desired.

You can check for flowcontrol negotiation with #show flowcontrol on the switch.

Re: "auto". It's not critical, but if flowcontrol is enabled on the switch with "recieve desired" and not negotitating, then try the "TX only" option on the host. But I have seen that force speed and duplex on the switch prevents flowcontrol negotiation.

Finally, there's a buffer tuning option, that we have found to reduce the retransmission rates with Cisco switches like the 3750 line.

Solution Title PERF: Optimizing Cisco 3750x buffers for iSCSI performance

Solution Details Important: Cisco 3750X switches should not be used for a SAN solution that will expand beyond 3-4 arrays, or involves large block or sequential data IO. If there are more than 4 arrays consider the Cisco 4948 or a Nexus 5548 instead.

In order to provide optimal iSCSI performance, configure QoS and optimize buffers for EQL iSCSI use. A "queue-set" configuration is applied on all the switches and in a stack, there is no additional fine tuning required. Generally, queue-set 1 should ONLY be dedicated to iSCSI traffic.

Here is the config we currently recommend:

switch(config)#mls qos queue-set output 1 buffers 4 88 4 4

switch(config)#mls qos queue-set output 1 threshold 1 100 100 100 400

switch(config)#mls qos queue-set output 1 threshold 2 3200 100 10 3200

switch(config)#mls qos queue-set output 1 threshold 3 100 100 100 400

switch(config)#mls qos queue-set output 1 threshold 4 100 100 100 400

Configuring a 3200 threshold value will allow the maximum memory allocation of 3200% for iscsi packets.

After tuning the buffers as recommended above, packet drops may continue to be seen under certain traffic conditions.

The changes made to ‘queue-set 1’ may result in a blockage of TX queues on some ports and halt communication on them especially if Jumbo Frames are in use. This becomes a platform restriction; the 3750x offers line rate performance and hence it does not offer a good amount of buffer space to handle bursty traffic. iSCSI traffic is usually bursty in nature and may demand large buffer resources to be used to prevent drops on the switch interfaces. If available buffers run out, we may continue to see packet drops as a result.

It is also important to understand that pause frames cause the interface to hold traffic in their transmit queue, which eventually chokes them and an error is thrown that you received in the logs while making qos queue-set configuration changes.

Consideration should be made that the initial queue-set configuration needs to be fine tuned and there are revised buffer tweaks that can be implemented. Instead of using a Threshold value of 3200, keep it under 2000 as noted below:

mls qos queue-set output 1 threshold 1 2000 2000 1 2000

mls qos queue-set output 1 threshold 2 2000 2000 100 2000

mls qos queue-set output 1 threshold 3 2000 2000 1 2000

mls qos queue-set output 1 threshold 4 2000 2000 1 2000

mls qos queue-set output 1 buffers 10 70 10 10

If dropped packets continue to be seen with the new tuning recommendation and there is no voice/video traffic or any traffic passing through this switch stack that needs special treatment, reconsider disabling QoS on the switch completely or try to equally load-share each switch in the stack. Keep in mind that the switch’s backplane capacity won’t be more than 64Gbps at any given point. More switches in the stack means more interfaces in use and possibility of stack-ring to over-subscribe.

There's also a document from Dell that talks about Cisco 3750X configuration.

That's here:

en.community.dell.com/.../download.aspx

Regards,

A

Anonymous

5 Practitioner

•

274.2K Posts

0

August 16th, 2013 13:00

I really hope this helps. Being that I work with Equallogic stuff, I'm very familiar with the networking side of iSCSI. The SiS doc will really help you and the network admin out.

Please do let us know how things go.

A

Anonymous

5 Practitioner

•

274.2K Posts

0

August 16th, 2013 13:00

I re-read your post, on the servers and array ports, you don't want to force set the speed and duplex. Leave those on auto.

On the server there's an option specifically for Flowcontrol. That's typically set to Auto. I have found in some cases that setting it to "TX only" or "Send only" or "TX" gets Flowcontrol control to be negotiated proprely.

sql-lover

7 Posts

0

August 16th, 2013 13:00

Fair enough.

This is the 1st time I do not set the speed to fix, but I guess it's different.

Very good post about the switch though! Lot of good info. I will re-read, understand and forward to my management team and our IT/SAN resource.

If we do the flow control thing, I may come back and update my thread and let you and other forum members know about the results.

Again, really appreciate your feedback.

A

Anonymous

5 Practitioner

•

274.2K Posts

0

August 20th, 2013 08:00

Did you also make the threshold changes?

If you don't already have a case open, please open one. That's a very generic message when an initiator can't reach the storage.

Also suggest the network admin, look at the Stack status. I have seen bad stack cables or a bad port cause this kind of error as well.

sql-lover

7 Posts

0

August 20th, 2013 08:00

Hello Don,

Sadly, we deployed the suggested flow control change (switch and iSCSI NICs) and the LUN got disconnected again while running backups on a big database.

Here are the errors:

-Connection to the target was lost. The initiator will attempt to retry the connection.

-The initiator could not send an iSCSI PDU. Error status is given in the dump data.

View All

No Events found!

FluidFS

Backup jobs (FULL or DIFFs) disconnect LUNs and bringing SQL Cluster down