I have also noticed that when I am executing a full backup on a large database, it takes six hours to complete whereas it used to take only two hours on the PS6000XV.
No, different environment. We are using 6048s and is dedicated to iSCSI use. I have opened a support case and they have looked through the diags, switch output, and SANHQ. The last thing they told me was that the switch and SAN looked good and that I should disable the delayed ACK and Nagle's algorithm. After adding those entries to the iSCSI interfaces in the registry (and rebooted), I still see the same I/O messages. I haven't heard back since...
When looking at Group Manager volumes, I noticed that there are only two connections being utilized instead of four. Is this normal?
(Well, actually there are four since the volumes are part of a cluster, but the server where the volumes resides only uses two connections and the other two connections are part of the other server)
We are not using ESXi. This is strictly a SQL Server 2008 R2 database server (in a cluster). The servers are connecting to the SAN via HIT Kit 4.6. Is the information pertaining to ESXi still relevant to a SQL Server database server?
Yes, I believe the firmware is up-to-date as my coworker installed it about two weeks ago.
In iSCSI Initiator, if I don't check MPIO when connecting to the volume, would there only be two connections instead of four? I just find it odd that there are only two connections on most of the volumes.
I've emailed him several times, but have not heard back. He also would not give me his phone number. So I'm pretty much dead in the water... And frustrated.
But that still doesn't explain why I never had this issue when the database was on the PS6000XV yet I have this issue when it's on a supposedly faster PS61000XS...
So it would help separating tempdb into two LUNs... I wish there was a way to test it before implementing as it would require a good amount of reconfiguration...
Their recommendation is to separate the tempdb data files into another LUN which is on the same SAN. After thinking about it, wouldn't it still have I/O pressure because both LUNs are on the same SAN?
We have been experiencing the EXACT same thing. It has happened on both a virtual and physical systems. What I don't understand is that the dedicated LUN for tempdb is hardly ever used, very low IOPS, but when we do, we get sporadic messages about I/O taking longer then 15 secs.
I have given support SAN HQ logs several times to which they always say that there does not appear to be a LUN/SAN issue. So when the problem started on a virtual, I called VMware support. They took a look at their logs and concluded that the SAN was an issue, at the controllers. Of course EQL disagreed. Then the problem started on an old 2003 based SQL server. EQL again looked at SAN HQ logs, verified best practices, but still stated that they did not see anything out of the ordinary, in fact the engineer said the SAN was performing well. So I have pretty much given up on support with respect to this problem, and I hope someone in the Equallogic community has seen and solved this issue and wouldn't mind sharing the resolution.
I have pretty much re-engineered my vshpere environment, VM, and physicals server to try and address the issue. Other SAN solutions are starting to look very appealing.
Right now, I have several people working on this case at Dell. The last thing they (SAN Engineer) recommended was to disable Delayed ACK and Nagle's Algorithm in the registry for the SAN interfaces which I have done and sent them SAN HQ logs since the change. I haven't heard back.
I have not split tempdb into separate volumes as that was a recommendation by a regular support person; not the SAN Engineer. So, I'm just waiting for them to get back to me.
I'm also convinced that the issue lies with the SAN. I'm not sure if it's because we are using a hybrid PS6100XS and the only RAID configuration is RAID 6 Accelerated or if it's the hardware issue, but I have configured the database and server the same way as I had it before the migration to the new SANs.
I had previously disabled delayed ACK on the ESXi hosts when I was using RDMs. and I have just (today) disabled DelayedACK and Nagle's through the registry on one of the troublesome VMs (using in-guest iSCSI). I'm waiting for off hours to commit changes with a reboot, not sure if cycling the interfaces will be sufficient to reload the TCP/IP stack with the changes.
I am using a hybrid PS6500. I have two members in the this pool, the other being a PS6100 that is not a hybrid. I think due to the small size of the tempdb lun, the PS6500 is the sole owner of the volume. So the other thing I'm trying is pinning the LUN/volume to the PS6100. I also have a SATA based pool that I may considered try as well.
dajonx
2 Intern
•
294 Posts
0
June 16th, 2014 07:00
I have also noticed that when I am executing a full backup on a large database, it takes six hours to complete whereas it used to take only two hours on the PS6000XV.
dajonx
2 Intern
•
294 Posts
0
June 16th, 2014 08:00
No, different environment. We are using 6048s and is dedicated to iSCSI use. I have opened a support case and they have looked through the diags, switch output, and SANHQ. The last thing they told me was that the switch and SAN looked good and that I should disable the delayed ACK and Nagle's algorithm. After adding those entries to the iSCSI interfaces in the registry (and rebooted), I still see the same I/O messages. I haven't heard back since...
dajonx
2 Intern
•
294 Posts
0
June 16th, 2014 08:00
When looking at Group Manager volumes, I noticed that there are only two connections being utilized instead of four. Is this normal?
(Well, actually there are four since the volumes are part of a cluster, but the server where the volumes resides only uses two connections and the other two connections are part of the other server)
dajonx
2 Intern
•
294 Posts
0
June 16th, 2014 09:00
Do you think the volumes are not using MPIO? There should be four connections from one server instead of two, correct?
dajonx
2 Intern
•
294 Posts
0
June 16th, 2014 10:00
Dell Networking N3048 Switch
www.dell.com/.../fs
I'm sorry... I added an "S" by mistake...
Yes, I did install HIT/ME on both servers.
dajonx
2 Intern
•
294 Posts
0
June 16th, 2014 10:00
Oops, I meant to say two NS3048.
I disabled the delayed ACK/Nagle Algorithm via the registry and rebooted. This is what I added:
To disabled delayed ACK and Nagle’s algorithm, create the following entries for each SAN interface subkey
in the Windows Server 2012 registry:
Subkey location:
HKEY_LOCAL_MACHINE \ SYSTEM \ CurrentControlSet \ Services \ Tcpip \ Parameters
\ Interfaces \
Entries:
TcpAckFrequency
TcpNoDelay
Value type:
REG_DWORD, number
Value to disable:
1
We are not using ESXi. This is strictly a SQL Server 2008 R2 database server (in a cluster). The servers are connecting to the SAN via HIT Kit 4.6. Is the information pertaining to ESXi still relevant to a SQL Server database server?
Thank you.
dajonx
2 Intern
•
294 Posts
0
June 16th, 2014 10:00
Yes, I believe the firmware is up-to-date as my coworker installed it about two weeks ago.
In iSCSI Initiator, if I don't check MPIO when connecting to the volume, would there only be two connections instead of four? I just find it odd that there are only two connections on most of the volumes.
I've emailed him several times, but have not heard back. He also would not give me his phone number. So I'm pretty much dead in the water... And frustrated.
dajonx
2 Intern
•
294 Posts
0
June 16th, 2014 10:00
We use stacking modules. Is that what you mean?
dajonx
2 Intern
•
294 Posts
0
June 19th, 2014 07:00
Thank you. I found out he works from 1:30 - 9:30pm cst whereas I'm here from 8-5pm est so our schedules don't exactly match...
Do you think this could resolve itself just by updating the NIC drivers? 7.8.16 to 7.8.53...
dajonx
2 Intern
•
294 Posts
0
June 26th, 2014 06:00
But that still doesn't explain why I never had this issue when the database was on the PS6000XV yet I have this issue when it's on a supposedly faster PS61000XS...
dajonx
2 Intern
•
294 Posts
0
June 26th, 2014 06:00
Thank you.
So it would help separating tempdb into two LUNs... I wish there was a way to test it before implementing as it would require a good amount of reconfiguration...
dajonx
2 Intern
•
294 Posts
0
June 26th, 2014 06:00
Thank you, Don.
Their recommendation is to separate the tempdb data files into another LUN which is on the same SAN. After thinking about it, wouldn't it still have I/O pressure because both LUNs are on the same SAN?
PFTK
3 Posts
0
July 3rd, 2014 09:00
We have been experiencing the EXACT same thing. It has happened on both a virtual and physical systems. What I don't understand is that the dedicated LUN for tempdb is hardly ever used, very low IOPS, but when we do, we get sporadic messages about I/O taking longer then 15 secs.
I have given support SAN HQ logs several times to which they always say that there does not appear to be a LUN/SAN issue. So when the problem started on a virtual, I called VMware support. They took a look at their logs and concluded that the SAN was an issue, at the controllers. Of course EQL disagreed. Then the problem started on an old 2003 based SQL server. EQL again looked at SAN HQ logs, verified best practices, but still stated that they did not see anything out of the ordinary, in fact the engineer said the SAN was performing well. So I have pretty much given up on support with respect to this problem, and I hope someone in the Equallogic community has seen and solved this issue and wouldn't mind sharing the resolution.
I have pretty much re-engineered my vshpere environment, VM, and physicals server to try and address the issue. Other SAN solutions are starting to look very appealing.
Thanks.
dajonx
2 Intern
•
294 Posts
0
July 3rd, 2014 09:00
Right now, I have several people working on this case at Dell. The last thing they (SAN Engineer) recommended was to disable Delayed ACK and Nagle's Algorithm in the registry for the SAN interfaces which I have done and sent them SAN HQ logs since the change. I haven't heard back.
I have not split tempdb into separate volumes as that was a recommendation by a regular support person; not the SAN Engineer. So, I'm just waiting for them to get back to me.
I'm also convinced that the issue lies with the SAN. I'm not sure if it's because we are using a hybrid PS6100XS and the only RAID configuration is RAID 6 Accelerated or if it's the hardware issue, but I have configured the database and server the same way as I had it before the migration to the new SANs.
Are you also using a hybrid SAN?
PFTK
3 Posts
0
July 3rd, 2014 10:00
I had previously disabled delayed ACK on the ESXi hosts when I was using RDMs. and I have just (today) disabled DelayedACK and Nagle's through the registry on one of the troublesome VMs (using in-guest iSCSI). I'm waiting for off hours to commit changes with a reboot, not sure if cycling the interfaces will be sufficient to reload the TCP/IP stack with the changes.
I am using a hybrid PS6500. I have two members in the this pool, the other being a PS6100 that is not a hybrid. I think due to the small size of the tempdb lun, the PS6500 is the sole owner of the volume. So the other thing I'm trying is pinning the LUN/volume to the PS6100. I also have a SATA based pool that I may considered try as well.