SQL Performance Troubleshooting

Question

Hi,

We are facing a performance issue with our SQL devices configured from VMAX Arrays. I did some initial analysis, but need some help to get to the bottom of it.

We have been receiving the following type of errors for sometime now:

Event Description: SQL Server has encountered 24804 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file in database [tempdb] (2). The OS file handle is 0x0000000000000730. The offset of the latest long I/O is: 0x00000066190000

This happens everyday at around 21:00 hours. I guess there is some batch job running at this time. The disks are all Striped Meta of TDEVs with a minimum of 4 meta members. The database and log volumes have been separated. These errors are on the database volumes I believe(looking at the file name tempdb.mdf). The devices are bound to a thin pool configured with RAID 6 (6+2) of Fibre Channel Devices. SRDF/Synchronous is setup for these devices to a near by site.

What I have been trying to find out is whether the issue is actually on the array side. So I had been going through Unisphere Performance Manager graphs. The first graph I looked at is the Storage Group Response Time graph. Response times peak at 10 ms during these periods. I believe that is acceptable for database volumes in SQL.

Graph sg respt.jpg

The SRDF relationship was a concern, so I looked at the RDF directors and found an average service time of less than 5 ms and utilization less than 35%.

I also checked the thin pool backend requests per second. No noticeable spikes, but I didnt know what to make out from the values.

I then looked at Front End port utilization, and it was never greater than 10%.

I read that if the response time on the Front end adapters is high, then it may be an issue on the array end. The Front end adapter response time was around 26 ms, so I am guessing there is a bottleneck somewhere in the array. But not sure were exactly.

Anything I am missing? Any ideas on drilling this down? Any help is much appreciated!

arjunsingh1 · Answer

I would suggest you to open a support ticket with EMC, the remote support team would do the host side config and other checks and will provide recommendations. If those don’t solve the issue then performance team would need to be engaged after collecting the required data from the Symm.

Saw this similar issue with similar error recently on tempdb with SRDF/S , when the SRDF/S for this tempdb device was halted the issue stopped occurring.

Again I would suggest to have the performance team involved.

- Arjun

sauravrohilla · Answer

Since you have mentioned that the devices are bound to raid 6 so I assume its the SATA tier? I would start by looking at these two things first.

1) write response time of that particular device (not for the entire SG).

2) Also, is it possible to test with SRDF being suspended or in acp mode?

regards,

Saurabh

PedalHarder · Answer

SQL is very good a doing large bursts of writes. They can easily queue up on the FA processors and elongate response times. How many FA CPU's are you using?

SRDF/S

The number in-flight writes is limited in SRDF/S mode to (up to ) one per symdev per FA port. So a 4-way meta presented to 4 Fa's can have a maximum of 16 in-flight writes. Are you using Powerpath? You want your writes over all the FA's.

TEMPDB

The data in TempDB is not required for system recovery. If it is the TEMPDB lun with the response time issues, set it to adaptive copy instead. You still get the structure of the lun at the R2 end with adaptive copy, so Disaster restart is not compromised.

How many writes per second in the problem period?

Run Perfmon in one-second interval during the problem period to asses the size of the write bursts. Use this information along with the SRDF/S rules to determine if your meta needs more members or you need more FA ports.

Fenglin1 · Answer

Have you asked your DBA to check if any job running at that time? There might be some abnormal space allocation(New Writes) by temp table during reindex or big join opeartion etc which caused peak disk response time. The response time increasing might caused by your SRDF/S. However, why your tempdb setup the SRDF/S? SQL Server TEMPDB normally don't need any backup or restore operation.

In additional, RAID 6 is not a proper raid protection for write sensitive application like SQL Server TEMPDB.

I agree Saurabh's advice, you can try to swithc SRDF to ACP Mode to see if it's SRDF/S issue

mumshad · Answer

These are not SATA. They are FC disks. I have mentioned that in the question.

Quincy561 · Answer

Why do you have RAID6 with synchronous RDF?  Seems like overkill.  RAID1 would reduce the load on the FC disks significantly.

VMAX

SQL Performance Troubleshooting

Was this post helpful?