I'm trying to guess the cause of a performance problem on an SQL Database.
Everything seems to be ok but read throughput and bandwitch which show big saw teeth graphics.
I am not sure but I think I found somewhere this could be a bad configuration, fragmented or not indexed database consecuence.
Can anyone tell me what this behavior exactly mean and if it exists some white paper talking about saw teeh performance metrics?
Thanks in advanced.
I had exactly the same issue. There was nothing wrong ton the array configuration. It was a bad store procedure. Do you know when it started? Could be related with a change on programming? Once they fixed it, I did not see this behavior anymore.
Hope this help,
Hi Pablo and thanks for your response.
It always has happened since we got there (2 years ago). They use many store procedures and it happens only when using them. I think the same but you know developers always claims for array problems and are the administartor who have to demostrate that there are not array problems.
Read cache Hit ratio shows exactly the same (saw teeth) continuosly when launching the select with metrics between 30% and 85% up and down.
What version of flare are you running and what is the array model?
What Archive Interval is Analyzer set for?
If you look at the LUN - look at Total IOPS and Total Read IOPSs - is the Total mostly reads?
On the same LUN, what is the read cache hits/s (or Read Cache Hit Ratio) when you get the spikes? Is the ratio very low when the IO is very high?
Are the Reads missing Read Cache more than they are hitting?
Look at the disks for the LUN - are the IOPS exceeding the Best Practices recommendations?
15K FC = 160-180 IOPS
10K FC = 120-140 IOPS
7200 RPM SATA = 80 IOPS
Look at the disk Average Seek Distance - is it very high or very low - the closer you get to zero, the more sequential the IO is. The more random the IO, the longer to service the request. If most of the IO is missing Read cache the request is going to the disks, if the IO is very random, the service times will be higher.
Flare is version 24 on a CX3-40 mirrored with a CX4-80 with flare 29 (we will soon upgrade CX3 to another new CX4-480)
- Archive Interval is set for 600 seconds.
- IOPS are indeed mostly reads at the LUN.
- When the IO is very high the the Read cache Ratio is very low.
- Reads cache misses are much more than Read Cache hits
- IOPS on every of the 10 disks (FC 15K rpm) of that Raid Group (Raid 1/0) do not exceed Best Practies. IOPS round between 30 and 110.
Wow! Average Seek Distance are the same (5GB more or less) on all Raid Group disks except on disk 100 and on disk 101 which shows between 22 and 32GB, pretty much bigger.
What size disk - 146GB? Are LUNs 100 and 101 the problem LUNS? If the ASD is 5GB and the disk is a 146GB, that 3% seeking (5/146)
If the ASD is less than about 1GB, that generally means the IO is more sequential - all seeks are pretty close.
For Read to be considered sequential they have to be one after the other - when you get two sequential Reads, the pre-fetch will fetch 4 additional IO's - hoping that the next IO will be in the pre-fetched group. If it's not, you get a cache miss. Look at the "Used Prefetches (%)" - this will show you how many of the prefetches you're using - if this is real low, then you are wasting resources prefetching - look at "Prefetch Bandwidth (mb/s)" is this high? Then that means you're prefetching large amounts of data that you're not using. It might be better to disable Prefetch on those LUNs.
Look at both "Queue Length" and Average Busy Queue Length (ABQL)" for one of the LUNs - are the two numbers close to each other (within a couple points) - if not, that means the data is coming in bursts - the IO is bursty - this is real hard for read cache - the data in read cache only stays there for a couple of seconds - if the reads are longer than the idle delay, then the cache is empty when the next burst of Reads come in. Not much you can do to solve this except to see if you can force the application to organize the requests sequentially - or maybe defrag the host filesystem.
300 GB disks. 100 and 101 are two of the ten disks which form part of the Raid Group containing the LUN7 containing the database.
Used Prefetches are high, always between 60 and 70% but Prefetch Bandwidth rise from 3 to only 15MB/s during workload procedures.
Queue Length gives saw teeth from almost 0 to 9 and ABQL keeps from 10 to 30. That seems to be what you say so they are not close.
Service time on LUN7 is very low 1-3 ms) but Response Time is very High (25-50 ms).
Unless you tell me anything else I will recommend forcing the application to organize the requests sequentially correcting the store procedures in SQL and defrag the windows filesystem.
Thanks a lot for your suggestions!
One last tip - select a LUN and then select Total IOPS and Response Time - look for high response times - are Total IOPS under 100 IOPS (or on the down slope) at the same time that response times are high?
If IO's are low and queue length is low, response times should also be low. There is a condition when IO is low that response times are high - this is false and can be ignored.