High Disk write latencies

Question

Hi,

We have been seeing slowness on our exchange servers and on the exchange admins request I started monitoring my Clariions as the windows server was logging " Disk write latency > 20ms" via MOM.

I could see most the problamatic disks were seen to be comming from SPA which had a utilization% of between 80-98. So I tresspassed these luns onto SPB which had a utilization of around 38. This seemed to have resolved issues on most of the disks except for one which was still logging " Disk wrtie latency > 20ms" on MOM.

This lun comes from a Raid5 RG with 5,146GB, disks @ 15KRpm. This raid group just has 2 luns and the second lun is never busy.

After running analyser I can see the below stats, there was some other luns which have worse stats than the problamatic one but they are not logging any errors. So I am in a spot now as I do not know what is affecting the disk latency on the windows server especially since trespassing some of the luns fixed latency errors logged for them.

Lun

===

service Time(ms) = 3.38
Avg busy queue length = 4
Throughput (IO/sec) = 106

The 5 disks which make up the RG

=========================

service Time(ms) = 3.3 to 4.5
Avg busy queue length = 4.5
Throughput (IO/sec) ,Avg=595839 /Max=10709894

SPs

===

SPA avg utilisation=78 and for SPB it is 42.7

SPA avg response time = 10.6 (max=55.5) and for SPB avg = 3.3 (max=6)

SPA Total throughput (IO/sec) = 3616 and SPB = 719

SPA dirty pages=83 and SPB=54

SPA Service Time(ms) = 0.221 and SPB=0.627

SPA avg busy queue length = 62 and SPB=5.6

Must admit the above values do not make much sense to me but the Service time seen on the disks and response time for the SPs are the only things which seem to a bit of bother to me. Would be very help if you guys could share your experiences in similar situations, or a pointer to what I might be missing and what other stats could be helpful.

Thanks in advance,

Ash

kenn2347 · Answer

It really looks like SPA is getting hammered. are the lun's split evenly across both SP's? If they are, it might come down to the type of luns on each SP. Maybe you have to many hi I/O R1/0 luns on SPA or something.

I got this from the Analyzer Guide:

The higher the queue length for the SP, the more
requests are waiting in its queue, thus increasing the
average response time of a single request. For a given
workload, queue length and response time are directly
proportional.

Looks like your response time and Queue length are hi which would make your Utilization high also.

Check the logs for any background verifies also as they will place a load on the SP.

ashish2009 · Answer

Hi,

You are right I feel SPA is being thrashed so I am checking the highest utilised luns and trespassing them to SPB, but it seems to be a painfully slow process since I cant find a way to easily identify the highest utilised luns from analyser.

By response times did you mean the reponse times shown by both SPs are a cause for concern? If not then, since I trespassed the problamatic luns to SPB should the SPA response time still affect it?

Also is there a base value above which we say the response time is bad?

Thanks

kenn2347 · Answer

one way to find the highest util luns on a SP is to right click the SP>Performance Detail> and just check all the lun's. the utilization box is checked by default. this will give you an idea as to which ones are the highest.

No i was saying that the SPA response times are high comparded to SPB. moving that lun from SPA to SPB should have made the response time on B go up and A go down.

I havent found a document stating what are "ok" values for any of the polled events so i cant be sure.

AranH1 · Answer

The SP response times and queue lengths look fine to me, generally less than 10 on average is what I like to see on an array. This is a number that I have also seen referenced by EMC personnel but I cannot find the reference at the moment.

As kenn2347 said you can use analyzer to spot your busiest luns. Use the Peformance Detail view and select all the LUNs in the array: on the LUNs tab right click on LUNs > Select All > LUNs. Then look at the Total Throughput (I/Os), Queue Length, and Response Time metrics. You will have to look at the metrics individually as they will have different scales and it will be hard to tell what the highest performers are if all the metrics are viewed together.

ashish2009 · Answer

Thanks for your replies. Relieved to know SP response is not much the issue here. Yeah, I did follow the same method to identify and trespass the luns, since I have what Emc call a large configuration the navi analyser window takes a bit longer to load and come up with the details and then going through the same is also a bit slower, am thinking of looking up on analyser cli to find something useful then probably I can script it.

Looking at the lun and RG values does anything seem out of place which could lead to these disk latency errors seen on the windows server hosting exchange. Is there something I am overlooking and which could give me a better understanding of these latencies?

Reg,

Ash

dynamox · Answer

The 5 disks which make up the RG=========================service Time(ms)      = 3.3 to 4.5Avg busy queue length = 4.5Throughput (IO/sec)   ,Avg=595839 /Max=10709894 595k IOPS combined for 5 drives ? (5 x 180 IOPS = 900 IOPS)

AranH1 · Answer

dynamox wrote:The 5 disks which make up the RG=========================service Time(ms)&#xa0;&#xa0;&#xa0;&#xa0;&#xa0; = 3.3 to 4.5Avg busy queue length = 4.5Throughput (IO/sec)&#xa0;&#xa0; ,Avg=595839 /Max=10709894595k IOPS combined for 5 drives ? (5 x 180 IOPS = 900 IOPS)I don't know what to think about that, 595k IOPs would be tough for the entire array to handle let alone a single R5 4+1 RAID Group. And the max of 10 million IOPs? I think we are missing a decimal point somewhere

AranH1 · Answer

Thanks Ash, those numbers make sense.

So, looking at the numbers though it looks like you shouldn't have any peformance issues with the LUN. A single 15k FC disk can handle sustained 180iops and bursts even higher. So the fact that your host is reporting a latency of greater than 20ms is odd.

What exactly is being hosted on this LUN? Is it just one Exchange store? Multiple Stores or Stores and Logs?

ashish2009 · Answer

&#xa0;&#xa0; apologies for those grosly mis-sstated&#xa0; numbers... please find the correct values belowRG==I/O/sec :&#xa0;&#xa0;&#xa0; Max = 398.4 / Min = 30.5 / avg = 124MB/s&#xa0;&#xa0;&#xa0; :&#xa0;&#xa0;&#xa0; Max = 5&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0; / avg = 2.8 Per Disk========I/O/sec :&#xa0;&#xa0;&#xa0; Max = 80&#xa0;&#xa0;&#xa0; / Min = 6.3 / avg = 25MB/s&#xa0;&#xa0;&#xa0; :&#xa0;&#xa0;&#xa0; Max = 1&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0;&#xa0; / avg = .57Lun===I/O/sec :&#xa0;&#xa0;&#xa0; Max = 317&#xa0;&#xa0; / Min = 8.7 MB/s&#xa0;&#xa0;&#xa0; :&#xa0;&#xa0;&#xa0; Max = .3&#xa0;&#xa0;&#xa0; / avg = 1.4

ashish2009 · Answer

You are right the IO on this lun is minimal so I suspected the SP but the trespass did not help and yes the read and write cache are enabled for the lun.

I am exactly not aware of the exchange design and the number of stores on this lun, but I will try and get the host config and application details from the resp teams.

Thanks for all the feedbacks,

Ash

AranH1 · Answer

I assume you checked to make sure that read and write cache is enabled for the LUN?

kelleg · Answer

Be sure to check the SP Event Log for both SPA and SPB - see if you see any excessive 'trespass'  messages - excessive is more than a 20-30/minute. glen

CLARiiON

High Disk write latencies

Was this post helpful?