ECN-APJ
3 Argentum

Iostat has detected storage performance issue

Iostat has detected storage performance issue

Symptom

Recently one customer has had disks for an archiving database and he has through iostat command found that the disks are having serious performance problems. From storage perspective, the overall IOPS and bandwidth is not that high.

avg-cpu:% user% nice% system% iowait% steal% idle

  13.43 0.00 5.37 12.99 0.00 68.21

Device: rrqm / s wrqm / sr / sw / s rsec / s wsec / s avgrq-sz avgqu-sz await svctm% util

sda 0.00 16.00 0.00 1.00 0.00 136.00 136.00 0.01 5.00 5.00 0.50

sdb 22.00 1468.00 140.00 141.00 3035 2.00 11824.00 150.09 4.63 16.41 3.56 99.90

sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

dm-1 0.00 0.00 0.00 17.00 0.00 136.00 8.00 0.08 4.59 0.29 0.50

dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

dm-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

The magnetic disk sdb is used by fastcache.


Cause

First let me explain the meaning of each field:

rrqm / s: the number of merged reads per second, that is delta (rmerge) / s

wrqm / s: the number of merged writes per second, that is delta (wmerge) / s

r / s: the number of completed reads per second, that is delta (rio) / s

w / s: the number of completed writes per second, that is delta (wio) / s

rsec / s: the number of read sectors per second, that is delta (rsect) / s.

wsec / s: the number of written sectors per second, that is delta (wsect) / s.

rkB / s: the KB read per second, half of rsect / s, because the size of one sector is 512 bytes.

wkB / s: the KB written per second, half of wsect / s.

avgrq-sz: the average size ( sector ) of data per I/O operations, that is delta (rsect + wsect) / delta (rio + wio).

avgqu-sz: the average I/O queue length, that is delta (aveq) / s/1000 ( because aveq is counted in milliseconds ) .

await: the average waiting time ( in milliseconds ) per I/O operations, that is delta (ruse + wuse) / delta (rio + wio)

svctm: the average service time (in milliseconds ) per I/O operations, that is delta (use) / delta (rio + wio)

% util: the I/O utilization per second, that is delta (use) / s/1000


Resolution

avg-cpu:% user% nice% system% iowait% steal% idle

  13.43 0.00 5.37 12.99 0.00 68.21

Device: rrqm / s wrqm / sr / sw / s rsec / s wsec / s avgrq-sz avgqu-sz await svctm% util

sdb 22.00 1468.00 140.00 141.00 3035 2.00 11824.00 150.09 4.63 16.41 3.56 99.90

% util = 99.9, indicates there are too many I/O requests, and the disks is likely to have bottleneck.

CPU idle less than 70%, IO pressure is more obvious, so we need pay attention to the read-write wait.

Seen by comparing await and svctm. We know the difference may be caused by IO issue. The I/O requests mainly kept waiting in the queue instead of being serviced, because 16.41 is much larger than 3.56.

Then look at the queue, avgqu-sz = 4.63, in fact, the I/O queue is not long, but relatively short. Since the queue is not long, why waiting occurs?

Check avgrq-sz which is the size of the data directly handled per I/O. According to avgrq-sz, r / s and w ​​/ s, it indicates each I/O operation is handling a great number of data very frequently, so in the end did read or write cause it?

Check the values of  ​​rsec / s wsec / s , and we found the number of read sector is much larger than the number of written. The values of rrqm / s wrqm / s show that few read requests are merged l, while many write operations are merged. In a summary, it shows the data to be read are scattered in different sectors of different block, without being merged.

So with shot queue, the system did not spend a lot of time to grab the data, but because of the big size of I/O, to seek the very scattered data on the physical disk. This is the present situation.

So on the one hand, it depends on the performance of your own physical disk as well as the read seek time, on the other hand, you need to adjust the data distribution, and finally look at the database table structure and SQL data to adjust the way to query data.

I suggest you use unisphsere analyzer and grab some FAST cache statistics, which is for performance analysis:

dirty page%

flushed MB

read hit / s

read miss / s

read ratio / s

write hit / s

write miss / s

write ratio / s

Again, not matter with analyzer or iostat, what you see is resulted data, so the application must be involved in performance analysis. Otherwise, it is useless.

Author:Kevin

iEMC APJ

Please click here for for all contents shared by us.

0 Kudos
2 Replies
dynamox
7 Thorium

Re: Iostat has detected storage performance issue

one comment /request

First of all these are very good discussions, good tips. I have only one request: is it possible to use images instead of text for output of commands that do not align properly. It's very hard to read output from commands like iostat or top when column headers are not aligned with actual data.

Thank you

0 Kudos
ECN-APJ
3 Argentum

Re: Iostat has detected storage performance issue

Very good suggestion, Dynamox. Unfortunately the author confirmed that this output was from a customer's thread and so we don't have screenshot image. But I updated another thread with a screenshot image, which is very similar to this situation. We will pay attention next time.

0 Kudos