pcore

24 Posts

4403

June 12th, 2017 07:00

R730xd with H730P periodic system stalls

Hi,

we have a range of PowerEdge R730xds with 16xHDD and 2xSSD. They are under high load, and every once in a while we see a complete system stall of around 10 seconds across all drives.

All drives are connected to a H730P RAID controller (firmware 25.5.0.0018), running RAID-1 in pairs.

We suspect it to be a controller issue, but there is nothing in the controller log indicating it (checked using "megacli -adpeventlog -getevents -f /tmp/t.t -a0 -nolog").

The servers all run Ubuntu 16.0.2 LTS, kernel 4.4.0-64-generic.

We caught one of the stalls with iostat:

06/12/2017 02:34:13 PM
avg-cpu: %user %nice %system %iowait %steal %idle
15.31 0.00 2.42 73.16 0.00 9.11

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 2.20 252.80 0.40 9.74 0.01 78.84 16.34 64.38 64.48 0.00 3.63 92.00
sdc 0.20 1.60 232.20 6.40 7.74 0.03 66.74 13.71 57.41 58.99 0.00 3.70 88.32
sdd 0.00 1.60 326.80 0.40 24.02 0.01 150.41 11.66 33.67 33.71 0.00 2.86 93.44
sde 0.00 2.40 407.40 0.40 18.98 0.01 95.35 75.72 184.47 184.65 0.00 2.45 100.08
sdf 0.00 7.00 336.00 64.60 15.66 14.49 154.10 37.86 87.55 104.10 1.49 2.37 95.12
sdg 0.00 2.60 227.00 0.40 8.29 0.01 74.77 11.90 51.47 51.56 0.00 3.72 84.48
sdh 0.00 3.40 310.60 6.00 20.41 0.04 132.27 20.45 63.16 64.38 0.00 3.13 99.12
sdi 0.00 0.40 86.80 0.40 3.21 0.00 75.43 1.59 18.26 18.34 0.00 4.29 37.44
sda 0.00 1.20 0.00 0.40 0.00 0.01 32.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 0.00 1.60 0.00 0.01 8.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

06/12/2017 02:34:18 PM
avg-cpu: %user %nice %system %iowait %steal %idle
10.43 0.00 1.66 83.03 0.00 4.88

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.20 5.00 93.00 0.40 2.90 0.02 64.10 22.77 34.97 35.12 0.00 9.00 84.08
sdc 0.00 5.80 83.20 0.00 2.94 0.00 72.48 17.32 30.38 30.38 0.00 9.46 78.72
sdd 0.00 9.40 144.60 69.00 9.13 16.12 242.16 71.17 228.31 148.42 395.72 4.63 98.80
sde 0.00 7.00 243.60 41.20 13.00 9.42 161.21 86.77 201.54 235.28 2.10 3.51 100.08
sdf 0.00 7.40 211.00 25.40 8.54 6.23 127.93 52.82 150.41 165.63 23.94 4.23 100.08
sdg 0.00 4.80 99.00 1.00 2.87 0.09 60.61 36.18 46.96 47.35 8.80 8.11 81.12
sdh 0.00 0.00 190.00 0.20 14.41 0.03 155.44 12.41 48.50 48.56 0.00 5.22 99.20
sdi 0.00 1.80 39.00 58.00 1.27 13.97 321.65 28.56 131.43 41.66 191.79 7.09 68.80
sda 0.00 1.60 15.20 0.20 0.60 0.00 79.90 0.72 0.21 0.21 0.00 25.04 38.56
dm-0 0.00 0.00 15.40 2.00 0.60 0.01 71.91 3.78 0.18 0.21 0.00 22.16 38.56
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

06/12/2017 02:34:23 PM
avg-cpu: %user %nice %system %iowait %steal %idle
0.04 0.00 0.03 99.93 0.00 0.00

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.80 0.00 0.00 0.00 0.00 0.00 55.28 0.00 0.00 0.00 0.00 100.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 42.00 0.00 0.00 0.00 0.00 100.00
sdd 0.00 1.80 0.00 0.00 0.00 0.00 0.00 71.85 0.00 0.00 0.00 0.00 100.00
sde 0.00 1.00 0.00 0.00 0.00 0.00 0.00 97.33 0.00 0.00 0.00 0.00 100.00
sdf 0.00 0.80 0.00 0.00 0.00 0.00 0.00 53.25 0.00 0.00 0.00 0.00 100.00
sdg 0.00 3.00 0.00 0.00 0.00 0.00 0.00 103.50 0.00 0.00 0.00 0.00 100.00
sdh 0.00 3.60 0.00 0.00 0.00 0.00 0.00 10.92 0.00 0.00 0.00 0.00 100.00
sdi 0.00 3.60 0.00 0.00 0.00 0.00 0.00 46.77 0.00 0.00 0.00 0.00 100.00
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.94 0.00 0.00 0.00 0.00 100.00
dm-0 0.00 0.00 0.40 0.00 0.00 0.00 8.00 10.94 0.00 0.00 0.00 2500.00 100.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

06/12/2017 02:34:28 PM
avg-cpu: %user %nice %system %iowait %steal %idle
0.02 0.00 0.02 99.96 0.00 0.00

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 56.00 0.00 0.00 0.00 0.00 100.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 42.00 0.00 0.00 0.00 0.00 100.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 78.00 0.00 0.00 0.00 0.00 100.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 98.00 0.00 0.00 0.00 0.00 100.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 54.00 0.00 0.00 0.00 0.00 100.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 120.00 0.00 0.00 0.00 0.00 100.00
sdh 0.00 0.00 0.00 0.00 0.00 0.00 0.00 11.00 0.00 0.00 0.00 0.00 100.00
sdi 0.00 0.00 0.00 0.00 0.00 0.00 0.00 47.00 0.00 0.00 0.00 0.00 100.00
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.80 0.00 0.00 0.00 0.00 100.00
dm-0 0.00 0.00 0.20 0.00 0.03 0.00 256.00 12.80 0.00 0.00 0.00 5000.00 100.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

06/12/2017 02:34:33 PM
avg-cpu: %user %nice %system %iowait %steal %idle
14.70 0.00 5.13 71.30 0.00 8.87

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 281.20 0.40 18.23 0.00 132.62 31.67 573.45 569.33 3466.00 3.23 90.88
sdc 0.00 0.40 202.40 1.00 9.19 0.04 92.93 26.25 608.42 599.80 2352.80 3.53 71.76
sdd 0.00 1.40 341.00 101.00 24.50 24.45 226.83 37.54 474.49 564.10 171.94 1.90 84.16
sde 0.00 4.00 367.40 12.00 18.84 1.77 111.23 78.13 795.99 818.08 119.73 2.63 99.92
sdf 0.00 4.20 346.80 12.80 15.29 1.22 94.04 51.46 484.42 498.39 106.06 2.74 98.40
sdg 0.40 0.80 197.00 6.20 7.46 0.04 75.61 37.12 1437.53 1267.24 6848.52 3.74 76.08
sdh 0.40 0.80 299.00 1.00 22.80 0.03 155.85 23.58 162.22 155.99 2023.20 3.28 98.32
sdi 0.00 0.40 81.40 61.00 3.09 14.67 255.36 9.19 832.75 1432.49 32.45 2.76 39.36
sda 4.60 4.20 47.80 3.40 2.43 0.12 102.09 0.64 177.61 138.11 732.94 2.73 14.00
dm-0 0.00 0.00 51.60 7.00 2.40 0.11 87.89 1.51 495.28 128.56 3198.51 2.39 14.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.20 0.00 0.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00

06/12/2017 02:34:38 PM
avg-cpu: %user %nice %system %iowait %steal %idle
18.72 0.00 4.24 73.82 0.00 3.23

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 1.40 248.60 0.80 8.56 0.06 70.76 18.26 76.29 76.53 0.00 3.72 92.88
sdc 0.00 1.60 354.60 0.80 22.86 0.05 132.00 28.41 82.93 83.12 0.00 2.79 99.12
sdd 0.00 5.40 340.60 0.40 20.75 0.02 124.76 15.52 47.39 47.45 0.00 2.68 91.28
sde 0.00 2.00 434.60 0.40 19.70 0.01 92.77 79.31 182.94 183.11 0.00 2.30 100.00
sdf 0.00 1.00 417.40 0.40 20.27 0.01 99.37 65.95 163.27 163.43 0.00 2.39 100.00
sdg 0.00 0.80 245.40 0.40 9.19 0.00 76.59 12.21 50.01 50.10 0.00 3.72 91.44
sdh 0.00 2.20 368.20 0.80 21.89 0.04 121.75 38.24 101.81 102.03 0.00 2.71 100.00
sdi 0.00 3.80 104.60 0.40 3.46 0.02 67.82 2.26 21.59 21.67 0.00 4.16 43.68
sda 0.00 7.00 1.20 1.20 0.02 0.03 45.33 0.00 0.00 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 1.20 8.00 0.02 0.03 11.65 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.20 0.00 0.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00

Is this something you have seen before, or do you have any ideas what we can try in order to find the cause?

Best regards,
Brian

Responses(14)

DELL-Josh Cr

Moderator

•

8.5K Posts

0

June 12th, 2017 11:00

Hi,

Are the drives Dell drives? What model drives are they? Is the drive firmware up to date? It is possible that the load is too high for the controller and it is doing a reset.

P

pcore

24 Posts

0

June 12th, 2017 12:00

Hello Josh,

the drives were all purchased with the server, so yes, all "Dell" drives:

16 x TOSHIBA MG03SCA400 rev DG09 (4TB) part number PH012GYY7557153RAVRQA03
2 x INTEL SSDSC2BB12 rev D201DL13 (120GB) part number TW0MVTNM7859255G012TA00

How can we see if the controller is performing a reset?

One thing that is a bit worrying is that a Patrol read operation was started June 3rd, and there is no log entry about it being completed yet.

2017-06-03T19:01:48+0200	CTL37	A Patrol Read operation started for Integrated RAID Controller 1.
2017-05-28T04:05:14+0200	CTL38	The Patrol Read operation completed for Integrated RAID Controller 1.
2017-05-27T19:01:44+0200	CTL37	A Patrol Read operation started for Integrated RAID Controller 1.

Best regards,
Brian

DELL-Josh Cr

Moderator

•

8.5K Posts

0

June 12th, 2017 14:00

A patrol read happens during idle time, so if the server is always busy it will take a long time for it to complete. The drives are up to date, so that isn’t causing the timeouts.

P

pcore

24 Posts

0

June 13th, 2017 03:00

Thanks Josh.

Is there any way we can get information about controller resets? I don't see anything in logs or idrac.

Best regards,
Brian

DELL-Josh Cr

Moderator

•

8.5K Posts

0

June 13th, 2017 10:00

From the iDRAC create a tech support report and choose the option to include the raid controller log, that log will have more information. http://www.dell.com/support/article/us/en/04/SLN295784/how-to-export-a-supportassist-collection-and-the-raid-controller-log-via-idrac-7-or-8-?lang=EN

P

pcore

24 Posts

0

June 15th, 2017 08:00

Hi Josh,

I exported the RAID controller log through iDrac, but it doesn't show anything interesting. The last days it has only logged lines like this:

06/14/17 22:04:31: C0:Bad Block Count for LD 8 is 0

Any other ideas on how to get to the bottom of this?

Best regards,
Brian

DELL-Josh Cr

Moderator

•

8.5K Posts

0

June 15th, 2017 11:00

What kind of role does the server have?

P

pcore

24 Posts

0

June 15th, 2017 15:00

The server is doing web crawling for a search engine. It is I/O bound with both reads and writes.

DELL-Josh Cr

Moderator

•

8.5K Posts

0

June 15th, 2017 16:00

Are the hard drives SAS or SATA?

P

pcore

24 Posts

0

June 16th, 2017 04:00

The 2 SSDs are SATA and the 16 HDDs are SAS.

DELL-Josh Cr

Moderator

•

8.5K Posts

0

June 16th, 2017 10:00

What speed are the SAS drives? I think that the load is higher than the drives can handle and so it waits for the writes to catch up before continuing. I don’t know how to test this without affecting your production though.

P

pcore

24 Posts

0

June 19th, 2017 02:00

Perhaps it could be something as simple as that.

The drives are 7.2k RPM Near-Line SAS 6Gbps drives, all connected to a H730P controller.

P

pcore

24 Posts

0

June 19th, 2017 06:00

However, we also see the stalls on the two system disks, which as SSDs - so why would heavy I/O on the HDDs affect the SSDs that only run the OS? All drives stall for 10-15 seconds.

As far as we have found out, the controller has a queue depth of 895 and iostat shows we're not even close to that.

Even though these are production servers, we can take one of them offline to run tests if you have suggestions of what to try.

DELL-Josh Cr

Moderator

•

8.5K Posts

0

June 19th, 2017 11:00

Can you try running a drive self test from the lifecycle controller? You could also try booting to our live image and trying to stress the drives there and see if the same behavior occurs. http://www.dell.com/support/home/us/en/19/Drivers/DriversDetails?driverId=CWF92&fileId=3550743303&osCode=W12R2&productCode=poweredge-r730xd&languageCode=en&categoryId=DI

View All

No Events found!

PowerEdge HDD/SCSI/RAID

R730xd with H730P periodic system stalls