Start a Conversation

Unsolved

This post is more than 5 years old

JH

75704

November 19th, 2015 10:00

md3000i Degraded Physical Disk Channel

Hi,

I have an md3000i with attached md1000.  We had a HDD fail which the hot spare took over with no down time or issues.

Now I replaced the failed drive and it was put back into the array.  Once that completed now I'm showing Degraded Physical Disk Channel on channel 0,1 (I'm assuming that's the LUN number since LUNs 0 and 1 are part of the RAID group with the failed disk).

Also it seems to have created a Virtual Disk Not On Preferred Path issue.  This md3000i has dual controllers and the ESX hosts are all multipathed but not sure why all of this is coming around after replacing a failed disk.

Thanks,
Josh.

November 19th, 2015 15:00

Hello, Josh.

So, there are a number of things working here. We'll start with the message "Degraded Physical Disk Channel 0,1"  It's telling us that channels 0 and 1 are marked as degraded by the controllers. This is probably because of the failed disk. The disk more than likely failed because it maxed a number of errors, further causing chatter down the channels. It's an easy fix though.

You'll need to run some commands to clear the channels of the errors. (it's like resetting a counter back to zero)

Here's the commands to do so:

show allPhysicalDiskChannels stats;

clear allPhysicalDiskChannels stats;

set physicalDiskChannel [0] status=optimal;

set physicalDiskChannel [1] status=optimal;

set physicalDiskChannel [2] status=optimal;

set physicalDiskChannel [3] status=optimal;

show allPhysicalDiskChannels stats;

To get access to the command line, you'll need to open the cli window in windows, and navigate to the SMcli folder:  C:\Program Files \Dell\MD Storage Manager\client or C:\Program Files\Dell\MD Storage Manager\client

(This depends on if the version is 32 bit or 64 bit)

Then, start your commands with:

>smcli -n "NameOfArray" -c "set physicalDiskChannel [1] status=optimal;"

As far as the Virtual Disk Not On Preferred Path,

SMCli –n "NameOfArray" -c "reset storageArray virtualdisk distribution;"

Run this command AFTER you've cleared the channel error counters, and let me know if it stays good.

I know I've given a lot to do here, so let me know if you have any questions.

Have a great rest of the week!

November 23rd, 2015 09:00

Thanks for the informative answer.  This is a live SAN, are there any issues with running those commands?  Looks like you['re just clearing counters and resetting the sensors?  This is safe to do?

Thanks,

Josh.

November 23rd, 2015 10:00

What will this command do to live data?

SMCli –n "NameOfArray" -c "reset storageArray virtualdisk distribution;"

November 23rd, 2015 10:00

Absolutely safe to run these, yes. That's exactly what you're doing. I should tell you as well, once in a great while, this won't clear it. Sometimes the 'message' is just traded between the controllers, and "sticks" in the GUI. IF these don't clear the message, you'll need to boot the SAN. (not fun, I know.)

But, the chances are good that these commands are all you need.

Let me know!

November 23rd, 2015 15:00

It doesn't touch data. It "redistributes" the ownership of virtual disks. IF you have multipath drivers installed (MDSM GUI and Host access tools) and, both raid controllers are  cabled, then you *shouldn't* see any sort of disconnect. The transfer of ownership from one controller to the next, *shouldn't* take longer than the timeouts are set.

Still, if you'd feel more comfortable waiting for an open maintenance window, then do that.

November 25th, 2015 12:00

Daniel,

I'm seeing lots of RAID Controller Module errors.  please see the issued command below:

DRIVE CHANNELS----------------------------

  SUMMARY

     CHANNEL  PORT              STATUS

     1        In,Out,Expansion  Degraded

     2        In,Out,Expansion  Degraded

  DETAILS

     DRIVE CHANNEL 1

        Port: In, Out, Expansion

           Status: Degraded

              Reason: Error threshold exceeded

           Max. Rate: 3 Gbps

           Current Rate: 3 Gbps

           Rate Control: Switched

           DRIVE COUNTS

              Total # of attached physical disks: 29

              Connected to: A (left), Port In

                 Attached physical disks: 14

                    Expansion enclosure: 1 (14 physical disks)

              Connected to: 0, Port Expansion

                 Attached physical disks: 15

                    Expansion enclosure: 0 (15 physical disks)

           CUMULATIVE ERROR COUNTS

              RAID Controller Module 0

                 Baseline time set:                       11/18/14 5:32:52 PM

                 Sample period (days, hh:mm:ss):          371 days, 20:04:01

                 RAID Controller Module detected errors:  0

                 Physical Disk detected errors:           3485767

                 Timeout errors:                          0

                 Total I/O count:                         757848036

              RAID Controller Module 1

                 Baseline time set:                       11/18/14 5:32:52 PM

                 Sample period (days, hh:mm:ss):          598 days, 12:09:58

                 RAID Controller Module detected errors:  948

                 Physical Disk detected errors:           5993184

                 Timeout errors:                          73

                 Total I/O count:                         2629457758

           CAPTURED INTERVAL ERROR COUNTS

           RAID Controller Module 1

              Start time: {0}                          11/18/14 10:23:41 PM

              End time: {0}                            6/5/16 6:42:17 AM

              RAID Controller Module detected errors:  916

              Physical Disk detected errors:           5642849

              Timeout errors:                          25

              Total I/O count:                         1835496439

     DRIVE CHANNEL 2

        Port: In, Out, Expansion

           Status: Degraded

              Reason: Error threshold exceeded

           Max. Rate: 3 Gbps

           Current Rate: 3 Gbps

           Rate Control: Switched

           DRIVE COUNTS

              Total # of attached physical disks: 29

              Connected to: B (right), Port In

                 Attached physical disks: 14

                    Expansion enclosure: 1 (14 physical disks)

              Connected to: 1, Port Expansion

                 Attached physical disks: 15

                    Expansion enclosure: 0 (15 physical disks)

           CUMULATIVE ERROR COUNTS

              RAID Controller Module 0

                 Baseline time set:                       11/18/14 5:32:52 PM

                 Sample period (days, hh:mm:ss):          371 days, 20:04:01

                 RAID Controller Module detected errors:  129

                 Physical Disk detected errors:           3810970

                 Timeout errors:                          2

                 Total I/O count:                         344810553

              RAID Controller Module 1

                 Baseline time set:                       11/18/14 5:32:52 PM

                 Sample period (days, hh:mm:ss):          598 days, 12:09:58

                 RAID Controller Module detected errors:  1655

                 Physical Disk detected errors:           5509740

                 Timeout errors:                          33

                 Total I/O count:                         82661965

           CAPTURED INTERVAL ERROR COUNTS

              RAID Controller Module 0

                 Start time: {0}                          11/18/14 5:32:52 PM

                 End time: {0}                            11/8/15 3:57:26 PM

                 RAID Controller Module detected errors:  65

                 Physical Disk detected errors:           3643402

                 Timeout errors:                          2

                 Total I/O count:                         189870495

Script execution complete.

SMcli completed successfully.

To the untrained eye that looks bad with over 1600 errors on module 1.  Granted that's over 2 years.  I'm trying to implement a 2nd md3000i/md1000 but having problems with the speed.  I'll create another post for  that one I think.

Anyways, are all those errors something I need to worry about?  I haven't reset the stats yet.

Thanks again for all your help!

November 25th, 2015 13:00

Hey, Josh.

Yes, these are historical errors. (acquired over the life of the array roughly) Nothing to worry about in the present. Definitely open a new post on that one, so we have a case for each issue. :)

Have a happy holiday!

No Events found!

Top