Josh Henry

8 Posts

75705

November 19th, 2015 10:00

md3000i Degraded Physical Disk Channel

Hi,

I have an md3000i with attached md1000. We had a HDD fail which the hot spare took over with no down time or issues.

Now I replaced the failed drive and it was put back into the array. Once that completed now I'm showing Degraded Physical Disk Channel on channel 0,1 (I'm assuming that's the LUN number since LUNs 0 and 1 are part of the RAID group with the failed disk).

Also it seems to have created a Virtual Disk Not On Preferred Path issue. This md3000i has dual controllers and the ESX hosts are all multipathed but not sure why all of this is coming around after replacing a failed disk.

Thanks,
Josh.

Responses(7)

DELL-Daniel Ca

243 Posts

0

November 19th, 2015 15:00

Hello, Josh.

So, there are a number of things working here. We'll start with the message "Degraded Physical Disk Channel 0,1" It's telling us that channels 0 and 1 are marked as degraded by the controllers. This is probably because of the failed disk. The disk more than likely failed because it maxed a number of errors, further causing chatter down the channels. It's an easy fix though.

You'll need to run some commands to clear the channels of the errors. (it's like resetting a counter back to zero)

Here's the commands to do so:

show allPhysicalDiskChannels stats;

clear allPhysicalDiskChannels stats;

set physicalDiskChannel [0] status=optimal;

set physicalDiskChannel [1] status=optimal;

set physicalDiskChannel [2] status=optimal;

set physicalDiskChannel [3] status=optimal;

show allPhysicalDiskChannels stats;

To get access to the command line, you'll need to open the cli window in windows, and navigate to the SMcli folder: C:\Program Files \Dell\MD Storage Manager\client or C:\Program Files\Dell\MD Storage Manager\client

(This depends on if the version is 32 bit or 64 bit)

Then, start your commands with:

>smcli -n "NameOfArray" -c "set physicalDiskChannel [1] status=optimal;"

As far as the Virtual Disk Not On Preferred Path,

SMCli –n "NameOfArray" -c "reset storageArray virtualdisk distribution;"

Run this command AFTER you've cleared the channel error counters, and let me know if it stays good.

I know I've given a lot to do here, so let me know if you have any questions.

Have a great rest of the week!

JH

Josh Henry

8 Posts

1

November 23rd, 2015 09:00

Thanks for the informative answer. This is a live SAN, are there any issues with running those commands? Looks like you['re just clearing counters and resetting the sensors? This is safe to do?

Thanks,

Josh.

JH

Josh Henry

8 Posts

0

November 23rd, 2015 10:00

What will this command do to live data?

SMCli –n "NameOfArray" -c "reset storageArray virtualdisk distribution;"

DELL-Daniel Ca

243 Posts

0

November 23rd, 2015 10:00

Absolutely safe to run these, yes. That's exactly what you're doing. I should tell you as well, once in a great while, this won't clear it. Sometimes the 'message' is just traded between the controllers, and "sticks" in the GUI. IF these don't clear the message, you'll need to boot the SAN. (not fun, I know.)

But, the chances are good that these commands are all you need.

Let me know!

DELL-Daniel Ca

243 Posts

0

November 23rd, 2015 15:00

It doesn't touch data. It "redistributes" the ownership of virtual disks. IF you have multipath drivers installed (MDSM GUI and Host access tools) and, both raid controllers are cabled, then you *shouldn't* see any sort of disconnect. The transfer of ownership from one controller to the next, *shouldn't* take longer than the timeouts are set.

Still, if you'd feel more comfortable waiting for an open maintenance window, then do that.

JH

Josh Henry

8 Posts

0

November 25th, 2015 12:00

Daniel,

I'm seeing lots of RAID Controller Module errors. please see the issued command below:

DRIVE CHANNELS----------------------------

SUMMARY

CHANNEL PORT STATUS

1 In,Out,Expansion Degraded

2 In,Out,Expansion Degraded

DETAILS

DRIVE CHANNEL 1

Port: In, Out, Expansion

Status: Degraded

Reason: Error threshold exceeded

Max. Rate: 3 Gbps

Current Rate: 3 Gbps

Rate Control: Switched

DRIVE COUNTS

Total # of attached physical disks: 29

Connected to: A (left), Port In

Attached physical disks: 14

Expansion enclosure: 1 (14 physical disks)

Connected to: 0, Port Expansion

Attached physical disks: 15

Expansion enclosure: 0 (15 physical disks)

CUMULATIVE ERROR COUNTS

RAID Controller Module 0

Baseline time set: 11/18/14 5:32:52 PM

Sample period (days, hh:mm:ss): 371 days, 20:04:01

RAID Controller Module detected errors: 0

Physical Disk detected errors: 3485767

Timeout errors: 0

Total I/O count: 757848036

RAID Controller Module 1

Baseline time set: 11/18/14 5:32:52 PM

Sample period (days, hh:mm:ss): 598 days, 12:09:58

RAID Controller Module detected errors: 948

Physical Disk detected errors: 5993184

Timeout errors: 73

Total I/O count: 2629457758

CAPTURED INTERVAL ERROR COUNTS

RAID Controller Module 1

Start time: {0} 11/18/14 10:23:41 PM

End time: {0} 6/5/16 6:42:17 AM

RAID Controller Module detected errors: 916

Physical Disk detected errors: 5642849

Timeout errors: 25

Total I/O count: 1835496439

DRIVE CHANNEL 2

Port: In, Out, Expansion

Status: Degraded

Reason: Error threshold exceeded

Max. Rate: 3 Gbps

Current Rate: 3 Gbps

Rate Control: Switched

DRIVE COUNTS

Total # of attached physical disks: 29

Connected to: B (right), Port In

Attached physical disks: 14

Expansion enclosure: 1 (14 physical disks)

Connected to: 1, Port Expansion

Attached physical disks: 15

Expansion enclosure: 0 (15 physical disks)

CUMULATIVE ERROR COUNTS

RAID Controller Module 0

Baseline time set: 11/18/14 5:32:52 PM

Sample period (days, hh:mm:ss): 371 days, 20:04:01

RAID Controller Module detected errors: 129

Physical Disk detected errors: 3810970

Timeout errors: 2

Total I/O count: 344810553

RAID Controller Module 1

Baseline time set: 11/18/14 5:32:52 PM

Sample period (days, hh:mm:ss): 598 days, 12:09:58

RAID Controller Module detected errors: 1655

Physical Disk detected errors: 5509740

Timeout errors: 33

Total I/O count: 82661965

CAPTURED INTERVAL ERROR COUNTS

RAID Controller Module 0

Start time: {0} 11/18/14 5:32:52 PM

End time: {0} 11/8/15 3:57:26 PM

RAID Controller Module detected errors: 65

Physical Disk detected errors: 3643402

Timeout errors: 2

Total I/O count: 189870495

Script execution complete.

SMcli completed successfully.

To the untrained eye that looks bad with over 1600 errors on module 1. Granted that's over 2 years. I'm trying to implement a 2nd md3000i/md1000 but having problems with the speed. I'll create another post for that one I think.

Anyways, are all those errors something I need to worry about? I haven't reset the stats yet.

Thanks again for all your help!

DELL-Daniel Ca

243 Posts

0

November 25th, 2015 13:00

Hey, Josh.

Yes, these are historical errors. (acquired over the life of the array roughly) Nothing to worry about in the present. Definitely open a new post on that one, so we have a case for each issue. :)

Have a happy holiday!

View All

No Events found!

DELL|EMC Storage Forum

md3000i Degraded Physical Disk Channel