Unsolved
This post is more than 5 years old
8 Posts
0
75705
md3000i Degraded Physical Disk Channel
Hi,
I have an md3000i with attached md1000. We had a HDD fail which the hot spare took over with no down time or issues.
Now I replaced the failed drive and it was put back into the array. Once that completed now I'm showing Degraded Physical Disk Channel on channel 0,1 (I'm assuming that's the LUN number since LUNs 0 and 1 are part of the RAID group with the failed disk).
Also it seems to have created a Virtual Disk Not On Preferred Path issue. This md3000i has dual controllers and the ESX hosts are all multipathed but not sure why all of this is coming around after replacing a failed disk.
Thanks,
Josh.
DELL-Daniel Ca
243 Posts
0
November 19th, 2015 15:00
Hello, Josh.
So, there are a number of things working here. We'll start with the message "Degraded Physical Disk Channel 0,1" It's telling us that channels 0 and 1 are marked as degraded by the controllers. This is probably because of the failed disk. The disk more than likely failed because it maxed a number of errors, further causing chatter down the channels. It's an easy fix though.
You'll need to run some commands to clear the channels of the errors. (it's like resetting a counter back to zero)
Here's the commands to do so:
show allPhysicalDiskChannels stats;
clear allPhysicalDiskChannels stats;
set physicalDiskChannel [0] status=optimal;
set physicalDiskChannel [1] status=optimal;
set physicalDiskChannel [2] status=optimal;
set physicalDiskChannel [3] status=optimal;
show allPhysicalDiskChannels stats;
To get access to the command line, you'll need to open the cli window in windows, and navigate to the SMcli folder: C:\Program Files \Dell\MD Storage Manager\client or C:\Program Files\Dell\MD Storage Manager\client
(This depends on if the version is 32 bit or 64 bit)
Then, start your commands with:
>smcli -n "NameOfArray" -c "set physicalDiskChannel [1] status=optimal;"
As far as the Virtual Disk Not On Preferred Path,
SMCli –n "NameOfArray" -c "reset storageArray virtualdisk distribution;"
Run this command AFTER you've cleared the channel error counters, and let me know if it stays good.
I know I've given a lot to do here, so let me know if you have any questions.
Have a great rest of the week!
Josh Henry
8 Posts
1
November 23rd, 2015 09:00
Thanks for the informative answer. This is a live SAN, are there any issues with running those commands? Looks like you['re just clearing counters and resetting the sensors? This is safe to do?
Thanks,
Josh.
Josh Henry
8 Posts
0
November 23rd, 2015 10:00
What will this command do to live data?
SMCli –n "NameOfArray" -c "reset storageArray virtualdisk distribution;"
DELL-Daniel Ca
243 Posts
0
November 23rd, 2015 10:00
Absolutely safe to run these, yes. That's exactly what you're doing. I should tell you as well, once in a great while, this won't clear it. Sometimes the 'message' is just traded between the controllers, and "sticks" in the GUI. IF these don't clear the message, you'll need to boot the SAN. (not fun, I know.)
But, the chances are good that these commands are all you need.
Let me know!
DELL-Daniel Ca
243 Posts
0
November 23rd, 2015 15:00
It doesn't touch data. It "redistributes" the ownership of virtual disks. IF you have multipath drivers installed (MDSM GUI and Host access tools) and, both raid controllers are cabled, then you *shouldn't* see any sort of disconnect. The transfer of ownership from one controller to the next, *shouldn't* take longer than the timeouts are set.
Still, if you'd feel more comfortable waiting for an open maintenance window, then do that.
Josh Henry
8 Posts
0
November 25th, 2015 12:00
Daniel,
I'm seeing lots of RAID Controller Module errors. please see the issued command below:
DRIVE CHANNELS----------------------------
SUMMARY
CHANNEL PORT STATUS
1 In,Out,Expansion Degraded
2 In,Out,Expansion Degraded
DETAILS
DRIVE CHANNEL 1
Port: In, Out, Expansion
Status: Degraded
Reason: Error threshold exceeded
Max. Rate: 3 Gbps
Current Rate: 3 Gbps
Rate Control: Switched
DRIVE COUNTS
Total # of attached physical disks: 29
Connected to: A (left), Port In
Attached physical disks: 14
Expansion enclosure: 1 (14 physical disks)
Connected to: 0, Port Expansion
Attached physical disks: 15
Expansion enclosure: 0 (15 physical disks)
CUMULATIVE ERROR COUNTS
RAID Controller Module 0
Baseline time set: 11/18/14 5:32:52 PM
Sample period (days, hh:mm:ss): 371 days, 20:04:01
RAID Controller Module detected errors: 0
Physical Disk detected errors: 3485767
Timeout errors: 0
Total I/O count: 757848036
RAID Controller Module 1
Baseline time set: 11/18/14 5:32:52 PM
Sample period (days, hh:mm:ss): 598 days, 12:09:58
RAID Controller Module detected errors: 948
Physical Disk detected errors: 5993184
Timeout errors: 73
Total I/O count: 2629457758
CAPTURED INTERVAL ERROR COUNTS
RAID Controller Module 1
Start time: {0} 11/18/14 10:23:41 PM
End time: {0} 6/5/16 6:42:17 AM
RAID Controller Module detected errors: 916
Physical Disk detected errors: 5642849
Timeout errors: 25
Total I/O count: 1835496439
DRIVE CHANNEL 2
Port: In, Out, Expansion
Status: Degraded
Reason: Error threshold exceeded
Max. Rate: 3 Gbps
Current Rate: 3 Gbps
Rate Control: Switched
DRIVE COUNTS
Total # of attached physical disks: 29
Connected to: B (right), Port In
Attached physical disks: 14
Expansion enclosure: 1 (14 physical disks)
Connected to: 1, Port Expansion
Attached physical disks: 15
Expansion enclosure: 0 (15 physical disks)
CUMULATIVE ERROR COUNTS
RAID Controller Module 0
Baseline time set: 11/18/14 5:32:52 PM
Sample period (days, hh:mm:ss): 371 days, 20:04:01
RAID Controller Module detected errors: 129
Physical Disk detected errors: 3810970
Timeout errors: 2
Total I/O count: 344810553
RAID Controller Module 1
Baseline time set: 11/18/14 5:32:52 PM
Sample period (days, hh:mm:ss): 598 days, 12:09:58
RAID Controller Module detected errors: 1655
Physical Disk detected errors: 5509740
Timeout errors: 33
Total I/O count: 82661965
CAPTURED INTERVAL ERROR COUNTS
RAID Controller Module 0
Start time: {0} 11/18/14 5:32:52 PM
End time: {0} 11/8/15 3:57:26 PM
RAID Controller Module detected errors: 65
Physical Disk detected errors: 3643402
Timeout errors: 2
Total I/O count: 189870495
Script execution complete.
SMcli completed successfully.
To the untrained eye that looks bad with over 1600 errors on module 1. Granted that's over 2 years. I'm trying to implement a 2nd md3000i/md1000 but having problems with the speed. I'll create another post for that one I think.
Anyways, are all those errors something I need to worry about? I haven't reset the stats yet.
Thanks again for all your help!
DELL-Daniel Ca
243 Posts
0
November 25th, 2015 13:00
Hey, Josh.
Yes, these are historical errors. (acquired over the life of the array roughly) Nothing to worry about in the present. Definitely open a new post on that one, so we have a case for each issue. :)
Have a happy holiday!