Unsolved
This post is more than 5 years old
14 Posts
0
4949
MD3000i Controller Issues - Non Optimal Array
Recently my duel controller MD3000i become non-optional, what seems to be to a battery issue in the controller in Slot 1. I took the controller offline, pulled the controller and replaced the battery. The battery fault light disappeared and I could bring the controller online (and those Virtual disks assigned to Controller 1 seem to fall back to their owner), however the MD3000i is still not optimal (battery still logging an issue).
I have tried another battery and another controller, but cannot get the MD3000i back to an optimal state. Further more, if I leave Controller 1 online I have noticed the MD3000i periodically turn up the cooling fans very high, for a short time and then resume normal operation (I do not ever hear this activity, unless initially starting the device). Also, if I leave controller 1 online, it seems within a relatively short time I am running into issues with my Host Server that is connected via iSCSI to the MD3000i (this also happened after I moved all disk ownership to controller 0). I am currently running with Controller 1 offline and, although I have no-redundancy, I am operational.
However, I am not too sure what to do next to resolve this issue. It seems like a hardware issue but I doubt both the controller and my "spare" controller are both defective, so could it be the chassis? I have not yet tried stopping all I/O and powering off the MD3000i, nor have I tried swap the controller in Slot 0 with the one in Slot 1 to see if the problems moves.
Any help guidance would be appreciated! I do have a support bundle I could share...
Thanks!
jezmathers
14 Posts
0
December 11th, 2017 09:00
I forgot to include these errors that appear frequently when controller in slot 1 in online...
Date/Time: 12/9/17 8:01:39 AM
Sequence number: 54719
Event type: 2837
Event category: Internal
Priority: Informational
Description: Discrete lines diagnostic failure resolved
Event specific codes: 0/0/0
Component type: Interconnect-battery module pack
Component location: RAID Controller Module enclosure
Logged by: RAID Controller Module in slot 1
Raw data:
4d 45 4c 48 03 00 00 00 bf d5 00 00 00 00 00 00
37 28 40 02 b3 de 2b 5a 08 00 00 00 00 00 00 00
00 00 00 00 04 00 00 00 10 00 00 00 10 00 00 00
ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
01 00 00 00 02 00 01 01 08 00 00 00 04 00 02 00
00 00 00 00
Date/Time: 12/9/17 8:01:07 AM
Sequence number: 54718
Event type: 2836
Event category: Internal
Priority: Critical
Description: Discrete lines diagnostic failure
Event specific codes: 0/0/0
Component type: RAID Controller Module
Component location: RAID Controller Module in slot 1
Logged by: RAID Controller Module in slot 1
Raw data:
4d 45 4c 48 03 00 00 00 be d5 00 00 00 00 00 00
36 28 48 01 93 de 2b 5a 08 00 00 00 00 00 00 00
00 00 00 00 04 00 00 00 08 00 00 00 08 00 00 00
ff ff ff ff 01 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
01 00 00 00 02 00 01 01 08 00 00 00 04 00 02 00
01 00 00 00
DELL-Sam L
Moderator
Moderator
•
6.8K Posts
0
December 12th, 2017 08:00
Hello jezmathers,
When you replaced the battery did you let it charge for 48hrs? I ask this as when a battery is replaced it can take that long for it to charge. In addition to that once the battery is charged it will normally want to do a battery test. Now you can modify the battery setting by clicking the Tools tab → Change Battery Settings.
If you are still getting the error after the battery test has completed then we would need to look at a support bundle to see what is going on.
Please let us know if you have any other questions.
jezmathers
14 Posts
0
December 12th, 2017 13:00
Yes, I believe i did charge for at least 48hrs. I do not have the "Change Batter Options" on my Modular Disk Storage Manager.
I have attached the storage bundle... Your help is appreciated!
1 Attachment
SupportBundle-12-11-2017.zip
DELL-Sam L
Moderator
Moderator
•
6.8K Posts
0
December 13th, 2017 12:00
Hello jezmathers,
Thanks for the support bundle as it helps to see what is going on. When you replaced the battery in controller 1 it did not reset. What will be needed is to run the following command via SMCLI.
reset storageArray batteryInstallDate [controller=(1)]
That should force the controller to reset the learn cycle and start a battery test. If it doesen’t start the battery test then you can manual start it via SMCLI.
set storageArray learnCycleDate (daysToNextLearnCycle=integer-literal | day=string-literal) time=HH:MM
Here is also a link to the SMCLI guide just in case u need it. http://downloads.dell.com/manuals/all-products/esuprt_ser_stor_net/esuprt_powervault/powervault-md3000i_reference%20guide2_en-us.pdf
Please let us know if you have any other questions.
jezmathers
14 Posts
0
December 15th, 2017 10:00
I put the controller 1 back on line and successfully ran the SMCLI command to reset the battery install date.
C:\Program Files (x86)\Dell\MD Storage Manager\client>smcli -n Empire -p ****** -c "reset storageArray batteryinstallDate controller=1;"
Performing syntax check...
Syntax check complete.
Executing script...
Script execution complete.
SMcli completed successfully.
I did not see any evidence of the learn cycle starting, so I ran the second SMCLI command you suggested, but I got the following error:
C:\Program Files (x86)\Dell\MD Storage Manager\client>smcli -n Empire -p *****-c "set storageArray learnCycleDate daystoNextLearnCycle=0;"
Performing syntax check...
Syntax check complete.
Executing script...
This storage array at line 1 is not SBD (Smart-Battery Data) capable and will no
t support the setting of learn cycles.
The command at line 1 that caused the error is:
set storageArray learnCycleDate daystoNextLearnCycle=0;
Script execution halted due to error.
SMcli failed.
The array is still non-optional and I am seeing this error message in the array manager:
Storage array: Empire
Component reporting problem: Battery
Status: Unknown
Location: RAID Controller Module enclosure 0,
RAID Controller Module in Slot 1 Component requiring service: RAID Controller Module 1
Service action (removal) allowed: No
Service action LED on component: No
And in the log, this error is repeating:
Date/Time: 12/15/17 11:45:02 AM
Sequence number: 54979
Event type: 2836
Description: Discrete lines diagnostic failure
Event specific codes: 0/0/0
Event category: Internal
Component type: RAID Controller Module
Component location: RAID Controller Module in slot 1
Logged by: RAID Controller Module in slot 1
Raw data:
4d 45 4c 48 03 00 00 00 c3 d6 00 00 00 00 00 00
36 28 48 01 0e fc 33 5a 08 00 00 00 00 00 00 00
00 00 00 00 04 00 00 00 08 00 00 00 08 00 00 00
ff ff ff ff 01 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
01 00 00 00 02 00 01 01 08 00 00 00 04 00 02 00
01 00 00 00
I have attached a new support bundle. Any other suggestions/advice would be most welcome.
Thank you!
1 Attachment
Support Bundle 12-15-17.zip
DELL-Sam L
Moderator
Moderator
•
6.8K Posts
0
December 19th, 2017 11:00
Hello jezmathers,
Looking at your support bundle after you ran the commands it looks like there is still an issue with your battery. When you replaced the battery was it a new battery or used from another controller? I ask this as when we see this issue we replace the battery as it is not functioning as it is supposed to. Your controller is not able to get any info on battery age and life span & current charge status when I am looking at the logs.
Please let us know if you have any other questions.
jezmathers
14 Posts
0
December 19th, 2017 11:00
I had a spare (refurbished) controller (with a battery installed), I tried the new controller (with the battery that was installed), when that didn't work I tried the battery from the spare controller in the old controller. When that didn't work, I purchased a new battery (which is what I have installed on Controller 1 now). The battery is not Dell OEM, but is a new a Zthy Tech battery (I have used these in the past with success - in fact controller 0 may have one of these installed). I could replace the controller/battery and try the SMcli commands again, if you thing that is worth a try?
As neither controller or batteries combination is solving my issue I am concerned it is an issue with the enclosure! I have not had chance to completely shut down I/O and power cycle the MD3000, not sure if that could help?
What is my next move?
jezmathers
14 Posts
0
December 20th, 2017 13:00
I brought the controller 1 online and ran the ping for 7 minutes, without a single timeout/fail.
Thank you for your help... What is my next move?
Anonymous
274.2K Posts
0
December 20th, 2017 13:00
That controller could be stuck in a boot loop. Here is how we can test this out. Please do a continual ping of the management port for raid controller one.
Should look something like this:
ping 172.16.17.110 -t
If the controller is stuck in a loop, you will see responses to the ping for about 3 minutes, and then responses will fail for about a minute, and it will just repeat this pattern.
Can you test this and report back to us?
Thanks
DELL-Daniel My
Moderator
Moderator
•
6.2K Posts
0
December 21st, 2017 10:00
Hello
When you pulled the support bundle was the spare controller inserted or the original? The battery is not detected properly by whichever controller was installed. Although possible, I think it is unlikely this is a chassis or slot issue. This is likely an issue with either the battery or controller.
According to the logs the issues started on 11/17 when a normal learn cycle started on both controllers. Controller 1 failed the learn cycle and has been producing errors since. Without a functional battery caching on the controller will be disabled. If controller 1 is brought into a redundant configuration with caching disabled then caching will be disabled on both controllers. This will likely decrease performance.
I suggest testing with another battery or controller if possible. Whenever you insert a new battery make sure to allow several hours(8+) for the battery to charge. It must meet a minimum charge threshold before it will start functioning correctly.
When you make changes you can pull a new support bundle and look at the statecapturedata.txt file to view the controller battery status. This is what the battery status looks like on controller 1 of your bundle. The status indicates that the controller is unable to provide information on the battery. The status is unknown.
Thanks
jezmathers
14 Posts
0
December 27th, 2017 06:00
I exchanged Controller 1 for my spare controller (& battery) on 12/22 around 1:30pm and brought the controller online. I grabbed a support bundle and this is what I see in statecapturedata.txt:
Battery [1]
Parent Ctlr = CTLR_B
Is Local = false
Parent CRU = CRU_2
BID Index = 0
CapabilityChking = true
Over Temp Count = 0
Install Time = 0x5A3CC927 12/22/2017 08:58:15
Warning Time = 0xFFFFFFFF
Expired Time = 0xFFFFFFFF
Current Status
Overall Sts = (0x0011) I2C Bus Err
Common Sts = (0x0001) Okay
Working Sts = (0x0042) I2CBusErr
Config Sts = (0x0060) AgeExpOff AgeWrnOff
Smart Sts = (0x0007) VrsnErr ChargeOk Smart
Learn Sts = (0x0002) NotReady
Bkup Mode = (0x0002) Disabled
I left the controller online (fyi, I had moved all my virtual disks to prefer Controller 0). At around 3:56am on 12/24 one of my 2 host servers (both Server 2008 R2 running as a Hyper -V cluster) crashed and rebooted (bugcheck 1001). It also appears that the other server lost connectivity with the SAN (I see iScsiPrt error - Connection to Target lost).
I grabbed another support bundle (attached) and this is what I see in statecapturedata.txt:
Battery [1]
Parent Ctlr = CTLR_B
Is Local = false
Parent CRU = CRU_2
BID Index = 0
CapabilityChking = true
Over Temp Count = 0
Install Time = 0x5A3CC927 12/22/2017 08:58:15
Warning Time = 0xFFFFFFFF
Expired Time = 0xFFFFFFFF
Current Status
Overall Sts = (0x0011) I2C Bus Err
Common Sts = (0x0001) Okay
Working Sts = (0x0042) I2CBusErr
Config Sts = (0x0060) AgeExpOff AgeWrnOff
Smart Sts = (0x0007) VrsnErr ChargeOk Smart
Learn Sts = (0x0002) NotReady
Bkup Mode = (0x0002) Disabled
I had my cluster crash on a previous occasion when I left Controller 1 Online (never had either server crash prior to this issue, or when running with Controller 1 Offline).
Is there anything else I can try? Thanks.
1 Attachment
Support Bundle 12-24-17.zip
DELL-Sam L
Moderator
Moderator
•
6.8K Posts
0
January 2nd, 2018 10:00
Hello jezmathers,
Looking at the last support bundle that you supplied I can see that the battery has been replaced. However I am also seeing that it is still not able to provide an information about the battery charge status. I also looked in the state capture data and the status still is not seeing the replacement battery.
When you bring raid controller 1 online does it fully boot up or no? Also, if you use the serial shell cable and watch the boot of the controller does it present the start of the day message on boot?
Please let us know if you have any other questions.
jezmathers
14 Posts
0
January 23rd, 2018 05:00
Apologies for the delayed response (I had to find a service cable). I now have a terminal log (when I bring Controller 1 online) - but do not see the option to attach a file anymore? Can I email that to you?
DELL-Sam L
Moderator
Moderator
•
6.8K Posts
0
January 23rd, 2018 08:00
Hello jezmathers,
I will send you an email that you can reply to with your serial shell output so that we can review it to see what it says.
Please let us know if you have any other questions.