Troubleshooting Dell blade server power on issues using Dell CMC and RACADM

Troubleshooting Dell blade server power on issues using Dell CMC and RACADM


In modular servers there are a lot of gating factors that control whether athe server is able to power on. When a server does not power on, it is not always straight forward to find the reason why. The below steps will provide a methodical way to trouble shoot a Dell blade server that will not turn on.

  1. First, make sure that the trivial things like AC and other cables are plugged in properly.
  2. Make sure that the Chassis is powered ON.

    Command to check the chassis power status remotely via CMC RACADM:

    $ racadm getsysinfo –c

    Chassis Information:
    System Model = PowerEdge M1000e
    System AssetTag = 00000
    Service Tag = HLSG7R1
    Chassis Name = CMC-HLSG7R1
    Chassis Location = PT Lab Power Test
    Chassis Midplane Version = 1.1
    Power Status = OFF

    Command to check the status of iDRAC:

    $ racadm chassisaction powerup
    Module power operation successful.

    It will take a minute for chassis to powerup using this command.

  3. Once the chassis is ON, check the status of iDRAC.

    Command to check the status of iDRAC:

    $ racadm getversion

    server-1 1.35.35 (Build 03) PowerEdgeM620 iDRAC7 Y
    server-3 iDRAC not ready
    switch-1 Dell PowerConnect M6348 A02
    switch-2 Dell PowerConnect M6220 A12
    cmc-1 4.30.X15.201210050401 Y
    cmc-2 4.30.X15.201210050401 Y

    $

    If iDRAC is not ready for more than 3 minutes since chassis power on then try step 13.

  4. Only if you are connected to 110 VAC source then make sure "Allow 110 VAC Operation" option is selected.

    Command to check status of "Allow 110 VAC Operation" option:

    $ racadm getconfig -g cfgChassisPower –o cfgChassisAllow110VACOperation
    0
    $

    Command to enable "Allow 110 VAC Operation" option:

    $ racadm config -g cfgChassisPower –o cfgChassisAllow110VACOperation 1
    Object value modified successfully.
    $

  5. Make sure you have latest combination of firmware (CMC, iDRAC, BIOS, CPLD and LC) for the given blade server model. For latest firmware version, you can go to Dell’s support page (support.dell.com/support) and provide the product’s service tag. Under drivers and downloads tab, it will show the latest firmware levels of different components.

    Command to get the current firmware version installed in the system:

    $ racadm getversion
    For BIOS version: $ racadm getversion –b
    For CPLD version: $ racadm getversion –c
    For USC version: $ racadm getversion –l

  6. Verify that there is no fabric mismatch in the blade server, if you changed the fabric recently and unable to power on since then.

    If DC1 and DC2 states of a server are "OK" or "N/A" then there is no fabric mismatch for that server. If either DC1 or DC2 state is "invalid" then there is fabric mismatch. To fix this issue, remove or change the mezzanine card or IOM.

    In the example below, server-1, server-3 and server-11 all have valid fabrics. Server-4 has mismatched fabric in slot B.

    Command to check if there is fabric mismatch:

    $ racadm getdcinfo
    Group A I/O Type : Gigabit Ethernet
    Group B I/O Type : 10 GbE KR
    Group C I/O Type : Fibre Channel 16

    switch-1 Gigabit Ethernet OK Master
    switch-2 Gigabit Ethernet OK Master
    switch-3 None N/A N/A
    switch-4 None N/A N/A
    switch-5 Fibre Channel 16 OK Master
    switch-6 None N/A N/A
    server-1 Present 10 GbE KR OK None N/A
    server-2 Not Present None N/A None N/A
    server-3 Present None N/A None N/A
    server-4 Present Fibre Channel 16 Not OK None N/A
    server-5 Not Present None N/A None N/A
    server-6 Not Present None N/A None N/A
    server-7 Not Present None N/A None N/A
    server-8 Not Present None N/A None N/A
    server-9 Not Present None N/A None N/A
    server-10 Not Present None N/A None N/A
    server-11 Present None N/A Fibre Channel 16 OK
    server-12 Not Present None N/A None N/A
    server-13 Not Present None N/A None N/A
    server-14 Not Present None N/A None N/A
    server-15 Not Present None N/A None N/A
    server-16 Not Present None N/A None N/A

    $

  7. After brown out, if the blade server that was "powered on" before is not "powering on" now automatically then you probably have the auto-recovery state set to off. Auto-recovery state can be set to ON, OFF or LAST state from BIOS F2 settings.
  8. You should have Chassis Control Administrator (Power Commands) priviledge or Server Administrator priviledge or iDRAC’s administrator account, to remotely control blade server power actions.
  9. Make sure Maximum Power Conservation Mode (MPCM) is disabled.

    Command to check the status of MPCM:

    $ racadm getconfig –g cfgchassispower –o cfgchassismaxpowerconservationmode
    0
    $

    If it returns 0 then MPCM is disabled. If it returns a time stamp then MPCM is enabled since that time.

  10. Check raclog for any insufficient power related messages. Removing bad PSUs if any and installing new PSUs may help.

    Command to check raclog:

    $ racadm getraclog

    Part of the output:

    --------------------------------------------------------------------------------
    SeqNumber = 73
    Message ID = USR8511
    Category = Audit
    AgentID = CMC
    Severity = Information
    Timestamp = 2013-01-11 23:18:34
    Message Arg 1 =
    Message Arg 2 = 192.168.0.100
    Message Arg 3 = root
    Message Arg 4 = GUI
    Message Arg 5 = 29179
    Message = Login success from 192.168.0.100 (username=root, type=GUI, sid=29179)
    --------------------------------------------------------------------------------
    SeqNumber = 72
    Message ID = USR8510
    Category = Audit
    AgentID = CMC
    Severity = Information
    Timestamp = 2013-01-11 23:17:17
    Message Arg 1 =
    Message Arg 2 = root
    Message Arg 3 = Serial
    Message Arg 4 = 6133
    Message = Login success (username=root, type=Serial, sid=6133)
    --------------------------------------------------------------------------------
    SeqNumber = 71
    Message ID = USR8506
    Category = Audit
    AgentID = CMC
    Severity = Information
    Timestamp = 2013-01-11 23:17:07
    Message Arg 1 = 41269
    Message = Session close succeeds: sid=41269
    --------------------------------------------------------------------------------

  11. If you see amber light in-front of the blade then check SEL and LC log for any critical messages. This is a scenario where a server turned on but was turned off due to hardware failure. Fix the issue based on recommended solution provided with the event log.

    Command to check the log:

    $ racadm getsel

    Part of the output:

    Mon Feb 25 2013 13:04:04 Critical The power input for power supply 2 is lost.
    Mon Feb 25 2013 13:04:04 Critical Power supply redundancy is lost.

  12. If you can’t ping iDRAC then do a virtual reseat.

    Command to virtually reseat a blade:

    $ racadm serveraction –m server-n reseat –f
    Object value modified successfully
    $
    Where n is the slot number of the blade(iDRAC).

  13. If AC redundancy is set then you may not be able to power on all the highly configured systems. Server Performance Over Redundancy feature sacrifies the redundancy to turn on all the servers. Hence try enabling Server Performance Over Redundancy.

    Command to enable Server Performance Over Redundancy:

    $racadm config –g cfgChassisPower –o cfgChassisPerformanceOverRedundancy 1
    Object value modified successfully
    $




Article ID: SLN309871

Last Date Modified: 08/14/2018 04:22 AM


Rate this article

Accurate
Useful
Easy to understand
Was this article helpful?
Yes No
Send us feedback
Comments cannot contain these special characters: <>()\
Sorry, our feedback system is currently down. Please try again later.

Thank you for your feedback.