Dell Unity:存储处理器在用户界面中显示为重新启动,但在 CLI 中不显示,并且没有出现故障的 LED(用户可纠正)

Summary: 本文解释了为什么 Unisphere 用户界面在 SP 处于正常模式时可能会显示 SP 降级并重新启动。

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

小型可插拔 (SFP) 在日志中显示缺失一次,之后显示为良好。
如果 SFP 有少量灰尘或未完全插入端口,则已知它们不会被检测到。
它是性能下降的常见因素,因为它会导致持续的断开连接,甚至可能被视为慢漏设备。

在 Unisphere 用户界面中,存储处理器在“SYSTEM” > “Service” > “Service Tasks”中显示为正在重新启动并处于降级状态。

在 Unisphere 用户界面中,存储处理器在“SYSTEM” > “Service” > “Service Tasks”中显示为正在重新启动并处于降级状态。


但是,在使用 CLI 的 SSH 终端中,两个 SP 都处于正常模式。
到目前为止,此问题已出现在 Unity OE 版本 4.5.1.0.5.001 中。

例:

service@CKMxxxxxxxx spa:~/user# svc_diag
======== Now executing basic state ========
* System Serial Number is: CKMxxxxxxx
* System Model Number is: Unity 500
* System Friendly Host Name is: CKMxxxxxxxx
* Current Software version: c4dev_PIE_3786R-4.5.1.0.5.001.1552025209-GNOSIS_RETAIL
* Unisphere IP address(es): xx.xxx.xxx.xx xxxx::xxx:xxxx:xxxx:xxxx
* SSH Enabled: true
* FIPS mode: Disabled
* Boot Mode: Normal Mode
* Post Faults:  0x0000
* Backend Faults:       0x0000
* Boot Faults:  0x0000
* Rescue Reason:        0x0000
* Rescue reason for code 0x0000 - No faults detected.
* SP Service Hint Code: <None>

Cause

安装新的 I/O 模块时会发生此特殊情况。
提交由于非最佳 SFP 而未完成,因此运行状况相关作被暂时禁用(类似于升级期间发生的情况)。
由于运行状况轮询被禁用,系统无法识别存储处理器的正确状态,并报告以前的已知状态“正在重新启动”。

要确认这是同一问题,请验证以下日志:
/var/tmp/ptm/ptm.log
/EMC/C4Core/log/c4_safe_ktrace.log

可以通过在 SSH 终端上或在分流服务数据收集日志中运行命令来实时查看此信息:

命令/日志 #1:
cat /var/tmp/ptm/ptm.log

预期的输出:
=====================================Tasks=====================================
10:56 [ 16/22 ]  Core reboot sp if required (local)                  10 minutes
Start at: Thu May 23 10:56:19 2019
Complete at: Thu May 23 10:56:19 2019
===============================================================================
10:56 [ 17/22 ]  Core start c4 (local)                                5 minutes
Start at: Thu May 23 10:56:19 2019
Task Manager was terminated unexpectedly with signal <TERM>
.... <there might be a few extra lines here > ....
Previous failure detected. Not auto-restarting.
命令/日志 #2:
less /EMC/C4Core/log/c4_safe_ktrace.log

并查找与 SFP 或夹层卡相关的事件。
我们可以看到,在安装新的 I/O 模块时出现了一些错误:
c4_safe_ktrace   INFO OBJ 3 RP:MEZZ(SP: 0, Slot: 0): fbe_base_env_send_resume_prom_read_async_cmd entry.
c4_safe_ktrace   INFO OBJ 3 RP:MEZZ(SP: 0, Slot: 0): Read async completed, workItem 0x7f2486432760, resumeStatus DEVICE_NOT_VALID_FOR_PLATFO
c4_safe_ktrace   INFO OBJ 3 100C0 : ModMgmt: CLEAR enclFaultLedReason Mezzanine RP Fault. <<<====== Fault detected in Root Port (RP)
..........
c4_safe_ktrace   INFO OBJ 3 100C0 : fbe_module_mgmt_check_module_state, SPB Mezzanine 0, state:ENABLED, substate:GOOD
c4_safe_ktrace   INFO OBJ 3 100C0 : fbe_module_mgmt_check_port_state Setting SPB Mezzanine 0, Port 0, state ENABLED, substate GOOD
c4_safe_ktrace   INFO OBJ 3 100C0 : fbe_module_mgmt_check_port_state Setting SPB Mezzanine 0, Port 1, state MISSING, substate MISS_SFP <<<=== SFP not detected 
c4_safe_ktrace   INFO OBJ 3 100C0 : fbe_module_mgmt_check_port_state Setting SPB Mezzanine 0, Port 2, state ENABLED, substate GOOD
c4_safe_ktrace   INFO OBJ 3 100C0 : fbe_module_mgmt_check_port_state Setting SPB Mezzanine 0, Port 3, state ENABLED, substate GOOD
c4_safe_ktrace   INFO OBJ 3 100C0 : fbe_module_mgmt_check_port_state Setting SPB Mezzanine 0, Port 4, state ENABLED, substate GOOD
c4_safe_ktrace   INFO OBJ 3 100C0 : fbe_module_mgmt_check_port_state Setting SPB Mezzanine 0, Port 5, state ENABLED, substate GOOD

Resolution

要解决此问题,请使用下面列出的命令重新提交最初失败的 I/O 模块。
注意:这些命令不需要 root,但需要正常运行的阵列,因此在运行之前,请确认您的阵列完全正常运行,如下所示:

Command #1:

uemcli -no /sys/general healthcheck -output csv -detail

输出示例:
#1 (非最佳) - 在解决显示的错误之前不要继续。
"Error code"
"Warning: One or more asynchronous replication sessions, or one or more NAS Server or file system synchronous replication sessions, exist. This could cause problems during upgrade. Pause the replication sessions on the production array prior to starting the upgrade and resume them after completing the upgrade. [Warning Code: platform::check_replication_health_4]"
"Warning: One or more NAS servers may not be in a healthy state. You can continue with the upgrade, but it is recommended that you record the error code and contact your service provider. [Warning Code: dm::check_nas_servers_health_3]"
Operation completed successfully.

提醒:其中确实提到了“Upgrade”,因为这是在执行无中断升级 (NDU) 之前使用的命令。但是,显示这些消息的原因是阵列(两个 SP)必须重新启动。
Command #2 可能还需要重新启动,这就是为什么此运行状况检查在没有任何 [错误代码:] 的情况下通过很重要的原因。
可以忽略 [警告],但 Command #2 会提示消息 "Do you still want to continue," 您可以输入“是”。但是,戴尔支持部门的建议是在继续作之前解决运行状况检查中的所有警告和错误。
 
要重新启动存储处理器,请按照知识库文章 Dell Unity:如何重新启动存储处理器(用户可纠正)

#2(最佳) - 您可以转到命令 #2
"Error code"
Operation completed successfully.


命令#2:

svc_change_hw_config -e

预期的输出:
service@CKMxxxxxxxx spa:~/user# svc_change_hw_config -e
Checking if both SPs are in Normal mode...OK
INFO:    Beginning eSLIC or CNA Hardware Upgrade...
WARNING: This operation will cause several reboots to occur on the Storage Processors.
WARNING: Do NOT proceed further if the user is unaware of this downtime!
==============================System Information===============================
Task Manager Command:            /opt/ptm/task_mgr.pl
Starts at:                       Sat Oct  5 10:03:47 2019
Dual SP:                         Yes
SP:                              b
Platform:                        OBERON
Original Primary:                Yes
Model:                           Unity xxx
Serial Number:                   xxxxxxxxxxxxx
Total number of attempts:        0
===============================================================================

==========================Time Estimate for All Tasks==========================
Task name [ 22 tasks in total ]                      Estimated     Status
                                                           Time(Minutes)
  1         Slic wait for system ready slic (local)         3
  2      Core run pre upgrade health checks (local)         2
  3         ESLIC check eslic configuration (local)         1
  4                  Core enable auto start (local)         0
  5                Core clear boot counters (local)         0
  6               Core clear boot counters (remote)         0
  7                 Core force vdms off sp (remote)         2
  8                  ESLIC set esp boolean (remote)         1
  9                 Core disable quickboot (remote)         1
 10         Core reboot peer sp if required (local)        10
 11                          Core start c4 (remote)         5
 12              Core wait for system ready on peer         3
 13                  Core force vdms off sp (local)         2
 14                   ESLIC set esp boolean (local)         1
 15                  Core disable quickboot (local)         1
 16              Core reboot sp if required (local)        10
 17                           Core start c4 (local)         5
 18              Core wait for system ready (local)         3
 19         ESLIC final configuration check (local)         1
 20                           Core clean up (local)         0
 21                      Core clean up peer (local)         0
 22                 Core disable auto start (local)         0

===============================================================================

=========================Estimated Time for Services ==========================
Current Time:                                      10:03
Estimated Time when eSLIC will be complete:      10:52
===============================================================================
Do you wish to continue [ yes or no ]? >
键入“yes”并按“Return”键后,您应该会看到以下输出:
=====================================Tasks=====================================
20:41 [ 17/22 ]  Core start c4 (local)                                5 minutes
===============================================================================
20:41 [ 18/22 ]  Core wait for system ready (local)                   3 minutes
===============================================================================
20:41 [ 19/22 ]  ESLIC final configuration check (local)             30 seconds
===============================================================================
20:41 [ 20/22 ]  Core clean up (local)                                5 seconds
===============================================================================
20:41 [ 21/22 ]  Core clean up peer (local)                           5 seconds
===============================================================================
20:41 [ 22/22 ]  Core disable auto start (local)                      5 seconds
===============================================================================
===================================SUMMARY=====================================
Status:                   Success
Actual Time Spent:        16452 minutes
Total Number of attempts: 1
Log File:                 /var/tmp/ptm/ptm.log
=====================================END=======================================
这些也已登录 /EMC/backend/log_shared/EMCSystemLogFile.log
 
Platform_Basic      30018 [NOTICE] Audit: Service user executed the following service script command: svc_change_hw_config -e
IOModule            30010 [INFO] User: Starting the hardware configuration commit operation
Platform_Basic      30018 [NOTICE] Audit: Service user executed the following service script command: svc_dc -pbc udoctor
IOModule            30014 [INFO] User: Completed task <17> of <22> (Restarting services)
IOModule            30014 [INFO] User: Completed task <18> of <22> (Waiting for system ready state)
IOModule            30014 [INFO] User: Completed task <19> of <22> (Checking if upgrade complete)
IOModule            30014 [INFO] User: Completed task <20> of <22> (Cleaning up)
IOModule            30014 [INFO] User: Completed task <21> of <22> (Cleaning up)
IOModule            30014 [INFO] User: Completed task <22> of <22> (Disabling automatic restart)
IOModule            30011 [NOTICE] User: The hardware configuration has been successfully committed
Health              6044f [INFO] User: Storage Processor SP A is operating normally
Health              6044f [INFO] User: Storage Processor SP B is operating normally

显示上述输出后,刷新 Unisphere 用户界面,并查看状态是否已更改回正常(预期)。
如果没有,请联系戴尔技术支持并参考本文。


提醒:有关此命令的更多信息,请参阅文档《Dell EMC Unity™ 系列服务命令技术说明》,网址为 https://www.dell.com/support/home/en-us

 

Additional Information

提醒:此外,可能存在坏纽扣电池,从而导致 UI 中出现 SP 重新启动的误报问题。
请参阅 Dell Unity 000069296知识库文章:存储处理器上的钮扣电池(戴尔可纠正)

Affected Products

Dell Unity 300, Dell EMC Unity Family

Products

Dell EMC Unity 300F, Dell EMC Unity 350F, Dell EMC Unity 400, Dell EMC Unity 400F, Dell EMC Unity 450F, Dell EMC Unity 500, Dell EMC Unity 500F, Dell EMC Unity 550F, Dell EMC Unity 600, Dell EMC Unity 600F, Dell EMC Unity 650F , Dell EMC Unity Family |Dell EMC Unity All Flash, Dell EMC Unity Family, Dell EMC Unity Hybrid ...
Article Properties
Article Number: 000056107
Article Type: Solution
Last Modified: 04 Dec 2025
Version:  6
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.