o17Uu33DCF12520
4 Tellurium

【分享】如何确定Data Domain电源意外关闭的原因

如何确定电源意外关闭的原因

untitled.png

目的
为了在Data Domain的恢复中帮助确定意外断电的原因

适用于

所有Data Domain系统
所有软件版本

症状
系统意外重新启动
系统电源关闭
系统无法使用

解决方案
查看以下电源关闭或重新启动相关的日志的消息:
Bios.txt
Kern.info
sms.info
messages.engineering

操作引起的停机会记录到在任一的sms.info,kern.info或messages.engineering
Jan 23 11:52:04 dd243a -ddsh: NOTICE: MSG-DDSH-00009: (tty=pts/0, session=21415)
sysadmin: command "system poweroff"


在kern.unfo和bios.txt中会发现看门狗重新启动
Mon Jan 28 06:00:01 PST 2013 : Am alive message to bios.txt
   1 | 01/29/2013 | 07:59:08 | Watchdog 2 Watchdog | Timer interrupt | Asserted
   2 | 01/29/2013 | 07:59:37 | OS Stop/Shutdown #0x41 | Run-time critical stop | Asserted
   3 | Linux kernel panic: Aiee, killi
   4 | Linux kernel panic: ng interrup
   5 | Linux kernel panic: t handler!
   6 | 01/29/2013 | 08:05:54 | Watchdog 2 Watchdog | Hard reset | Asserted
Tue Jan 29 06:00:01 PST 2013 : Am alive message to bios.txt
Jan 29 00:48:02 datadomain1 autosupport: AUTOSUPPORT: System rebooted
Jan 29 00:50:06 datadomain1 IPMI watchdog: ===== System was rebooted by watchdog =====


在messages.engineering或bios.txt中,处理器的CAT错误或Thermal Trip的消息
Jan 29 17:50:59 rbantv005 sysmon: INFO: Event posted: 30: EVT-ENVIRONMENT-00011: CATERR:
Event Assert EVT-OBJ::Enclosure=1沮丧表情ensor Id(0x)=68
Jan 21 13:20:45 lsvdd670 sysmon: WARNING: Warning: Alert Id=3 Sensor 0x7a (CAT Err) Thermal Trip


1月21日六点00分01秒PST 2013:bios.txt的消息
1 | 01/21/2013 | 20:43:40 | Processor CAT ERR | State Asserted
2 | 01/21/2013 | 20:57:50 | Watchdog 2 WatchDog2 | Timer interrupt | Asserted
3 | 01/21/2013 | 21:02:06 | Watchdog 2 WatchDog2 | Hard reset | Asserted
4 | 01/21/2013 | 21:02:11 | Processor CAT ERR | State Asserted
5 | 01/21/2013 | 21:02:17 | Unknown #0x81 |
6 | 01/21/2013 | 21:13:22 | System Event #0x85 | OEM System boot event | Asserted
1 | 01/15/2013 | 01:07:48 | Processor Processor | Thermal Trip | Asserted
2 | 01/15/2013 | 01:07:54 | Power Unit Power Unit | Power off/down | Asserted
3 | 01/15/2013 | 05:30:23 | Button Button | Power Button pressed | Asserted
4 | 01/15/2013 | 05:30:23 | Button Button | Power Button pressed | Deasserted
1 | 09/13/2012 | 06:16:31 | Power Unit Power Unit | Power off/down | Asserted
2 | 01/16/2013 | 04:33:36 | Processor Processor | Presence detected | Asserted

messages.engineering发现磁盘或机箱内温度过高
Jul  8 11:12:47 dd860-54 emsmon: WARNING: EMS: ****System will shutdown due to event-id =
EVT-STORAGE-00004, event-name = DiskTemperatureShutdown, event-msg =
Disk temperatures are dangerously high. System is shutting down.EVT-INFO::Threshold(C)=0:
Temperature(C)=36****


在kern.info,ddfs.info或messages.engineering会发现系统出现紧急情况,

Jan 28 09:49:09 usnyczdedup06 ddfs[6792]: ERROR: MSG-INTRNL-00001: PANIC: rpc/dd_rpc_split.c:
svc_process_thread: 526: Async rpc taking too long (619 secs).


bios.txt会发现交流电源丢失和交流电源按钮按下

2 | 01/15/2013 | 01:07:54 | Power Unit Power Unit | Power off/down | Asserted
3 | 01/15/2013 | 05:30:23 | Button Button | Power Button pressed | Asserted
5 | 05/04/2011 | 17:30:06 | Power Unit Power Unit | Power off/down | Asserted
1 | 05/16/2011 | 01:02:51 | Power Unit Power Unit | AC lost | Asserted


审查ASUP警告或messages.engineering会发现由于RAID降级导致关机
Volume dg0 has been degraded for over 1 hour.
Jan 24 00:19:50 abs0dd03 sysmon: WARNING: MSG-EMS-00001: EMS: Volume dg0 has been degraded for over



对于AC电源丢失,电源按钮按下,由于温度关机或操作关机的情况下,分析并验证环境。
Thermal Trip,CAT的错误,看门狗重新启动,RAID错误或其他DDR相关问题的情况下,需要上传支持包(SUB)和创建支持案例(Case)。

标签 (1)
标记 (1)
0 项奖励
1 条回复1
liulei_it
5 Tungsten

Re: 【分享】如何确定Data Domain电源意外关闭的原因

兄弟们在配置电源的时候,一定要使用两路哦。一路正常电哦一路从UPS机房电哦 。

0 项奖励