Isilon:OneFS — 如何解释监护程序错误

Summary: 软件看门狗是一个进程,用于监视内核并在节点无响应时打印堆栈或重新启动节点。这样可以保护群集免受严重 CPU 不足症状的影响,并帮助戴尔技术支持部门识别问题并加以纠正。

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Instructions

简介

本知识库文章介绍如何读取和解释由 swatchdog 进程创建的堆栈。软件看门狗也称为看门狗或软手表。

 

详细信息

有时,节点会将堆栈写入 /var/log/messages 文件或重新启动自身,并显示类似于以下内容的错误:

**********************************************
Software Watchdog failed (userspace is starved!)
**********************************************

**********************************************
Software Watchdog failed on CPU 0 (6353: kt: gmp-split [-])
0x80bda7b9 -> 0x80bda5dc (fp=0xf734bb78): lk_fail_create_entry_and_owner
0x80bbe950 -> 0x80bbe7e0 (fp=0xf734bbf0): lkf_group_change_save_locks
0x80aa251c -> 0x80aa2268 (fp=0xf734bc2c): rtxn_sync_locks_prepare
0x80aa447d -> 0x80aa4304 (fp=0xf734bcdc): rtxn_split
0x80aac9cf -> 0x80aac8ec (fp=0xf734bcfc): kt_main
0x802a9d43 -> 0x802a9ca8 (fp=0xf734bd14): fork_exit

intr counts:
irq3: 1382 irq4: 1164845 irq14: 19331 irq17: 10672321 irq18: 11 stray: 1 irq24: 22011026 irq48: 46902637
**********************************************

panic @ time 1257444527.664: Software watchdog timed out

Stack: -------------------------------------------------

0x802e24f0 -> 0x802e24e4 (fp=0xf734ba78): isi_swatchdog_panic
0x802e27d7 -> 0x802e26ac (fp=0xf734ba8c): isi_swatchdog_hardclock
0x80295187 -> 0x80295068 (fp=0xf734bab0): hardclock_process
0x802951ba -> 0x802951a8 (fp=0xf734bac4): hardclock
0x8041d608 -> 0x8041d5b8 (fp=0xf734bad4): lapic_handle_timer
0x804281c3 -> 0x804281a4 (fp=0xf734bb78): bcmp
0x80bbe950 -> 0x80bbe7e0 (fp=0xf734bbf0): lkf_group_change_save_locks
0x80aa251c -> 0x80aa2268 (fp=0xf734bc2c): rtxn_sync_locks_prepare
0x80aa447d -> 0x80aa4304 (fp=0xf734bcdc): rtxn_split
0x80aac9cf -> 0x80aac8ec (fp=0xf734bcfc): kt_main
0x802a9d43 -> 0x802a9ca8 (fp=0xf734bd14): fork_exit

---------------------------------------------------------

swatchdog 的构造如下:

  • 低电平定时器中断每 10 秒触发一次。
  • 高级用户空间代码尝试每隔 5 秒为计时器中断设置一个邮箱。

当低级别计时器中断无法从用户空间中找到邮箱备注时,将采取措施,然后转储堆栈。连续四次失败后,群集将重新启动。
如需帮助解释错误堆栈或 swatchdog 触发的重新启动,请联系 戴尔技术支持

Affected Products

Isilon

Products

Isilon, PowerScale OneFS
Article Properties
Article Number: 000018976
Article Type: How To
Last Modified: 10 Jun 2025
Version:  6
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.