PowerScale (Isilon): Child or Parent isi_hangdump process not running on a single or multiple nodes. (Gen5, Gen6, Gen6.5)

Shrnutí: This article provides an overview of how to resolve issues with isi_hangdump messages spamming in /var/log/messages. Summary: Child or Parent isi_hangdump process not running on a single or multiple nodes. For isi_hangdump to work properly, both parent and child process needs to be running. ...

Tento článek se vztahuje na Tento článek se nevztahuje na Tento článek není vázán na žádný konkrétní produkt. V tomto článku nejsou uvedeny všechny verze produktu.

Příznaky

Multiple nodes report ping timeouts, possibly to one specific node.
NOTE: This is not for RBM ping timeouts

Problematic node show symptoms of a continual isi_hangdump loop.
Major isi_hangdumps occurs roughly the same time every hour.

This could also be causing performance issues.

Similar messages in /var/log/messages:

2021-04-04T01:30:50-04:00 <1.5> CLUSTER-24 isi_hangdump: Triggering clusterwide hangdump
2021-04-04T01:30:50-04:00 <1.5> CLUSTER-24 isi_hangdump: LOCK TIMEOUT AT 1617514250 UTC
2021-04-04T01:30:50-04:00 <1.5> CLUSTER-24 isi_hangdump: Hangdump after 752602 seconds: Ping timeout
2021-04-04T01:31:00-04:00 <1.5> CLUSTER-24 isi_hangdump: END OF DUMP AT 1617514250 UTC
2021-04-04T01:31:00-04:00 <1.5> CLUSTER-24 isi_hangdump: Initiating hangdump on 26 nodes...
2021-04-04T01:31:09-04:00 <1.5> CLUSTER-24 isi_hangdump: Skipping requested dump(Ping timeout)
2021-04-04T01:32:09-04:00 <1.5> CLUSTER-24 isi_hangdump: Skipping requested dump(Ping timeout)
2021-04-04T01:35:12-04:00 <1.5> CLUSTER-24 isi_hangdump: Skipping requested dump(Ping timeout)
2021-04-04T01:36:13-04:00 <1.5> CLUSTER-24 isi_hangdump: Skipping requested dump(Ping timeout)
2021-04-04T01:52:27-04:00 <1.5> CLUSTER-24 isi_hangdump: Skipping requested dump(Ping timeout)
2021-04-04T01:53:28-04:00 <1.5> CLUSTER-24 isi_hangdump: Skipping requested dump(Ping timeout)


The node 2 is triggering the hangdump and the difference is one hour
2020-08-20T00:53:48-07:00 <1.5> CLUSTER-2 isi_hangdump: Triggering clusterwide hangdump
2020-08-20T01:53:49-07:00 <1.5> CLUSTER-2 isi_hangdump: Triggering clusterwide hangdump   <-- 1 hour difference between the hangdumps: 1:53 and 0:53
2020-08-20T02:53:49-07:00 <1.5> CLUSTER-2 isi_hangdump: Triggering clusterwide hangdump


or

Only the node 24 is triggering the hangdumps and the frequency is one hour:

CLUSTER-24# isi_for_array "grep -i triggering /var/log/messages | grep 2021-04"
CLUSTER-24:2021-04-01T00:30:12-04:00 <1.5> CLUSTER-24 isi_hangdump: Triggering clusterwide hangdump
CLUSTER-24:2021-04-01T01:30:12-04:00 <1.5> CLUSTER-24 isi_hangdump: Triggering clusterwide hangdump      <-- 01:30:12 and 00:30:12 : one hour difference from the previous instan
CLUSTER-24:2021-04-01T02:30:12-04:00 <1.5> CLUSTER-24 isi_hangdump: Triggering clusterwide hangdump

The number of isi_hangdump processes can be 4 or 1.The expected number of isi_hangdump processes should be 2. To see how many isi_hangdump processes are running on each node:

# isi_for_array -s "ps awux | grep '[h]angdump'"


Resolution is to restart isi_hangdump service and check for the number of isi_hangdump processes.
If it’s not 2 then restart the node itself.

Příčina

Parent or Child process of isi_hangdump is not running.  If the child (ping) process is not running, then that node will not send the internal ping messages which will result in hangdumps being triggered. This could potentially lead to performance issues due to the continuous generation of hangdumps.

Řešení

Currently the resolution is to run "isi_hangdump restart" (as shown in the example below).

If that fails, panic reboot the node to get the cores and restart the isi_hangdump process.

CLUSTER-1# ps -auwx | grep -i isi_hangdump
root    1015   0.0  0.6 437876  38928  -  S    25Mar21      0:57.01 /usr/libexec/isilon/isi_hangdump /usr/bin/isi_hangdump start
root    1016   0.0  0.5 398676  32200  -  S    25Mar21     20:05.60 /usr/libexec/isilon/isi_hangdump /usr/bin/isi_hangdump start
root   32228   0.0  0.0  12344   2616  0  S+   20:41        0:00.00 grep -i isi_hangdump

CLUSTER-1# isi_hangdump restart
CLUSTER-1# ps -auwx | grep -i isi_hangdump
root   32253   3.9  0.6 398808  35976  -  S    20:41        0:00.01 /usr/libexec/isilon/isi_hangdump /usr/bin/isi_hangdump restart
root    1016   0.0  0.5 398676  32200  -  S    25Mar21     20:05.61 /usr/libexec/isilon/isi_hangdump /usr/bin/isi_hangdump start
root   32260   0.0  0.0  12344   2616  0  S+   20:41        0:00.00 grep -i isi_hangdump


In the meantime, engineering is working on a full time resolution.

Dotčené produkty

PowerScale OneFS
Vlastnosti článku
Číslo článku: 000185607
Typ článku: Solution
Poslední úprava: 12 led 2023
Verze:  6
Najděte odpovědi na své otázky od ostatních uživatelů společnosti Dell
Služby podpory
Zkontrolujte, zda se na vaše zařízení vztahují služby podpory.