PowerScale (Isilon): Child or Parent isi_hangdump process not running on a single or multiple nodes. (Gen5, Gen6, Gen6.5)
Shrnutí: This article provides an overview of how to resolve issues with isi_hangdump messages spamming in /var/log/messages. Summary: Child or Parent isi_hangdump process not running on a single or multiple nodes. For isi_hangdump to work properly, both parent and child process needs to be running. ...
Příznaky
Multiple nodes report ping timeouts, possibly to one specific node.
NOTE: This is not for RBM ping timeouts
Problematic node show symptoms of a continual isi_hangdump loop.
Major isi_hangdumps occurs roughly the same time every hour.
This could also be causing performance issues.
Similar messages in /var/log/messages:
2021-04-04T01:30:50-04:00 <1.5> CLUSTER-24 isi_hangdump: Triggering clusterwide hangdump
2021-04-04T01:30:50-04:00 <1.5> CLUSTER-24 isi_hangdump: LOCK TIMEOUT AT 1617514250 UTC
2021-04-04T01:30:50-04:00 <1.5> CLUSTER-24 isi_hangdump: Hangdump after 752602 seconds: Ping timeout
2021-04-04T01:31:00-04:00 <1.5> CLUSTER-24 isi_hangdump: END OF DUMP AT 1617514250 UTC
2021-04-04T01:31:00-04:00 <1.5> CLUSTER-24 isi_hangdump: Initiating hangdump on 26 nodes...
2021-04-04T01:31:09-04:00 <1.5> CLUSTER-24 isi_hangdump: Skipping requested dump(Ping timeout)
2021-04-04T01:32:09-04:00 <1.5> CLUSTER-24 isi_hangdump: Skipping requested dump(Ping timeout)
2021-04-04T01:35:12-04:00 <1.5> CLUSTER-24 isi_hangdump: Skipping requested dump(Ping timeout)
2021-04-04T01:36:13-04:00 <1.5> CLUSTER-24 isi_hangdump: Skipping requested dump(Ping timeout)
2021-04-04T01:52:27-04:00 <1.5> CLUSTER-24 isi_hangdump: Skipping requested dump(Ping timeout)
2021-04-04T01:53:28-04:00 <1.5> CLUSTER-24 isi_hangdump: Skipping requested dump(Ping timeout)
The node 2 is triggering the hangdump and the difference is one hour
2020-08-20T00:53:48-07:00 <1.5> CLUSTER-2 isi_hangdump: Triggering clusterwide hangdump
2020-08-20T01:53:49-07:00 <1.5> CLUSTER-2 isi_hangdump: Triggering clusterwide hangdump <-- 1 hour difference between the hangdumps: 1:53 and 0:53
2020-08-20T02:53:49-07:00 <1.5> CLUSTER-2 isi_hangdump: Triggering clusterwide hangdump
or
Only the node 24 is triggering the hangdumps and the frequency is one hour:
CLUSTER-24# isi_for_array "grep -i triggering /var/log/messages | grep 2021-04"
CLUSTER-24:2021-04-01T00:30:12-04:00 <1.5> CLUSTER-24 isi_hangdump: Triggering clusterwide hangdump
CLUSTER-24:2021-04-01T01:30:12-04:00 <1.5> CLUSTER-24 isi_hangdump: Triggering clusterwide hangdump <-- 01:30:12 and 00:30:12 : one hour difference from the previous instan
CLUSTER-24:2021-04-01T02:30:12-04:00 <1.5> CLUSTER-24 isi_hangdump: Triggering clusterwide hangdump
The number of isi_hangdump processes can be 4 or 1.The expected number of isi_hangdump processes should be 2. To see how many isi_hangdump processes are running on each node:
# isi_for_array -s "ps awux | grep '[h]angdump'"
Resolution is to restart isi_hangdump service and check for the number of isi_hangdump processes.
If it’s not 2 then restart the node itself.
Příčina
Řešení
Currently the resolution is to run "isi_hangdump restart" (as shown in the example below).
If that fails, panic reboot the node to get the cores and restart the isi_hangdump process.
CLUSTER-1# ps -auwx | grep -i isi_hangdump
root 1015 0.0 0.6 437876 38928 - S 25Mar21 0:57.01 /usr/libexec/isilon/isi_hangdump /usr/bin/isi_hangdump start
root 1016 0.0 0.5 398676 32200 - S 25Mar21 20:05.60 /usr/libexec/isilon/isi_hangdump /usr/bin/isi_hangdump start
root 32228 0.0 0.0 12344 2616 0 S+ 20:41 0:00.00 grep -i isi_hangdump
CLUSTER-1# isi_hangdump restart
CLUSTER-1# ps -auwx | grep -i isi_hangdump
root 32253 3.9 0.6 398808 35976 - S 20:41 0:00.01 /usr/libexec/isilon/isi_hangdump /usr/bin/isi_hangdump restart
root 1016 0.0 0.5 398676 32200 - S 25Mar21 20:05.61 /usr/libexec/isilon/isi_hangdump /usr/bin/isi_hangdump start
root 32260 0.0 0.0 12344 2616 0 S+ 20:41 0:00.00 grep -i isi_hangdump
In the meantime, engineering is working on a full time resolution.