Unsolved
This post is more than 5 years old
1 Rookie
•
107 Posts
0
7315
June 5th, 2017 05:00
isi_job_d problems...
Has anyone run into any isi_job_d issues? I'm seeing issues with the jobs hung on our cluster and I cannot cancel them. I've actually disabled and re-enabled the job service to no degree. EMC mentioned a threadcount issue increase to 240 which I did, but since I cannot restart the isi_job_d service, this has no affect. The only option I can see is to perform a rolling reboot (which I hate to do during production hours).
Example:
isi job jobs pause --job 4423
Connection refused
No Events found!


Eric_W1
33 Posts
1
June 5th, 2017 06:00
Connection refused? Could this be a PAPI issue at this point?
Phil.Lam
4 Apprentice
•
638 Posts
•
3 Points
0
June 5th, 2017 09:00
Is the job engine still OFF? I know I get that error when job engine is OFF.
Brian_Coulombe_
1 Rookie
•
107 Posts
0
June 5th, 2017 10:00
Nah, I checked. The message is saying the job stopped but every time I check it's already been restarted. I do what to check the PAPI option to see if that is the problem.
EMC said the thread count for FSAnalyze needed to be bumped up to 240 but as I check this, it appears the problem went away.
I HATE issues that fix themselves and I have no clue WHY!
Phil.Lam
4 Apprentice
•
638 Posts
•
3 Points
0
June 5th, 2017 11:00
Has an SR been created for the isi_job_d issue?
Brian_Coulombe_
1 Rookie
•
107 Posts
0
June 5th, 2017 11:00
Yeah, that's where EMC told me to stop the isi_job_d service, change the thread count to 240 then restart the service. I did that, still got the error until between the time I did that, opened this question and then responded to you -- it fixed itself.
I'm dumbfounded. EMC has the logs though and that is where they told me that FSAnalyzer will need the threadcount increased on OneFS 8.0.0.4. That should be a part of the post upgrade config, no?
Brian_Coulombe_
1 Rookie
•
107 Posts
0
June 7th, 2017 05:00
Hmmm, I am still seeing that I cannot stop or restart any jobs. It appears maybe isi_job_d is stuck?
CLUSTER-1# isi job jobs list
ID Type State Impact Pri Phase Running Time
-----------------------------------------------------------------
4422 FSAnalyze Running Low 6 9/10 1h 27m
4423 ShadowStoreProtect Waiting Low 6 1/1 -
4424 WormQueue Waiting Low 6 1/1 -
4428 MultiScan Running Low 4 1/4 31s
4432 SnapshotDelete Running Medium 2 1/2 2s
-----------------------------------------------------------------
Total: 5
CLUSTER#-1# isi job jobs cancel --job=4422
Connection refused
Brian_Coulombe_
1 Rookie
•
107 Posts
2
June 8th, 2017 04:00
Alright if you get this issue again, it's the MCP process that is the problem. Here's how you fix it:
isi_for_array -sX ps auwxx | grep -v grep | grep isi_job_d
isi_for_array -sX ps auwwx | grep -v grep | grep isi_mcp
isi services -a isi_job_d disable
isi_for_array -sX ps auwxx | grep -v grep | grep isi_job_d
If processes are still running for isi_job_d after 60 seconds, run:
isi_for_array -sX killall -9 isi_job_d
isi_for_array -sX killall -6 isi_mcp
isi_for_array -sX ps auwwx | grep -v grep | grep isi_mcp
If processes are still running for isi_mcp after 60 seconds, run:
isi_for_array -sX killall -9 isi_mcp
Then once truly dead, restart isi_mcp with
isi_for_array -sX isi_mcp
Then isi_job_d enable:
isi services -a isi_job_d enable
Check ps auwwx again:
isi_for_array -sX ps auwxx | grep -v grep | grep isi_job_d
isi_for_array -sX ps auwwx | grep -v grep | grep isi_mcp
Also, check the thread count too for FSAnalyze.
JimK513
1 Rookie
•
89 Posts
0
June 8th, 2017 09:00
thanks.
We are completing our upgrades to 8.0.0.4 this weekend. I'll definitely want to review this.
Jim
JimK513
1 Rookie
•
89 Posts
0
June 8th, 2017 09:00
how do you check the current threadcount?
Brian_Coulombe_
1 Rookie
•
107 Posts
0
June 8th, 2017 09:00
Here's how to check the threadcount:
isi_gconfig -t job-config core.load_balance_interval_sec
Here's how to change it to EMC's recommendation of 240.
isi_gconfig -t job-config core.load_balance_interval_sec=240
chjatwork
2 Intern
•
356 Posts
0
June 8th, 2017 10:00
Run a top on nodes to see if the papi process it running over 80%. You may want to allow for the nodes to finish that process.
Yan_Faubert
117 Posts
0
June 8th, 2017 10:00
Just to clarify, this setting is not a threadcount per se; this is the interval at which job engine makes load balancing decisions. So with 240 it means job engine will wake up every 4 minutes (instead of the default of 1 min / 60 seconds) and evaluate the load placed on the cluster by job engine vs client load and adjust accordingly. (i.e.: scale up or down the # of job engine workers)
Brian_Coulombe_
1 Rookie
•
107 Posts
0
June 8th, 2017 10:00
True, I am using "threadcount" improperly. Thanks for clarifying, Yan!