isi_job_d problems...

Question

Has anyone run into any isi_job_d issues? I'm seeing issues with the jobs hung on our cluster and I cannot cancel them. I've actually disabled and re-enabled the job service to no degree. EMC mentioned a threadcount issue increase to 240 which I did, but since I cannot restart the isi_job_d service, this has no affect. The only option I can see is to perform a rolling reboot (which I hate to do during production hours).

Example:

isi job jobs pause --job 4423

Connection refused

Eric_W1 · Answer

Connection refused?  Could this be a PAPI issue at this point?

Phil.Lam · Answer

Is the job engine still OFF? I know I get that error when job engine is OFF.

Brian_Coulombe_ · Answer

Nah, I checked.&#xa0; The message is saying the job stopped but every time I check it's already been restarted.&#xa0; I do what to check the PAPI option to see if that is the problem.EMC said the thread count for FSAnalyze needed to be bumped up to 240 but as I check this, it appears the problem went away.I HATE issues that fix themselves and I have no clue WHY!

Phil.Lam · Answer

Has an SR been created for the isi_job_d issue?

Brian_Coulombe_ · Answer

Yeah, that's where EMC told me to stop the isi_job_d service, change the thread count to 240 then restart the service. I did that, still got the error until between the time I did that, opened this question and then responded to you -- it fixed itself.

I'm dumbfounded. EMC has the logs though and that is where they told me that FSAnalyzer will need the threadcount increased on OneFS 8.0.0.4. That should be a part of the post upgrade config, no?

Brian_Coulombe_ · Answer

Hmmm, I am still seeing that I cannot stop or restart any jobs. It appears maybe isi_job_d is stuck?

CLUSTER-1# isi job jobs list

ID Type State Impact Pri Phase Running Time

-----------------------------------------------------------------

4422 FSAnalyze Running Low 6 9/10 1h 27m

4423 ShadowStoreProtect Waiting Low 6 1/1 -

4424 WormQueue Waiting Low 6 1/1 -

4428 MultiScan Running Low 4 1/4 31s

4432 SnapshotDelete Running Medium 2 1/2 2s

-----------------------------------------------------------------

Total: 5

CLUSTER#-1# isi job jobs cancel --job=4422

Connection refused

Brian_Coulombe_ · Answer

Alright if you get this issue again, it's the MCP process that is the problem. Here's how you fix it:

isi_for_array -sX ps auwxx | grep -v grep | grep isi_job_d

isi_for_array -sX ps auwwx | grep -v grep | grep isi_mcp

isi services -a isi_job_d disable

isi_for_array -sX ps auwxx | grep -v grep | grep isi_job_d

If processes are still running for isi_job_d after 60 seconds, run:

isi_for_array -sX killall -9 isi_job_d

isi_for_array -sX killall -6 isi_mcp

isi_for_array -sX ps auwwx | grep -v grep | grep isi_mcp

If processes are still running for isi_mcp after 60 seconds, run:

isi_for_array -sX killall -9 isi_mcp

Then once truly dead, restart isi_mcp with

isi_for_array -sX isi_mcp

Then isi_job_d enable:

isi services -a isi_job_d enable

Check ps auwwx again:

isi_for_array -sX ps auwxx | grep -v grep | grep isi_job_d

isi_for_array -sX ps auwwx | grep -v grep | grep isi_mcp

Also, check the thread count too for FSAnalyze.

JimK513 · Answer

thanks. We are completing our upgrades to 8.0.0.4 this weekend. I'll definitely want to review this. Jim

JimK513 · Answer

how do you check the current threadcount?

Brian_Coulombe_ · Answer

Here's how to check the threadcount:

isi_gconfig -t job-config core.load_balance_interval_sec

Here's how to change it to EMC's recommendation of 240.

isi_gconfig -t job-config core.load_balance_interval_sec=240

chjatwork · Answer

Run a top on nodes to see if the papi process it running over 80%.  You may want to allow for the nodes to finish that process.

Yan_Faubert · Answer

Just to clarify, this setting is not a threadcount per se; this is the interval at which job engine makes load balancing decisions.  So with 240 it means job engine will wake up every 4 minutes (instead of the default of 1 min / 60 seconds) and evaluate the load placed on the cluster by job engine vs client load and adjust accordingly. (i.e.: scale up or down the # of job engine workers)

Brian_Coulombe_ · Answer

True, I am using 'threadcount' improperly.  Thanks for clarifying, Yan!

Isilon

isi_job_d problems...

Was this post helpful?