Unsolved

This post is more than 5 years old

1 Rookie

 • 

107 Posts

7315

June 5th, 2017 05:00

isi_job_d problems...

Has anyone run into any isi_job_d issues?  I'm seeing issues with the jobs hung on our cluster and I cannot cancel them. I've actually disabled and re-enabled the job service to no degree.  EMC mentioned a threadcount issue increase to 240 which I did, but since I cannot restart the isi_job_d service, this has no affect.  The only option I can see is to perform a rolling reboot (which I hate to do during production hours).

Example:

isi job jobs pause --job 4423

Connection refused

33 Posts

June 5th, 2017 06:00

Connection refused?  Could this be a PAPI issue at this point?

4 Apprentice

 • 

638 Posts

 • 

3 Points

June 5th, 2017 09:00

Is the job engine still OFF? I know I get that error when job engine is OFF.

1 Rookie

 • 

107 Posts

June 5th, 2017 10:00

Nah, I checked.  The message is saying the job stopped but every time I check it's already been restarted.  I do what to check the PAPI option to see if that is the problem.


EMC said the thread count for FSAnalyze needed to be bumped up to 240 but as I check this, it appears the problem went away.

I HATE issues that fix themselves and I have no clue WHY!

4 Apprentice

 • 

638 Posts

 • 

3 Points

June 5th, 2017 11:00

Has an SR been created for the isi_job_d issue?

1 Rookie

 • 

107 Posts

June 5th, 2017 11:00

Yeah, that's where EMC told me to stop the isi_job_d service, change the thread count to 240 then restart the service.  I did that, still got the error until between the time I did that, opened this question and then responded to you -- it fixed itself.


I'm dumbfounded.  EMC has the logs though and that is where they told me that FSAnalyzer will need the threadcount increased on OneFS 8.0.0.4.  That should be a part of the post upgrade config, no?

1 Rookie

 • 

107 Posts

June 7th, 2017 05:00

Hmmm, I am still seeing that I cannot stop or restart any jobs.  It appears maybe isi_job_d is stuck?

CLUSTER-1# isi job jobs list

ID   Type               State   Impact  Pri  Phase  Running Time

-----------------------------------------------------------------

4422 FSAnalyze          Running Low     6    9/10   1h 27m

4423 ShadowStoreProtect Waiting Low     6    1/1    -

4424 WormQueue          Waiting Low     6    1/1    -

4428 MultiScan          Running Low     4    1/4    31s

4432 SnapshotDelete     Running Medium  2    1/2    2s

-----------------------------------------------------------------

Total: 5

CLUSTER#-1# isi job jobs cancel --job=4422

Connection refused

1 Rookie

 • 

107 Posts

June 8th, 2017 04:00

Alright if you get this issue again, it's the MCP process that is the problem.  Here's how you fix it:


isi_for_array -sX ps auwxx | grep -v grep | grep isi_job_d

isi_for_array -sX ps auwwx | grep -v grep | grep isi_mcp

isi services -a isi_job_d disable

isi_for_array -sX ps auwxx | grep -v grep | grep isi_job_d

If processes are still running for isi_job_d after 60 seconds, run:

isi_for_array -sX killall -9 isi_job_d

isi_for_array -sX killall -6 isi_mcp

isi_for_array -sX ps auwwx | grep -v grep | grep isi_mcp

If processes are still running for isi_mcp after 60 seconds, run:

isi_for_array -sX killall -9 isi_mcp

Then once truly dead, restart isi_mcp with

isi_for_array -sX isi_mcp

Then isi_job_d enable:

isi services -a isi_job_d enable

Check ps auwwx again:

isi_for_array -sX ps auwxx | grep -v grep | grep isi_job_d

isi_for_array -sX ps auwwx | grep -v grep | grep isi_mcp

Also, check the thread count too for FSAnalyze.

1 Rookie

 • 

89 Posts

June 8th, 2017 09:00

thanks.

We are completing our upgrades to 8.0.0.4 this weekend. I'll definitely want to review this.

Jim

1 Rookie

 • 

89 Posts

June 8th, 2017 09:00

how do you check the current threadcount?

1 Rookie

 • 

107 Posts

June 8th, 2017 09:00

Here's how to check the threadcount:

isi_gconfig -t job-config core.load_balance_interval_sec


Here's how to change it to EMC's recommendation of 240.

isi_gconfig -t job-config core.load_balance_interval_sec=240

2 Intern

 • 

356 Posts

June 8th, 2017 10:00

Run a top on nodes to see if the papi process it running over 80%.  You may want to allow for the nodes to finish that process.

117 Posts

June 8th, 2017 10:00

Just to clarify, this setting is not a threadcount per se; this is the interval at which job engine makes load balancing decisions.  So with 240 it means job engine will wake up every 4 minutes (instead of the default of 1 min / 60 seconds) and evaluate the load placed on the cluster by job engine vs client load and adjust accordingly. (i.e.: scale up or down the # of job engine workers)

1 Rookie

 • 

107 Posts

June 8th, 2017 10:00

True, I am using "threadcount" improperly.  Thanks for clarifying, Yan!

No Events found!

Top