thaedeus

4 Posts

1417

March 21st, 2020 12:00

[Isilon Simulator] Failed to connect to isi_job_d daemon: Connection refused

Hey all !

In our Dev envirionment we are running Isilon Simulator. It is running OneFS 8.2.0, but some of the virtual nodes are "older" - and droping of dozens different errors, like jobs failing, inconsistent filesystem, etc. So I've decided to rebuild our virtual Isilon environment.

Nodes 1 to 7 are "old, bugged" ones.

vNode-1 SmartFailed, so I've shut it down (for some reason it was still running).
vNode-2 - I've smartfailed on my own, turned off.
vNode-3 to vNode-7 running with "A" flag (Attention) due to many errors occuring.
vNode-8 is in OK status (this is new one, built and joined the cluster 2 days ago.

Unfortunately NONE of the data was migrated to new node (Used: 600k from 46,5G).
I tought of running "FlexProtect" job to repair the existing vNodes, but the job fails with "Connection refused".

Ok, so here comes my questions:

Do you have any ideas how can I fix broken nodes?
It might be not worth effort to fix them, so I've also considered replacing broken nodes with new ones, but as per vNode-8 - none of the data is being migrated to the new node. Existing nodes are ~80% full. vNode-8 is less than 1% full. Is there any way to "force" data balance/data migration ?
SmartFailing for some reason does not migrate data from the nodes being failed, since the node turns off after few seconds of running the SmartFail (too fast to migrate off 30+ GB of data...).

Any other suggestion how can I remediate the environment would be much appreciated.
There is some Hadoop-thingy I need to implement in PROD environment next week, but need to check on dev first if the solution even works...

Thanks all and regards,

Ted

Responses(5)

T

thaedeus

4 Posts

0

March 23rd, 2020 11:00

Today I've done further checks:

SmartFail still doesn't work
Cannot start any of following jobs:
1. FSAnalyze
2. FlexProtect
3. IntegrityScan
4. AutoBalance

Everything keeps failing with error:
"Failed to connect to isi_job_d daemon: Connection refused"

No matter from which vnode I try, new, old, Job Controller or regular node.

When checking for isi job status in details - I receive:

vnode-8# isi job status --verbose | grep Coordinator
The Coordinator is not connected to all other up nodes yet. Jobs will not start (but can be queued).
Coordinator LNN: 6

Forgot to add - nodes are allowed to go to read-only state, but read-only is not enabled on any.

DELL-Sam L

Moderator

•

6.9K Posts

0

March 23rd, 2020 15:00

Hello thaedeus,

Here is a link to a knowledge base article that may resolve your issue. https://dell.to/3biGiIQ You will need to login to your EMC account to view the link.

Please let us know if you have any other questions.

T

thaedeus

4 Posts

0

March 24th, 2020 01:00

Hey @DELL-Sam L !

Thank you for the reply, but the link doesn't work, even after I login. It redirects me to main page o Dell US Support.

Regards,
Ted

PL

Phil.Lam

1 Rookie

•

567 Posts

0

April 7th, 2020 21:00

425478 : OneFS 6.5 and 7.0: The isi_job_d process fails to start on recently added nodes https://support.emc.com/kb/425478

T

thaedeus

4 Posts

0

April 9th, 2020 01:00

@PhilLam - thank you so much for the post - nonetheless this helps with many difficult issues - it didn't helped us. I'll keep looking for solutions and will post once I'll get the solution.

View All

No Events found!