"system halted" jobs?

This one's just out of curiosity. Some nodes were recently added to a cluster. The cluster is then supposed to rebalance the data across the new nodes. I check after a few minutes to make sure the job is running and i see "system halted" status for the job. When i restarted the job, it started running as it should, I was just wondering if anyone knew what sort of criteria the Isilon has to make this kind of judgement call. Why completely halted instead of just paused?

There were no other higher or equal priority jobs waiting to run at that time either. I figured I would ask the community before I ask support

Thanks in advance!

Responses(1)

Peter_Sero

1.2K Posts

0

November 22nd, 2013 22:00

The Collect and the MultiScan (= AutoBalance + Collect) jobs choke

on disk stalls and group changes. They do so silently, no failure, no re-launch

nor notification... and they still do in 7.1, issue ID 97024.

With MultiScan it's actually the Collect part that halts the whole job, so

you can run a pure AutoBalance manually to get the data rebalanced

to the new nodes.

Then run Collect at a low priority in the "background". (With 7.1 Collect

can run truly in parallel with other jobs.)

And of course, check /var/log/messages for stalls and group changes.

The lesson is, one should better understand /and/ monitor the OneFS jobs

very concisely.

Cheers

-- Peter

View All

No Events found!