Multiscan - Isilon Version 7.0.2.4 - causing large disk io

Question

Hi Techies,

Anyone see this behaviour from Multiscan job,

As soon as the job starts we are seeing disk IO rise from around 30K to 250K.

We have a 30 Node cluster (20 x X400, 10xNL's) now the other issue appears to be that the job never has time to complete due to snap deletes job having a higher priority. 3 x SSD per X400.

Could some please provide some info on what this job does (Re balancing nodes ? ) and heard that we can break the phases down to their individual jobs ? is this correct ?

And is this a know issue with multiscan ?

Many Thanks

Paul

crklosterman · Accepted Answer

The other item to be aware of is that when OneFS 7.1 was introduced it brought a totally new job engine to the table that permits multiple jobs to run simultaneously, based upon exclusion sets (you can read more about that in the OneFS 7.1 or above admin guides). The upgrade to 7.1 or later can most likely help you solve this problem, as we've seen with other customers with things like snapshotdelete that is a low priority job, but might have problems completing because other higher priority jobs come along and keep interrupting it.

~Chris Klosterman

Senior SA EMC Isilon Offer & Enablement Team

chris.klosterman@emc.com

twitter: @croaking

Peter_Sero · Answer

> As soon as the job starts we are seeing disk IO rise from around 30K to 250K.

Not very surprised, still curious: Does Multiscan run at low, medium or high impact level?

How many IOPS on SSD, on X400 SATA, on NL400 SATA drives?

> We have a 30 Node cluster (20 x X400, 10xNL's) now the other issue appears to be that the job never has time to complete due to snap deletes job having a higher priority. 3 x SSD per X400.

How many percent of time is SnapDelete running?

Can you afford (capacity-wise) to postpone the SnapDelete for some days?

> Could some please provide some info on what this job does (Re balancing nodes ? )

> and heard that we can break the phases down to their individual jobs ? is this correct ?

Not actually two phases, but two tasks in parallel: balancing nodes/disks and

reclaiming space in case files had been deleting with a node being offline.

And yes, you can run an AutoBalanceLin job first, which is faster,

and run Collect later.

> And is this a know issue with multiscan ?

The "issue" is that the time MultiScan takes doesn't only depend

on the amount of data (and the cluster dimensions of course),

but also on the number of files, and that this dependency

is hard to be expressed quantitatively.

Also SnapshotDelete takes longer the more files

expire from the snapshots, but that's more obvious.

Do you happen to know how many files are there

in the cluster (and roughly how many on X and how many on NL pools)?

And how many files have their snaps expired and deleted (say, per day)?

One more thing: Have you SSD metadata acceleration also

applied to the NL nodes (via GNA - Global NameSpace Acceleration)?

That can speed up things considerably.

Oh, one last thing: Have you configured snapshot for X pool

data to be stored on the NL pool? (in absence of GNA?)

That would go really slow, in particular for many-small-files scenarios.

-- Peter

Peter_Sero · Answer

Hello Paul, how is it going, have you gained new insights? -- Peter

ShaneCrist · Answer

FWIW

Ours:

3 72NL

7 X200

OneFS 7.0.2.4

Multiscan been running for 9+ days, still on phase 1 of 4.

EMC says to NOT cancel the job, allow it to run. This has been blocking 3 other jobs with lower priority.

Not seeing corresponding high IO.

-Shane

Peter_Sero · Answer

Shane: in OneFS 7.1 MultiScan can run simultaneously with up to two other jobs (of certain, but not all types). Which jobs are blocked for you at the moment? -- Peter

rstickalnd · Answer

we are not seeing the high i/o however, we are seeing the same behaviour with our multiscan job not finishing and preventing other jobs from running.  we are on 7.0.2.4 in a 5 node nl cluster that has been running for 53 days

Peter_Sero · Answer

MultiScan has a very low priority by default and thus cannot block other jobs...

Use isi job status -v and check the "Pri" values.

Feel free to share your findings here or discuss them with support.

hth

-- Peter

rstickalnd · Answer

this is the output of isi job status -v Running jobs: Job                        Impact Pri Policy    Phase Run Time -------------------------- ------ --- ---------- ----- ---------- MultiScan[40822]          Low    4  LOW        1/4  54d 23:22         (Actions: Collect, AutoBalance)         Progress: Started Paused and waiting jobs: Job                        Impact Pri Policy    Phase Run Time  State -------------------------- ------ --- ---------- ----- ---------- ------------- MediaScan[41991]          Low    8  LOW        1/7  0:00:00    Waiting         Progress: n/a SetProtectPlus[40823]      Low    6  LOW        1/2  0:00:00    Waiting         Progress: n/a No failed jobs. it appears as if the snapshot delete runs about every 11 min i think that i will have to probably open a support case Recent job results: Time            Job                        Event --------------- -------------------------- ------------------------------ 01/29 09:26:00  SnapshotDelete[46820]      Succeeded (MEDIUM) 01/29 09:15:23  SnapshotDelete[46819]      Succeeded (MEDIUM) 01/29 09:04:15  SnapshotDelete[46818]      Succeeded (MEDIUM) 01/29 08:54:30  SnapshotDelete[46817]      Succeeded (MEDIUM) 01/29 08:33:59  SnapshotDelete[46816]      Succeeded (MEDIUM) 01/29 08:24:28  SnapshotDelete[46815]      Succeeded (MEDIUM) 01/29 08:03:26  SnapshotDelete[46814]      Succeeded (MEDIUM) 01/29 07:53:40  SnapshotDelete[46813]      Succeeded (MEDIUM)

Peter_Sero · Answer

You will loose all the 'Collect' effect in an instance -- why throw it away after nearly 2 months...

LucSimard · Answer

Is there a capacity imbalance between nodes ( excess of +/- 5%)? If not, you can safely stop it. Luc Simard - 415-793-0989 Senior Technical Account Manager. Isilon Systems - Simple is Smart™

LucSimard · Answer

You can stop it without negative impact, kick off autobalanceLIN later if you really need it.

Luc Simard - 415-793-0989

Senior Technical Account Manager.

Isilon Systems - Simple is Smart™

Messages may contain confidential information.

Sent from my iPhone

Peter_Sero · Answer

OK, your MultiScan has priority 4 ('pretty high') which explains it, because default is 10 ('lowest'). (AutoBalance by default has prio 4, which is kind of weird IMHO. We pushed it down, made it identical to MultiScan's prio.) SnapshotDelete by default has prio 2 which makes it run even with your unusual high MultiScan prio. Can you track down why and when the MultiScan job priority was changed? There might have been a reason at some time, so yes, better involve support to get things right. Good luck -- Peter

Peter_Sero · Answer

Don't cancel the MultiScan job, it is really about to finish now (final Collect phase).

SnapshotDelete kicking in every few minutes has been slowing it down for sure - can you just manually pause it (SnapshotDelete) for now, and then work out a solution to run it less frequently.

rstickalnd · Answer

yes all nodes are with in 5%.  we had a reboot to install a patch in November which cause the multiscan job to kick off. about 5 or 6 days ago, the job status did show 99% on both parts of the job so i am thinking that the job is complete. I am not aware that the multiscan job priority was elevated.   i will have to open a support case  thanks for the insight

Isilon

Multiscan - Isilon Version 7.0.2.4 - causing large disk io

Was this post helpful?