SmartFail Estimate time

Question

hi Is there any guide line or even experience that how long will it take to to complete SmartFail upon disk failure ? thanks aya

Peter_Sero · Accepted Answer

> The best way to estimate future FlexProtect runtime is to use old repair runtimes (if you have them) as a guide. If you don't have such history, then you'll have to guess.

Kip, I couldn't agree more -- and it should be much easier to get historical job data without much scripting then.

It is too clumsy currently:

isi job statistics -- first guess, but no historical data

isi job events -- per job phase, need to filter/assemble per job data

isi job reports -- need to "list" first, then "view" data per single job

What would be useful to see and understand trends without much hassle:

isi job history --job-type TYPE

Job-ID Exit status Impact Start Time Running time LINs processesd TB processed

(output a single line per job...)

Cheers

-- Peter

sluetze · Answer

depends on: - disk type - disk size - free cluster ressources

ayas · Answer

hi thanks you for the quick reply ... I agree with you for that but if you have any number you know ... would be great .. Info I can provide .... s210 400GB SSD 74% usage

ayas · Answer

hi

thanks you so much for the reply ..

I may need to get more info to customer but so far your advise is great ..

Example I have was ,

13 node , SSD 400G drive --> 3days to complete ,

example you showed me is more attractive for customer ,,,would my example too long ?

thanks

aya

sluetze · Answer

Is your node / cluster full of SSDs or do you have SSD and HDD? Do you use your SSDs for Data or for Metadata or as L3-Cache?

As a reference I’ll just make some assumptions to calculate an approximate time:

Assumptions:

Full Cluster is full of SSD (so writespeed should not matter)

You use the SSD for Data

Readspeed of one SSD is 400MB/s (that’s a “okay” value in consumer market)

The SSD has a used capacity like your whole cluster (74%)

Calculation:

Used Capacity on the Drive: 296000MB (400GB0,741000)

Time to read all the data: 12,3 Minutes (296000MB / 400MB/s *60)

So under this circumstances I guess a smartfail should not take longer than 15 Minutes.

This value is a GUESS which can vary a lot. You could open an SR or just try it in an testenvironment

Peter_Sero · Answer

3 to 6 nodes NL400 cluster, 3TB drives, 50-70% full, avg file size 2MB, <50% CPU load

-> smartfailing a single drive usually takes "about one day" on our cluster

Would expect the S210 to be "somewhat" faster...

but still unclear wether yours is an all-SSD config or mixed HDD/SSD.

hth

-- Peter

Anonymous User · Answer

I've seen it take a few hours (600GB SAS) to a couple of days (4TB SATA). This definitely falls into the 'it depends' answer.

kipcranford · Answer

The time to repair a drive depends on a lot of variables:

- OneFS release (determines Job Engine version and how efficiently it operates)

- System hardware (determines drive types, amount of CPU and RAM, etc)

- Filesystem (amount of data, makeup of data [lots of small vs large files], protection, tunables, etc)

- Load on the cluster during the drive failure

The best way to estimate future FlexProtect runtime is to use old repair runtimes (if you have them) as a guide. If you don't have such history, then you'll have to guess.

We've been running some drive failure tests in my lab to help with this a bit, using a known "real" data set made up of a mix of small and large files. All of the testing to date has been with the cluster idle (i.e. no external load), so I consider these runtimes "best case" at least for the defined setup.

The S210 cluster we tested had the following node configuration:

- 22 x 1.2 TB SAS drives (26 TB/node)

- 2 x 800 GB SSD drives (1.6 TB/node)

- 256 GiB RAM

- 3 nodes running OneFS 7.2.0.0 @ +2d:1n protection, metadata-read SSD storage strategy

- Cluster filled evenly to ~81% capacity

The data makeup was the following:

Bin	Count	%Count	Physical	%Phy
B	n	%	B	%
128.0K	87.8m	83.3	1.6T	3.1
256.0K	5.6m	5.3	1.2T	2.4
384.0K	719.1k	0.7	188.5G	0.4
512.0K	3.8m	3.6	1.8T	3.5
640.0K	460.0k	0.4	232.4G	0.4
768.0K	11.1k	0.0	7.3G	0.0
896.0K	3.4k	0.0	2.5G	0.0
1.0M	2.6m	2.5	2.5T	4.8
1.1M	316.3k	0.3	314.3G	0.6
1.2M	8.0k	0.0	9.1G	0.0
1.4M	2.4k	0.0	3.0G	0.0
1.5M	52.0	0.0	73.6M	0.0
2.0M	1.7m	1.6	3.3T	6.2
2.1M	206.0k	0.2	405.8G	0.8
2.2M	4.7k	0.0	10.0G	0.0
2.4M	1.5k	0.0	3.3G	0.0
2.5M	39.0	0.0	96.2M	0.0
2.6M	13.0	0.0	32.5M	0.0
>5M	2.1m	2.0	40.6T	77.8

File count: 105.3m
Physical size: 52.2T (57408570224161)

For this system, with this amount and type of data, a single drive SmartFail took 23774 seconds, or about 6.6 hours, to complete.

kipcranford · Answer

> What would be useful to see and understand trends without much hassle: > > isi job history --job-type TYPE Agreed.  'isi job report' is *almost* there, but not quite.  I'll file a bug.

ayas · Answer

hi experts ! .. Thanks heaps for Great info! As beginner of Isilon ,found Isilon has so much depth on data protection .. But you guys gave me so much good lecture .... thank you so much ... thanks aya

sluetze · Answer

wow. never thought it would be so slow.

Peter_Sero · Answer

Terrific -- thank you!

cowlee007 · Answer

You can estimate with the FlexProtect job progress:

Type State Impact Pri Phase Running Time

------------------------------------------------------------------------

FlexProtect Running Medium 1 2/6 1d 5h 24m

If 1/3 of the FlexProtect progress took 1day 5hours 24minutes then it may take another 2 days to complete the process

Isilon

SmartFail Estimate time

Was this post helpful?