This post is more than 5 years old

Community Manager

 • 

7.4K Posts

9132

June 9th, 2015 23:00

SmartFail Estimate time

hi

Is there any guide line or even experience that how long will it take to to complete SmartFail upon disk failure ?

thanks

aya



4 Operator

 • 

1.2K Posts

June 10th, 2015 09:00

> The best way to estimate future FlexProtect runtime is to use old repair runtimes (if you have them) as a guide.  If you don't have such history, then you'll have to guess.

Kip, I couldn't agree more -- and it should be much easier to get historical job data without much scripting then.

It is too clumsy currently:

isi job statistics -- first guess, but no historical data

isi job events -- per job phase, need to filter/assemble per job data

isi job reports -- need to "list" first, then "view" data per single job

What would be useful to see and understand trends without much hassle:

isi job history --job-type TYPE

Job-ID     Exit status   Impact     Start Time   Running time    LINs processesd  TB processed

(output a single line per job...)


Cheers


-- Peter



2 Intern

 • 

300 Posts

June 9th, 2015 23:00

depends on:

- disk type

- disk size

- free cluster ressources

Community Manager

 • 

7.4K Posts

June 10th, 2015 00:00

hi

thanks you for the quick reply ... I agree with you for that but if you have any number you know ... would be great ..

Info I can provide ....

s210

400GB SSD

74% usage

Community Manager

 • 

7.4K Posts

June 10th, 2015 01:00

hi

thanks you so much for the reply ..

I may need to get more info to customer but so far your advise is great ..

Example I have was ,

13 node , SSD 400G drive --> 3days to complete  ,

example you showed me is more attractive for customer ,,,would my example too long ?

thanks

aya

2 Intern

 • 

300 Posts

June 10th, 2015 01:00

Is your node / cluster full of SSDs or do you have SSD and HDD? Do you use your SSDs for Data or for Metadata or as L3-Cache?

As a reference I’ll just make some assumptions to calculate an approximate time:

Assumptions:

Full Cluster is full of SSD (so writespeed should not matter)

You use the SSD for Data

Readspeed of one SSD is 400MB/s (that’s a “okay” value in consumer market)

The SSD has a used capacity like your whole cluster (74%)

Calculation:

Used Capacity on the Drive: 296000MB (400GB0,741000)

Time to read all the data: 12,3 Minutes (296000MB / 400MB/s *60)

So under this circumstances I guess a smartfail should not take longer than 15 Minutes.

This value is a GUESS which can vary a lot. You could open an SR or just try it in an testenvironment

4 Operator

 • 

1.2K Posts

June 10th, 2015 02:00

3 to 6 nodes NL400 cluster, 3TB drives, 50-70% full, avg file size 2MB, <50% CPU load

-> smartfailing a single drive usually takes "about one day" on our cluster

Would expect the S210 to be "somewhat" faster...

but still unclear wether yours is an all-SSD config or mixed HDD/SSD.

hth

-- Peter

June 10th, 2015 06:00

I've seen it take a few hours (600GB SAS) to a couple of days (4TB SATA).

This definitely falls into the "it depends" answer.

125 Posts

June 10th, 2015 08:00

The time to repair a drive depends on a lot of variables:

- OneFS release (determines Job Engine version and how efficiently it operates)

- System hardware (determines drive types, amount of CPU and RAM, etc)

- Filesystem (amount of data, makeup of data [lots of small vs large files], protection, tunables, etc)

- Load on the cluster during the drive failure

The best way to estimate future FlexProtect runtime is to use old repair runtimes (if you have them) as a guide.  If you don't have such history, then you'll have to guess.

We've been running some drive failure tests in my lab to help with this a bit, using a known "real" data set made up of a mix of small and large files.  All of the testing to date has been with the cluster idle (i.e. no external load), so I consider these runtimes "best case" at least for the defined setup.

The S210 cluster we tested had the following node configuration:

- 22 x 1.2 TB SAS drives (26 TB/node)

- 2 x 800 GB SSD drives (1.6 TB/node)

- 256 GiB RAM

- 3 nodes running OneFS 7.2.0.0 @ +2d:1n protection, metadata-read SSD storage strategy

- Cluster filled evenly to ~81% capacity

The data makeup was the following:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                

Bin Count %Count Physical %Phy
B n % B %
128.0K 87.8m 83.3 1.6T 3.1
256.0K 5.6m 5.3 1.2T 2.4
384.0K 719.1k 0.7 188.5G 0.4
512.0K 3.8m 3.6 1.8T 3.5
640.0K 460.0k 0.4 232.4G 0.4
768.0K 11.1k 0.0 7.3G 0.0
896.0K 3.4k 0.0 2.5G 0.0
1.0M 2.6m 2.5 2.5T 4.8
1.1M 316.3k 0.3 314.3G 0.6
1.2M 8.0k 0.0 9.1G 0.0
1.4M 2.4k 0.0 3.0G 0.0
1.5M 52.0 0.0 73.6M 0.0
2.0M 1.7m 1.6 3.3T 6.2
2.1M 206.0k 0.2 405.8G 0.8
2.2M 4.7k 0.0 10.0G 0.0
2.4M 1.5k 0.0 3.3G 0.0
2.5M 39.0 0.0 96.2M 0.0
2.6M 13.0 0.0 32.5M 0.0
>5M 2.1m 2.0 40.6T 77.8
        File count: 105.3m                    
    Physical size: 52.2T (57408570224161)    

For this system, with this amount and type of data, a single drive SmartFail took 23774 seconds, or about 6.6 hours, to complete.

125 Posts

June 10th, 2015 11:00

> What would be useful to see and understand trends without much hassle:

>

> isi job history --job-type TYPE

Agreed.  "isi job report" is *almost* there, but not quite.  I'll file a bug.

Community Manager

 • 

7.4K Posts

June 10th, 2015 17:00

hi experts ! ..

Thanks heaps for Great info!

As beginner of Isilon ,found Isilon has so much depth on data protection ..

But you guys gave me so much good lecture .... thank you so much ...

thanks

aya

2 Intern

 • 

300 Posts

June 10th, 2015 23:00

wow. never thought it would be so slow.

4 Operator

 • 

1.2K Posts

June 10th, 2015 23:00

Terrific -- thank you!

1 Message

December 10th, 2016 03:00

You can estimate with the FlexProtect job progress:

    Type               State         Impact  Pri  Phase  Running Time

------------------------------------------------------------------------

FlexProtect        Running       Medium  1    2/6    1d 5h 24m



If 1/3 of the FlexProtect progress took 1day 5hours 24minutes then it may take another 2 days to complete the process






Top