This post is more than 5 years old
Community Manager
•
7.4K Posts
0
9132
June 9th, 2015 23:00
SmartFail Estimate time
hi
Is there any guide line or even experience that how long will it take to to complete SmartFail upon disk failure ?
thanks
aya
This post is more than 5 years old
Community Manager
•
7.4K Posts
0
9132
June 9th, 2015 23:00
hi
Is there any guide line or even experience that how long will it take to to complete SmartFail upon disk failure ?
thanks
aya
Top
Peter_Sero
4 Operator
•
1.2K Posts
0
June 10th, 2015 09:00
> The best way to estimate future FlexProtect runtime is to use old repair runtimes (if you have them) as a guide. If you don't have such history, then you'll have to guess.
Kip, I couldn't agree more -- and it should be much easier to get historical job data without much scripting then.
It is too clumsy currently:
isi job statistics -- first guess, but no historical data
isi job events -- per job phase, need to filter/assemble per job data
isi job reports -- need to "list" first, then "view" data per single job
What would be useful to see and understand trends without much hassle:
isi job history --job-type TYPE
Job-ID Exit status Impact Start Time Running time LINs processesd TB processed
(output a single line per job...)
Cheers
-- Peter
sluetze
2 Intern
•
300 Posts
0
June 9th, 2015 23:00
depends on:
- disk type
- disk size
- free cluster ressources
ayas
Community Manager
•
7.4K Posts
0
June 10th, 2015 00:00
hi
thanks you for the quick reply ... I agree with you for that but if you have any number you know ... would be great ..
Info I can provide ....
s210
400GB SSD
74% usage
ayas
Community Manager
•
7.4K Posts
0
June 10th, 2015 01:00
hi
thanks you so much for the reply ..
I may need to get more info to customer but so far your advise is great ..
Example I have was ,
13 node , SSD 400G drive --> 3days to complete ,
example you showed me is more attractive for customer ,,,would my example too long ?
thanks
aya
sluetze
2 Intern
•
300 Posts
0
June 10th, 2015 01:00
Is your node / cluster full of SSDs or do you have SSD and HDD? Do you use your SSDs for Data or for Metadata or as L3-Cache?
As a reference I’ll just make some assumptions to calculate an approximate time:
Assumptions:
Full Cluster is full of SSD (so writespeed should not matter)
You use the SSD for Data
Readspeed of one SSD is 400MB/s (that’s a “okay” value in consumer market)
The SSD has a used capacity like your whole cluster (74%)
Calculation:
Used Capacity on the Drive: 296000MB (400GB0,741000)
Time to read all the data: 12,3 Minutes (296000MB / 400MB/s *60)
So under this circumstances I guess a smartfail should not take longer than 15 Minutes.
This value is a GUESS which can vary a lot. You could open an SR or just try it in an testenvironment
Peter_Sero
4 Operator
•
1.2K Posts
0
June 10th, 2015 02:00
3 to 6 nodes NL400 cluster, 3TB drives, 50-70% full, avg file size 2MB, <50% CPU load
-> smartfailing a single drive usually takes "about one day" on our cluster
Would expect the S210 to be "somewhat"
faster...
but still unclear wether yours is an all-SSD config or mixed HDD/SSD.
hth
-- Peter
Anonymous User
170 Posts
0
June 10th, 2015 06:00
I've seen it take a few hours (600GB SAS) to a couple of days (4TB SATA).
This definitely falls into the "it depends" answer.
kipcranford
125 Posts
2
June 10th, 2015 08:00
The time to repair a drive depends on a lot of variables:
- OneFS release (determines Job Engine version and how efficiently it operates)
- System hardware (determines drive types, amount of CPU and RAM, etc)
- Filesystem (amount of data, makeup of data [lots of small vs large files], protection, tunables, etc)
- Load on the cluster during the drive failure
The best way to estimate future FlexProtect runtime is to use old repair runtimes (if you have them) as a guide. If you don't have such history, then you'll have to guess.
We've been running some drive failure tests in my lab to help with this a bit, using a known "real" data set made up of a mix of small and large files. All of the testing to date has been with the cluster idle (i.e. no external load), so I consider these runtimes "best case" at least for the defined setup.
The S210 cluster we tested had the following node configuration:
- 22 x 1.2 TB SAS drives (26 TB/node)
- 2 x 800 GB SSD drives (1.6 TB/node)
- 256 GiB RAM
- 3 nodes running OneFS 7.2.0.0 @ +2d:1n protection, metadata-read SSD storage strategy
- Cluster filled evenly to ~81% capacity
The data makeup was the following:
For this system, with this amount and type of data, a single drive SmartFail took 23774 seconds, or about 6.6 hours, to complete.
kipcranford
125 Posts
2
June 10th, 2015 11:00
> What would be useful to see and understand trends without much hassle:
>
> isi job history --job-type TYPE
Agreed. "isi job report" is *almost* there, but not quite. I'll file a bug.
ayas
Community Manager
•
7.4K Posts
0
June 10th, 2015 17:00
hi experts ! ..
Thanks heaps for Great info!
As beginner of Isilon ,found Isilon has so much depth on data protection ..
But you guys gave me so much good lecture .... thank you so much ...
thanks
aya
sluetze
2 Intern
•
300 Posts
0
June 10th, 2015 23:00
wow. never thought it would be so slow.
Peter_Sero
4 Operator
•
1.2K Posts
0
June 10th, 2015 23:00
Terrific -- thank you!
cowlee007
1 Message
0
December 10th, 2016 03:00
You can estimate with the FlexProtect job progress:
Type State Impact Pri Phase Running Time
------------------------------------------------------------------------
FlexProtect Running Medium 1 2/6 1d 5h 24m
If 1/3 of the FlexProtect progress took 1day 5hours 24minutes then it may take another 2 days to complete the process