The journal check should not take 2 hours. It should be a pretty instant process, so there is something wrong. Any time you see any kind of journal error, contact Isilon Support immediately and raise an S1 case.
The BMC&CMC error is a bug and has a fix. There is more information here:
Ahhhhhh thats a major problem its still offline now at the 18 hours mark and not communicating to the cluster - and after 3x SR's and *attempting* to stress to the support staff whom apparently don't normally answer calls my ULTRA urgent call is an S3 (thats after 2x separate chat sessions both ending in "oh so you need someone there now?" - and yet its now been over 15 hours since my initial "We need urgent help on a dead node" chat.
NOT HAPPY DELL/EMC.
So, the node if works today will have to resync between 3 other HD400's around 90% of data (ie they are full).
I dont wish upon any EMC person being on call this Christmas weekend - cause I bet a Million $$$$ that someone will get called in due to another failure.
Do you have a case number to reference? I can make sure that Isilon Support management is aware of what is going on if this hasn't been addressed by now.
Hi sjones5 - yes, finally we managed to get a little *awareness* that this was a problem with SR at DELL/EMC.
Ultimately it appears that the node failed on shutdown after my attempts at resetting the BMC & CMC - it just didnt shutdown cleanly.
On reboot it then found the journal in error - bad that it couldnt really be recovered though as;
Support had a webex and taken control, but no luck - im now for the next week (hopefully less) *shuffling* my storagepool between the nodes and reducing the data percent on the HD400's.
Then we are going to smart fail a hd400 and re-add it in. I just hope the BMC & CMC error are a once off - even though they are a low risk fault/quirk, this time it was nasty!
*CROSSES FINGERS*
We managed to commence deleting around 200TB also - but so far after locking out all clients and shutting down all accesses - after 2 days its only removed around 20TB....hmmmmmm...
Yes - but with the cluster being degraded none are working - so I modified the job engine settings today and then re-enabled the snapshots delete - its almost finished now after 2 hours.
I'll then run a new smartpool for migrate large files 2TB and greater to the other nodes today - hopefully that will free up around 100TB off the failed smartpool tier so we can fully smartfail the hd400 node and re-add it to the cluster next week
Hi Francisco, its a change to the job engine settings to allow the jobs to run in a degraded state - I would imagine that this is *highly* shouldn't really touch type option.
Here it is for reference - but I would think that support would feel that this is a last resort option:
isi_classic job config -p core.run_degraded=True
After this I was able to start/stop and add/change jobs with the cluster being degraded.
3255 FlexProtect System Cancelled 2018-01-02T08:57:52
3256 FlexProtect Failed 2018-01-02T09:10:08
3257 FlexProtect Failed 2018-01-02T09:22:45
3258 FlexProtect Failed 2018-01-02T16:10:12
additionally I had the same message in the LCD showed "Test Journal existed with error - Checking Isilon Journal integrity..." in the other node, the action was: to reset the node to the factory settings, with the command isi_reformat_node, while I executed an smartfail of node in the cluster. isi devices -a smartfail -d (num of node)
sjones51
252 Posts
2
December 22nd, 2016 07:00
Hi wyszynski,
The journal check should not take 2 hours. It should be a pretty instant process, so there is something wrong. Any time you see any kind of journal error, contact Isilon Support immediately and raise an S1 case.
The BMC&CMC error is a bug and has a fix. There is more information here:
466373 : S210, X210, X410, NL410 or HD400 shows event: 'Node's Baseboard Management Controller (BMC) and/or Chassis Management Controller (CMC) are unresponsive.' https://support.emc.com/kb/466373
LeoWski
34 Posts
0
December 22nd, 2016 13:00
G'Day sjones5,
Ahhhhhh thats a major problem its still offline now at the 18 hours mark and not communicating to the cluster - and after 3x SR's and *attempting* to stress to the support staff whom apparently don't normally answer calls my ULTRA urgent call is an S3 (thats after 2x separate chat sessions both ending in "oh so you need someone there now?" - and yet its now been over 15 hours since my initial "We need urgent help on a dead node" chat.
NOT HAPPY DELL/EMC.
So, the node if works today will have to resync between 3 other HD400's around 90% of data (ie they are full).
I dont wish upon any EMC person being on call this Christmas weekend - cause I bet a Million $$$$ that someone will get called in due to another failure.
sjones51
252 Posts
0
December 23rd, 2016 07:00
Do you have a case number to reference? I can make sure that Isilon Support management is aware of what is going on if this hasn't been addressed by now.
LeoWski
34 Posts
0
December 25th, 2016 12:00
Hi sjones5 - yes, finally we managed to get a little *awareness* that this was a problem with SR at DELL/EMC.
Ultimately it appears that the node failed on shutdown after my attempts at resetting the BMC & CMC - it just didnt shutdown cleanly.
On reboot it then found the journal in error - bad that it couldnt really be recovered though as;
Support had a webex and taken control, but no luck - im now for the next week (hopefully less) *shuffling* my storagepool between the nodes and reducing the data percent on the HD400's.
Then we are going to smart fail a hd400 and re-add it in. I just hope the BMC & CMC error are a once off - even though they are a low risk fault/quirk, this time it was nasty!
*CROSSES FINGERS*
We managed to commence deleting around 200TB also - but so far after locking out all clients and shutting down all accesses - after 2 days its only removed around 20TB....hmmmmmm...
FUN FUN FUN! Snigger.,
Paconet1
1 Rookie
•
68 Posts
0
December 29th, 2016 17:00
Hi Wyszynski
do you have the jobs paused by the system? when you run the isi job status in another node?
I will wait your answer
Thanks
LeoWski
34 Posts
0
December 29th, 2016 18:00
G'Day Francisco,
Yes - but with the cluster being degraded none are working - so I modified the job engine settings today and then re-enabled the snapshots delete - its almost finished now after 2 hours.
I'll then run a new smartpool for migrate large files 2TB and greater to the other nodes today - hopefully that will free up around 100TB off the failed smartpool tier so we can fully smartfail the hd400 node and re-add it to the cluster next week
*Crosses fingers*
Paconet1
1 Rookie
•
68 Posts
0
December 30th, 2016 09:00
Hi
How to modify the job engine settings for work the the jobs?
Thanks
Ing de Servicios Profesionales
Ing. Francisco Reyes Bautista
freyes@net-brains.com
Cel (52)1 5534664851
Skype nb.francisco.reyes
LeoWski
34 Posts
0
January 2nd, 2017 19:00
Hi Francisco, its a change to the job engine settings to allow the jobs to run in a degraded state - I would imagine that this is *highly* shouldn't really touch type option.
Here it is for reference - but I would think that support would feel that this is a last resort option:
isi_classic job config -p core.run_degraded=True
After this I was able to start/stop and add/change jobs with the cluster being degraded.
Paconet1
1 Rookie
•
68 Posts
0
January 2nd, 2018 14:00
Thanks for your answer
I will run the command and I will tell you the results
Thanks
I tried to remove a disk, so I execute an smartfailand the flex protect stated, but it failed
MVSISILON-1# isi devices
Node 1, [ OK ]
Bay 1 Lnum 12 [HEALTHY] SN:JPW9K0N12EHRKL /dev/da1
Bay 2 Lnum 10 [HEALTHY] SN:JPW9J0N10YW36V /dev/da2
Bay 3 Lnum 9 [HEALTHY] SN:JPW9J0N10Z109V /dev/da3
Bay 4 Lnum 8 [HEALTHY] SN:JPW9J0N10Z4HRV /dev/da4
Bay 5 Lnum 14 [HEALTHY] SN:JPW9K0N10J962L /dev/da5
Bay 6 Lnum 6 [HEALTHY] SN:JPW9J0N10YWXXV /dev/da6
Bay 7 Lnum 5 [HEALTHY] SN:JPW9J0N10Z4HEV /dev/da7
Bay 8 Lnum 4 [HEALTHY] SN:JPW9J0N10YDDWV /dev/da8
Bay 9 Lnum 3 [HEALTHY] SN:JPW9J0N10X2KBV /dev/da9
Bay 10 Lnum 2 [HEALTHY] SN:JPW9J0N10W491V /dev/da10
Bay 11 Lnum 1 [HEALTHY] SN:JPW9J0N10TMYRV /dev/da11
Bay 12 Lnum 0 [HEALTHY] SN:JPW9J0N10YX7XV /dev/da12
Unavailable drives:
Lnum 7 [SUSPENDED] Last Known Bay N/A
I put the cluster in degraded mode but the flex protect is failing
Recent finished jobs:
ID Type State Time
------------------------------------------------------------
3251 WormQueue Succeeded 2018-01-02T02:00:32
3250 FlexProtect Failed 2018-01-02T02:16:29
3252 FlexProtect Failed 2018-01-02T02:28:47
3253 ShadowStoreProtect Succeeded 2018-01-02T04:00:19
3243 MediaScan Succeeded 2018-01-02T08:19:33
3254 FlexProtect Failed 2018-01-02T08:52:45
3255 FlexProtect System Cancelled 2018-01-02T08:57:52
3256 FlexProtect Failed 2018-01-02T09:10:08
3257 FlexProtect Failed 2018-01-02T09:22:45
3258 FlexProtect Failed 2018-01-02T16:10:12
additionally I had the same message in the LCD showed "Test Journal existed with error - Checking Isilon Journal integrity..." in the other node, the action was: to reset the node to the factory settings, with the command isi_reformat_node, while I executed an smartfail of node in the cluster. isi devices -a smartfail -d (num of node)