252 Posts

December 22nd, 2016 07:00

Hi wyszynski,

The journal check should not take 2 hours. It should be a pretty instant process, so there is something wrong. Any time you see any kind of journal error, contact Isilon Support immediately and raise an S1 case.

The BMC&CMC error is a bug and has a fix. There is more information here:

466373 : S210, X210, X410, NL410 or HD400 shows event: 'Node's Baseboard Management Controller (BMC) and/or Chassis Management Controller (CMC) are unresponsive.' https://support.emc.com/kb/466373

34 Posts

December 22nd, 2016 13:00

G'Day sjones5,

Ahhhhhh thats a major problem its still offline now at the 18 hours mark and not communicating to the cluster - and after 3x SR's and *attempting* to stress to the support staff whom apparently don't normally answer calls my ULTRA urgent call is an S3 (thats after 2x separate chat sessions both ending in "oh so you need someone there now?" - and yet its now been over 15 hours since my initial "We need urgent help on a dead node" chat.

NOT HAPPY DELL/EMC.

So, the node if works today will have to resync between 3 other HD400's around 90% of data (ie they are full).

I dont wish upon any EMC person being on call this Christmas weekend - cause I bet a Million $$$$ that someone will get called in due to another failure.

252 Posts

December 23rd, 2016 07:00

Do you have a case number to reference? I can make sure that Isilon Support management is aware of what is going on if this hasn't been addressed by now.

34 Posts

December 25th, 2016 12:00

Hi sjones5 - yes, finally we managed to get a little *awareness* that this was a problem with SR at DELL/EMC.

Ultimately it appears that the node failed on shutdown after my attempts at resetting the BMC & CMC - it just  didnt shutdown cleanly.

On reboot it then found the journal in error - bad that it couldnt really be recovered though as;

Support had a webex and taken control, but no luck - im now for the next week (hopefully less) *shuffling* my storagepool between the nodes and reducing the data percent on the HD400's.

Then we are going to smart fail a hd400 and re-add it in. I just hope the BMC & CMC error are a once off - even though they are a low risk fault/quirk, this time it was nasty!

*CROSSES FINGERS*

We managed to commence deleting around 200TB also - but so far after locking out all clients and shutting down all accesses - after 2 days its only removed around 20TB....hmmmmmm...

FUN FUN FUN! Snigger.,

1 Rookie

 • 

68 Posts

December 29th, 2016 17:00

Hi Wyszynski

do you have the jobs paused by the system? when you run the isi job status in another node?

I will wait your answer

Thanks

34 Posts

December 29th, 2016 18:00

G'Day Francisco,

Yes - but with the cluster being degraded none are working - so I modified the job engine settings today and then re-enabled the snapshots delete - its almost finished now after 2 hours.

I'll then run a new smartpool for migrate large files 2TB and greater to the other nodes today - hopefully that will free up around 100TB off the failed smartpool tier so we can fully smartfail the hd400 node and re-add it to the cluster next week

*Crosses fingers*

1 Rookie

 • 

68 Posts

December 30th, 2016 09:00

Hi

How to modify the job engine settings for work the the jobs?

Thanks

Ing de Servicios Profesionales

Ing. Francisco Reyes Bautista

freyes@net-brains.com

Cel (52)1 5534664851

Skype nb.francisco.reyes

34 Posts

January 2nd, 2017 19:00

Hi Francisco, its a change to the job engine settings to allow the jobs to run in a degraded state - I would imagine that this is *highly* shouldn't really touch type option.

Here it is for reference - but I would think that support would feel that this is a last resort option:

isi_classic job config -p core.run_degraded=True

After this I was able to start/stop and add/change jobs with the cluster being degraded.

1 Rookie

 • 

68 Posts

January 2nd, 2018 14:00

Thanks for your answer

I will run the command and I will tell you the results

Thanks

I tried to remove a disk, so I execute an smartfailand the flex protect stated, but it failed

MVSISILON-1# isi devices

Node 1, [ OK ]

  Bay 1        Lnum 12      [HEALTHY]      SN:JPW9K0N12EHRKL      /dev/da1

  Bay 2        Lnum 10      [HEALTHY]      SN:JPW9J0N10YW36V      /dev/da2

  Bay 3        Lnum 9       [HEALTHY]      SN:JPW9J0N10Z109V      /dev/da3

  Bay 4        Lnum 8       [HEALTHY]      SN:JPW9J0N10Z4HRV      /dev/da4

  Bay 5        Lnum 14      [HEALTHY]      SN:JPW9K0N10J962L      /dev/da5

  Bay 6        Lnum 6       [HEALTHY]      SN:JPW9J0N10YWXXV      /dev/da6

  Bay 7        Lnum 5       [HEALTHY]      SN:JPW9J0N10Z4HEV      /dev/da7

  Bay 8        Lnum 4       [HEALTHY]      SN:JPW9J0N10YDDWV      /dev/da8

  Bay 9        Lnum 3       [HEALTHY]      SN:JPW9J0N10X2KBV      /dev/da9

  Bay 10       Lnum 2       [HEALTHY]      SN:JPW9J0N10W491V      /dev/da10

  Bay 11       Lnum 1       [HEALTHY]      SN:JPW9J0N10TMYRV      /dev/da11

  Bay 12       Lnum 0       [HEALTHY]      SN:JPW9J0N10YX7XV      /dev/da12

Unavailable drives:

  Lnum 7    [SUSPENDED]     Last Known Bay N/A

I put the cluster in degraded mode but the flex protect is failing

Recent finished jobs:

ID   Type               State            Time           

------------------------------------------------------------

3251 WormQueue          Succeeded        2018-01-02T02:00:32

3250 FlexProtect        Failed           2018-01-02T02:16:29

3252 FlexProtect        Failed           2018-01-02T02:28:47

3253 ShadowStoreProtect Succeeded        2018-01-02T04:00:19

3243 MediaScan          Succeeded        2018-01-02T08:19:33

3254 FlexProtect        Failed           2018-01-02T08:52:45

3255 FlexProtect        System Cancelled 2018-01-02T08:57:52

3256 FlexProtect        Failed           2018-01-02T09:10:08

3257 FlexProtect        Failed           2018-01-02T09:22:45

3258 FlexProtect        Failed           2018-01-02T16:10:12

additionally I had the same message in the LCD showed "Test Journal existed with error - Checking Isilon Journal integrity..." in the other node, the action was: to reset the node to the factory settings, with the command isi_reformat_node, while I executed an smartfail of node in the cluster. isi devices -a smartfail -d (num of node)

No Events found!

Top