ISILON Journal check - status/timeframe/info?

Question

Ok, so we had an error with the BMC & CMC which support have guided me though in an SR which ended in a power-cycle that never finished powering off after around 20 mins plus. So I removed one power lead, the fans kicked into over-drive and the system then shutdown 10 sec after that. So I removed the 2nd power lead. On reboot, the LCD showed 'Test Journal existed with error - Checking Isilon Journal integrity...' So, this is the 2nd of four HD400's running at 89% capacity - anyone wanna guess/suggest how long this is gonna check the Journal for? its been running now for about two hours. Running on 7.2.1.1. in a 23 node cluster. ....and I still don't know why the BMC&CMC error-ed initially! Anyone ever see the Journal message before? what was your result? Thanks! _L_

sjones51 · Answer

Hi wyszynski,

The journal check should not take 2 hours. It should be a pretty instant process, so there is something wrong. Any time you see any kind of journal error, contact Isilon Support immediately and raise an S1 case.

The BMC&CMC error is a bug and has a fix. There is more information here:

466373 : S210, X210, X410, NL410 or HD400 shows event: 'Node's Baseboard Management Controller (BMC) and/or Chassis Management Controller (CMC) are unresponsive.' https://support.emc.com/kb/466373

LeoWski · Answer

G'Day sjones5,

Ahhhhhh thats a major problem its still offline now at the 18 hours mark and not communicating to the cluster - and after 3x SR's and *attempting* to stress to the support staff whom apparently don't normally answer calls my ULTRA urgent call is an S3 (thats after 2x separate chat sessions both ending in "oh so you need someone there now?" - and yet its now been over 15 hours since my initial "We need urgent help on a dead node" chat.

NOT HAPPY DELL/EMC.

So, the node if works today will have to resync between 3 other HD400's around 90% of data (ie they are full).

I dont wish upon any EMC person being on call this Christmas weekend - cause I bet a Million $$$$ that someone will get called in due to another failure.

sjones51 · Answer

Do you have a case number to reference? I can make sure that Isilon Support management is aware of what is going on if this hasn't been addressed by now.

LeoWski · Answer

Hi sjones5 - yes, finally we managed to get a little *awareness* that this was a problem with SR at DELL/EMC.

Ultimately it appears that the node failed on shutdown after my attempts at resetting the BMC & CMC - it just didnt shutdown cleanly.

On reboot it then found the journal in error - bad that it couldnt really be recovered though as;

Support had a webex and taken control, but no luck - im now for the next week (hopefully less) *shuffling* my storagepool between the nodes and reducing the data percent on the HD400's.

Then we are going to smart fail a hd400 and re-add it in. I just hope the BMC & CMC error are a once off - even though they are a low risk fault/quirk, this time it was nasty!

*CROSSES FINGERS*

We managed to commence deleting around 200TB also - but so far after locking out all clients and shutting down all accesses - after 2 days its only removed around 20TB....hmmmmmm...

FUN FUN FUN! Snigger.,

Paconet1 · Answer

Hi Wyszynski do you have the jobs paused by the system? when you run the isi job status in another node? I will wait your answer Thanks

LeoWski · Answer

G'Day Francisco,

Yes - but with the cluster being degraded none are working - so I modified the job engine settings today and then re-enabled the snapshots delete - its almost finished now after 2 hours.

I'll then run a new smartpool for migrate large files 2TB and greater to the other nodes today - hopefully that will free up around 100TB off the failed smartpool tier so we can fully smartfail the hd400 node and re-add it to the cluster next week

*Crosses fingers*

Paconet1 · Answer

Hi How to modify the job engine settings for work the the jobs? Thanks Ing de Servicios Profesionales Ing. Francisco Reyes Bautista freyes@net-brains.com Cel (52)1 5534664851 Skype nb.francisco.reyes

LeoWski · Answer

Hi Francisco, its a change to the job engine settings to allow the jobs to run in a degraded state - I would imagine that this is *highly* shouldn't really touch type option.

Here it is for reference - but I would think that support would feel that this is a last resort option:

isi_classic job config -p core.run_degraded=True

After this I was able to start/stop and add/change jobs with the cluster being degraded.

Paconet1 · Answer

Thanks for your answer

I will run the command and I will tell you the results

Thanks

I tried to remove a disk, so I execute an smartfailand the flex protect stated, but it failed

MVSISILON-1# isi devices

Node 1, [ OK ]

Bay 1 Lnum 12 [HEALTHY] SN:JPW9K0N12EHRKL /dev/da1

Bay 2 Lnum 10 [HEALTHY] SN:JPW9J0N10YW36V /dev/da2

Bay 3 Lnum 9 [HEALTHY] SN:JPW9J0N10Z109V /dev/da3

Bay 4 Lnum 8 [HEALTHY] SN:JPW9J0N10Z4HRV /dev/da4

Bay 5 Lnum 14 [HEALTHY] SN:JPW9K0N10J962L /dev/da5

Bay 6 Lnum 6 [HEALTHY] SN:JPW9J0N10YWXXV /dev/da6

Bay 7 Lnum 5 [HEALTHY] SN:JPW9J0N10Z4HEV /dev/da7

Bay 8 Lnum 4 [HEALTHY] SN:JPW9J0N10YDDWV /dev/da8

Bay 9 Lnum 3 [HEALTHY] SN:JPW9J0N10X2KBV /dev/da9

Bay 10 Lnum 2 [HEALTHY] SN:JPW9J0N10W491V /dev/da10

Bay 11 Lnum 1 [HEALTHY] SN:JPW9J0N10TMYRV /dev/da11

Bay 12 Lnum 0 [HEALTHY] SN:JPW9J0N10YX7XV /dev/da12

Unavailable drives:

Lnum 7 [SUSPENDED] Last Known Bay N/A

I put the cluster in degraded mode but the flex protect is failing

Recent finished jobs:

ID Type State Time

------------------------------------------------------------

3251 WormQueue Succeeded 2018-01-02T02:00:32

3250 FlexProtect Failed 2018-01-02T02:16:29

3252 FlexProtect Failed 2018-01-02T02:28:47

3253 ShadowStoreProtect Succeeded 2018-01-02T04:00:19

3243 MediaScan Succeeded 2018-01-02T08:19:33

3254 FlexProtect Failed 2018-01-02T08:52:45

3255 FlexProtect System Cancelled 2018-01-02T08:57:52

3256 FlexProtect Failed 2018-01-02T09:10:08

3257 FlexProtect Failed 2018-01-02T09:22:45

3258 FlexProtect Failed 2018-01-02T16:10:12

additionally I had the same message in the LCD showed "Test Journal existed with error - Checking Isilon Journal integrity..." in the other node, the action was: to reset the node to the factory settings, with the command isi_reformat_node, while I executed an smartfail of node in the cluster. isi devices -a smartfail -d (num of node)

Isilon

Was this post helpful?