how can I know when disk was broken?

Jump to solution

Hi guys,

I understand when Isilon identfies  that disk is broken, FlexProtect job is running automatically.

I checked 'disk repair initiated' check-box in  Edit Notification Rule, and Recipients for email and

snmp are set to send the notification.


When I tried to execute smartfail to remove 1 disk from the cluster, I received a snmp notification

, but email notification wasn't be sent. Ofcource I checked the test event ,and snmp

notification were sent successfully.

Have you ever received 'disk repair initiated' notification at the time of starting flexprotect? or any ideas?

I used the latest OS, OneFS v6.5.5.26.


Thank you,

Masahiro

0 Kudos
1 Solution

Accepted Solutions

Re: how can I know when disk was broken?

Jump to solution

Mashahiro,

To be clear, OneFS 6.5.5.26 may be the latest GA code in the 6.5 family, but certainly not on the platform.  (7.0.1.10 & 7.0.2.5 & 7.1.0 are the latest)  The latest GA code on the platform can always be found here (go to support.emc.com & search for Current Isilon Software Releases):

Or use this link:

https://support.emc.com/docu46145_Current_Isilon_Software_Releases.pdf?language=en_US

As to your failed disk.  A smartfail that you initiated at the CLI or GUI is far less likely to generate an event than a disk actually failing.  Yes a FlexProtect Job will kick off automatically, however running a flexprotect job is not a noteable issue, the disk failure is.  Personally, I usually create a notification rule for emails and simply click all emergency and all critical, and have them emailed to a distribution group for a storage team.  A similar setting for SNMP is appropriate.  If you think you've discovered that a particular event is sending SNMP traps but not emails when the thresholds are set the same for both rules, I would ask that you please open a Service Request with EMC support.  Support.emc.com.  That way if it is an error in the code it can be corrected.

Hope this helps;

Chris Klosterman

Twitter: @croaking

Senior Solution Architect

Offer and Enablement Team

EMC Isilon

0 Kudos
7 Replies
gshah1
1 Nickel

Re: how can I know when disk was broken?

Jump to solution

An event should be generated telling you that a disk has failed. And OneFS will show an Amber or Red colored dot besides the node number on dashboard. If you click the node and scroll down to see the bay area it will show you the drive bay in red.

I myself have not seen the 'disk repair initiated' thing.

Hope this helps.

0 Kudos

Re: how can I know when disk was broken?

Jump to solution

Mashahiro,

To be clear, OneFS 6.5.5.26 may be the latest GA code in the 6.5 family, but certainly not on the platform.  (7.0.1.10 & 7.0.2.5 & 7.1.0 are the latest)  The latest GA code on the platform can always be found here (go to support.emc.com & search for Current Isilon Software Releases):

Or use this link:

https://support.emc.com/docu46145_Current_Isilon_Software_Releases.pdf?language=en_US

As to your failed disk.  A smartfail that you initiated at the CLI or GUI is far less likely to generate an event than a disk actually failing.  Yes a FlexProtect Job will kick off automatically, however running a flexprotect job is not a noteable issue, the disk failure is.  Personally, I usually create a notification rule for emails and simply click all emergency and all critical, and have them emailed to a distribution group for a storage team.  A similar setting for SNMP is appropriate.  If you think you've discovered that a particular event is sending SNMP traps but not emails when the thresholds are set the same for both rules, I would ask that you please open a Service Request with EMC support.  Support.emc.com.  That way if it is an error in the code it can be corrected.

Hope this helps;

Chris Klosterman

Twitter: @croaking

Senior Solution Architect

Offer and Enablement Team

EMC Isilon

0 Kudos

Re: how can I know when disk was broken?

Jump to solution

Also, If you have SupportIQ or ESRS (OneFS 7.1 only) enabled, when a disk actually fails, the cluster should call home and auto-generate a support case on the issue. Back when I was managing Celerras on a day to day basis, it was not uncommon at all for me to hear from EMC support about a failed disk, before I even found it myself.

Chris Klosterman, ICSP, ICIE, CCNA, VCP

Senior Solution Architect

Offer and Enablement Team

EMC²| Isilon Storage Division

dynamox
6 Gallium

Re: how can I know when disk was broken?

Jump to solution

it will not send out an email until smart fail or flexprotect (can't remember which job exactly) completes.

0 Kudos
Highlighted
TanyaLB
1 Copper

Re: how can I know when disk was broken?

Jump to solution

Does the dial home occur automatically or do you still have to configure notification rules?

0 Kudos
dynamox
6 Gallium

Re: how can I know when disk was broken?

Jump to solution

rules need to be configured

12-24-2014 4-10-14 PM.jpg

chughh
2 Iron

Re: how can I know when disk was broken?

Jump to solution

Hello,

Yes you can check in /var/log/messages ...

for eg for every drive activity a group change will be logged below you can see drive was soft_failed on 21 December @ 23:45 pm.

2014-12-21T23:45:25-05:00 <0.4> DASNAS01-20(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1706](pid 60250="kt: gmp-drive-updat")(tid=101054) group change: <20,10359> [up: 29 nodes, down: 1 drive, soft_failed: 1 drive] (node 20 drive 34 changed to down)

from cli you can run command  to check all group changes for a node.

# sysctl efs.gmp.group

efs.gmp.group: <2,1765>: { 1:0-6,8-11, 2:0-11, 4:0-1,3-10,13, down: 3, 5, soft_failed: 1:0, 2:8,11, 3 }