This post is more than 5 years old
21 Posts
0
6744
how can I know when disk was broken?
Hi guys,
I understand when Isilon identfies that disk is broken, FlexProtect job is running automatically.
I checked 'disk repair initiated' check-box in Edit Notification Rule, and Recipients for email and
snmp are set to send the notification.
When I tried to execute smartfail to remove 1 disk from the cluster, I received a snmp notification
, but email notification wasn't be sent. Ofcource I checked the test event ,and snmp
notification were sent successfully.
Have you ever received 'disk repair initiated' notification at the time of starting flexprotect? or any ideas?
I used the latest OS, OneFS v6.5.5.26.
Thank you,
Masahiro
crklosterman
450 Posts
0
December 6th, 2013 20:00
Mashahiro,
To be clear, OneFS 6.5.5.26 may be the latest GA code in the 6.5 family, but certainly not on the platform. (7.0.1.10 & 7.0.2.5 & 7.1.0 are the latest) The latest GA code on the platform can always be found here (go to support.emc.com & search for Current Isilon Software Releases):
Or use this link:
https://support.emc.com/docu46145_Current_Isilon_Software_Releases.pdf?language=en_US
As to your failed disk. A smartfail that you initiated at the CLI or GUI is far less likely to generate an event than a disk actually failing. Yes a FlexProtect Job will kick off automatically, however running a flexprotect job is not a noteable issue, the disk failure is. Personally, I usually create a notification rule for emails and simply click all emergency and all critical, and have them emailed to a distribution group for a storage team. A similar setting for SNMP is appropriate. If you think you've discovered that a particular event is sending SNMP traps but not emails when the thresholds are set the same for both rules, I would ask that you please open a Service Request with EMC support. Support.emc.com. That way if it is an error in the code it can be corrected.
Hope this helps;
Chris Klosterman
Twitter: @croaking
Senior Solution Architect
Offer and Enablement Team
EMC Isilon
gshah1
27 Posts
0
December 4th, 2013 06:00
An event should be generated telling you that a disk has failed. And OneFS will show an Amber or Red colored dot besides the node number on dashboard. If you click the node and scroll down to see the bay area it will show you the drive bay in red.
I myself have not seen the 'disk repair initiated' thing.
Hope this helps.
crklosterman
450 Posts
1
December 6th, 2013 20:00
Also, If you have SupportIQ or ESRS (OneFS 7.1 only) enabled, when a disk actually fails, the cluster should call home and auto-generate a support case on the issue. Back when I was managing Celerras on a day to day basis, it was not uncommon at all for me to hear from EMC support about a failed disk, before I even found it myself.
Chris Klosterman, ICSP, ICIE, CCNA, VCP
Senior Solution Architect
Offer and Enablement Team
EMC²| Isilon Storage Division
dynamox
2 Intern
2 Intern
•
20.4K Posts
0
December 6th, 2013 21:00
it will not send out an email until smart fail or flexprotect (can't remember which job exactly) completes.
TanyaLB
7 Posts
0
December 24th, 2014 07:00
Does the dial home occur automatically or do you still have to configure notification rules?
dynamox
2 Intern
2 Intern
•
20.4K Posts
0
December 24th, 2014 13:00
rules need to be configured
chughh
122 Posts
1
December 25th, 2014 18:00
Hello,
Yes you can check in /var/log/messages ...
for eg for every drive activity a group change will be logged below you can see drive was soft_failed on 21 December @ 23:45 pm.
2014-12-21T23:45:25-05:00 <0.4> DASNAS01-20(id20) /boot/kernel.amd64/kernel: [gmp_info.c:1706](pid 60250="kt: gmp-drive-updat")(tid=101054) group change: <20,10359> [up: 29 nodes, down: 1 drive, soft_failed: 1 drive] (node 20 drive 34 changed to down)
from cli you can run command to check all group changes for a node.
# sysctl efs.gmp.group
efs.gmp.group: <2,1765>: { 1:0-6,8-11, 2:0-11, 4:0-1,3-10,13, down: 3, 5, soft_failed: 1:0, 2:8,11, 3 }