Patch-191603 on OneFS 8.0.0.4 Experience

Question

Looking for feedback on what we experienced yesterday which caused a disruption on certain hosts in our production and development VM environment. I currently have an SR open but haven't been able to get anyone from the Isilon support team to start working on it (24 hours and counting so far, 3 follow-up calls... sounds like they are slammed with some severity 1 issues right now and in fairness, I stabilized our environment by suspending the node we had issue with). Ultimately I need assistance with a root-cause analysis and determining if I can place the troubled node back in SmartConnect duty.

Our current x410 4-node cluster is running OneFS 8.0.0.4 (target code) with highly recommended patch-188239. Because we are exclusively NFSv3 right now and have experienced some of the issues described as fixed by the patch, I opted to install patch-191603 (a roll-up to address multiple NFS and SmartConnect issues). Over the last year we've done rolling reboots to the cluster countless times (patches, OneFS upgrades, etc.) and never experienced a disruption... until now it has been completely transparent in all cases.

We have a very light workload (relative to the power of the cluster) so as luck would have it, almost all our NFS clients were on the 4th node. The patch itself appears to have installed correctly and did the expected rolling reboot. When it got to the final 4th node, SmartConnect did what it was supposed to do and distributed the clients to the other three nodes, and when the 4th node came back up it took some of the clients back. In fact, it was perfectly balanced by the same number of IP addresses per node when all was said and done. Very symmetrical.

The problem was that clients assigned to node 2 immediately lost all connectivity to storage, it just disappeared. No indication whatsoever of any problems on the Isilon side, my first hint of trouble was the VM guys going into a panic about VMs suddenly going dark from certain hosts. A quick check found all those hosts on node 2 and I saw that the throughput for node 2 was exactly zero, nothing in or out. I "fixed" the issue by suspending node 2 from SmartConnect duty and rebooting the node. This caused the SmartConnect to immediately send those clients to the other 3 nodes and connectivity was restored instantly to those hosts (they were able to see the storage again).

So I'm looking for feedback from anyone else who might have experienced a similar phenomenon either with this patch or in general. I will still expect the Isilon support team to assist with root-cause analysis but if anyone has ideas or advice I'd love to hear it. I still currently have node 2 suspended from SmartConnect duty waiting to discuss with support on whether I can enable it again. Thanks.

JoeCap2 · Answer

Ryan, Were you able to figure out what the issue was after you installed the patch?  Are you able to add node 2 back Joe

Ryan_CSULB · Answer

Sort of, the issues ultimately was a panic from the 10g adapters, not necessarily anything to do with the patch itself or with SmartConnect, but support couldn't say exactly why they went into a panic.

Yes, after getting the all clear from support I simply allowed LNN2 back into the SmartConnect circle. For whatever reason, it immediately grabbed the majority of the clients and we had no issues with NFS (or any) traffic being disrupted.

priyal420 · Answer

Hi Ryan

We just Upgraded our isilon cluster to 8.0.0.4 from 7.2.1.2, i see one of our node has just 3 NFS clients connected to it and smb clients are not failing over to that particular node, I can connect to that node manually and see 1 smb connection which is mine being connected to that node, however SMB connections are not failing over to that particular node which is X410

how do you resolve it?

Phil.Lam · Answer

PriyalP7 SMB2 clients will get disconnected because it is a stateful protocol, they would use a Static Pool . NFS v3 is a stateless protocols & Dynamic IPs allows the IP to be taken over by another NIC.

SKT2 · Answer

Try using Continuous Availability option with the SMB share . CA supported client OS included Windows 2012 and WIndows 8.

Phil.Lam · Answer

SKT You still need a Static Pool for SMB CA to work correctly. Phil

priyal420 · Answer

I fixed the issue, when I did a dig smartconnect was not returning the  effected node IP for both NFS and SMB pool from logs i saw that HDFS services failed to restart on that particular node, restarting the HDFS services worked and smartconnect was able to return the node IP and clients were getting connected to that Node

Ryan_CSULB · Answer

PriyalP7, regarding your HDFS issue, did you apply patch-194268?  It addresses the HDFS service not restarting automatically after failing.  Sounds like it might be something you want to consider for your environment.

angelom1 · Answer

This looks like a Smartconnect connection issue that is fixed in 8.0.0.6.  See page 18-19 in the 8.0.0.6 release notes

Ryan_CSULB · Answer

Yes, I think you are right.  In fact, the notes of patch 206322 (which replaced 191603) mentions some very similar bugs that it fixes.  I don't know if this was a 8.0.0.4 issue or something with the 191603 patch itself, but when I did install the new patch I had no issues.

Isilon

Patch-191603 on OneFS 8.0.0.4 Experience

SKT

Was this post helpful?