Numerous Issues, no active support contract

Question

I started at this position 6 months ago, and inherited an older Isilon array. At the time of its purchase, the company could invest in the CapEx/OpEx needed to maintain such a beast of a system, but these days, no warranty is in sight, and as such, I'm facing errors now that I've never had to deal with before.

It's mostly minor things from the surface, but they're annoying nonetheless, and I'm out of ideas.

The two major problems that I really want to find a solution for:

I keep receiving notices about an interface on the internal network that it's up and down, then up again... a situation I've come to learn is called "flapping" but other than some obscure references online, I've yet to see any concrete way to solve the problem.

The second one is a little more annoying. Today, for whatever reason after replacing a hard drive in one of our nodes, we've been receiving non-stop Emails about how a node is now online. I've already told the Email alerts to stop sending me both informational AND the first level Alert Emails, but nothing helps. We've had a little more than 2 Emails a minute all day today and I'm about to scream.

Cluster GUID: xxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Sender OneFS Version: Isilon OneFS v6.5.4.17 B_6_5_4_162(RELEASE) Sender Serial Number:xxxxxxxxxxxx

Node 62 Events

------------------------------------------------------------------------

OneFS Version: Isilon OneFS v6.5.4.17 B_6_5_4_162(RELEASE) Serial Number: xxxxxxxxxx

------------------------------------------------------------------------

ID Started Sev Message

------------------------------------------------------------------------

1.6730 10/14 10:31 I Node 62 is online (offline event 18.682, Dec 14 2014 07:58 +0000 to Oct 14 2016 10:31 +0000)

Attachment Manifest:

Attached:

events-001517e7c48a7fdbb44c4808f0bd2f662462-1476480044.xml

Isilon Support Toll-Free: 1-866-276-0723 | Email: support@isilon.com | Toll: 1-206-777-7970 Customer Contact ID: xxxxxxx xxxx#

As you can see, the alert is from an event that apparently occurred almost 2 years ago.

I don't have any documentation here to help me at all - I figured out how to reboot nodes on my own, but have yet to do so since the company relies on this thing for everything it does these days.

Short of stopping the alerts, anyone have any suggestions? (P.S. no, a support contract won't work - the quote for the support is about 1.5 times the cost of a whole new storage array... and the CEO won't do it for that reason)

Oh, and the problem with the Emails - it's the same nodes every time, near as I can tell - but it's a substantial number of them, which makes me believe that there's just a log file stuck in a mail queue or something somewhere.

carlilek · Answer

See my blog here:

Simple script for restarting the CELOG on Isilon | Unscrupulous Modifier

(note that this will only work for OneFS 7 and earlier)

A flapping interface on the internal network is scary problem. How many nodes do you have, and what kind of / how big switches?

sjones51 · Answer

Hi cogeek,

In terms of the flapping IB interface, you could be looking at software or hardware. There were a number of software fixes implemented in later versions of OneFS to add more stability there. In terms of hardware though, there are a couple of easy things you could try.

1. Move the IB cable to a different port on the switch (different leaf if applicable).

2. Replace the IB cable.

The next hardware step is replacing the IB/NVRAM card which I would strongly recommend having a certified engineer do since the card stores the journal information and you could be looking at data loss if anything goes wrong. You don't have to have a full support contract. You can get assistance from support with a Time and Materials quote rather than a full contract.

Isilon

Numerous Issues, no active support contract

Was this post helpful?