Start a Conversation

Unsolved

This post is more than 5 years old

A

5 Practitioner

 • 

274.2K Posts

3828

July 10th, 2013 02:00

Infiniband switch monitoring

As we told a customer that all our Infiniband switches are unmanaged switches he nevertheless was raising the question how to monitor the switches in case of a power supply or port failure. I did not find any information about our Infiniband switches and monitoring options whether they are inside an Isilon our just a few SNMP traps sent out to the customers internal management LAN. Blinking LED's are not an option.

Any kind of Best Practices, doc's etc. appreciated.

Rgds

Tobias

1.2K Posts

July 10th, 2013 03:00

If an Infiniband link goes down on a node, that node will issue an event.

In the (6.5.) WebGUI, make sure to have enabled notifications for

Status -> Events -> Edit Notifications -> Node status events -> "Internal network link ... down"

and "Multiple internal network problems detected"

(Also for the various "offline" events.)

It seems those incidences are only registered from the node's view, and nothing

further is retrieved from the switch through SNMP or the like. So you can't detect

a single power supply failure (in case of redundant p.s.) on the switch, for example.

It's recommended to have two redundant Infiniband switches anyway...

Cheers

Peter

July 10th, 2013 07:00

The customer has a legitimate concern and saying that the answer is to buy redundant switches, although a good idea, is rather short-sighted.  Why wait until TWO power supplies fail and the IB links have failed before you report that the first one failed? 

You may not - and probably really don't want - a customer to update firmware but these switches have a log of capabilities that are now being turned off.

I see no reason why a customer shouldn't be able to turn on syslog or snmp.    Just like EMC does for its fibre channel switches - a customer should/will buy two - but you still want to monitor each one.

1.2K Posts

July 10th, 2013 09:00

Nothing to object to...

Just currently it is not "in the box".

Btw, how about customers who'd like to see lights-out-management such as IPMI on the Isilon nodes...?

The point is, the Isilon design is meant to do without; instead all monitoring and redundancy/failover is done on the OS level. There are SNMP and the platform API,  but served from the nodes' OS, not from a platform processor or BMC.

Cheers

Peter

4 Posts

July 16th, 2013 11:00

Remember that redundant IB switches are not just to protect against a PSU failure. If you had a single IB switch with two power supplies, the entire switch could still fail and your cluster would go down.

I wish the switches had two PSUs because during live IB switch swaps, at one point the entire cluster is running on a single switch, and if that power cable gets jiggled... eeek.

1.2K Posts

July 17th, 2013 01:00

Would that help you, drawing enough attention to the single point (cable) of failure?

Glowing power cable | fosfor gadgets

;-)

5 Practitioner

 • 

274.2K Posts

July 17th, 2013 04:00

Thanks a lot. Everthing was helpful. Customer is not excited but in a mode of comprehension...  ;-)

110 Posts

July 19th, 2013 12:00

Most of the higher end (bigger) IB switches have an ethernet port that will allow you configure an IP and to send SNMP traps that will cover things like an IB power supply failure. The smaller switches only have one power supply and the whole switch is replaced when it fails.

As a rule, we always sell 2 IB switches unless it is going into a test/dev area that doesn't have uptime criticality for some reason. Without 2 IB switches, you create a single point of failure in an otherwise very resilient architecture.

No Events found!

Top