Start a Conversation

Unsolved

This post is more than 5 years old

3526

December 7th, 2012 12:00

Can a program cause a network disaster in my blade?

Hello everybody,

I'm a young chemistry PhD of Naples and I'm working on a Dell blade super computer. It worked very well since two day ago when I performed same calculations with the new version of Turbomole.

The default setting of this program runs its scripts with a different architecture than that one I set (and this is already an interesting thing on which thinks). After I descovered this strange behaviour I started to set manually the architecture of the program and it worked well... since I have performed more wide calculations (i.e. they need a lot of memory, a lot of cpus, a lot of cpus time,...). When I run these heavy calculations lot of catastrophic things appened:

1 - 30 minutes after I submit one job the server become inaccessible from remote control. I can reconnect to it only if I remove the LAN cable from one of the two M6220 units (always the same unit);

2 - After some hours the internal connections of the blade falls down so I can't connect to any unit form the server. The only thing I can do is to shutdown everyting manually (i.e. using the power key) and than switch on;

3 - If I reconnect the cable before to restart all units all works well until I submit another calculation (I have a break for launch of about 1 hour and an half between to restart the system and to run jobs). After this operation, the server becomes inaccessible another time (I must take off the cable like before) and some units (2 on 3 on which I submit my jobs) are unreachable (if I type 'ssh nodename' I receive the error 'No route to host').

Can a program cause all these?

During this period there is bad time weather with a lot of storms. Can any lightnings damages the blade? All the power connections are governed by a UPS unit so I don't aspected it.

Thanks for any suggestions,

Emanuele Breuza

No Responses!
No Events found!

Top