Datamovers and Fail Back

Question

Hi All,

I have had a probelm for a week now that I just cannot resolve. I have a VNX 5300 and the Primary datamover did a failover to the Secondary. The issue causing the failover has been resolved, but now I need to "fail back" to the Primary.

How do I do this? The pressing issue for me is that no new folders can be created via our automated tool onto our storage. I am doing this manually at the moment.

I do not have AFM installed.

ANY suggestions would be greatly appreciated.

Thanks

darrell

cincystorage · Answer

First i'd do a 'nas_server -info server_2' to verify it has a standby server,  which it should... To restore server_2 you simply run the following: server_standby server_2 -restore mover That will start the process which makes the Primary active again.. keep in mind the impact this may have to connected users, depending on your environment..

dynamox · Answer

server_standby server_2 -restore mover

christopher_ime · Answer

darrellx wrote:

The pressing issue for me is that no new folders can be created via our automated tool onto our storage. I am doing this manually at the moment.

Thanks

darrell

darrellx,

You have piqued my curiosity. How did failing over to the standby data mover prevent the tool from creating folders? How exactly is the tool scripted to reference the data mover? As you know, when the standby data mover takes over, the entire "personality", filesystem (mount) configuration, VDM/CIFS server, etc. of the original is transferred over. The standby is even renamed with the original server_# so that even the system's own internal scripts that rely on the same don't break, and of course the physical and logical network configuration is identical. (well.. at least on the data mover itself, the assumption is that the switch configuration is identical as is required and had been tested before putting into production, is that possibly the issue?)

I can understand backups no longer working (temporarily) in a 2-way NDMP setup but even that can be rectified if the standby data mover is expected to be active for an extended period of time; however, I'm trying to understand how your environment is no longer working properly when it is failed over. I think that should be reviewed so it doesn't happen again in the future, if possible. While I won't agree or disagree that failing back should be a priority (the data movers are after all hardware identical and should not be considered as inferior to its peer), you shouldn't necessarily feel pressure to fail-back ASAP unless you have more than the 2 data movers and it is a standby protecting other peers.

christopher_ime · Answer

Even if the script used server_2, the standby when it takes over is reassigned that name; then as you've probably observed, the failed over data mover is reassigned: server_2.faulted.server_3. The system has its own internal scripts that would break if nothing responded to the name of the original active data mover.

Can you run:

nas_server -l

Let us ignore the details (-info ) for now. I am most interested in:

1) type (0=nas, 4=standby)

2) slot (physical location that never changes regardless of state)

3) state (0=enabled, 2=failed over)

4) name (current name)

darrellx · Answer

UNCLASSIFIED

Christopher

Well, this is a long story. We had a guy who wrote a script to automate our folder creation on our previous storage - MD1000 and MD1200 devices.

When we made the move to EMC a year ago, he re-wrote his script for the new hardware. I believe he has done some hard-coding of device names and referenced the server_2 by name. He is not here at the moment, and I have tried to communicate via text messaging. So I am a little confused as to exactly why, but I think this is it in a nutshell.

So when we run the script, it appears that the folder (sub-folder) structure is created, and the "okay" message is displayed. But when I look on the volume, it is not there.

When he returns (the script writer), I will be asking about this.

I am not that nervous about the fail-over. I suspect if something happened to server_3 (DataMover 3), it would just fail over to server_2 (DataMover2). It is really this folder creation issue that is making life hard.

Having read what you have posted, I do hope this isn't a coincidental issue and the problem remains once I do the fail back routine.

Thanks

Darrell

DARRELL BETTS

TEAM LEADER

FORENSIC AND DATA CENTRES

Tel +61(0) 7 52221222 Ext 172428 Mob +61(0) 419484720

www.afp.gov.au

UNCLASSIFIED

christopher_ime · Answer

Assuming you have just the two data movers, the nas_server -info server_2 output you provided at Mark's request, suggests everything is normal believe it or not.  I'll wait though for the output of nas_server -l.

cincystorage · Answer

You do not need to get the users to logoff - however the level of interruption they receive will vary from none to a lot - depending on a lot of things.. In theory, they should lose connectivity for a brief moment and it should restore itself.. in reality it might not work quite so perfectly..

But yes, it is that simple.. Some other useful commands are:

nas_server -l

/nas/sbin/getreason

christopher_ime · Answer

Scrolling back to the top, if you have a VNX5300 as you mentioned already, then you could only have 2 data movers at most so I answered my own question there.  Go ahead though and still run nas_server -l for me though.

darrellx · Answer

UNOFFICIAL

Mark

Thanks very much for the quick response - I should have done this last week.

Anyway, I have run the command -

nas_server -info server_2

I got the response -

id = 1

name = server_2

acl = 0

type = nas

slot = 2

member_of =

standby = server_3, policy=auto

status :

defined = enabled

actual = online, active

So now, once I get the users to logoff, I can run the command -

server_standby server_2 -restore mover

???

Seem too easy.

Darrell

DARRELL BETTS

TEAM LEADER

FORENSIC AND DATA CENTRES

Tel +61(0) 7 52221222 Ext 172428 Mob +61(0) 419484720

www.afp.gov.au

UNOFFICIAL

christopher_ime · Answer

darrellx wrote:

I am not that nervous about the fail-over. I suspect if something happened to server_3 (DataMover 3), it would just fail over to server_2 (DataMover2).

So just to clarify, in its failed over state, the system would *not* failback automatically; that is always a manual process via the command that Mark had provided.

For instance, using the actual system naming:

1) server_2/slot 2 (server_3/slot 3 is its standby)

2) server_2/slot 2 itself has a fault that triggers an automatic failover (configurable)

3) System renames data movers as follows:

a) server_2: renamed server_2.faulted.server_3 (slot 2)

b) server_3: renamed server_2 (slot 3)

Then let's say you resolve the issue with the original server_2 (slot 2/currently named: server_2.faulted.server_3); however, while it is a very unlucky but possible scenario, original server_3 (slot 3/currently name: server_2) now faults before you have a chance to failback.

IMPORTANT: The system would not failback automatically for you. It is a manual process and unlike the failover process, can't be configured to be done automatically by the system.

So you should consider failing back when you have the chance. I'm thinking though if it weren't for the folder issue (but as discussed, I suspect that is something completely unrelated) and since you only have the two data movers and this standby isn't protecting more than one, I would assume you wouldn't be as eager to failback, since as noted above, you do have to plan for the brief downtime.

EDIT:

I just wanted quickly note that there is a whitepaper about standby data movers: "Configuring Standbys on VNX"

https://mydocs.emc.com/VNXDocs/Standbys.pdf

You can also search for it by name on support.emc.com, but looking to mix it up a bit and take you instead to the "My Documents" section (whitepaper available via the "Related documents" section).

https://mydocs.emc.com/VNX/requestMyDoc.jsp

darrellx · Answer

UNCLASSIFIED

Sorry, I meant to include this.

Component Name: Control Station 0

Type: Control Station

Status: OK

Variant:

Version: N/A

Serial Number: N/A

History: N/A

Component Name: Control Station 1

Type: Control Station

Status: OK

Variant:

Version: 7.0.52-1

Serial Number: FCN00122200001

History: CPU_VENDOR_ID:GenuineIntel

CPU_FAMILY:6

CPU_MODEL_NUMBER:22

CPU_MODEL_NAME:Intel(R) Celeron(R) CPU 440 @ 2.00GHz

CPU_SPEED_MHZ:2000 MHz

CPU_CACHE_SIZE:512 KB

DARRELL BETTS

TEAM LEADER

FORENSIC AND DATA CENTRES

Tel +61(0) 7 52221222 Ext 172428 Mob +61(0) 419484720

www.afp.gov.au

UNCLASSIFIED

From: Betts, Darrell

Sent: Wednesday, 26 June 2013 6:27 AM

To: 'jive-991801997-a3gr-2-fvch@emc-ecn.hosted.jivesoftware.com'

Subject: RE: - Datamovers and Fail Back

UNCLASSIFIED

Mark

Sorry to keep bothering you, but I think I have an issue.

I ran the command -

Server_standby server_2 -restore mover

And I got the response -

server_2 :

Error 4004: server_2 : standby is not available, is active

So I assume that somehow server_2 has failed back and server_3 is now my secondary again. I am a bit confused because I don't understand how that could happen, but I will accept it for the minute.

My concern is that normally I log onto the Control Stations via the browser-based Unisphere and go to IP 10.66.68.90 - which is Control Station 0. Since the fail over, I have not been able to get to this IP and have been using 10.66.68.91. I thought the fail over must have had something to do with this, but maybe not.

Any suggestions?

Thanks

Darrell

DARRELL BETTS

TEAM LEADER

FORENSIC AND DATA CENTRES

Tel +61(0) 7 52221222 Ext 172428 Mob +61(0) 419484720

www.afp.gov.au

UNCLASSIFIED

dynamox · Answer

can you post output from nas_server -l

darrellx · Answer

UNCLASSIFIED

Mark

Sorry to keep bothering you, but I think I have an issue.

I ran the command -

Server_standby server_2 -restore mover

And I got the response -

server_2 :

Error 4004: server_2 : standby is not available, is active

So I assume that somehow server_2 has failed back and server_3 is now my secondary again. I am a bit confused because I don't understand how that could happen, but I will accept it for the minute.

My concern is that normally I log onto the Control Stations via the browser-based Unisphere and go to IP 10.66.68.90 - which is Control Station 0. Since the fail over, I have not been able to get to this IP and have been using 10.66.68.91. I thought the fail over must have had something to do with this, but maybe not.

Any suggestions?

Thanks

Darrell

DARRELL BETTS

TEAM LEADER

FORENSIC AND DATA CENTRES

Tel +61(0) 7 52221222 Ext 172428 Mob +61(0) 419484720

www.afp.gov.au

UNCLASSIFIED

dynamox · Answer

they are back to normal state so they were either failed over manually or they panic again.

darrellx · Answer

UNCLASSIFIED

Christopher

I have run the command -

nas_server -1

And I get the response -

Id type acl slot groupID state name

1 1 0 2 0 server_2

2 4 0 3 0 server_3

As I have already replied to Mark -

I ran the command -

Server_standby server_2 -restore mover

And I got the response -

server_2 :

Error 4004: server_2 : standby is not available, is active

So I assume that somehow server_2 has failed back and server_3 is now my secondary again.

Thanks

Darrell

DARRELL BETTS

TEAM LEADER

FORENSIC AND DATA CENTRES

Tel +61(0) 7 52221222 Ext 172428 Mob +61(0) 419484720

www.afp.gov.au

UNCLASSIFIED

VNX

Datamovers and Fail Back

Was this post helpful?