possible bug with nsrclone -m

Question

Dear Colleagues,

We have a long discussion with EMC because it seems that we lost data while we were manually cloning savesets from DD to tapes.

I have no nsrclone logs at the time of data loss and we can't reproduce the original problem, but I'm sure that something went wrong with the cloning..

Original issue: I used to migrate long-retention backups to tapes with a command similar to: nsrclone -m ... -S -f ssid_list

The ssid_list was generated with mminfo -q savetime<., ssretent>... -r ssid command.

This method has a "problem": cover savesets may be included in this list (because it has a special long retention (01/19/2038), but such savesets are automatically deleted by networker when the "covered" savesets expire. (Now I'm using ",! cover" in the mminfo query),

I had some long-running migration sessions where such cover ss (included in the ssid_list) was deleted by the server while the migration session was still running.

There was an error message in the daemon logs related to this cover ss and we also had many messages like: "nsrd NSR info Media Info: did not retrieve 12 MB from cloning session:2215 of 3146 save set(s)".

I think these are separate problems, however at the end many (non-cover) savesets were deleted from DD which were not cloned, so we lost data!

(I have a daily full mminfo report, so I know the ss list we had the day before the migration and what left after that.)

We ran many-many migration sessions without any problems, but when we noticed that we are missing some data I could clearly identify 3 sessions where something went wrong..

We can't reproduce the same situation (I have no idea when "did not retrieve" messages may appear).

I had an idea how to "confuse" nsrclone, but this method is not accepted by the support, I'll explain it soon.

Basics:

I think nsrclone -m should delete savesets only if the migration was successful. It doesn't matters what's the input to this program, there should be no delete attempt to any ss which is not successfully copied to an another volume. I think nsrclone should always delete the source ss when the cloning/migration was successful.

NetWorker is running in a multi-user, multitask environment, so each component must handle that parallel operations may happen to any ss in the system.

So here is my test scenario:

- create 3 backups

- start nsrclone -m .. -S ssid1 ssid2 ssid3.

- wait until the tape gets mounted and the 1st ss is being cloned

- delete ssid2 with nsrmm -y -d -S ssid2 (the ssid list is already checked by cloning, but before the any process could read it!)

- let nsrclone finish the job.

result:

- ssid1: successfully migrated (cloned and removed from the source volume)

- ssid2: nsr notices that the ss does not exists, but later nsrclone says that the cloning was successful. there is also an attempt to

remove it by nsrclone!!!!

- ssid3 is cloned, but left on the source volume!!

Here is the output of nsrclone -vvvvv:

(the ssid of the manually deleted ss is: 1238639719).

08/19/15 10:22:49.666321 'nsrclone' is invoked by 'none', co_pjobid=none

Job 'nsrstage' has jobid 4513199.

generate_snodes_from_ss:

parms val: 1238639719

parms val: 1255416899

parms val: 2446598104

libDDBoost version: major: 2, minor: 6, patch: 5, engineering: 0, build: 449492

add_to_snodes:

loc_volid:

volname: FILESYSTEM.019

mounted on: hu_dev008

volume_location: [NETWORKER_SERVER].

hostname [NETWORKER_SERVER] added

volid 962489863 added

vallist_add_ssid:

ssid: b55fa954-00000006-49d42467-55d42467-46c25000-ec749456

cloneid: 1439966311

add_to_snodes:

volid 962489863 in [NETWORKER_SERVER]

vallist_add_ssid:

ssid: f7e2e1a4-00000006-4ad42443-55d42443-46c15000-ec749456

cloneid: 1439966275

add_to_snodes:

volid 962489863 in [NETWORKER_SERVER]

vallist_add_ssid:

ssid: 2ffd502b-00000006-91d41fd8-55d41fd8-467a5000-ec749456

cloneid: 1439965144

80470:nsrclone: Following volumes are needed for cloning

80471:nsrclone: FILESYSTEM.019 (Regular)

79631:nsrclone: Migrating the following Regular save sets (ssid/cloneid):

2446598104/1439965144

1238639719/1439966311

1255416899/1439966275

5874:nsrclone: Automatically copying save sets(s) to other volume(s)

79633:nsrclone:

Starting migration operation for Regular save sets...

6217:nsrclone: ...from storage node: [NETWORKER_SN]

inactive ssid 1238639719

77181:nsrclone: Cannot open nsrclone session with [NETWORKER_SERVER]. Error is 'inactive ssid 1238639719'

clone_ids before:

1439965144

clone_ids after:

1439973728

1439965144

cloneid in new list = 1439973728

cloneid 1439965144 in old list did not match

cloneid in new list = 1439965144

cloneid 1439965144 in old list matched

i=0, new cloneid= 1439973728

2446598104/1439973728: new cloneid count = 1, multiple clone warning status=false

my cloneids = 1439973728

clone_ids before:

1439966275

clone_ids after:

1439973728

1439966275

cloneid in new list = 1439973728

cloneid 1439966275 in old list did not match

cloneid in new list = 1439966275

cloneid 1439966275 in old list matched

i=0, new cloneid= 1439973728

1238639719/1439973728: new cloneid count = 1, multiple clone warning status=false

my cloneids = 1439973728

generate_clone_result(): 2 savesets successfully cloned.

81542:nsrclone: Successfully cloned all requested Regular save sets (with new cloneid)

2446598104/1439973728

1238639719/1439973728

79629:nsrclone: Clones were written to the following volume(s) for Regular save sets:

FRA504JC

6359:nsrclone: Deleting the successfully cloned save set 2446598104

6359:nsrclone: Deleting the successfully cloned save set 1238639719

88977:nsrd: Unable to find save-set for SSID '1238639719'.

61488:nsrclone: Failed to process the followings clone instances:

b55fa954-00000006-49d42467-55d42467-46c25000-ec749456/1439966311

90825:nsrclone: Selecting device for volume operation on volume ID '962489863'

90828:nsrclone: Found device 'hu_dev008' for volume ID '962489863'.

90828:nsrclone: Found device 'rd=[NETWORKER_SN].[DOMAIN]:hu_dev008_lib' for volume ID '962489863'.

90831:nsrclone: Preferring server device 'hu_dev008' for volume ID '962489863'.

90836:nsrclone: Selected device 'hu_dev008' for volume operation on volume ID '962489863'.

77181:nsrclone: Cannot open nsrclone session with [NETWORKER_SERVER]. Error is 'inactive ssid 1238639719'

Some of the clones failed

snl_src_snode: [NETWORKER_SERVER]

loc_volid:

Obtaining media database information on server [NETWORKER_SERVER]

Deleting save set 2446598104 clone 1439965144 from the media data base.

Deleting save set 1238639719 clone 1439966311 from the media data base.

Recovering space from volume 962489863 after successfully cloning save sets.

do_volume_op(): deleting 2 savesets from volume 962489863.

hu_dev008:

hu_dev008: space recovered from volume FILESYSTEM.019

Completed recover space operation for volume 962489863

Refer to the NetWorker log for any failures.

----------------

With this test I can't prove that the ss would be really deleted, because I've deleted it already.., but according to the log messages nsrclone wants to delete it! It's also important that the 3rd ss exists both on the source and migration target volumes.

I belive there is something wrong with nsrclone's result list checking algorithm and in case of any other cloning problems it may delete

unsuccessfully copied savesets.

The current recommendation is to upgrade to the latest nw versions. We had (the original) issues with both 8.0.2.6 and 8.1.1.x systems.

The previous test scenario has the same results with the latest 8.1.3 version too. May I use the -m option with the latest versions?

My questions are:

- Do you agree that there is a problem with nsrclone (-m)?

- How is it possible to reach the "code team" to analyze this or how can I push EMC to continue the investigation with this?

Thank you in advance,

Istvan

ble1 · Answer

I agree that nsrclone should not delete what hasn't been cloned - actually staged. I used quite often -m and in several instances I had this interrupted, but it would not delete ssids using input file (referring to 8.0.3.x and 8.0.4.x on HPUX and AIX). What you test shows is that there is a message for deleting ssid which had issue (didn't exist in your case) so one could eventually question how this works. It seems like, given that 3rd ssid was not removed, that NW will be paranoid and interrupt any removal for ssids after it gets signal something went wrong. I can also confirm that as I have seen it in instances where my nsrclone would fail that it would leave failed and those after so I had to check with mminfo list of ssids again and remove them manually. For example, in my own case, I would have ssid.lst with:

1111

2222

3333

4444

5555

6666

7777

I would also run nsrclone -m and if we assume sequential order as listed above and that error happened during 4444, I would see that 1111, 2222, 3333 and 4444 instances had copies on tape but were not removed. I could identify faulty ssid with ssflags,sumflags,clflags and I would just remove instance of ssid on tape for ssid 4444 and instance on disk for 1111,2222 and 3333. In my case, 5555, 6666 and 7777 never happened as nsrclone was interrupted (so it never came to the end of the run). I understand this is different from your case, but it is hard to grasp what exactly caused your issue in the first place and how that condition was reflected on what nsrclone was doing.

Did you also use -F option? (I didn't).

So far, I didn't notice any issue where nsrclone -m would remove ssid which was not migrated. I know that for sure as where it failed it would leave mark in mdb and I could see that through number of copies and flags associated with those. What is different in your case, is that it seems that something happened during nsrclone run and it is not clear to me if that was at the end or middle and if you used -F or not. So far, I didn't have instances with -m where nsrclone would come to the end and report some ssids not cloned so I can't exactly verify what you have. I'm not sure even how we could test it (perhaps having ssids on multiple volumes and unmount one before nsrclone -m gets to it - really no idea if that would work - it might be that NW would most likely wait for volume in such case). When you did this, were all ssids on the same or multiple volumes? I think you would have a golden ticket if you could make NW think something is wrong with ssid and having it not migrated and then having NW to delete it at the end. While one may suspect something like this due to message you showed in your test, it might also be bogus parsing which lame coders sometimes do as well.

I know there was one bug fixed with description of "Nsrclone command does not assign correct values to ssid's when -m, -w and -y flags specified from the command line", but I really do not know the details (I suspect values here are browse and retention policy when doing migration based on switches mentioned). Perhaps, and that's long shot, you had that one (that doesn't explain why some ssids where not migrated unless they were related to cover ssid which was migrated first and assigned wrong expiration and then during migration run they got removed as well upon next nsrim run - again long shot).

oldhercules · Answer

Hi,

First of all, thank you for your comments, I really hoped that you will respond to this topic

I'll do this test again with more ssids - maybe to pick only 3 ss was not a good idea, so we will see if nsrclone -m is getting paranoid after issues and is not removing the following savesets even if they were successfully staged or he continues to delete them from the source volume except the one which is mixed up with the failed one..an another test idea is to use adv_file device and do something wrong during cloning with the ss itself..

I've never used -F option.

Back to the original issue: I guess I was probably cloning from multiple volumes. I don't have the nsrclone process logs or the ssid list, but from the daemon.log it is clear that there was a cloning session running which was reading from the volume where my savesets were.

I know how I built my ssid_list file: I list the savesets by their pool, savetime & retention and sum their sizes (I had to migrate many-many TBs).

If the size is too much then I decrease the savetime until it's OK (networker is unable to clean the source volume when cloning is running.. so generally I migrated max. 10-20 TB per session).

So, the cloning was running, nw is not logging anything else to daemon.log during cloning, but when the migration finishes it performs a volume cleaning operation (like nsrstage -C -V [volume]) which has some logs ("Deleted xxxx GB from save set xxxx on volume xxxx"). My missing ssids are there...

After that the tape gets unmounted - so it is also visible that the migration operation is finished. This happened at ~2am, so I'm sure that noone was running nsrmm to delete those savesets.. The "neighbours" of the missing savesets are on tapes with that clonetime..

This issue happened with 3 sessions on 2 systems (nw 8.0 & 8.1).

I have many facts & evidents which clearly shows me that those savesets were gone during cloning, but there are also many things what is not based on logs... eg.:

- I have a daily mminfo report. This is the only place where we have the ssid,name,savetime,ssretent of the missing ss.

- the cloning session can be tracked from the daemon.log, but not like "cloning session #1234 started.. did so.. #1234 finished"

- I can say how I cloned, what I've cloned & the neighbours are on tapes, but this is based only on my words as there are no logs about it

It is also true that if nothing is wrong with nsrclone then I could also do that thing manually (delete the ss with nsrmm) and then I would have the same logs..

However I belive if the server has not enough logs then the vendor should trust the customer much more.. at the beginning I think our SR was handled well, but later when they were not able to reproduce the issue or recover data it seems to me that they are focusing on the incident closure + we never reached the "code team".. as they didn't found the root cause - and I guess they won't find it this way - I think there is a huge risk that the issue will happen again (to us or to other users).

We will upgrade to the latest 8.1 soon, use nsrclone -m again, but we will always use verbose logging + we will keep that logs.. I hope we wont hit this issue again.

I'll let you know the results of the migration test+ss deletion with more savesets.

ble1 · Answer

I think normally to get to the code team they need to be able to reproduce the issue - otherwise code team has nothing to deal with really so I can see this being tough on both ends. Since NSR doesn't log manual deletion of ssids there is always possibility of customer mistake and then this kind of tickets - when they can't be reproduced - go on and on.

You certainly enhance the script and add couple of things like mminfo before run and mminfo after nsrclone run along with whole output captured. Someone may still argue you could edit log file, but then where do such discussions end. I guess one thing to do perhaps would be to collects mminfo -q ssid=$ssid -S for each ssid you are moving before and after operation if you wish to have really complete ssid knowledge.

I assume you are moving from disk to tape? In such case, what I normally do, is to create global (total) list of ssids to be migrated and then I query mminfo to create lists per volume and then for each source volume list I run nsrclone (of course, you do not wish to run more of those than you have tape drives and normally you wish to keep one tape drive not affected). This gives much better performance when dealing with clones or staging from multiple source volumes. What kind of your disk type device was source? File, AFTD or DDBoost? Or to expand, did you also had clones and did you use ssid in input file or ssid/cloneid pair?

oldhercules · Answer

I agree that they have to 'protect' the developer team, but as we had a data loss issue and at the beginning the case was handled very well: they automatically escalated the issue, we were working with 'solution recovery team' etc., they told me that a data loss is very rare and they take it seriously.. so I expected that we will reach the highest available levels.. I still think that this issue can occur to other customers as well... but now it seems that the case will be closed soon. I'm moving data from ddboost devices to tapes. Maybe I could run separate cloning sessions based on the source volume, but nsrclone is also working that way: the ssid list is reordered by the src volume. This is also something wrong to me.. we have eg. 3 ddboost volumes in the DATABASE pool. Target sessions is 12, so if more clients are running the backups the sessions for a specific db backup may spread between the volumes... this is OK until we are dealing with disk based volumes, but unfortunately there is no way to force nsrclone to use any preferred ss order.. so eg. the RMAN pieces will be in random order on tapes - and this causes heavy positioning during restores from tapes Running nsrclone separately for each ssid is not an option since the tape is rewinded and the label is read for each clone session... ssid/cloneid: there were no clones, so I didn't specified the cloneid for the migration

ble1 · Answer

oldhercules wrote:

I'm moving data from ddboost devices to tapes. Maybe I could run separate cloning sessions based on the source volume, but nsrclone is also working that way: the ssid list is reordered by the src volume.

In this case, I was more referring to number of streams as it sounded as you used single file, while what I suggested would be multiple files with multiple streams at the time (which would also complete overall operation quicker).

oldhercules · Answer

ok, I see. I have only 2 physical drives for each server. Unfortunately mixed stream cloning is not working (yet).

Sometimes I ran two sessions, and you are right, I'm not 100% sure that all the time but if I used the same source volume for both sessions then the 2nd session was waiting until the other session released the volume.

It is also very important to not to clone/migrate continuously because NetWorker can't do volume cleaning during this time.

ble1 · Answer

I did see this clash (nsrim vs read operations) happening in 8.0.x.  With 8.2.x (admitelly, with different OS this time), I do not see that any longer (but might be helped a bit by the change in whole setup after I moved from big servers to couple of VM servers instead and spread the whole load).

NetWorker

possible bug with nsrclone -m

Was this post helpful?