4 Operator

 • 

14.4K Posts

August 31st, 2015 14:00

I agree that nsrclone should not delete what hasn't been cloned - actually staged.  I used quite often -m and in several instances I had this interrupted, but it would not delete ssids using input file (referring to 8.0.3.x and 8.0.4.x on HPUX and AIX). What you test shows is that there is a message for deleting ssid which had issue (didn't exist in your case) so one could eventually question how this works.  It seems like, given that 3rd ssid was not removed, that NW will be paranoid and interrupt any removal for ssids after it gets signal something went wrong.  I can also confirm that as I have seen it in instances where my nsrclone would fail that it would leave failed and those after so I had to check with mminfo list of ssids again and remove them manually.  For example, in my own case, I would have ssid.lst with:

1111

2222

3333

4444

5555

6666

7777

I would also run nsrclone -m and if we assume sequential order as listed above and that error happened during 4444, I would see that 1111, 2222, 3333 and 4444 instances had copies on tape but were not removed.  I could identify faulty ssid with ssflags,sumflags,clflags and I would just remove instance of ssid on tape for ssid 4444 and instance on disk for 1111,2222 and 3333.  In my case, 5555, 6666 and 7777 never happened as nsrclone was interrupted (so it never came to the end of the run).  I understand this is different from your case, but it is hard to grasp what exactly caused your issue in the first place and how that condition was reflected on what nsrclone was doing. 

Did you also use -F option? (I didn't).

So far, I didn't notice any issue where nsrclone -m would remove ssid which was not migrated.  I know that for sure as where it failed it would leave mark in mdb and I could see that through number of copies and flags associated with those. What is different in your case, is that it seems that something happened during nsrclone run and it is not clear to me if that was at the end or middle and if you used -F or not.  So far, I didn't have instances with -m where nsrclone would come to the end and report some ssids not cloned so I can't exactly verify what you have.  I'm not sure even how we could test it (perhaps having ssids on multiple volumes and unmount one before nsrclone -m gets to it - really no idea if that would work - it might be that NW would most likely wait for volume in such case). When you did this, were all ssids on the same or multiple volumes?  I think you would have a golden ticket if you could make NW think something is wrong with ssid and having it not migrated and then having NW to delete it at the end.  While one may suspect something like this due to message you showed in your test, it might also be bogus parsing which lame coders sometimes do as well.

I know there was one bug fixed with description of "Nsrclone command does not assign correct values to ssid's when -m, -w and -y flags specified from the command line", but I really do not know the details (I suspect values here are browse and retention policy when doing migration based on switches mentioned). Perhaps, and that's long shot, you had that one (that doesn't explain why some ssids where not migrated unless they were related to cover ssid which was migrated first and assigned wrong expiration and then during migration run they got removed as well upon next nsrim run - again long shot).

1 Rookie

 • 

116 Posts

August 31st, 2015 23:00

Hi,

First of all, thank you for your comments, I really hoped that you will respond to this topic

I'll do this test again with more ssids - maybe to pick only 3 ss was not a good idea, so we will see if nsrclone -m is getting paranoid after issues and is not removing the following savesets even if they were successfully staged or he continues to delete them from the source volume except the one which is mixed up with the failed one..an another test idea is to use adv_file device and do something wrong during cloning with the ss itself..

I've never used -F option.

Back to the original issue: I guess I was probably cloning from multiple volumes. I don't have the nsrclone process logs or the ssid list, but from the daemon.log it is clear that there was a cloning session running which was reading from the volume where my savesets were.

I know how I built my ssid_list file: I list the savesets by their pool, savetime & retention and sum their sizes (I had to migrate many-many TBs).

If the size is too much then I decrease the savetime until it's OK (networker is unable to clean the source volume when cloning is running.. so generally I migrated max. 10-20 TB per session).

So, the cloning was running, nw is not logging anything else to daemon.log during cloning, but when the migration finishes it performs a volume cleaning operation (like nsrstage -C -V [volume]) which has some logs ("Deleted xxxx GB from save set xxxx on volume xxxx"). My missing ssids are there...

After that the tape gets unmounted - so it is also visible that the migration operation is finished. This happened at ~2am, so I'm sure that noone was running nsrmm to delete those savesets.. The "neighbours" of the missing savesets are on tapes with that clonetime..

This issue happened with 3 sessions on 2 systems (nw 8.0 & 8.1).

I have many facts & evidents which clearly shows me that those savesets were gone during cloning, but there are also many things what is not based on logs... eg.:

- I have a daily mminfo report. This is the only place where we have the ssid,name,savetime,ssretent of the missing ss.

- the cloning session can be tracked from the daemon.log, but not like "cloning session #1234 started.. did so.. #1234 finished"

- I can say how I cloned, what I've cloned & the neighbours are on tapes, but this is based only on my words as there are no logs about it

It is also true that if nothing is wrong with nsrclone then I could also do that thing manually (delete the ss with nsrmm) and then I would have the same logs..

However I belive if the server has not enough logs then the vendor should trust the customer much more.. at the beginning I think our SR was handled well, but later when they were not able to reproduce the issue or recover data it seems to me that they are focusing on the incident closure + we never reached the "code team".. as they didn't found the root cause - and I guess they won't find it this way - I think there is a huge risk that the issue will happen again (to us or to other users).

We will upgrade to the latest 8.1 soon, use nsrclone -m again, but we will always use verbose logging + we will keep that logs.. I hope we wont hit this issue again.

I'll let you know the results of the migration test+ss deletion with more savesets.

4 Operator

 • 

14.4K Posts

September 1st, 2015 08:00

I think normally to get to the code team they need to be able to reproduce the issue - otherwise code team has nothing to deal with really so I can see this being tough on both ends.  Since NSR doesn't log manual deletion of ssids there is always possibility of customer mistake and then this kind of tickets - when they can't be reproduced - go on and on.

You certainly enhance the script and add couple of things like mminfo before run and mminfo after nsrclone run along with whole output captured.  Someone may still argue you could edit log file, but then where do such discussions end.  I guess one thing to do perhaps would be to collects mminfo -q ssid=$ssid -S for each ssid you are moving before and after operation if you wish to have really complete ssid knowledge.

I assume you are moving from disk to tape?  In such case, what I normally do, is to create global (total) list of ssids to be migrated and then I query mminfo to create lists per volume and then for each source volume list I run nsrclone (of course, you do not wish to run more of those than you have tape drives and normally you wish to keep one tape drive not affected). This gives much better performance when dealing with clones or staging from multiple source volumes.  What kind of your disk type device was source?  File, AFTD or DDBoost?  Or to expand, did you also had clones and did you use ssid in input file or ssid/cloneid pair?

1 Rookie

 • 

116 Posts

September 2nd, 2015 00:00

I agree that they have to "protect" the developer team, but as we had a data loss issue and at the beginning the case was handled very well: they automatically escalated the issue, we were working with "solution recovery team" etc., they told me that a data loss is very rare and they take it seriously.. so I expected that we will reach the highest available levels.. I still think that this issue can occur to other customers as well... but now it seems that the case will be closed soon.

I'm moving data from ddboost devices to tapes. Maybe I could run separate cloning sessions based on the source volume, but nsrclone is also working that way: the ssid list is reordered by the src volume.

This is also something wrong to me.. we have eg. 3 ddboost volumes in the DATABASE pool. Target sessions is 12, so if more clients are running the backups the sessions for a specific db backup may spread between the volumes... this is OK until we are dealing with disk based volumes, but unfortunately there is no way to force nsrclone to use any preferred ss order.. so eg. the RMAN pieces will be in random order on tapes - and this causes heavy positioning during restores from tapes

Running nsrclone separately for each ssid is not an option since the tape is rewinded and the label is read for each clone session...

ssid/cloneid: there were no clones, so I didn't specified the cloneid for the migration

4 Operator

 • 

14.4K Posts

September 3rd, 2015 04:00

oldhercules wrote:

I'm moving data from ddboost devices to tapes. Maybe I could run separate cloning sessions based on the source volume, but nsrclone is also working that way: the ssid list is reordered by the src volume.

In this case, I was more referring to number of streams as it sounded as you used single file, while what I suggested would be multiple files with multiple streams at the time (which would also complete overall operation quicker).

1 Rookie

 • 

116 Posts

September 3rd, 2015 13:00

ok, I see. I have only 2 physical drives for each server. Unfortunately mixed stream cloning is not working (yet).

Sometimes I ran two sessions, and you are right, I'm not 100% sure that all the time but if I used the same source volume for both sessions then the 2nd session was waiting until the other session released the volume.

It is also very important to not to clone/migrate continuously because NetWorker can't do volume cleaning during this time.

4 Operator

 • 

14.4K Posts

September 3rd, 2015 14:00

I did see this clash (nsrim vs read operations) happening in 8.0.x.  With 8.2.x (admitelly, with different OS this time), I do not see that any longer (but might be helped a bit by the change in whole setup after I moved from big servers to couple of VM servers instead and spread the whole load).

No Events found!

Top