Networker not Clearing savesets from a DDBoost Device

Question

I hijacked another thread (by accident) I'll put up the pertinent information in one place and hopefully someone else will have an idea. AIX 7.1 Networker 8.2.0.4 (upgraded from 8.1.1.1) DDDOS 5.5.0.8 Discovered 3 weeks ago savesets that should have expired 2 years ago, were recoverable and still on the Data Domain. There is a Sev2 case that has been open for 3 weeks. There is a bug out there that fits these symptoms. 187916 (NW158511) - nsrim -X does not delete expired saveset on DDBOOST devices. The procedure for that bug is to upgrade to a version after 8.1.1.5 then run nsrim -X -v [volumename] and then the process will begin running normally on saveset completion. It does not. Yes I am able to manually delete savesets that have expired. The vast majority of my backup retention policies are a full on the weekend, and incrementals every week night. So I wrote a script that runs every friday during my maintenance window. It uses mminfo to look at ssretent and clretent times for more then 15 days ago, (gives me 2 full chances for a successful full, since they run every weekend.) grabs those ssid/clone id's runs nsrmm on the ssid and then runs nsrstage -C on each volume and then nsrim -x on each ddboost volume. This is a really dangerous procedure, they may well be a good reason some of these SaveSets are still hanging out, but not THOUSANDS of them. Removing the NSRMM line means none of the savesets get cleaned form DDBoost volumes. I am not using Synthetic Fulls. The devices are not marked 'Scan Needed'. 'Recycle' is set to 'Auto' on the volume. The Data Domain Cleaning procedure does run every Tuesday at 6:00am. The 4 hot-fixes released since 8.2.0.4 do not include any related fixes in their release notes.  Since their upgrade procedure for AIX is 'delete it all, and then install again', I'm hesitant to upgrade. I think removing expired savesets is a core functionality of the backup solution; so I intend to keep pushing the case. However, does anyone else have any other 'click this checkbox you moron' ideas?

ble1 · Answer

Since best match to description is fix which has been committed to 8.1.x, I would ask support to check with engineering if this issue is specific to 8.1.x or it needs patch for 8.2.x as well. Second is to use debug to collect more data. I think all other classics you already tried.

As for remove everything/install everything approach, it is pretty much standard on all UNIX boxes (Solaris, HPUX, AIX). I think on Linux update works now, but I also remove it there as well instead as last time I did update it left orphan entry in repository database. I use installp for both install and uninstall on AIX and didn't had any issues.

Davidtgnome · Answer

The first time I upgraded AIX from 8.1.1.1 to 8.2.0.4 (removing and reinstalling) it wouldn't even start. I restored the entire thing from backup as a sev 1 case, did the procedure again and it worked.

Support just came back with "your media database might be corrupt". nsrim-x is SUPPOSED to resolve that... Debug Logs have not proved fruitful either. They claim the flags are not being changed by Networker on old savesets.

I know the scanning and recycling is supposed to happen on save group completion. However, I don't know WHEN that happens. We clone a lot of data to tape, and despite the fact that our clone window is Friday at 15:00 to Friday at Noon they usually don't finish. I'm wondering if the cleaning process isn't running because the last backups reading to and writing from he volume are being canceled at noon on friday, and aren't finishing gracefully.

ble1 · Answer

Could be related (referring to cloning/cleanup - it is something that is supposed to be fixed in 8.1SP3, but I do not know if 8.2 has same issue).

I just found something silly myself. I'm doing migration of HPUX server to Linux one in rather straight forward manner; disable clients on old server, create on new one, and once they are expired on old remove them.

What I found moments ago is a bit strange. I use my script to remove clients which in essence does (top of my head):

- mminfo -avot -q -r 'ssid(64)'

- feeds that to nsrmm to delete it

- does nsrck -RY for client

- does nsrim -X -c

I noticed that while there were save sets to remove, in daemon.raw there was nothing noted that was removed. Which is strange of course

So, it occurred to me that nsrim might have some kind on anchor or relation on status of client in NSR_client config so for test I enabled one client and run script against that one - and it worked just fine. At this point I decided to recreate previously removed clients and try nsrim -X again. That didn't clean anything so I went to rap.log to compare if clientid would be same (it should be still cached by NW) and found that:

- rap.log won't contain clientid information if client was disabled while removed

- rap.log will contain clientid information if client was enabled while removed

Now, CID is record which is not part of NSR_client, but mdb, but regardless it should listed IMHO and I wonder why and how this is related to nsrim doings. And of course, I wonder what will happen to data I removed from mdb, nsrim didn't report to have removed anything which virtually means it still seats somewhere on DD (in theory at least).

ble1 · Answer

Try mminfo -q ssid=39288098895 -S

As for cleanup, I think is clear (at least from what I have seen). In my case as soon as NW reports (XX) removed from device, I also see change on DD on estimated part. Of course, value is not the same and in general it does match difference made by global and local compression (as NW and DD report different values; NW reports native source value while DD reports one after processed). And of course, it is estimate, as this chunk can be still kept if new data will link to it if it happens to land on DD prior to DD cleanup.

Davidtgnome · Answer

The documentation on how Data Domain cleans is vague as hell. It strongly indicates some sort of 2 way communication that the DD cleaning tool uses to determine which files are no longer part of savesets in the MMDB. However the cleaning will run on it's own if your networker happens to be down on Tuesday morning, and clean up a similar amount of space. Not only that, but if you manually clean savesets, the Data Domain doesn't know about the free space for several hours, and then over estimates the amount of space it will clean.

When you run a saveset query on the Media screen, the saveset is still there, after nsrstage, nsrim and even nsrmm. It feels almost like ddboost devices save their information in another table, and they are missing a line of code to update the main saveset table when a change is made, or missing the line to update the ddboost table when the main saveset table is changed.

I just a query that is just dates, on Media>Savesets and you get a different list of savesets on a volume, then you do when you run a query with the same time frame with a specific volume

39288098895 is an ssid that should have expire in 2014, it shows up as recyclable on dd1backups.001. When I change the query to make it specific to that volume, suddenly it's not there.

yet when I run.

mminfo -avot -q "ssid=39288098895"

6095:mminfo: no matches found for the query.

Something inside the database isn't cleaning up/updating.

ble1 · Answer

It seems like mminfo can't find that ssid.  Media>Savesets is supposed to build mminfo command too so no idea from where that ssid came in the first place.  But if they did a setup, it's time for them to check it and see what is going on.

Davidtgnome · Answer

root@sedebaxbkup00 # mminfo -q ssid=39288098895 -S

6095:mminfo: no matches found for the query

Perhaps then the problem I've been struggling to find is in how EMC configured the DDboost devices when they added them? If you are seeing different behavior, and yours is working, that points towards something else...

Davidtgnome · Answer

As an update to this issue:

Support thought the MMDB might be corrupted, that is not the case and they are not seeing the same issue in their lab. They believe it is a configuration issue, and have FINALLY decided to perhaps take a look at my configuration.

Davidtgnome · Answer

Further update, they've changed their minds again!

They are seeing the same symptoms I am, however now the think perhaps it's a problem where networker is unable to remove the savesets from the DDBoost device, Networker engineering is engaging Data Domain Engineering.

CarlosRojas · Answer

Hello David,

Just wondering...if the saveset is no longer in the MDB, where are you finding that saveset not being deleted?

Bear in mind that nsrim will start, but doesn't clean up the savesets immediately, it has to perform further checks before going ahead and actually remove it from the MDB.

Then as soon as nsrim completes these checks and is OK to delete, DD will also delete the saveset from the container.

Then File System clean-up is required on DD to physically eliminate that data.

Thank you,

Carlos

Davidtgnome · Answer

You can find the savesets using MMinfo, but right clicking on the disk volume and clicking savesets, or by looking in any of the reports in Networker, or a saveset query in the gui. They are there.

These savesets, when I first started this fiasco, should have expired in 2013. Now, every week I run a script to manually compile a list of and remove savesets that should have expired more then 15 days previously. What I mean by that is, these should have been set as recyclable and removed on the first of the month. On the 15th of that month I delete them myself because they are still there. These are savesets that should have had at least 2 weeks (and 2 full cleaning cycles on the data domain) to clear on their own. My backup schedules are such that none should have any dependencies more then one week, 2 weeks in the event a full fails. (Anything kept longer then 1 month is stored on a different ddboost device).

On average I manually remove 150 or so savesets, that should have had 2 weeks to clear themselves, each week.

All of your points are correct, and have been accounted for. There is still a failure.

rtravellin · Answer

Hey David, could you share the script you created?

Davidtgnome · Answer

EXPIRE=$(/opt/freeware/bin/date --date='15 days ago' +%m/%d/%y) TODAY=$(/opt/freeware/bin/date +%m/%d/%y) SOURCE=/tmp/cleanup.txt LOG=/tmp/cleanuplog.txt #CLEANLOG=/tmp/cleanlog.txt MAILX=/usr/bin/mailx COUNTER=0 MAILTO='allsortsofadmin@doman.tld' `/usr/bin/mminfo -avot -q 'ssretent < $EXPIRE' -q 'clretent < $EXPIRE' -r ssid,cloneid >$SOURCE` `/usr/bin/mminfo -avot -q 'ssretent < $EXPIRE' -q 'clretent < $EXPIRE' -r 'client(25),name(40),savetime(8),ssretent(8),clretent(8),ssflags(8),level(5),ssid(10),cloneid(10)' >$LOG` #This was an extra command requested by support. #`/usr/bin/mminfo -avot -q 'ssretent < $EXPIRE' -q 'clretent < $EXPIRE' -r 'name,group,client,clientid,savetime(20),ssflags,sumflags,ssid,cloneid,level,ssbrowse(20),ssretent(20),clretent(20),ssflags,sumflags,clflags,volume' >$LOG` while read line; do let COUNTER=COUNTER+1 ssid=$(echo $line | tr ' ' /) #Can be used to force it to change expire time instead. # nsrmm -S $ssid -e '10/10/2015' -y # nsrmm -dS $ssid -y done <$SOURCE `/usr/bin/nsrstage -C -V ArchiveLogs.001 >$LOG` `/usr/bin/nsrstage -C -V CECArchiveLogsClone.001 >$LOG` `/usr/bin/nsrstage -C -V CECBootStrap.001 >$LOG` `/usr/bin/nsrstage -C -V DBExports3YDDClone.001 >$LOG` `/usr/bin/nsrstage -C -V dd1backups.001 >$LOG` `/usr/bin/nsrstage -C -V Mac.001 >$LOG` `/usr/bin/nsrstage -C -V NDMP.001 >$LOG` `/usr/bin/nsrstage -C -V OracleDB.001 >$LOG` `/usr/bin/nsrim -X >$CLEANLOG` cat $LOG | ${MAILX} -s '[${COUNTER}] Savesets removed for [${TODAY}] from prior to [${EXPIRE}]' ${MAILTO}

Davidtgnome · Answer

I think nsrim isn't actually running successfully, so far as I can tell NSRIM only runs if the volume isn't in use AND when a backup completes successfully. We have clones running right to the end of the backup window reading from the DD volumes in question. I have to cancel 3 - 4 backups every week to eject tapes in time for the vendor to take them, and to begin the cycle again.

(It's not what I want it's what management wants, I know it's dumb)

I have a feeling when you cancel a backup, nsrim isn't called, because it doesn't complete normally. Which means the savests aren't being set to recyclable, so data domain doesn't actually know it CAN delete them. Believe it or not, this was fine in older versions of networker IE 7.2/7.4

Latest Update on my almost 3 month old sev 2 case is they plan a major bug fix in 8.2.2, which is still in development, but i'm not the only one seeing a problem.

However to answer your question:

/usr/bin/mminfo -avot -q "ssretent < $EXPIRE" -q "clretent < $EXPIRE" -r "client(25),name(40),savetime(8),ssretent(8),clretent(8),ssflags(8),level(5),ssid(10),cloneid(10)"

Where $EXPIRE is the correct format for a date 15 days ago, jsut ran it and it produced 9,000 safesets, It was last run 2 weeks ago. Part of the script uses nsrmm to remove the savesets by force... A recommendation of the engineer I don't particularly agree with.

rtravellin · Answer

Thanks David. About how long after running your script does Data Domain see new cleanable data? Is it instant?

NetWorker

Was this post helpful?