Since best match to description is fix which has been committed to 8.1.x, I would ask support to check with engineering if this issue is specific to 8.1.x or it needs patch for 8.2.x as well. Second is to use debug to collect more data. I think all other classics you already tried.
As for remove everything/install everything approach, it is pretty much standard on all UNIX boxes (Solaris, HPUX, AIX). I think on Linux update works now, but I also remove it there as well instead as last time I did update it left orphan entry in repository database. I use installp for both install and uninstall on AIX and didn't had any issues.
The first time I upgraded AIX from 8.1.1.1 to 8.2.0.4 (removing and reinstalling) it wouldn't even start. I restored the entire thing from backup as a sev 1 case, did the procedure again and it worked.
Support just came back with "your media database might be corrupt". nsrim-x is SUPPOSED to resolve that... Debug Logs have not proved fruitful either. They claim the flags are not being changed by Networker on old savesets.
I know the scanning and recycling is supposed to happen on save group completion. However, I don't know WHEN that happens. We clone a lot of data to tape, and despite the fact that our clone window is Friday at 15:00 to Friday at Noon they usually don't finish. I'm wondering if the cleaning process isn't running because the last backups reading to and writing from he volume are being canceled at noon on friday, and aren't finishing gracefully.
Could be related (referring to cloning/cleanup - it is something that is supposed to be fixed in 8.1SP3, but I do not know if 8.2 has same issue).
I just found something silly myself. I'm doing migration of HPUX server to Linux one in rather straight forward manner; disable clients on old server, create on new one, and once they are expired on old remove them.
What I found moments ago is a bit strange. I use my script to remove clients which in essence does (top of my head):
- mminfo -avot -q -r 'ssid(64)'
- feeds that to nsrmm to delete it
- does nsrck -RY for client
- does nsrim -X -c
I noticed that while there were save sets to remove, in daemon.raw there was nothing noted that was removed. Which is strange of course
So, it occurred to me that nsrim might have some kind on anchor or relation on status of client in NSR_client config so for test I enabled one client and run script against that one - and it worked just fine. At this point I decided to recreate previously removed clients and try nsrim -X again. That didn't clean anything so I went to rap.log to compare if clientid would be same (it should be still cached by NW) and found that:
- rap.log won't contain clientid information if client was disabled while removed
- rap.log will contain clientid information if client was enabled while removed
Now, CID is record which is not part of NSR_client, but mdb, but regardless it should listed IMHO and I wonder why and how this is related to nsrim doings. And of course, I wonder what will happen to data I removed from mdb, nsrim didn't report to have removed anything which virtually means it still seats somewhere on DD (in theory at least).
As for cleanup, I think is clear (at least from what I have seen). In my case as soon as NW reports (XX) removed from device, I also see change on DD on estimated part. Of course, value is not the same and in general it does match difference made by global and local compression (as NW and DD report different values; NW reports native source value while DD reports one after processed). And of course, it is estimate, as this chunk can be still kept if new data will link to it if it happens to land on DD prior to DD cleanup.
The documentation on how Data Domain cleans is vague as hell. It strongly indicates some sort of 2 way communication that the DD cleaning tool uses to determine which files are no longer part of savesets in the MMDB. However the cleaning will run on it's own if your networker happens to be down on Tuesday morning, and clean up a similar amount of space. Not only that, but if you manually clean savesets, the Data Domain doesn't know about the free space for several hours, and then over estimates the amount of space it will clean.
When you run a saveset query on the Media screen, the saveset is still there, after nsrstage, nsrim and even nsrmm. It feels almost like ddboost devices save their information in another table, and they are missing a line of code to update the main saveset table when a change is made, or missing the line to update the ddboost table when the main saveset table is changed.
I just a query that is just dates, on Media>Savesets and you get a different list of savesets on a volume, then you do when you run a query with the same time frame with a specific volume
39288098895 is an ssid that should have expire in 2014, it shows up as recyclable on dd1backups.001. When I change the query to make it specific to that volume, suddenly it's not there.
yet when I run.
mminfo -avot -q "ssid=39288098895"
6095:mminfo: no matches found for the query.
Something inside the database isn't cleaning up/updating.
It seems like mminfo can't find that ssid. Media>Savesets is supposed to build mminfo command too so no idea from where that ssid came in the first place. But if they did a setup, it's time for them to check it and see what is going on.
Perhaps then the problem I've been struggling to find is in how EMC configured the DDboost devices when they added them? If you are seeing different behavior, and yours is working, that points towards something else...
Support thought the MMDB might be corrupted, that is not the case and they are not seeing the same issue in their lab. They believe it is a configuration issue, and have FINALLY decided to perhaps take a look at my configuration.
Further update, they've changed their minds again!
They are seeing the same symptoms I am, however now the think perhaps it's a problem where networker is unable to remove the savesets from the DDBoost device, Networker engineering is engaging Data Domain Engineering.
Just wondering...if the saveset is no longer in the MDB, where are you finding that saveset not being deleted?
Bear in mind that nsrim will start, but doesn't clean up the savesets immediately, it has to perform further checks before going ahead and actually remove it from the MDB.
Then as soon as nsrim completes these checks and is OK to delete, DD will also delete the saveset from the container.
Then File System clean-up is required on DD to physically eliminate that data.
You can find the savesets using MMinfo, but right clicking on the disk volume and clicking savesets, or by looking in any of the reports in Networker, or a saveset query in the gui. They are there.
These savesets, when I first started this fiasco, should have expired in 2013. Now, every week I run a script to manually compile a list of and remove savesets that should have expired more then 15 days previously. What I mean by that is, these should have been set as recyclable and removed on the first of the month. On the 15th of that month I delete them myself because they are still there. These are savesets that should have had at least 2 weeks (and 2 full cleaning cycles on the data domain) to clear on their own. My backup schedules are such that none should have any dependencies more then one week, 2 weeks in the event a full fails. (Anything kept longer then 1 month is stored on a different ddboost device).
On average I manually remove 150 or so savesets, that should have had 2 weeks to clear themselves, each week.
All of your points are correct, and have been accounted for. There is still a failure.
I think nsrim isn't actually running successfully, so far as I can tell NSRIM only runs if the volume isn't in use AND when a backup completes successfully. We have clones running right to the end of the backup window reading from the DD volumes in question. I have to cancel 3 - 4 backups every week to eject tapes in time for the vendor to take them, and to begin the cycle again.
(It's not what I want it's what management wants, I know it's dumb)
I have a feeling when you cancel a backup, nsrim isn't called, because it doesn't complete normally. Which means the savests aren't being set to recyclable, so data domain doesn't actually know it CAN delete them. Believe it or not, this was fine in older versions of networker IE 7.2/7.4
Latest Update on my almost 3 month old sev 2 case is they plan a major bug fix in 8.2.2, which is still in development, but i'm not the only one seeing a problem.
Where $EXPIRE is the correct format for a date 15 days ago, jsut ran it and it produced 9,000 safesets, It was last run 2 weeks ago. Part of the script uses nsrmm to remove the savesets by force... A recommendation of the engineer I don't particularly agree with.
ble1
4 Operator
•
14.4K Posts
0
June 4th, 2015 03:00
Since best match to description is fix which has been committed to 8.1.x, I would ask support to check with engineering if this issue is specific to 8.1.x or it needs patch for 8.2.x as well. Second is to use debug to collect more data. I think all other classics you already tried.
As for remove everything/install everything approach, it is pretty much standard on all UNIX boxes (Solaris, HPUX, AIX). I think on Linux update works now, but I also remove it there as well instead as last time I did update it left orphan entry in repository database. I use installp for both install and uninstall on AIX and didn't had any issues.
Davidtgnome
1 Rookie
•
66 Posts
0
June 4th, 2015 04:00
The first time I upgraded AIX from 8.1.1.1 to 8.2.0.4 (removing and reinstalling) it wouldn't even start. I restored the entire thing from backup as a sev 1 case, did the procedure again and it worked.
Support just came back with "your media database might be corrupt". nsrim-x is SUPPOSED to resolve that... Debug Logs have not proved fruitful either. They claim the flags are not being changed by Networker on old savesets.
I know the scanning and recycling is supposed to happen on save group completion. However, I don't know WHEN that happens. We clone a lot of data to tape, and despite the fact that our clone window is Friday at 15:00 to Friday at Noon they usually don't finish. I'm wondering if the cleaning process isn't running because the last backups reading to and writing from he volume are being canceled at noon on friday, and aren't finishing gracefully.
ble1
4 Operator
•
14.4K Posts
0
June 4th, 2015 04:00
Could be related (referring to cloning/cleanup - it is something that is supposed to be fixed in 8.1SP3, but I do not know if 8.2 has same issue).
I just found something silly myself. I'm doing migration of HPUX server to Linux one in rather straight forward manner; disable clients on old server, create on new one, and once they are expired on old remove them.
What I found moments ago is a bit strange. I use my script to remove clients which in essence does (top of my head):
- mminfo -avot -q -r 'ssid(64)'
- feeds that to nsrmm to delete it
- does nsrck -RY for client
- does nsrim -X -c
I noticed that while there were save sets to remove, in daemon.raw there was nothing noted that was removed. Which is strange of course![]()
So, it occurred to me that nsrim might have some kind on anchor or relation on status of client in NSR_client config so for test I enabled one client and run script against that one - and it worked just fine. At this point I decided to recreate previously removed clients and try nsrim -X again. That didn't clean anything so I went to rap.log to compare if clientid would be same (it should be still cached by NW) and found that:
- rap.log won't contain clientid information if client was disabled while removed
- rap.log will contain clientid information if client was enabled while removed
Now, CID is record which is not part of NSR_client, but mdb, but regardless it should listed IMHO and I wonder why and how this is related to nsrim doings. And of course, I wonder what will happen to data I removed from mdb, nsrim didn't report to have removed anything which virtually means it still seats somewhere on DD (in theory at least).
ble1
4 Operator
•
14.4K Posts
0
June 4th, 2015 05:00
Try mminfo -q ssid=39288098895 -S
As for cleanup, I think is clear (at least from what I have seen). In my case as soon as NW reports (XX) removed from device, I also see change on DD on estimated part. Of course, value is not the same and in general it does match difference made by global and local compression (as NW and DD report different values; NW reports native source value while DD reports one after processed). And of course, it is estimate, as this chunk can be still kept if new data will link to it if it happens to land on DD prior to DD cleanup.
Davidtgnome
1 Rookie
•
66 Posts
0
June 4th, 2015 05:00
The documentation on how Data Domain cleans is vague as hell. It strongly indicates some sort of 2 way communication that the DD cleaning tool uses to determine which files are no longer part of savesets in the MMDB. However the cleaning will run on it's own if your networker happens to be down on Tuesday morning, and clean up a similar amount of space. Not only that, but if you manually clean savesets, the Data Domain doesn't know about the free space for several hours, and then over estimates the amount of space it will clean.
When you run a saveset query on the Media screen, the saveset is still there, after nsrstage, nsrim and even nsrmm. It feels almost like ddboost devices save their information in another table, and they are missing a line of code to update the main saveset table when a change is made, or missing the line to update the ddboost table when the main saveset table is changed.
I just a query that is just dates, on Media>Savesets and you get a different list of savesets on a volume, then you do when you run a query with the same time frame with a specific volume
39288098895 is an ssid that should have expire in 2014, it shows up as recyclable on dd1backups.001. When I change the query to make it specific to that volume, suddenly it's not there.
yet when I run.
mminfo -avot -q "ssid=39288098895"
6095:mminfo: no matches found for the query.
Something inside the database isn't cleaning up/updating.
ble1
4 Operator
•
14.4K Posts
1
June 4th, 2015 06:00
It seems like mminfo can't find that ssid. Media>Savesets is supposed to build mminfo command too so no idea from where that ssid came in the first place. But if they did a setup, it's time for them to check it and see what is going on.
Davidtgnome
1 Rookie
•
66 Posts
0
June 4th, 2015 06:00
root@sedebaxbkup00 # mminfo -q ssid=39288098895 -S
6095:mminfo: no matches found for the query
Perhaps then the problem I've been struggling to find is in how EMC configured the DDboost devices when they added them? If you are seeing different behavior, and yours is working, that points towards something else...
Davidtgnome
1 Rookie
•
66 Posts
0
June 11th, 2015 06:00
As an update to this issue:
Support thought the MMDB might be corrupted, that is not the case and they are not seeing the same issue in their lab. They believe it is a configuration issue, and have FINALLY decided to perhaps take a look at my configuration.
Davidtgnome
1 Rookie
•
66 Posts
0
June 11th, 2015 12:00
Further update, they've changed their minds again!
They are seeing the same symptoms I am, however now the think perhaps it's a problem where networker is unable to remove the savesets from the DDBoost device, Networker engineering is engaging Data Domain Engineering.
CarlosRojas
1.7K Posts
0
June 14th, 2015 23:00
Hello David,
Just wondering...if the saveset is no longer in the MDB, where are you finding that saveset not being deleted?
Bear in mind that nsrim will start, but doesn't clean up the savesets immediately, it has to perform further checks before going ahead and actually remove it from the MDB.
Then as soon as nsrim completes these checks and is OK to delete, DD will also delete the saveset from the container.
Then File System clean-up is required on DD to physically eliminate that data.
Thank you,
Carlos
Davidtgnome
1 Rookie
•
66 Posts
0
June 15th, 2015 04:00
You can find the savesets using MMinfo, but right clicking on the disk volume and clicking savesets, or by looking in any of the reports in Networker, or a saveset query in the gui. They are there.
These savesets, when I first started this fiasco, should have expired in 2013. Now, every week I run a script to manually compile a list of and remove savesets that should have expired more then 15 days previously. What I mean by that is, these should have been set as recyclable and removed on the first of the month. On the 15th of that month I delete them myself because they are still there. These are savesets that should have had at least 2 weeks (and 2 full cleaning cycles on the data domain) to clear on their own. My backup schedules are such that none should have any dependencies more then one week, 2 weeks in the event a full fails. (Anything kept longer then 1 month is stored on a different ddboost device).
On average I manually remove 150 or so savesets, that should have had 2 weeks to clear themselves, each week.
All of your points are correct, and have been accounted for. There is still a failure.
rtravellin
2 Posts
0
July 16th, 2015 20:00
Hey David, could you share the script you created?
Davidtgnome
1 Rookie
•
66 Posts
0
July 17th, 2015 05:00
EXPIRE=$(/opt/freeware/bin/date --date="15 days ago" +%m/%d/%y)
TODAY=$(/opt/freeware/bin/date +%m/%d/%y)
SOURCE=/tmp/cleanup.txt
LOG=/tmp/cleanuplog.txt
#CLEANLOG=/tmp/cleanlog.txt
MAILX=/usr/bin/mailx
COUNTER=0
MAILTO="allsortsofadmin@doman.tld"
`/usr/bin/mminfo -avot -q "ssretent < $EXPIRE" -q "clretent < $EXPIRE" -r ssid,cloneid >$SOURCE`
`/usr/bin/mminfo -avot -q "ssretent < $EXPIRE" -q "clretent < $EXPIRE" -r "client(25),name(40),savetime(8),ssretent(8),clretent(8),ssflags(8),level(5),ssid(10),cloneid(10)" >$LOG`
#This was an extra command requested by support.
#`/usr/bin/mminfo -avot -q "ssretent < $EXPIRE" -q "clretent < $EXPIRE" -r "name,group,client,clientid,savetime(20),ssflags,sumflags,ssid,cloneid,level,ssbrowse(20),ssretent(20),clretent(20),ssflags,sumflags,clflags,volume" >$LOG`
while read line;
do
let COUNTER=COUNTER+1
ssid=$(echo $line | tr " " /)
#Can be used to force it to change expire time instead.
# nsrmm -S $ssid -e "10/10/2015" -y
# nsrmm -dS $ssid -y
done <$SOURCE
`/usr/bin/nsrstage -C -V ArchiveLogs.001 >$LOG`
`/usr/bin/nsrstage -C -V CECArchiveLogsClone.001 >$LOG`
`/usr/bin/nsrstage -C -V CECBootStrap.001 >$LOG`
`/usr/bin/nsrstage -C -V DBExports3YDDClone.001 >$LOG`
`/usr/bin/nsrstage -C -V dd1backups.001 >$LOG`
`/usr/bin/nsrstage -C -V Mac.001 >$LOG`
`/usr/bin/nsrstage -C -V NDMP.001 >$LOG`
`/usr/bin/nsrstage -C -V OracleDB.001 >$LOG`
`/usr/bin/nsrim -X >$CLEANLOG`
cat $LOG | ${MAILX} -s "[${COUNTER}] Savesets removed for [${TODAY}] from prior to [${EXPIRE}]" ${MAILTO}
Davidtgnome
1 Rookie
•
66 Posts
0
July 17th, 2015 05:00
I think nsrim isn't actually running successfully, so far as I can tell NSRIM only runs if the volume isn't in use AND when a backup completes successfully. We have clones running right to the end of the backup window reading from the DD volumes in question. I have to cancel 3 - 4 backups every week to eject tapes in time for the vendor to take them, and to begin the cycle again.
(It's not what I want it's what management wants, I know it's dumb)
I have a feeling when you cancel a backup, nsrim isn't called, because it doesn't complete normally. Which means the savests aren't being set to recyclable, so data domain doesn't actually know it CAN delete them. Believe it or not, this was fine in older versions of networker IE 7.2/7.4
Latest Update on my almost 3 month old sev 2 case is they plan a major bug fix in 8.2.2, which is still in development, but i'm not the only one seeing a problem.
However to answer your question:
/usr/bin/mminfo -avot -q "ssretent < $EXPIRE" -q "clretent < $EXPIRE" -r "client(25),name(40),savetime(8),ssretent(8),clretent(8),ssflags(8),level(5),ssid(10),cloneid(10)"
Where $EXPIRE is the correct format for a date 15 days ago, jsut ran it and it produced 9,000 safesets, It was last run 2 weeks ago. Part of the script uses nsrmm to remove the savesets by force... A recommendation of the engineer I don't particularly agree with.
rtravellin
2 Posts
0
July 17th, 2015 17:00
Thanks David. About how long after running your script does Data Domain see new cleanable data? Is it instant?