Update: I believe dedupe has stopped completely, and I'm only getting local compression. My daily autosupport emails now say my "Global-Comp Factor" is 1.0x.
I would recommend the following steps , they may help you in narrowing down the investigation :
----------------------------------------------
1-check the snapshots in the system :
#snapshot list mtree *
if you see old snapshots that shouldn't be there expire them "be careful with the system created snapshots, don't expire them unless you are 100% sure what you are doing "
make sure also you don't have replication lag or disabled replications
2-you may try to ask the RMAN administrators and networker guys also if they did any changes in the backup script, deletion, retention, etc ....throw this question also: did anyone enable encryption or pre-compression
3-check the output of the following command to know the daily ingestion :
# filesys show compression tier active daily-detailed last 120 days
4-from the auto supports .... check old auto support (13 days ago, 7 days ago, today
compare the "mtree list" outputs if you see any significant increase/decrease in the mtrees size and which mtree has less/higher ingestion
also check "file distribution" and make sure there are not old files that passed their retention period
5- check from networker side if new clients have been added or old clients being deleted
2. No changes to RMAN (for sure -- I *am* the RMAN admin). No known changes to Networker, but I haven't entirely ruled this out.
3. My DD didn't like "tier" or "active", but that command minus those two parameters shows very little variety over time, except that 10 days ago my global-comp factor went from about around 2.5x (plus or minus a bit) to exactly 1.0x. And of course my post-comp size more than doubled.
4. Unfortunately I don't have my auto-support emails from before the trouble started.
5. No changes in Networker clients.
A bit of additional information: we don't use Data Domain replication. RMAN does a backup to Networker, which uses one DD as a target, then Networker clones the backup to another DD. Both DDs started showing the same behavior at the same time. So it seems like it might be something either RMAN or Networker is doing. But I sure don't know what that might be.
Since my global-comp ratio is 1.0x, and I can't get it to improve, I tried improving the local-comp ratio by moving from lz to gzfast. It worked, but my Oracle backup was much too slow, so I'm switching back to lz.
I'm seeing a little de-dupe now, but not much. As reported by the "mtree show compression" command, the last seven daily global comp factor numbers were 1.1, 1.0, 1.1, 1.0, 1.1, 1.4, and 1.1.
I still have no idea what triggered the change 20 days ago, nor why it got slightly better. I've had to reduce my backup retention duration a little so I don't completely run out of space.
I'm in the middle of making another change that might have an effect: along with regular Oracle RMAN backups, this mtree also has backups of compressed Oracle export files. This isn't something new. Those compressed files have been backing up for years; long before this problem started. But the host that has those files has enough space to write non-compressed exports, so I'm switching to that. I have two large databases and two small that have these export files; I'll start with one of the large ones and see how it goes. If everything were working correctly, I would guess that I'd get better results. Presumably the compression would be roughly similar on the DD side compared to host-side compression, and de-dupe should be much better. But obviously not everything is working correctly. I'll check the numbers tomorrow after the backup.
Since I last updated this thread, I've been through a roller-coaster of a little success and a lot of failure.
For the first time, I tried the Data Domain recommended RMAN setting of FILESPERSET = 1. I had never set this explicitly, and the default is 64. Setting it to 1 prevents RMAN from multiplexing Oracle database files into larger files before sending them to the backup target. If the files are multiplexed, Data Domain gets a vastly different data stream each time, and can't do its de-duplication magic.
So I made that change, and sure enough, starting with the second backup with FILESPERSET = 1, I was getting the best de-duplication I'd ever had. Global-Comp Factor was around 9x, and Total-Comp Factor was around 20x. Great, right?
Not so great. I couldn't restore any of these backups. Every time I'd get the following error:
RMAN-03002: failure of restore command at 11/25/2019 09:14:26
ORA-27192: skgfcls: sbtclose2 returned error - failed to close file
ORA-19511: Error received from media manager layer, error text:
The call to mm_rend() failed with the message: Server can't decode arguments (1:5:11)
The odd thing is that it would fail at a different spot each time. I could run the same RMAN restore command twice in a row, and the first time it would read 10 files successfully, then fail. The next time it would read 4 files, then fail. So I don't think I have corrupt backup files, but for some reason I can't get them to restore.
Next I tried another RMAN setting, MAXOPENFILES = 1, and I changed FILESPERSET to 16. The MAXOPENFILES was an attempt to eliminate the interleaving of Oracle files. So instead of file1 part 1, file2 part1, file3 part 1, file1 part 2, file 2 part 2, etc., I'd be sending all of file 1, all of file2, all of file3. The theory was that I'd be sending long chunks of data that look mostly the same every day. But that didn't work either. Data Domain couldn't de-dupe this any better than when I had 64 interleaved files. I since found out that RMAN inserts file and block ID headers and footers into each block, so to Data Domain everything looks new, and thus can't de-dupe it very well.
I found a Data Domain setting that looked promising, until I saw the restrictions on it. One can set an mtree option app-optimized-compression = oracle1. This will use knowledge of those block headers and footers to enable better de-dupe. But it only works with Oracle block sizes 8K and 16K, and my database has 4K blocks. From what I've read, it doesn't do any harm with 4K blocks, even if it doesn't do any good, so I'm inclined to give it a try. I'd like to convert my database to 8K blocks, but that is a very substantial undertaking, involving a lot of down time.
Setting the mtree option app-optimized-compression = oracle1 resulted in the same behavior -- de-dupe was better (though not quite as good as FILESPERSET = 1), but I couldn't do a restore. I'd get the same mm_rend() error.
I haven't figured out why my de-dupe acted the way it did, but after trying many different combinations of RMAN, Networker, and Data Domain settings, I'm finally getting better results than even before this problem started.
I'm back to the RMAN setting FILESPERSET = 1, which I should have had all along. I'm getting good de-dupe. For the last two days, my total comp factor has been 19.5x, and my backups are completing a couple of hours faster. Better yet, I'm able to run an RMAN "restore validate" command to completion without errors. The key was when I got a different error from what I had been getting. I had numerous errors like this, which I haven't been able to diagnose:
ORA-19511: Error received from media manager layer, error text: The call to mm_rend() failed with the message: Server can't decode arguments (1:5:11)
But one restore attempt resulted in this error:
ORA-19511: Error received from media manager layer, error text: RPC receive operation failed; errno = Connection reset by peer (0:5:73)
For this error, I was able to find a fix, and it seems to have also fixed the other error. In Networker, I changed the group timeout to "0" which means no timeout at all. This is found in Configure > Group > [group name] > Group Properties > Advanced > inactivity timeout.
Since I changed that setting, every restore I've tried has completed successfully.
jthvedt
7 Posts
0
November 6th, 2019 12:00
Update: I believe dedupe has stopped completely, and I'm only getting local compression. My daily autosupport emails now say my "Global-Comp Factor" is 1.0x.
Any idea what would cause that?
mahmoud m
1 Message
0
November 7th, 2019 11:00
I would recommend the following steps , they may help you in narrowing down the investigation :
----------------------------------------------
1-check the snapshots in the system :
#snapshot list mtree *
if you see old snapshots that shouldn't be there expire them "be careful with the system created snapshots, don't expire them unless you are 100% sure what you are doing "
make sure also you don't have replication lag or disabled replications
2-you may try to ask the RMAN administrators and networker guys also if they did any changes in the backup script, deletion, retention, etc ....throw this question also: did anyone enable encryption or pre-compression
3-check the output of the following command to know the daily ingestion :
# filesys show compression tier active daily-detailed last 120 days
4-from the auto supports .... check old auto support (13 days ago, 7 days ago, today
compare the "mtree list" outputs if you see any significant increase/decrease in the mtrees size and which mtree has less/higher ingestion
also check "file distribution" and make sure there are not old files that passed their retention period
5- check from networker side if new clients have been added or old clients being deleted
jthvedt
7 Posts
0
November 8th, 2019 12:00
1. No snapshots for any mtree
2. No changes to RMAN (for sure -- I *am* the RMAN admin). No known changes to Networker, but I haven't entirely ruled this out.
3. My DD didn't like "tier" or "active", but that command minus those two parameters shows very little variety over time, except that 10 days ago my global-comp factor went from about around 2.5x (plus or minus a bit) to exactly 1.0x. And of course my post-comp size more than doubled.
4. Unfortunately I don't have my auto-support emails from before the trouble started.
5. No changes in Networker clients.
A bit of additional information: we don't use Data Domain replication. RMAN does a backup to Networker, which uses one DD as a target, then Networker clones the backup to another DD. Both DDs started showing the same behavior at the same time. So it seems like it might be something either RMAN or Networker is doing. But I sure don't know what that might be.
Since my global-comp ratio is 1.0x, and I can't get it to improve, I tried improving the local-comp ratio by moving from lz to gzfast. It worked, but my Oracle backup was much too slow, so I'm switching back to lz.
jthvedt
7 Posts
0
November 18th, 2019 07:00
I'm seeing a little de-dupe now, but not much. As reported by the "mtree show compression" command, the last seven daily global comp factor numbers were 1.1, 1.0, 1.1, 1.0, 1.1, 1.4, and 1.1.
I still have no idea what triggered the change 20 days ago, nor why it got slightly better. I've had to reduce my backup retention duration a little so I don't completely run out of space.
I'm in the middle of making another change that might have an effect: along with regular Oracle RMAN backups, this mtree also has backups of compressed Oracle export files. This isn't something new. Those compressed files have been backing up for years; long before this problem started. But the host that has those files has enough space to write non-compressed exports, so I'm switching to that. I have two large databases and two small that have these export files; I'll start with one of the large ones and see how it goes. If everything were working correctly, I would guess that I'd get better results. Presumably the compression would be roughly similar on the DD side compared to host-side compression, and de-dupe should be much better. But obviously not everything is working correctly. I'll check the numbers tomorrow after the backup.
jthvedt
7 Posts
0
November 27th, 2019 11:00
Since I last updated this thread, I've been through a roller-coaster of a little success and a lot of failure.
For the first time, I tried the Data Domain recommended RMAN setting of FILESPERSET = 1. I had never set this explicitly, and the default is 64. Setting it to 1 prevents RMAN from multiplexing Oracle database files into larger files before sending them to the backup target. If the files are multiplexed, Data Domain gets a vastly different data stream each time, and can't do its de-duplication magic.
So I made that change, and sure enough, starting with the second backup with FILESPERSET = 1, I was getting the best de-duplication I'd ever had. Global-Comp Factor was around 9x, and Total-Comp Factor was around 20x. Great, right?
Not so great. I couldn't restore any of these backups. Every time I'd get the following error:
The odd thing is that it would fail at a different spot each time. I could run the same RMAN restore command twice in a row, and the first time it would read 10 files successfully, then fail. The next time it would read 4 files, then fail. So I don't think I have corrupt backup files, but for some reason I can't get them to restore.
Next I tried another RMAN setting, MAXOPENFILES = 1, and I changed FILESPERSET to 16. The MAXOPENFILES was an attempt to eliminate the interleaving of Oracle files. So instead of file1 part 1, file2 part1, file3 part 1, file1 part 2, file 2 part 2, etc., I'd be sending all of file 1, all of file2, all of file3. The theory was that I'd be sending long chunks of data that look mostly the same every day. But that didn't work either. Data Domain couldn't de-dupe this any better than when I had 64 interleaved files. I since found out that RMAN inserts file and block ID headers and footers into each block, so to Data Domain everything looks new, and thus can't de-dupe it very well.
I found a Data Domain setting that looked promising, until I saw the restrictions on it. One can set an mtree option app-optimized-compression = oracle1. This will use knowledge of those block headers and footers to enable better de-dupe. But it only works with Oracle block sizes 8K and 16K, and my database has 4K blocks. From what I've read, it doesn't do any harm with 4K blocks, even if it doesn't do any good, so I'm inclined to give it a try. I'd like to convert my database to 8K blocks, but that is a very substantial undertaking, involving a lot of down time.
jthvedt
7 Posts
0
December 3rd, 2019 09:00
Setting the mtree option app-optimized-compression = oracle1 resulted in the same behavior -- de-dupe was better (though not quite as good as FILESPERSET = 1), but I couldn't do a restore. I'd get the same mm_rend() error.
jthvedt
7 Posts
0
December 11th, 2019 08:00
I haven't figured out why my de-dupe acted the way it did, but after trying many different combinations of RMAN, Networker, and Data Domain settings, I'm finally getting better results than even before this problem started.
I'm back to the RMAN setting FILESPERSET = 1, which I should have had all along. I'm getting good de-dupe. For the last two days, my total comp factor has been 19.5x, and my backups are completing a couple of hours faster. Better yet, I'm able to run an RMAN "restore validate" command to completion without errors. The key was when I got a different error from what I had been getting. I had numerous errors like this, which I haven't been able to diagnose:
ORA-19511: Error received from media manager layer, error text: The call to mm_rend() failed with the message: Server can't decode arguments (1:5:11)
But one restore attempt resulted in this error:
ORA-19511: Error received from media manager layer, error text: RPC receive operation failed; errno = Connection reset by peer (0:5:73)
For this error, I was able to find a fix, and it seems to have also fixed the other error. In Networker, I changed the group timeout to "0" which means no timeout at all. This is found in Configure > Group > [group name] > Group Properties > Advanced > inactivity timeout.
Since I changed that setting, every restore I've tried has completed successfully.