Start a Conversation

Unsolved

This post is more than 5 years old

3256

May 29th, 2015 08:00

Free space is decreasing constantly

  Something is eating away free space on our Datadomain and I can't figure out what. Its decreasing about 40GB in hour. With two days (48 hour) there was 2TB gone. There were some old data which I cold delete and after filesys cleaning cycle there was 11.60TB free space again but it started to go down again. I looked that there is debug/tcp log in Datadomain an I see servers which connect to Datadomain there. But can I read out from there how much data moves between servers and DD? There are info like "send 23.4Mbps" in some rows. Or how do I interpret info in tcp.log? Or how can I hunt down the space eater? I don't seem to notice any big growth in mtree sizes either.

14.3K Posts

June 1st, 2015 02:00

Are you using DD directly or via backup application (or mix)?  It could be that someone is dumping data which does not dedupe well of course (like compressed dumps), but to see what it is without application manage it in central way, it is a bit difficult.

33 Posts

June 1st, 2015 04:00

We have mixed usage - Networker, VEEAM and some other backups which write in their own ways. I asked around, if there were any changes recenetly and got answer that there weren't. After some time I looked at "filsys show space" and seems that "Cleanable GiB" increases roughly same amount as free space decreases. Our filesystem cleaning schedule was two times at week and I changed it for everyday for now. But still, I haven't noticed such pattern before, decreasin constantly at small amounts and about 1TB in 24H. I am going to monitor things and see what happens.

14.3K Posts

June 1st, 2015 05:00

Having it every day is not so bright, but if you need to survive...  cleanable GiB will grow each time backup application has finished their own cleanup (for example, in NW that would be nsrim cleanup).  What you can do is - as I assume each application is using their own mtree - check mtree usage and try to see from where does bulk growth come from.  Once you know that, you can use native tool of application which is using DD to get more details on what data there is (at least with NW you could).

33 Posts

June 18th, 2015 23:00

Can't say that I figured out what it exactly is but it kind of stabilized somewhat. I put cleaning three times at week and free space goes between 12...10 TB between cleanings. I suspect Networker but can't put my finger on it why this started suddenly when quite a long time there was no such behaviour.

208 Posts

June 19th, 2015 03:00

3 times a week to clean is still too much.

Once a week is the most frequent you should go with or the filesystem will be fragmented and you will see performance issues.

I appreciate that it can be hard to see what looks like space for you to get back and use but that figure is only an "best guess" without it walking through and finding and "knowing" for sure.

If you have enough free space to cope despite the "cleanable" space figure shown (taunting you to take it NOW ), resist the temptation to run it sooner and just let the "once a week" schedule run.

The usual suspects like multiplexing off, no encrypted or pre-compressed backup data etc...

If VTL, make sure your tapes are recycling when they expire.

Make sure any replication is in sync or the data/space can't be reclaimed by cleaning because it is needed to complete replication awaiting..

High change rate will obviously require more space.

So much to look at when you have data coming in from all places without an application to help you figure out what was supposed to have been sent in.

Regards, Jonathan

66 Posts

June 23rd, 2015 06:00

I ran into the same issue, it turned out that Networker wasn't actually cleaning up savesets on it's own. I still have a sev 2 case open on that. One of the things I did find out is that the cleaning process re-indexes the entire file system on the data domain, so they do not recommend running the clean more then twice per week. The reasoning is it can actually use more space.

mminfo -avot -q "ssretent<06/08/15" -q "clretent<06/08/15" -r "client,name,sscreate,ssbrowse,ssretent,clretent,ssflags,volume,level,sumflags" | wc -l

Will list for you the dates savesets SHOULD have fallen off the system, it does seem like there is an extra 1 - 2 week delay, so I try to set it for 15 days ago.

33 Posts

November 24th, 2015 02:00

I think I found the culprit at last. Oracle archive logs on one mtree. Pattern matches, about 1TB archive logs per day, data is not dedupe friendly, only compressable and there is certain amount of days worth of data, which means that old data is continuously deleted and new appears every hour. That explains this continuous decreasing of space and at the same time increase in Cleanable Gib in same amount. And as I deleted some older DD snapshot's for mtree and Networker save sets then I recovered quite enough space. I guess its case closed for me.

66 Posts

November 24th, 2015 04:00

They finally found a bug for me. Fixed in earlier versions and then unfixed in later versions, nsrim will only attempt to run once a day, if there are any clones or restores reading from any volumes, the process will skip running that day. My 6 and a half day clone window was causing the process to never actually run and clean the systems... It was fixed AGAIN in 8.2.2.2.

Only took EMC 6 months to figure it out, I had to add recover_space_anytime as an empty flatfile in /nsr/debug to enable it. Afer upgrading to SP2 of course.

14.3K Posts

November 25th, 2015 05:00

Isn't that SP2 in general new feature (see Recover space operation and concurrent read operations)?

66 Posts

November 30th, 2015 04:00

Yes, it was also patched in 8.0.4.1, and again in 8.0.3.5. I guess their cumulative patching isn't really cumulative?

The bug is 187916

14.3K Posts

December 1st, 2015 03:00

I have impression that 187916 and what they did in SP2 are different things.

66 Posts

December 1st, 2015 10:00

If they are, then the engineer in charge of my support case, his boss, and the regional service manager were incorrect.

Given that they fail to document the changes they make, and how those changes might impact other systems, I feel either is entirely possible.

No Events found!

Top