Slow filesystem on Celerra - network and backend blamed

Question

Hi All, We have Celerra NS20 with CX3-10 as a backend with 15x1TB 7200 SATA disks in RAID5, presented via CIFS. It has been set up by EMC support guys few years ago and we followed their advice on filesystem setup.At the beginning all data resided in one big filesystme fs01 , 5 TB in size.At a later stage we created three smaller filesystems, from 500 to 1 TB each.Primary filesystem is a production one and fs2-3 are used for archiving and temporary data.There has never been big impact on fs0 and we never had users complaining. Recently however we have noticed that NDMP backups for fs01 have slowed down considerably.We have measured file transfer time ( 700Mb write/read, three runs per each filesystem) and results were shocking - fs01 took 6 minutes on average to write file as opposite to 1:20 minutes for other ones. EMC third line support ran few tests and their conclusion was that it is either network or backend. According to support they found frame-loss is occurring in the direction from the DM to the host.   The backend statistics are inconclusive but showing average Read/Write Server response times of greater than 11ms. As a results we were advised to check network and backend. This does not sound logical to me.First of all , how network could be a problem if other filesystems have good performance and they are still served via same interface.Secondly NDMP backups are not even touching ethernet and they slowed down. As for responce time, it is not clear either.I was under impression that anything under 14ms is more than acceptable in most environemnts. Do   I miss something here? Kind Regards, Sergei

DanJost · Answer

Have you upgraded or enabled dedupe? After I upgraded to Dart6 and the super-de-dupe ran my NDMP backups increased several hours - but the data on the tape is a lot smaller...when I say several hours, it actually went from 40 to 52 and 90% of it is FC.

Dan

uninitializing · Answer

Thank you Dan,

I did check deduplication as we tested it some time ago but for various reasons it was not enabled anyway.I also deleted checkpoints just in case.

We have auditing enabled on the filesystem but for a time being I did not disable it.

EMC support agreed that network has got nothing to do with the problem.They ran performance collection on Clariion backend and I can see heavy load on Celerra raid groups.

What I don't understand is why other filesystems residing on the same raidgroup are not affected.Some LUNs used by fs01 are heavily used but surely those LUNs are part of the RAID array used by other LUNs allocated to other filesystems.I would expect to see some correlation in degradation on other filesystems as platters are used across entire array, seems like my assumptions are wrong.

Next thing was to identify what slows down the filer.From Celerra logs I could see constant CIFS Ops/sec, above 2000 , day and night - this did not look normal to me.Throughput data graph did not really correlate with Ops/sec, however IOPS/sec on Clariion array did.

First thing I did is ran server_tcpdump on the filer to see what IP creates most packets as this would almost probably correlate to Celerra CIFS Ops/sec.Seems that Backup Exec server doing NDMP backup was generating quite a lot of packets.This was a surprise for me as I would expect NDMP ethernet traffic to be very light, but again, it seems like my assumptions were wrong.

Next thing I will try to identify whether bottleneck is iops, throughput, or both.Criteria would be a transfer time for the big file to and from array.It should be as close as possible on all filesystems and reasonably short.

Transfer times definetly improve when backup server is off and CIFS Ops/sec drops to low value.I still need to see how they change when throuhput is high.I have been looking for the tool that would do file transfers on a scheduled basis and graph them but could not find anything that suits me. Most benchmarking tools produce a lot of data without simple graph output.Wha I need is a graph showing transfer times of file size X done every Y minutes.

DanJost · Answer

I didn't think NDMP could generate CIFS traffic - are you sure you are using NDMP and the backups aren't running over CIFS? CIFS-based backups are pretty slow. You can do over-the-wire NDMP backups if you aren't set up for this - there are both EMC and Symantec documents that cover this setup.

Dan

uninitializing · Answer

Celerra is backed up using NDMP backup only.I can see data stream going to the tape during the backup job and CIFS would not be able to sustaing such high speeds.CIFS jobs done through a separate backup server and they are few times sloweAt the same time I can see a lot of traffic on the ethernet side.The only way I can explain it is that Backup exec traverses filesystem for catalogue.

However I would need to consult documentation about it.

Rainer_EMC · Answer

NDMP will never use CIFS It shouldn't be difficult to find out which client is causing the CIFS usage - try server_stats or server_cifs -o audit

Celerra

Was this post helpful?