Celerra Performance problems - slow write fast read

Question

I work with 2 other offices in my company and we all have new Celerras (purchased within last 6-9 months). We have all upgraded to Unisphere. We are noticing that our Celerras are not writing data very fast. I configured my own Celerra using AVM. I have Fibre and SATA drives. The problem is accross the board on all CIFS shares that we have. I thought my Celerra was faster last fall before Unisphere but I am not certain.

Here is the issue. Read peformance is excellent but writes are slow. We use a lot of large files so we took a couple of them to test. When we copy a 1GB SAS or ACCESS file from the Celerra to our PC (we have 1GB to desktop and use fast drives), we can copy it at over 100MB/s. It takes about 10 seconds to copy down. However, when we try to copy files up to the Celerra, it takes minutes and goes at about 8-10MB/s (BTW, makes no difference on size of files...we just used the larger ones for testing).

I am using 10GB connection on the Celerra to procurve switches. One office uses Extreme switches (I think) and they have exact same issue.. However, I have also tested using a single 1GB connection to the Celerra and have the same problem. Write cache on the backend is enabled on my ns120 with the value 566. I would assume it is also enabled on my colleagues ns480 Celerras.

When we copy the same files to our windows servers with local SAS storage or with older Clariion storage, the performance (both read and write) is fine (over 100MB/s). All of these boxes use the same switches. None of our offices are very big. My office only has about 80 users.

I have run a lot of performance tests on the backend and nothing stood out (using knowledge from last years emcworld). The EMC technician ran the following and noted the Ierrors and restransmitted packets. however, those numbers have not changed over the past couple of weeks of testing. 11 Ierrors have been there. Also, there are no errors on the cge1 connection which is where I have been conducting additional testing (so I could isolate the connection and test things like duplex and so forth).

ANY suggestions on other ways I can test to see what would cause such slow write speeds. Anyone else have slow performance like this with new celerra running unisphere? We now have tested on three Celerras and they all exhibit same issue!

Name Mtu Ibytes Ierror Obytes Oerror PhysAddr

****************************************************************************

fxg0 9000 3016360536 11 2709827918 0 0:60:16:32:56:46

fxg1 9000 1237764640 0 0 0 0:60:16:32:56:47

mge0 9000 2762780894 0 4205729750 0 0:60:16:40:ee:1

mge1 9000 331775952 0 52356319 0 0:60:16:40:ed:ed

cge0 9000 1964674952 0 1408110665 0 0:60:16:2b:5c:96

cge1 9000 610079448 0 2747930285 0 0:60:16:2b:5c:97

tcp:

****

2022297234 packets sent

66962 data packets retransmitted

0 resets

3932824347 packets received

210284 connection requests

124 connections lingered

DanJost · Answer

Just a thought - or perhaps a shot in the dark...

From your output it looks like you are using Jumbo frames (MTU of 9000) - is this enabled on your network infrastructure? Are you doing iSCSI (typically where you see jumbo frames)? If your CGE's aren't in use except for testing you might want to set the MTU to 1500 and see if that changes anything . I'm not a jumbo frame guru but I have always been under the assumption that to use jumbo frames correctly they need to be enabled end-to-end (this would include your CIFS clients).

Dan

borninpa · Answer

I did, for kicks, setup an IP address interface that used MTU of 9000. I had already set this up a while ago to test backup over Jumbo. We are only using CIFS for production.

Anyway, I setup the test so the test CIFS server was using this Jumbo frame connection. It is on a separate VLAN on my procurve switch which is configured for jumbo frames. the network card is also configured for jumbo frames. There is no difference in the test.

borninpa · Answer

Interesting point. We don't use jumbo frames and none of our interfaces are set at 9000 so I don't know why this command shows them at 9000.  But it is something I may test.

InsaneGeek · Answer

You can calculate your retransmission percentage as retrans / #sent pac * 100 = retrans%

EMC wants that to be < 0.01%, between 0.01 & 0.1% is on the verge, > 0.1% is a problem.

You are well within a resonable value: 66962 / 2022297234 * 100 = .0033 < 0.01%

I'd suggest opening up a case (or escalate the one you still have open?), but if you want to poke around your self, on the celerra try running a number of the server_stats command.

server_stats server_ -i -table

"-table cifs" look at the uSec/call, it's in microseconds so divide by 1000 to get milliseconds. This should tell you how long it takes the celerra to perform paticular CIFS operations.

"-table dvol" this is disk stats and look at the number of read/write ops going to the paticular drives and see if you are hamming any lun heavily.

"-table fsvol" You can use this to see the filesystem I/O, there might be a lot of I/O going to a filesystem that should be quiet competing for resources

I tend to start with an interval of "1" first to look for spikes or bursts and then increase it to 10, 30 or 60sec.

Another one is in unisphere to go into celerra monitor, and get the Clariion stats. Look for queueing, cache flushes, etc. (writes should be through to cache on the Clariion, unless write cache is filling up they "should" be faster than reads).

Something I've just ran into on mine is I was running into ufslog issues (hopefully it is fixed, jury is out as to if it is or not). Run "server_log server_ " to get a log. I saw that I was hitting the soft threshold multiple times a second, if you are seeing lots of these (not a handfull) you might want to contact support. I had to do a number of run arounds with them from network traces, etc. I run millions of little files over NFS, with your big files I wouldn't expect this.

If you want to pull/turn on collection of the backend disk perf stats, check primus emc177167 on how to do it. If you don't have a analyzer license on it, you'll get encrypted naz files that can only be opened by support. Having that in hand already with at least 3x hours collected already avoided some the latency of "let's turn on stats to collect at least 3 hours or data... it's late in the day so I'll send it to engineering tomorrow and you'll get a detailed analysis the day after"

borninpa · Answer

Thanks for this information. I have opened a ticket but it does not seem to be going anywhere. In fact, we have opened tickets in more then one office because we are each independently run. Each Celerra was configured independently and they all have the same problem.

I looked at most of the stats you provide but will go through them again. It seems that the backend is not even being hit hard. It seems like the NAS is what is slowing things down. Don't know for sure.

borninpa · Answer

We are homing in on the problem of write performance. I can't believe other people have not reported this problem. We almost exclusively use our Celerra servers for CIFS. We use lots of checkpoints, setting up jobs for every 2-3 hrs during the day on our file systems and also weekly. We use most of the maximum checkpoints (90 max? or whatever it is).

We are fairly certain that these checkpoints along with the checkpoints for our replicator v2 (we also replicate each of our file systems offsite) are the culpritof slow writes.

We have setup a file system on the performance pool created from the vault drives for testing. No replication or checkpoints are on this pool. For this poool we get about 115MB/S read and about 60MB/s write. Still much slower write but more acceptable performance.

Oh, and the other office purchased the fast cache on their celerra and it does nothing to help their write performance. A bunch of wasted money if you ask me. I don 't have fast cache and see equal performance for the type of work we do.

Rainer_EMC · Answer

given the way Fast Cache works its not surpise that it wont make a difference writing new files.

In order for Fast Cache to make a difference you have to have a number of I/O's to the same 64k cluster of blocks so that it gets promoted into the FAST Cache.

From then on you are working with SSD speed and latency for these blocks for both reads and writes.

That does make a tremendous difference for applications that do use the same blocks like Exchange, databases, ....

For these apps we have seen reduced latency and 3+ times the I/O performance

The FAST Cache white paper explains that into more detail.

Rainer

borninpa · Answer

Just wanted to provide an update. We have configured our Celerras in every which way based on EMC support. IT appears that Celerra replicator V2 causes major slow downs when writing files to Cifs shares. Either way, we do not get the write performance our Windows servers (with Clariion or local SAS storage). The best write performance on the Celerra of our large files is around 60-80MB/s. With replicator running, speed drops down to about 20MB/s.

I will post updates if we ever make any more progress on this.

downhill2 · Answer

I have been doing benchmarking on our new VNX5300 exclusively with NFS and seeing similar results. So, I went to our NS960 which does alot of CIFS and NFS sharing, but generally not much in the way of serious load. What I have concluded is reads out of the celerra are as fast as the backend disk (or network will allow) but writes to it suck regardless of the number or type of spindles at the destination. This has to be occuring in more environments but unless people are not pushing beyond 60MB/sec, they would never care that it can't go any faster than this. Rainer or anyone, have any suggestions? I am at a loss as to how it is possible that the celerra with it's cache, combined with all the cache and performance of a 960, is not able to munch data to any FS without choking at around 60-70MB/sec.

dart 6.0.41-4

We do not have replicator running but there are some checkpoints for some FS. Also, this is not a CIFS vs NFS issue, the performance is exactly the same regardless of the protocol.

thanks

Dave

downhill2 · Answer

Well, as I said, I’m testing currently against a VXN5300 which has no checkpoints at all. It is not related to my problem, but I think it is a bug in there somewhere.

borninpa · Answer

We only get about 10 MB/s write performance but we use a lot of checkpoints. Has something to do with write on copy method. Supposedly there is a fix in the latest 6.0.51.6 nas code to help write performance. Would love to hear from someone if it helps.

More info here https://community.emc.com/thread/124864

Sathish Dodda · Answer

Which command are you used to get the below out put

Name Mtu Ibytes Ierror Obytes Oerror PhysAddr

****************************************************************************

fxg0 9000 3016360536 11 2709827918 0 0:60:16:32:56:46

fxg1 9000 1237764640 0 0 0 0:60:16:32:56:47

mge0 9000 2762780894 0 4205729750 0 0:60:16:40:ee:1

mge1 9000 331775952 0 52356319 0 0:60:16:40:ed:ed

cge0 9000 1964674952 0 1408110665 0 0:60:16:2b:5c:96

cge1 9000 610079448 0 2747930285 0 0:60:16:2b:5c:97

borninpa · Answer

What kind of performance issues? Do you have checkpoints on the file systems in questions. If so, expect SSSSSSlow write performance.

Paul Shane | Systems Administrator | paul.shane@milliman.com

Milliman | 1550 Liberty Ridge Drive, Suite 200 | Wayne, PA 19087-5572 | USA

Tel +1 610 975 8012 | Fax +1 610 687 4236 | Mobile +1 610 389 5088 | milliman.com

Sathish Dodda · Answer

I Am sorry, i am really new to Celerra stuff as i am manging only normal oprations on my celerra.

But we are facing lot of perfomance issues in our Celerra ad would like to know how can i get the info to troubleshoot.

Can you please let me know how can i find out the perfomace issues.

Can i use Celerra monitor to know something about pefomance.if, yes can you please let me know the parfomance parameters.

Rainer_EMC · Answer

You’re kidding – right ? Never seen a netstat output ?

Celerra

Celerra Performance problems - slow write fast read

Was this post helpful?