Write Cache or not ?

Question

Hi all, I'd like to have your point of view or experience regarding write cache activation (or not) in a SSD context. In our case : 2 SC8000 wtth 3 tiers (SSD SLC -  SAS 15K - SAS 7K) In a recommended profile, the Storage Center documentation advises to disable write cache, argumenting that write IOs will occur on SSD and latency on SSD is comparable to NVRAM latency. Ok... so we made a simple test : 1 Windows 2008 R2 server with 2 LUNs in recommended profile.  We simply copy/paste a big file (8Go) from the first LUN to the second LUN (the second is write cache disabled). We notice that the copy operation is just 30% to 40% longer than with cache enable....   So what ? We tried with another server (Windows 2003) --> same thing. We tried with other NTFS cluster block size --> same thing. We tried to modify HBA Queue depth --> same thing... So, for the moment, I don't disable write cache, even for SSD LUN... Any idea or similar experience ? Thanks.

HMA59 · Answer

Yes, these are physical boxes with FC. And no Anti-virus software...

In fact, I've noticed this problem on a production server hosting a big SQLServer database (2To). We have restored this database and we was suprised by the low write rates (4Go/Min) where we used to be at 6Go/min.... when I've enabled the write cache.. oh surprise... the write rates was ok...

I've also made some test with VMware ESXi Host, some VMs clone operation, and, with VMware, all is normal... (better without write cache)...

So this issue is "windows dependent"... but where is the tips ?

mtanen · Answer

I am assuming these are physical boxes - there are some interesting things about the Windows i/O system that can cause wierd issues. In general the SSD should write as fast or faster without the write cache turned on. It will also reduce the latency (which can be big with certain applications).

You could try something with a Linux server I guess - or get Condusiv's Velocity demo and see if that makes a difference. I would also check for virus scanners and other things that can drag down I/O. I sometimes recommend people make a RAM disk on the Windows server in question and see if performance increases between the RAM disk and disk. If not there is generally something effecting the whole I/O system.

The IOmeter not being slow could mean you have all sorts of trapped potential in those servers :)

HMA59 · Answer

The SC8000 Firmware is 6.3.10.

And Yes, more than 4To of RAID10 SSD Free Space (this storage area has just entered the production state...).

To complete my first post, I can say that I don't notice this problem with IOMeter.

With IOMeter, Intensive Write IO on SSD is better without cache than with cache.

mtanen · Answer

I forgot to ask - is this FC or iSCSI and what speed?

mtanen · Answer

What version of the Compellent firmware are you running?

mtanen · Answer

Also - what does the space look like on your SSD? Do you have free space available in the RAID10 extent?

mtanen · Answer

What are your settings on the Windows device? Using the windows cache or direct to device? Whats the MPIO profile set to?

HMA59 · Answer

I use a Roud Robin profile over 4 FC paths. What do you mean by 'windows cache or direct to device' ? I guess I use the default one...

mtanen · Answer

Go into the Windows Device manager and check the properties of each of the LUNs. Make sure you check the advanced settings in particular to see if you are caching all your writes through the OS.

Also - I would probably pin the MPIO setting to a single port for this type of test. That way you can make it easier to see if there is a FC related issue. Put each lun on a different port.

HMA59 · Answer

Disk Write Caching is disabled. So I guess the OS doesn't cache anything...

mtanen · Answer

Summary of the situation: (just to keep it straight in my head)

You have two physical Windows machines - one Windows 2003 and the other Windows 2008 R2. Both are connected via Fibre Channel to a compellent system which includes a tier of SSD. You turn off the write cache to the volume (that uses the recommended profile) so that writes would not be cached to the SSD.

You run IOMeter - its faster with the write cache off

You do file copies (between compellent LUNS) - its slower with the write cache off

You have tried tuning parameters on the HBA and in the OS - nothing seems to change the outcome. You have verified that no Antivirus or other storage effecting product is installed on the hosts. The issue does NOT occur with Linux or VMware - only windows. Compellent is at 6.3.10 (which should be fine) - note that I dont believe you get flash wear tracking until 6.4.3, so you might want to think on that.

Have you tried a copy from local storage to LUN? Does that make a difference?

Have you tried using something like FastCopy to do your copy test? Does that make a difference?

Are you seeing any errors on the FC switches? I would guess not if the IOmeter comes out fine.

Obviously not the expected behavior - unless someone else can think of something I have missed, you might want to raise a ticket with Copilot just so they can run through their checklist of common things.

HMA59 · Answer

'You turn off the write cache to the volume (that uses the recommended profile) so that writes would not be cached to the SSD.' --> I turn off the write cache to the destination volume (that uses the recommended profile) so that writes would not be passed to the NVRAM/Cache controller and would not be replicated to the second controller, but would be writen directly to SSD. --> Not tried with LINUX. 'Have you tried a copy from local storage to LUN? Does that make a difference?' --> Yes,this is it : I copy the big file a Compellent LUN and paste the file to another Compellent LUN. 'Have you tried using something like FastCopy to do your copy test? Does that make a difference?' --> Not tried with FastCopy. But, as mentioned, a Full SQL Restore though a Symantec Backup Exex SQLServer Agent reaches to same conclusion : the restore rates is 40% slower without cache. 'Are you seeing any errors on the FC switches? I would guess not if the IOmeter comes out fine.' --> No errors on FC switches. Yes, I will put the copilot support in charge of this problem. Thanks for your help.

JBrunt · Answer

Did you ever get an explanation for this problem from co-pilot?  Could you post it here as we are beginning to deploy more and more flash, and I would be interested to see the explanation.

HMA59 · Answer

No sorry, and unfortunately, I didn't had any time left to open a case on this subject. But this is still relevant today. And as a workaround, I keep write cache active on all my Windows Servers (not on my ESXi servers that doesn't seem to be concerned).

I will open a case on Copilot and let you know the result. But I don't know when...

carsten anker · Answer

Hi, did you find out? we see the same as you descripe. with ISCSI. another thing we see, is that 2008r2 is faster than 2012 r2. we have also tried everything around nic tuning. Thanks carsten

Compellent

Was this post helpful?