Start a Conversation

This post is more than 5 years old

Solved!

Go to Solution

1336225

June 10th, 2014 06:00

Performance issue / poor performance: VRTX with Fault Tolerant / Dual Shared PERC 8

Hi,

I have been struggling with a storage performance issue for a few months. I found out that the culprit in this performance hit, is the Fault Tolerant Shared PERC 8 in the VRTX. In this configuration you are bound to the write-through caching policy, which is currently performing very bad.

Because of this big performance hit on the storage (caused by the caching policy) I had the following problems:

  • Poor performance of (HyperV) VM's
  • Poor copy performance (copying big files within the same volume 70MB/s max)
  • Unable to make a backup of the VM's using either Veeam B&R 7 or Backup Exec 2014 RTM (backup and VSS are related)
  • VSS time-out errors (0x8000ffff- Catastrophic failure)
  • Checkpoint creation of a VM took about 2 minutes

The release notes of the latest firmware (23.8.12-0061) for the SPERC 8 state:

Firmware supports VRTX running with single SPERC8 controller;
- Write-back cache is enabled for VDs.
- Does not have high availability capability

Firmware supports VRTX running with dual SPERC8 controller;
- High availability capability present (Both SPERC8 needs to be connected to expanders)
- One of the controllers is active and the other is passive
- Write-back cache is disabled for VDs; only option for write cache is Write Through.


To downgrade from a dual controller to a single controller, I removed the second SPERC8 controller, removed the second expander, re-cabled the SAS connections and downgraded the chassis to 1.25. Just removing the controller and expander wasn't enough, because then I still was unable to set "write-back" as caching policy. As soon as I downgraded to 1.25 I was able to set write-back as caching policy.



With the single controller configuration, the storage performance is as it should be. All the earlier mentioned problems are gone:

  • (HyperV) VM's perform excellent
  • copy performance is excellent (copying big files within the same volume around 650MB/s max)
  • Backup actually works (backup and VSS are related)
  • No more VSS time-out errors (0x8000ffff- Catastrophic failure)
  • Checkpoint creation of a VM took about 4 seconds

From this experience I currently strongly advise to not go for the dual SPERC8 configuration and stick with the single configuration. Hopefully a firmware update will solve this performance issue, because at the moment having the dual / fault tolerant SPERC 8 configuration with this very poor storage performance makes no sense at all. Yeah, you're fault tolerant... but no backup and bad performing VM's aren't really a good option in a production environment I think...

Or might there be an essential setting I am missing for the write through setup? (I have tried different settings already...)

Below are the benchmarks made using Atto. The difference is tremendous! Funny fact: while doing the benchmark with the write back policy enabled, the virtual disk was still in its background initialization  process, and with the write through it was not. 

NOTE: Both VD's were configured as RAID6

August 25th, 2014 16:00

Finally Dell Support budged. This is the latest update from support:

--
Write Back
Cache Enablement

We are actively driving a plan to deliver a Write-Back option (via firmware) for Dual PERC configurations. We are working this plan with great urgency and will have it implemented as soon as we are able complete our development and Enterprise validation of the firmware.
--

November 7th, 2014 02:00

As it seems it was a question of firmware update afterall! The latest firmware found here (or here) states:

Fixes & Enhancements
Fixes:
- Fixed an issue in which Virtual Disks may become inaccessible when migrated to another Power Edge VRTX.
- Fixed an issue in which Virtual Disk to blade server mapping may be deleted after a failover.
- Fixed an issue in which a cable failure in large configurations could cause the Shared PERC8 to fault.
- Fixed an issue in which a redundant VD may be unable to use the "Copy Back" feature after disabling the second Shared PERC8
- Fixed an issue in which I/O traffic wasn't always sent using the "Fast Path" feature.
- Improved Battery Back Up (BBU) messaging in the storage controller logs.
- Fixed an issue in which the rebuild of a redundant virtual disk starts over after a controller failover.

Enhancements:
- VDs can now be configured with write back caching in dual Shared PERC8 configurations
- Increased hard drive predictive failure polling interval to 5 minutes.
- Firmware TTY logs persist across controller and chassis resets.

I will implement the new firmware and test it right away.

Moderator

 • 

6.2K Posts

June 10th, 2014 08:00

Hello Erik

Thanks for the great write-up!

Here are a few things to consider with this:

#1 Write cache has a HUGE performance impact on RAID arrays that use parity, HUGE difference. You should not run a parity array without write caching. Your speeds of 70MB/s on a RAID 6 without write caching are extremely good.

#2 Write caching is disabled when running dual PERCs because the functionality is not present to match the write cache. Our storage appliances that run dual PERC failover configurations have the capability to match the cache. The passive PERC runs a mirror image of the active PERCs cache so that it can take over without data loss in the event of controller failure. The VRTX is not a full-featured storage appliance.

The use of dual PERCs on a VRTX will be dependent on the needs of the individual customer. I think it is great that you made this write-up to point out that write caching is not available in this configuration. Someone running a RAID 10 or other non-parity array will not suffer such a HUGE performance impact due to write caching being disabled, so some configurations run very well with dual PERCs.

Thanks

June 10th, 2014 08:00

Hi Daniel.

Thank you for the clarification! Which led to my following question(s):

Q: In point #2 you say "because the functionality is not present to match the write cache". Will this functionality ever be present/available, or is this impossible seen the way caching works?

I will test and benchmark a RAID10 setup somewhere this week. The big downside with RAID10 is, that we lose a lot of storage capacity, meaning we have 21TB in RAID10 versus 36TB in  RAID6 (that is a 15TB loss). Note: Our VRTX is completely filled with 3.64TB disks

If RAID6 is not the best option (or actual the worst) to use with the dual SPERC8 controller, it would be very nice if Dell made a mention / recommendation / best practice for the dual SPERC8 setup. Now I have been struggling with this issue for months, time that could have been saved when this was written down somewhere. (I realize that this dual SPERC8 setup is pretty new still, so giving new customers/users a heads up on this would be very convenient).

So for anyone that comes across this post:
* If you have a VRTX with a dual SPERC8 and want to keep the fault tolerance, then do not use RAID6 for the storage, go for RAID10 instead (take the storage capacity loss into account!).
* If you think the fault tolerance with SPERC8 is not that important and you want to use RAID6, downgrade the VRTX to be a single controller SPERC8 config and use write-back caching policy..

Q: Currently I have found a way to downgrade to a single SPERC8 controller. Is this the best way or do you know of an easier way to downgrade?

Regards,

Erik

Moderator

 • 

6.2K Posts

June 10th, 2014 09:00

Q: In point #2 you say "because the functionality is not present to match the write cache". Will this functionality ever be present/available, or is this impossible seen the way caching works?

In short, I don't know. I doubt this type of functionality could be added with a firmware update, but I'm not a developer so I'm not sure of the engineering barriers to make it work. I'm not aware of any plans, and I'm not sure if it is possible.

Q: Currently I have found a way to downgrade to a single SPERC8 controller. Is this the best way or do you know of an easier way to downgrade?

You should be able to run the latest firmware with a single PERC8 shared and have write-caching enabled. Your method works, but I'm curious if back-flashing the firmware on the PERC is required. Some changes made with firmware updates are to the default configuration file. If you don't reset the PERC to defaults then the changes will not be implemented. I would suggest trying to reset the PERC to defaults prior to back-flashing to see if that will allow the write-cache option to become available. You can do this in the CMC: Storage>Controllers>Troubleshooting>Actions drop-down.

Thanks

2 Posts

June 10th, 2014 11:00

I face exactly the same case.
New Dell VRTX + 3 M620 run Vsphere ESXi 5.5
VD as RAID-5 in dual SPERC8 controllers. Write Policy is Write Through (only)
I made 2 images of the  same VM, one in local HDD (of M620) and one in VD storage.


Write Performance of VM in VD storage is very low.




Now backup or cloning VMs are very slow

Do you have the test result with RAID-10 yet? 

should i contact local Dell support ? I dont sure they have the answer. I feel so bad now

Thanks

June 11th, 2014 01:00

Hi Daniel (and ThiPham),

In your first reply you said the following: "Someone running a RAID 10 or other non-parity array will not suffer such a HUGE performance impact due to write caching being disabled, so some configurations run very well with dual PERCs."

Note: by my understanding "write through" is not equal to write caching being disable, but it is just another (much slower) way of write caching.


Of course I wanted to see/test your claim and I hate to say it, but the results are just as bad (according to me and the expectations sold to us by Dell Sales Department). So I am really wondering what kind of configuration actually is suitable for a dual PERC setup. Currently it is a very bad choice for any virtual environment, as that is a very disk intensive setup.

The results:

Taking all the results into account, I am rephrasing my advice:
If you are planning to run any virtual environment on a VRTX, do not go for the dual SPERC8 setup, stick with the Single SPERC8 instead. The performance hit caused by the currently mandatory caching policy (Write Through), is way too big to smoothly run a virtual environment. No matter which RAID config you choose.

@ThiPham
To have a real performance gain I would advice you to downgrade your VRTX to a Single PERC.

I am still figuring out the best way to downgrade from dual to single. The procedure described in my first post does work, but I am wondering if Daniel is right. Meaning that I only have to remove the controller + expander, re-cable and then reset the single controller. And that I do not need to downgrade the firmware.

And yes, please do contact Dell support! I have done this as well. I think the more people "complain" about this, the more serious they will take it.

Moderator

 • 

6.2K Posts

June 11th, 2014 11:00

Erik

Note: by my understanding "write through" is not equal to write caching being disable, but it is just another (much slower) way of write caching.

That is not correct. Write through and write caching being disabled are the same. The controller will always run data through the cache, but with caching disabled or write through, which are the same thing, it will not queue anything in cache. It basically has a queue depth of 0 with no cache.

And yes, please do contact Dell support! I have done this as well. I think the more people "complain" about this, the more serious they will take it.

There is not an issue with the hardware. This is just an issue of understanding how pivotal of a role cache plays in RAID. Your write speeds with cache enabled are almost 2500 MB/sec. Your HDD's are not capable of writing that fast. Cache allows you to achieve speeds that the drives are not capable of. All you are showing in the above tests is how much effect cache has on RAID performance.

I have run your benchmark on an R720XD with an H710P controller to show you that this is not a hardware or firmware issue. This is simply how RAID works. My test results are not as good as yours, so I suspect you are running better HDDs than me. I have 7200 RPM SATA drives that I am testing with. This is my result with all caching disabled on a different controller in a different system on a 4 drive RAID 10:



2 Posts

June 11th, 2014 23:00

Taking all the results into account, I am rephrasing my advice:
If you are planning to run any virtual environment on a VRTX, do not go for the dual SPERC8 setup, stick with the Single SPERC8 instead. The performance hit caused by the currently mandatory caching policy (Write Through), is way too big to smoothly run a virtual environment. No matter which RAID config you choose.

@ThiPham
To have a real performance gain I would advice you to downgrade your VRTX to a Single PERC.

I am still figuring out the best way to downgrade from dual to single. The procedure described in my first post does work, but I am wondering if Daniel is right. Meaning that I only have to remove the controller + expander, re-cable and then reset the single controller. And that I do not need to downgrade the firmware.

And yes, please do contact Dell support! I have done this as well. I think the more people "complain" about this, the more serious they will take it.

Hi Erik,

Many thanks to your advice.

I just contact Dell support and wait for their reply

Some test results on our VRTX system for reference

June 12th, 2014 01:00

Erik

[quote user="Erik Nettekoven"]Note: by my understanding "write through" is not equal to write caching being disable, but it is just another (much slower) way of write caching.

That is not correct. Write through and write caching being disabled are the same. The controller will always run data through the cache, but with caching disabled or write through, which are the same thing, it will not queue anything in cache. It basically has a queue depth of 0 with no cache.
[/quote]

Thanks for clearing that up :)


[quote user="Erik Nettekoven"]And yes, please do contact Dell support! I have done this as well. I think the more people "complain" about this, the more serious they will take it.

There is not an issue with the hardware. This is just an issue of understanding how pivotal of a role cache plays in RAID. Your write speeds with cache enabled are almost 2500 MB/sec. Your HDD's are not capable of writing that fast. Cache allows you to achieve speeds that the drives are not capable of. All you are showing in the above tests is how much effect cache has on RAID performance.

I have run your benchmark on an R720XD with an H710P controller to show you that this is not a hardware or firmware issue. This is simply how RAID works.

[/quote]

Regarding "there is not an issue with the hardware", yes and no.

Yes: In the respect of "that is how RAID works" than I agree it is indeed not an issue with the hardware.

No:  In the dual SPERC8 setup in the VRTX, which is also an hardware setup, you are bound to "write through" or "caching disabled". According to me this is a very poor choice, because of the huge impact on the write performance. It is in no way a suitable setup for a virtual environment, although it was sold to us by Dell "being a very suitable solution" for a virtual environment. The problems mentioned in my starting post and the benchmark results proof my "poor choice" statement.

If this "write through" caching policy is the only technical possible option with  Dual SPERC controllers, making -running a virtual environment unfeasible/impossible-, than I feel like I have been more or less scammed by Dell. Or at least Dell didn't live up to the expectations that they gave me.

So for now I stick with my advice/statement::
If you plan to run any virtual environment on a VRTX and its internal storage, stick with a Single SPERC setup, stay far away from the Dual SPERC setup.

Furthermore I am still wondering: what kind of environment is actually suitable for this Dual SPERC setup?

June 12th, 2014 03:00

I have contacted Dell Support and they recognize the performance problem with the Dual PERC controller setup. According Dell support the specialists are currently looking into this performance issue.

Furthermore support will sent me an email with troubleshooting steps to improve the performance of the Dual PERC setup and/or to find out if "write through" is the actual cause of this problem.

So everybody experiencing this problem, please DO contact support!

June 12th, 2014 08:00

I received a reply from Dell support, I am very dissatisfied with their response:

Hi Erik,

Thanks for sending the  information.

Single controller configuration is for performance
Dual for fault tolerance.

The solution is to lower fault tolerance of the dual configuration.
1.Power off the chassis
2.Remove the 2nd PERC card
3.Optionally remove the 2nd expander and re-cable (MB SAS1A <-->UP EXP A, MB SAS1B<-->UP EXP B)
4.Run the following command on CMC console: racadm raid revertononfaulttolerant:RAID.ChassisIntegrated.1-1
5.Power on the chassis

Let me know if this solved your problem.
You have to evaluate what is more important performance or fault tolerance.

Kind regards / Met vriendelijke groeten,

[removed personal info]

Pro Support Engineer
Dell | Enterprise Support Services
Dutch PowerEdge Servers
----------------------------------------

[removed personal info]
-----------------------------------------------------

Feedback my manager [removed personal info]
You may receive a survey, please participate!
-----------------------------------------------------

EXPRESS SERVICE CODE  You can use this link to convert your service tag to express service code.
SUPPORT ASSIST  remote monitoring & automated support 
TECH DIRECT  online portal for efficient problem resolution
TWITTER  follow us for technical support
YOUTUBE  video sharing support
LINKEDIN  build and engage with your professional network

I can't believe they actually said this "You have to evaluate what is more important performance or fault tolerance." If I choose fault tolerance, virtual environment is not performing at all and I am not able to create any backup. I think I prefer a backup over fault tolerance. And thank you Dell for selling something that is useless at moment...

I am still talking to Dell about this ridiculous solution, I really hope that they come with something better than this!

June 12th, 2014 10:00

I received a reply from Dell support, I am very dissatisfied with their response:

[quote user="Dell Support Technician"]

Hi Erik,

Thanks for sending the  information.

Single controller configuration is for performance
Dual for fault tolerance.

The solution is to lower fault tolerance of the dual configuration.
1.Power off the chassis
2.Remove the 2nd PERC card
3.Optionally remove the 2nd expander and re-cable (MB SAS1A <-->UP EXP A, MB SAS1B<-->UP EXP B)
4.Run the following command on CMC console: racadm raid revertononfaulttolerant:RAID.ChassisIntegrated.1-1
5.Power on the chassis

[...]

I am still talking to Dell about this ridiculous solution, I really hope that they come with something better than this!

[/quote]
 
Do you happen to know if this downgrade procedure that Dell Support outlines will keep data intact, or will I have to move all my data off and re-create the virtual drives?

June 13th, 2014 01:00

Do you happen to know if this downgrade procedure that Dell Support outlines will keep data intact, or will I have to move all my data off and re-create the virtual drives?

I have done several upgrades and downgrades allready, and until now all the data was still intact. It never hurts to make a backup, just to be sure.
So I did not need to recreate the virtual drives (VD's). The only thing you might need to do is, to change the write caching policy on the VD's from "write through" to "write back".

June 13th, 2014 01:00

All the

Do you happen to know if this downgrade procedure that Dell Support outlines will keep data intact, or will I have to move all my data off and re-create the virtual drives?

With all the downgrades (and upgrades) I have done, my experience is that all the data  and VD's will stay intact. But it sure doesn't hurt to make a backup first, just to be sure. I made a copy of the VM's and other data to different storage (a MD3200).

No Events found!

Top