MD3200 Sequential Performance

Question

Hello -

I've submitted a forum post on this before, but now I'm seeing some slightly different behavior. Here's a quick recap of your environment:

2x current-gen R715's with 2x SAS 6 HBAs connected to a MD3200 (so 2x SAS links per host with redundant HBAs)

MD3200 fully loaded with 600GB 15K SAS 2.0 drives, and 2x expansion enclosures:

- 1x MD1200 fully loaded with 600GB 15K SAS 2.0 drives
- 1x MD1220 fully loaded with 146GB 15K SAS 2.0 drives

We're a SQL Server DSS shop, so most of our heavy I/O traffic is sequential. For the current configuration I'm troubleshooting, we have 4x "DATA" volumes spread across the 2x enclosures containing the 600GB disks:

- MD3200 head-unit
* DATA-1 - 6x 600GB disks, RAID-5
* DATA-2 - 6x 600GB disks, RAID-5

- MD1200 (daisy-chained to the MD3200)
* DATA-3 - 6x 600GB disks, RAID-5
* DATA-4 - 6x 600GB disks, RAID-5

In a SQL Server world, this would be considered a pretty standard LUN partitioning scheme for a basic SAN. Cut up your aggregate 24x spindles into 4x RAID-5 volumes and present them to your SQL Servers.

What's strange about this configuration is that each individual RAID-5 volume seems to content for the *same* 600MB/sec of bandwidth. So for example, if I start a SQLIO test on just one of the volumes, I'll see 600GB/sec (which is the performance I would expect out of sequential reads across 6x spindles).

Where it get weird is when I start simultaneous load testing. So if I start a SQLIO test against two volumes at the same time, the 600GB/sec is cut close to exactly in half, so roughly 300MB/sec on each host. And as you'd expect with a progression like this, if I hit all four volumes at the same time, I see roughly 150MB/sec per host.

So this tell me that the controller or SAS paths are the bottleneck. In terms of the SAS paths, it won't make sense that each of the volumes would need to share a single "SAS Lane" when there's 4x SAS lanes available per physical SAS link.

Does anyone have experience with multiple simultaneous sequential workloads against an MD3200? This is pretty disappointing given how Dell advertises this as great implementation of the new SAS 2.0 protocol.

DELL-Kenny K · Answer

After reading your questions there is a few things that we will need to look at to get a better idea of where the bottleneck is at. We would need to pull a DSET from one of the hosts that access the array. We will also need to get some support logs from the MD3200. I will send you a private msg with my email address so you can send me that information.

FrostyAtCBM · Answer

I would be very interested to learn the eventual outcome of this issue/query.

We run a 3-host setup similar to yours (we also have an additional MD1220).

If there is a config issue affecting throughput that can be corrected, I would want to cross-check against my setup.

jeffhoward014 · Answer

This is entirely just a guess at this point, but after some additional analysis I have a theory that I'm hoping the Dell Technicians can confirm or deny. If you deep-drive the MD3200 Technical Guide Book, you'll see comments like:

The MD3200 series of arrays were designed with performance in mind. Each controller is equipped with
four 6Gb SAS ports providing total aggregated bandwidth of 4,000MB/s of throughput for a dual controller
system which is 4X the throughput of the MD3000 and the most competitive products in the external RAID
SAS array market.

A common question is, how did they arrive at 4000MB/sec? This is because they're taking the 6Gbps and converting that to MB/s minus a little for signaling overhead, then doing some very basic math... Other SAS sites say you can get 500MB/sec per lane, so take 4-ports per controller, then 2x controllers, and voila! 500*4*2 = 4000MB/sec.

Now that being said, there are two major points that are left out in that math, and so far I've only found 1 engineer at Dell who could confirm this. First, SAS speeds are rated UNIdirectional, so 500MB/sec would only be for you were measuring 100%/0% (reads vs writes).

The second major point that hardly anyone is aware of, is that the SAS 2.0 spec has 4x lanes per physical link! Again, returning to the MD3200 Technical Guidebook , if you look at the Controller Architecture diagram, you'll quickly begin to see the relationship between "links" (which are ports) and "lanes".

So after an extremely long-winded intro, my working theory at this point is that there is something in the configuration or implementation of the controller that is preventing additional requests from "spilling over" into the additional SAS lanes, possibly even at the controller level.

We've confirmed through extensive testing that the 600MB/sec limit seems to extend beyond a single host or HBA. We have two physical hosts connected to our MD3200 with dual HBAs running MPIO, so there are a total of 4 physical links available, each with their own 6Gpbs HBA.

Anyone with a deep understanding of SAS 2.0 and the interworkings of the MD3200, I would love to chat about this. I'm willing to provide whatever time is needed for testing this on our systems as this is a major deal-breaker for us continuing to roll out the MD3200's for our DSS application.

Thanks,

- Jeff

jeffhoward014 · Answer

So my testing continues, and I believe I've narrowed this down a little more... The 600MB/sec is apparently tied to each controller, not the array as a whole. So for example, using SQLIO with 64K IOs, I'm able to push 600MB/sec for a single RAID volume on a single controller. That allows me to push about 1000-1200MB/sec with 64K IOs if I purposely choose two different RAID volumes on two different controllers.

Keeping in mind, my end-goal is to ferret out what the true max throughput is for this device before we pour more money into buying additional expansion enclosures. The advertised max throughput on a dual-controller setup is 4000MB/sec for perfectly sequential reads, so I'm simply trying to understand how that factors out in a real-world DSS app. So at this point in my testing, I have no clue how they got to even 2000MB/sec.

My frustration is this - We currently have three total enclosures for a total of 48 disks. My testing is showing that I'm able to max out the sequential throughput with as few as twelve physical disks. Meaning, when I run the dual-controller test noted above that maxes our both controllers, the aggregate throughput is only 1200MB/sec. While 1200MB/sec maybe doesn't sound too bad, it makes me nervous to invest any more cash in additional expansion enclosures to hang off this device.

Here's an example:

If 12x disks are able to max it out, here's the logical progression (for a DSS app with 64K sequential reads). For the sake of the example, let's say that a single client's data consumes 6 physical disks:

02 client(s) - 12 disks - 600MB/sec/client
04 client(s) - 24 disks - 300MB/sec/client (what we're seeing today with all four RAID volumes under load)
08 client(s) - 48 disks - 150MB/sec/client
16 client(s) - 96 disks - 75MB/sec/client

So even adding one additional MD1220, our effective throughput drops to 150MB/sec (per client)? Can an engineer from Dell please confirm this is the expected performance of your SAS 2.0 SAN? If this math is correct, then that means we can max the throughput of this device with basically one MD3220 and one MD1220 expansion enclosure? That seems a little light given the way this device is advertised. We were able to do that with our SAS 1.0 devices 5+ years ago.

Is it possible for an engineer from Dell to chime in on this!?!? We're on hold for our next round of purchasing until we can confirm if this is a device limitation or a configuration issue.

FrostyAtCBM · Answer

Any further news on this?  I'm still very interested in the subject.

jeffhoward014 · Answer

Yes, actually I do, thank you for the reminder. The mystery is solved! A couple calls with the Dell Storage Engineers turned up dry... They really were trying to be helpful, I think I was just too deep into the details of the SAS pathing by the time I brought them in. I decided to revisit the High Performance Tier guide with a fine-tooth comb, since that seemed to have to most "granular" details and documented performance I was looking for. While I found the solution in the HPT Guide, you don't necessarily need the HPT "Premium" feature to get past the 600MB/sec boundary.

http://www.dell.com/downloads/global/products/pvaul/en/powervault-md3200-high-performance-tier-implementation.pdf

The root cause was in the controller-level cache block size! Changing the "default" size of 4K, to the max of 32K allowed us to push 3000MB/sec across as few as 24 15k spindles! Remember, this is in a 100/0 R/W scenario on completely sequential reads. But sequential writes were much better as well.

Sp in thinking about this, it makes sense. For us a 4K block size at any level of the system is pretty useless... Our stripe size on disk, our cluster size in Windows, and our row widths in SQL are all 8K or larger (larger as you move farther down in the technology stack, by design). So basically, the controllers were having to go through multiple cycles to load the cache as each cycle could only write 4K at a time.

This small change completely changes the extendability of this device for us. We're even seeing improved performance on mixed workloads where it's a blend of sequential and random operations. I have to believe this is because out data blocks are simply better align throughout every layer of the stack (SQL -> Windows -> VMWare -> Controller Cache -> RAID Striping).

FrostyAtCBM · Answer

That is an absolutely BRILLIANT find ... thank you!!! ... I can confirm that my own MD3200 is also configured with the default 4KB Cache block size ... I'll look into that some more and schedule to make the change ... it should massively improve our backup performance I would think.

One more question: can this be changed "on the fly" or do I need to schedule an outage?

FrostyAtCBM · Answer

Thanks again.  I will try a 16KB cache block size and also disable cache pre-fetch on the virtual disks.  But I will wait until the weekend, just in case, rather than changing it during the week.  I'll let you know how it goes next week.

jeffhoward014 · Answer

Quadruple-checking the PDF from dell in my previous post, but a verbatim cut/paste from the doc says:

Proper array and server tuning is essential for realizing the advantages of High Performance Tier. If a 
storage system is not optimally tuned the benefits from High Performance Tier may not be fully 
realized.

For sequential workloads, a controller cache block size of 32KB should be used (the default setting is 4KB). 
The cache setting may be changed by selecting Change -> Cache Settings from the Storage Array main menu item.

For random workloads, a controller cache block size of 16KB should be used, and cache pre-fetch 
should be disabled for each virtual disk. Virtual disk cache settings may be change by selecting Change 
-> Cache Settings from the Virtual Disk main menu item.
If the workloads are a mix of sequential and random IO, we initially recommend using the 16KB cache 
block with cache pre-fetch disabled, but encourage the system administrator to adjust cache block size 
as necessary to achieve the optimal results. Cache block size may be changed ‘on the fly’ without 
requiring a controller reset, and live system tuning may be required in some cases.

FrostyAtCBM · Answer

Switched from 4KB to 16KB cache block size in the controllers on Saturday.

No reboot was necessary. It will take me maybe a week to determine the impact.

I haven't fiddled with the virtual disk cache pre-fetch settings. One change at a time!

FrostyAtCBM · Answer

The change to 16KB cache block size was fine, however the impact was almost imperceptible on the usual workloads running on our MD3200. I did keep stats from our backup jobs, however our improvement was marginal (<5%) and possibly statistically indistinguishable from zero?

Since then I have also switched off cache pre-fetch on most of my virtual disks (those not storing backup data). Again, the difference imperceptible, possibly zero.

But regardless, I now have more knowledge about the configuration and therefore more confidence for the future. So thanks again for your work in digging out this information.

PowerVault

MD3200 Sequential Performance

Was this post helpful?