After reading your questions there is a few things that we will need to look at to get a better idea of where the bottleneck is at. We would need to pull a DSET from one of the hosts that access the array. We will also need to get some support logs from the MD3200. I will send you a private msg with my email address so you can send me that information.
This is entirely just a guess at this point, but after some additional analysis I have a theory that I'm hoping the Dell Technicians can confirm or deny. If you deep-drive the MD3200 Technical Guide Book, you'll see comments like:
The MD3200 series of arrays were designed with performance in mind. Each controller is equipped with four 6Gb SAS ports providing total aggregated bandwidth of 4,000MB/s of throughput for a dual controller system which is 4X the throughput of the MD3000 and the most competitive products in the external RAID SAS array market.
A common question is, how did they arrive at 4000MB/sec? This is because they're taking the 6Gbps and converting that to MB/s minus a little for signaling overhead, then doing some very basic math... Other SAS sites say you can get 500MB/sec per lane, so take 4-ports per controller, then 2x controllers, and voila! 500*4*2 = 4000MB/sec.
Now that being said, there are two major points that are left out in that math, and so far I've only found 1 engineer at Dell who could confirm this. First, SAS speeds are rated UNIdirectional, so 500MB/sec would only be for you were measuring 100%/0% (reads vs writes).
The second major point that hardly anyone is aware of, is that the SAS 2.0 spec has 4x lanes per physical link! Again, returning to the MD3200 Technical Guidebook , if you look at the Controller Architecture diagram, you'll quickly begin to see the relationship between "links" (which are ports) and "lanes".
So after an extremely long-winded intro, my working theory at this point is that there is something in the configuration or implementation of the controller that is preventing additional requests from "spilling over" into the additional SAS lanes, possibly even at the controller level.
We've confirmed through extensive testing that the 600MB/sec limit seems to extend beyond a single host or HBA. We have two physical hosts connected to our MD3200 with dual HBAs running MPIO, so there are a total of 4 physical links available, each with their own 6Gpbs HBA.
Anyone with a deep understanding of SAS 2.0 and the interworkings of the MD3200, I would love to chat about this. I'm willing to provide whatever time is needed for testing this on our systems as this is a major deal-breaker for us continuing to roll out the MD3200's for our DSS application.
So my testing continues, and I believe I've narrowed this down a little more... The 600MB/sec is apparently tied to each controller, not the array as a whole. So for example, using SQLIO with 64K IOs, I'm able to push 600MB/sec for a single RAID volume on a single controller. That allows me to push about 1000-1200MB/sec with 64K IOs if I purposely choose two different RAID volumes on two different controllers.
Keeping in mind, my end-goal is to ferret out what the true max throughput is for this device before we pour more money into buying additional expansion enclosures. The advertised max throughput on a dual-controller setup is 4000MB/sec for perfectly sequential reads, so I'm simply trying to understand how that factors out in a real-world DSS app. So at this point in my testing, I have no clue how they got to even 2000MB/sec.
My frustration is this - We currently have three total enclosures for a total of 48 disks. My testing is showing that I'm able to max out the sequential throughput with as few as twelve physical disks. Meaning, when I run the dual-controller test noted above that maxes our both controllers, the aggregate throughput is only 1200MB/sec. While 1200MB/sec maybe doesn't sound too bad, it makes me nervous to invest any more cash in additional expansion enclosures to hang off this device.
Here's an example:
If 12x disks are able to max it out, here's the logical progression (for a DSS app with 64K sequential reads). For the sake of the example, let's say that a single client's data consumes 6 physical disks:
02 client(s) - 12 disks - 600MB/sec/client 04 client(s) - 24 disks - 300MB/sec/client (what we're seeing today with all four RAID volumes under load) 08 client(s) - 48 disks - 150MB/sec/client 16 client(s) - 96 disks - 75MB/sec/client
So even adding one additional MD1220, our effective throughput drops to 150MB/sec (per client)? Can an engineer from Dell please confirm this is the expected performance of your SAS 2.0 SAN? If this math is correct, then that means we can max the throughput of this device with basically one MD3220 and one MD1220 expansion enclosure? That seems a little light given the way this device is advertised. We were able to do that with our SAS 1.0 devices 5+ years ago.
Is it possible for an engineer from Dell to chime in on this!?!? We're on hold for our next round of purchasing until we can confirm if this is a device limitation or a configuration issue.
Yes, actually I do, thank you for the reminder. The mystery is solved! A couple calls with the Dell Storage Engineers turned up dry... They really were trying to be helpful, I think I was just too deep into the details of the SAS pathing by the time I brought them in. I decided to revisit the High Performance Tier guide with a fine-tooth comb, since that seemed to have to most "granular" details and documented performance I was looking for. While I found the solution in the HPT Guide, you don't necessarily need the HPT "Premium" feature to get past the 600MB/sec boundary.
The root cause was in the controller-level cache block size! Changing the "default" size of 4K, to the max of 32K allowed us to push 3000MB/sec across as few as 24 15k spindles! Remember, this is in a 100/0 R/W scenario on completely sequential reads. But sequential writes were much better as well.
Sp in thinking about this, it makes sense. For us a 4K block size at any level of the system is pretty useless... Our stripe size on disk, our cluster size in Windows, and our row widths in SQL are all 8K or larger (larger as you move farther down in the technology stack, by design). So basically, the controllers were having to go through multiple cycles to load the cache as each cycle could only write 4K at a time.
This small change completely changes the extendability of this device for us. We're even seeing improved performance on mixed workloads where it's a blend of sequential and random operations. I have to believe this is because out data blocks are simply better align throughout every layer of the stack (SQL -> Windows -> VMWare -> Controller Cache -> RAID Striping).
That is an absolutely BRILLIANT find ... thank you!!! ... I can confirm that my own MD3200 is also configured with the default 4KB Cache block size ... I'll look into that some more and schedule to make the change ... it should massively improve our backup performance I would think.
One more question: can this be changed "on the fly" or do I need to schedule an outage?
Thanks again. I will try a 16KB cache block size and also disable cache pre-fetch on the virtual disks. But I will wait until the weekend, just in case, rather than changing it during the week. I'll let you know how it goes next week.
Quadruple-checking the PDF from dell in my previous post, but a verbatim cut/paste from the doc says:
Proper array and server tuning is essential for realizing the advantages of High Performance Tier. If a storage system is not optimally tuned the benefits from High Performance Tier may not be fully realized.
For sequential workloads, a controller cache block size of 32KB should be used (the default setting is 4KB). The cache setting may be changed by selecting Change -> Cache Settings from the Storage Array main menu item.
For random workloads, a controller cache block size of 16KB should be used, and cache pre-fetch should be disabled for each virtual disk. Virtual disk cache settings may be change by selecting Change -> Cache Settings from the Virtual Disk main menu item. If the workloads are a mix of sequential and random IO, we initially recommend using the 16KB cache block with cache pre-fetch disabled, but encourage the system administrator to adjust cache block size as necessary to achieve the optimal results. Cache block size may be changed ‘on the fly’ without requiring a controller reset, and live system tuning may be required in some cases.
The change to 16KB cache block size was fine, however the impact was almost imperceptible on the usual workloads running on our MD3200. I did keep stats from our backup jobs, however our improvement was marginal (<5%) and possibly statistically indistinguishable from zero?
Since then I have also switched off cache pre-fetch on most of my virtual disks (those not storing backup data). Again, the difference imperceptible, possibly zero.
But regardless, I now have more knowledge about the configuration and therefore more confidence for the future. So thanks again for your work in digging out this information.
DELL-Kenny K
685 Posts
0
March 25th, 2013 14:00
After reading your questions there is a few things that we will need to look at to get a better idea of where the bottleneck is at. We would need to pull a DSET from one of the hosts that access the array. We will also need to get some support logs from the MD3200. I will send you a private msg with my email address so you can send me that information.
FrostyAtCBM
1 Rookie
•
59 Posts
0
March 26th, 2013 15:00
I would be very interested to learn the eventual outcome of this issue/query.
We run a 3-host setup similar to yours (we also have an additional MD1220).
If there is a config issue affecting throughput that can be corrected, I would want to cross-check against my setup.
jeffhoward014
15 Posts
1
March 26th, 2013 15:00
This is entirely just a guess at this point, but after some additional analysis I have a theory that I'm hoping the Dell Technicians can confirm or deny. If you deep-drive the MD3200 Technical Guide Book, you'll see comments like:
The MD3200 series of arrays were designed with performance in mind. Each controller is equipped with
four 6Gb SAS ports providing total aggregated bandwidth of 4,000MB/s of throughput for a dual controller
system which is 4X the throughput of the MD3000 and the most competitive products in the external RAID
SAS array market.
A common question is, how did they arrive at 4000MB/sec? This is because they're taking the 6Gbps and converting that to MB/s minus a little for signaling overhead, then doing some very basic math... Other SAS sites say you can get 500MB/sec per lane, so take 4-ports per controller, then 2x controllers, and voila! 500*4*2 = 4000MB/sec.
Now that being said, there are two major points that are left out in that math, and so far I've only found 1 engineer at Dell who could confirm this. First, SAS speeds are rated UNIdirectional, so 500MB/sec would only be for you were measuring 100%/0% (reads vs writes).
The second major point that hardly anyone is aware of, is that the SAS 2.0 spec has 4x lanes per physical link! Again, returning to the MD3200 Technical Guidebook , if you look at the Controller Architecture diagram, you'll quickly begin to see the relationship between "links" (which are ports) and "lanes".
So after an extremely long-winded intro, my working theory at this point is that there is something in the configuration or implementation of the controller that is preventing additional requests from "spilling over" into the additional SAS lanes, possibly even at the controller level.
We've confirmed through extensive testing that the 600MB/sec limit seems to extend beyond a single host or HBA. We have two physical hosts connected to our MD3200 with dual HBAs running MPIO, so there are a total of 4 physical links available, each with their own 6Gpbs HBA.
Anyone with a deep understanding of SAS 2.0 and the interworkings of the MD3200, I would love to chat about this. I'm willing to provide whatever time is needed for testing this on our systems as this is a major deal-breaker for us continuing to roll out the MD3200's for our DSS application.
Thanks,
- Jeff
jeffhoward014
15 Posts
1
March 29th, 2013 17:00
So my testing continues, and I believe I've narrowed this down a little more... The 600MB/sec is apparently tied to each controller, not the array as a whole. So for example, using SQLIO with 64K IOs, I'm able to push 600MB/sec for a single RAID volume on a single controller. That allows me to push about 1000-1200MB/sec with 64K IOs if I purposely choose two different RAID volumes on two different controllers.
Keeping in mind, my end-goal is to ferret out what the true max throughput is for this device before we pour more money into buying additional expansion enclosures. The advertised max throughput on a dual-controller setup is 4000MB/sec for perfectly sequential reads, so I'm simply trying to understand how that factors out in a real-world DSS app. So at this point in my testing, I have no clue how they got to even 2000MB/sec.
My frustration is this - We currently have three total enclosures for a total of 48 disks. My testing is showing that I'm able to max out the sequential throughput with as few as twelve physical disks. Meaning, when I run the dual-controller test noted above that maxes our both controllers, the aggregate throughput is only 1200MB/sec. While 1200MB/sec maybe doesn't sound too bad, it makes me nervous to invest any more cash in additional expansion enclosures to hang off this device.
Here's an example:
If 12x disks are able to max it out, here's the logical progression (for a DSS app with 64K sequential reads). For the sake of the example, let's say that a single client's data consumes 6 physical disks:
02 client(s) - 12 disks - 600MB/sec/client
04 client(s) - 24 disks - 300MB/sec/client (what we're seeing today with all four RAID volumes under load)
08 client(s) - 48 disks - 150MB/sec/client
16 client(s) - 96 disks - 75MB/sec/client
So even adding one additional MD1220, our effective throughput drops to 150MB/sec (per client)? Can an engineer from Dell please confirm this is the expected performance of your SAS 2.0 SAN? If this math is correct, then that means we can max the throughput of this device with basically one MD3220 and one MD1220 expansion enclosure? That seems a little light given the way this device is advertised. We were able to do that with our SAS 1.0 devices 5+ years ago.
Is it possible for an engineer from Dell to chime in on this!?!? We're on hold for our next round of purchasing until we can confirm if this is a device limitation or a configuration issue.
FrostyAtCBM
1 Rookie
•
59 Posts
0
April 10th, 2013 15:00
Any further news on this? I'm still very interested in the subject.
jeffhoward014
15 Posts
0
April 10th, 2013 16:00
Yes, actually I do, thank you for the reminder. The mystery is solved! A couple calls with the Dell Storage Engineers turned up dry... They really were trying to be helpful, I think I was just too deep into the details of the SAS pathing by the time I brought them in. I decided to revisit the High Performance Tier guide with a fine-tooth comb, since that seemed to have to most "granular" details and documented performance I was looking for. While I found the solution in the HPT Guide, you don't necessarily need the HPT "Premium" feature to get past the 600MB/sec boundary.
http://www.dell.com/downloads/global/products/pvaul/en/powervault-md3200-high-performance-tier-implementation.pdf
The root cause was in the controller-level cache block size! Changing the "default" size of 4K, to the max of 32K allowed us to push 3000MB/sec across as few as 24 15k spindles! Remember, this is in a 100/0 R/W scenario on completely sequential reads. But sequential writes were much better as well.
Sp in thinking about this, it makes sense. For us a 4K block size at any level of the system is pretty useless... Our stripe size on disk, our cluster size in Windows, and our row widths in SQL are all 8K or larger (larger as you move farther down in the technology stack, by design). So basically, the controllers were having to go through multiple cycles to load the cache as each cycle could only write 4K at a time.
This small change completely changes the extendability of this device for us. We're even seeing improved performance on mixed workloads where it's a blend of sequential and random operations. I have to believe this is because out data blocks are simply better align throughout every layer of the stack (SQL -> Windows -> VMWare -> Controller Cache -> RAID Striping).
FrostyAtCBM
1 Rookie
•
59 Posts
0
April 10th, 2013 16:00
That is an absolutely BRILLIANT find ... thank you!!! ... I can confirm that my own MD3200 is also configured with the default 4KB Cache block size ... I'll look into that some more and schedule to make the change ... it should massively improve our backup performance I would think.
One more question: can this be changed "on the fly" or do I need to schedule an outage?
FrostyAtCBM
1 Rookie
•
59 Posts
0
April 10th, 2013 17:00
Thanks again. I will try a 16KB cache block size and also disable cache pre-fetch on the virtual disks. But I will wait until the weekend, just in case, rather than changing it during the week. I'll let you know how it goes next week.
jeffhoward014
15 Posts
0
April 10th, 2013 17:00
Quadruple-checking the PDF from dell in my previous post, but a verbatim cut/paste from the doc says:
FrostyAtCBM
1 Rookie
•
59 Posts
0
April 14th, 2013 16:00
Switched from 4KB to 16KB cache block size in the controllers on Saturday.
No reboot was necessary. It will take me maybe a week to determine the impact.
I haven't fiddled with the virtual disk cache pre-fetch settings. One change at a time!
FrostyAtCBM
1 Rookie
•
59 Posts
0
April 25th, 2013 20:00
The change to 16KB cache block size was fine, however the impact was almost imperceptible on the usual workloads running on our MD3200. I did keep stats from our backup jobs, however our improvement was marginal (<5%) and possibly statistically indistinguishable from zero?
Since then I have also switched off cache pre-fetch on most of my virtual disks (those not storing backup data). Again, the difference imperceptible, possibly zero.
But regardless, I now have more knowledge about the configuration and therefore more confidence for the future. So thanks again for your work in digging out this information.