3 Silver

Re: Ask the Expert: Performance Calculations on Clariion/VNX

Let’s build on Rob’s post with some performance troubleshooting. For example, a server administrator has asked us to create a LUN that can handle 1000 IOps in a R:W ratio of 70:30. We calculated that he would end up with 700 read IOps and 300 write IOps from server perspective. Since we only have RAID5 at our disposal, we calculated that the back-end write IOps would not be 300 IOps but in fact 300 x 4 = 1200 IOps. Remember, this is a “all random I/O”, worst-case calculation. Real life will probably have some sequential write I/O, lowering the write penalty a bit. But since making assumptions can cost you dearly, if you don’t have facts, assume worst.

So Rob designed a LUN that needs to handle 700 + 4x300 IOps = 1900 IOps. Calculating with 15k FC/SAS drives, he ended up with 10.55 drives. Since EMC doesn’t sell parts of a drive, he built a 11 disk RAID 5 group, created the LUN and allocated it to the server. Job done!

Or is it? A couple of weeks later, the server admin comes back:

“my customers are complaining about the performance of the application. It’s the storage, fix it please!”

So how do you continue?

Customers (or server admins for that matter) usually don’t mind about utilization, throughput, R:W ratios, block sizes and the lot. They want only one thing: low response time. Think of yourself: opening Google, you only want the page to be there quickly. You don’t care how many CPU cycles it takes, whether it’s coming from cache or of its efficient on the back-end.

So I always start looking at response times. If I can, I prefer to start on the server end, using perfmon or an equivalent. This gives me the best view from the customer perspective, it allows me to look to the storage as a whole (SAN switches included). And, not unimportant, it allows me to check the assumption the server admin made: if I see <10 ms response times to the storage at all times, my money is on a different server component that’s being the bottleneck.

So let’s assume the server does show response times of a 100 ms. Ok, it’s a storage bottleneck. Make a note of the other perfmon data you have at your disposal: the server will also show you how many writes and/or reads it’s sending. This may come in handy during the following steps, or might give you some clues where to look. For example, if you know your RAID group can deliver 1900 IOps, you know it’s dedicated to this server, you verified that the R:W ratio is in the neighborhood of 70:30 and you only see 1000 IOps going towards the storage, my money is not on the disks or the RG. Probably it’s a bottleneck somewhere else: storage processor utilization, cache configuration, SAN ISL overloaded, etc. On the other hand, if you see the LUN requesting 5000 IOps all the time, with your knowledge of the disk setup you can already fairly certain assume the disks aren’t keeping up.

So after perfmon on the server side, I usually jump straight into Unisphere Analyzer, skipping the SAN switches. I haven’t had too many SAN bottlenecks yet, so I’ll save me the time. Again, I start at response times for the LUN. If the server is seeing 100 ms response times and the storage is reporting 20 ms response times, I know I have to double back to the SAN for the 80 ms gap. If the storage is also reporting 100 ms response times, I know I can focus there. From there on, I start at the storage processors and drill my way down to the disks, checking metrics such as utilization, throughput and queue lengths. Don’t forget to look at the LUN/RAID group configuration as well: maybe at some point someone disabled write cache!

At this point I think there’s no single flow of troubleshooting: it entirely depends on what you find and even more so on how you “connect the dots”. For example, in the previous example where the server was experiencing high response times but wasn’t using that many IOps, we also experienced slowdowns for approximately 50% of the environment and the problem was apparent on two storage systems at the same time. Checking some servers, we quickly came to the conclusion that the problem was only experienced on all LUNs attached to storage processor A. We used the VMware Infrastructure Client for this, which combined with a consistent VMFS data store naming convention made connecting the dots easy.

It turned out that a DBA was restoring a database to a virtual machine, doing so at the perfect speed of 300+ MB/s. The downside was that the insane amount of write throughput completely overloaded the storage processor, causing all other LUNs on that SP to slow down. The reason the other storage system also experienced a slowdown was because both systems were mirroring to each other, causing the writes on that system to slow down as well.

So what is the most common troubleshooting you have to do? What do you find easy or hard, or would you recommend to other storage troubleshooters? Which tools do you use? Let us know!