firozg

41 Posts

1687

February 4th, 2013 13:00

VNX7500 Array - Citrix PVS 6.5 sluggish performance

Hi

We have got a VNX Array 7500 with 900 GB of Fast cache / 8 GB of write cache on each SP. with 600 drives running. 8 x 30 drives pools of RAID 5 and ESX host 4.0 connected via Cisco MDS 9500 series. Citrix PVS is deployed during the initial phase to support approx 600-700 users. All the Luns on the array are thick luns assigned to ESX host with a standard size of 500 GB each. Total of 90 Xenapp servers for users who connect through and a cluster PVS server. User experience is not good and the problem is intermittent without any pattern. Some users face slowness while logging in but some dont. some have issues in accessing their files but some don't. Theirs no definitive problem but a in-consistent and random issues which all the users compliain that is sluggish performance. I have run NAR files to two days and this is what I found

total of 31 luns are in question here and all RAID 5 thick luns. Total IOPS of these luns are 12000 approx and R:W ratio is 4000 : 8000 IOPS i.e. 1:2. SP utilization is slightly lower than 40 both the SP. memory Cache is ranging between 60 and 90 as oppose to (LWM/HWM which is set to 60-80) At times it does reaches 100 but once or twice a day for a couple of seconds. Pool lun response times are ranging from 5ms to 15ms except few spikes which goes upto 35ms. ABQL for all the luns are below 20 majorly between 5 and 10. SP response time is 1-2 ms (both SP). VNX array is running flare 32p11. Analyzer doesnt provide forced flushes for pool luns but found it is good if pool luns are below 20 ms.

couple of forced luns are found but those are RG luns and spikes are not throughout just a couple of times during the day.

Some drives in the pool has ABQL over 1 but again those are spiky.

Recommendations are made as below

1) migrate all the Xenapp VMs to RAID 10 as the Writes are 66% and reads 33%

2) Stop mirror view replication of these luns

Lun Numbers in questions are (Lun numbers 10 to 33 and 250 to 251)

Questions I have is:

The stats I stated above doesnt really shows any major problem on the array but the number of users affected is huge. Not sure if this really is a SAN issue or their are other factors. Dont know if moving the VMs from RAID 5 to RAID 10 will really improve user experience.

I have attached the NAR files for 31st Jan'13 and 1st Feb'13 to this forum. Will be oblige if anyone can really have a look and let me know if the recommendations will help

Working business hours are only from 08:00 hrs to 18:00 hrs GMT

2 Attachments

CKM00114800241_SPA_2013-01-31_20-19-08-GMT_P00-00.nar

01_Feb_13_merged.nar

Responses(18)

dynamox

2 Intern

•

20.4K Posts

0

February 5th, 2013 06:00

i tried to open the nar file and it said incorrect archive version, odd ..tried it from VNX running 32p11.

J

jpveen

1 Rookie

•

85 Posts

0

February 5th, 2013 07:00

I checked for you, but I can open both files normally on a VNX5700 running 32.011. A quick look doesn't really indicate a problem on the box. For example lun 16 is relatively busy, but with low responsetimes (<6ms).

But only spent a couple of minutes, did not dig deep...

SushantGulati

55 Posts

0

February 5th, 2013 07:00

Have you tried with Unisphere Client to open the file locally on your laptop/desktop?

SushantGulati

55 Posts

0

February 6th, 2013 02:00

Can you specify any time frame that we need to check to find any anomalies?

However, there is one general observation.

Pool 2 has 27 disks

Pool 8 has 40 disks

All other pools have 30 disks

All the pools are RAID5, so Pool 2 has 27 disks, which is not divisible by 5. This means that one/two of the private RGs created in a pool will have less than 5 disks. This will give a lower performance characteristic to these particular private RGs and hence the lun extents residing on them. But in the same pool, for the other lun extents, the performance will be fine. Hence this may explain the unpredictable/random nature of the performance issue.

dynamox

2 Intern

•

20.4K Posts

0

February 6th, 2013 03:00

did you look at spindles of pool 2 and saw IOPs that exceed "documented" values ?

SushantGulati

55 Posts

0

February 6th, 2013 22:00

Hi Dynamox, I checked the NAR files, but the disks on Pool 2 in both the NAR files seem to be running below their hardware specification values. They are 300GB SAS drives but not sure if they are 15/10K RPM. I may have been mistaken, can you please help me confirm?

dynamox

2 Intern

•

20.4K Posts

0

February 7th, 2013 04:00

i am not sure why but still can't get the NAR file to open, even though i am connected to VNX running 32 p11. Will need to update my local Uni client/server installation like you did and try again locally.

firozg

41 Posts

0

February 7th, 2013 14:00

Hi Dynamox.

Do you want me to arrange a file transfer for these NAR files. please email me on firoz_ali@jltgroup.com incase you can do a webex transfer or Teamviewer transfer.

regards

firoz

firozg

41 Posts

0

February 7th, 2013 14:00

are you seeing the error ''The system could not be verify the certificate''?

firozg

41 Posts

0

February 7th, 2013 14:00

Hi Sushant

Can you also put some lights on the RLP luns which is POOL 4 (RAID 10). when I use the search feature in the analyzer to find the busiest luns i find major luns from Pool 4 which are RLP. Now i can't see if my RLPs are too contributing to SP forced flush and or if they are causing issues for the Citrix Xenapp luns which I noted in my original description.

regards

firozg

41 Posts

0

February 7th, 2013 14:00

Hi Sushant

All the drives used are 15k RPM. (both 300 GB and 600 GB)

dynamox

2 Intern

•

20.4K Posts

0

February 7th, 2013 14:00

I can get to your file ok but for some reason I can’t open the NAR file. I even updated to the latest Unisphere Client/Server version today and still could not open it. Others have opened it so much be something on my side.

firozg

41 Posts

0

February 7th, 2013 14:00

Hi All

Apologies for not responding during the last two days as I was busy at DC doing fiber cablings.

I have moved some RAID10 Luns from RG 22, 23, 24 which were causing SP forced flushing to a 32 drives RG (RAID 10).

I did it tonight ie.7th Feb and hope to see some better memory utilization tomorrow i.e. 8th Feb.

------------------------------------------------------------------------------------------------------------------------------------

Update.

We did a POC by creating a fresh 2 x 16 drives RG of RAID 10. Striped a metalun of 300 GB across both the newly created RG and vmotion'd a Xenapp VM. Added only 10 users to see if they see performance problems but to my expectation they did face the same problem. I was pretty confident that the problem lies somewhere else. As sushant noted about pool 2 drives sets and the lun 16 which are quite busy but this cannot cause problems to a large estate of 800 users. Some memory spikes are too seen but they are just once or twice during a day which can be ruled out.

Let me know if theirs any other surprises in the NAR files which I couldnt notified.

appreciate sushants efforts

regards

firozg

41 Posts

0

February 7th, 2013 14:00

Hi Sushant

The time frame you can view is from 8:00 hrs to 17:59 hrs GMT.

thanks for the Pool 2 details you provided but i have a question for you. Now that private RG uses 4+1 groups which means 5 drives each so I am sure till the drive 25 i have 5 pvt RGs in it. Now the last 2 drives can I make to 4+1 by adding another 3 drives?

christopher_ime

2K Posts

0

February 9th, 2013 19:00

Curiously can't open the first attachment but no worries as the merged one loads just fine with the latest Unisphere client and server (Dynamox is this the NAR file you reported about earlier?).

While I'm sitting here waiting on copies to finish, I thought I'd do a quick review and simply dump the items that stood out without necessarily correlating or judging the individual items. Maybe this list would provide ideas for others that might be able to spend more time on it. I will also mention that in cases like this, we also ask for SP Collects. Can I assume that you also have a ticket open with support to review?

Again, the following is just a dump of those items that stood out like the proverbial sore thumb:

1) Pool 2 FAST Cache is disabled

2) Pool 2 has a private RAID group with a 2+1 construct (as well as a 5+1)

- I'm less concerned about the fact there is a 5+1 (or in other words anything that doesn't grossly deviate from the rest or the recommended multiples)

To answer your question, once the private RG's are created they are immutable. Expanding with new drives are used to build new private RAID groups only (and not expand on existing ones). The only way to realign to consistent/recommended sizing is to redo the pool from scratch which means LUN migrations assuming you even have the space.

The following post talks about the logic behind the private RAID groups and provides a command to show how they were constructed.

https://community.emc.com/message/696115#696115

3) Pool 3 has a private RAID group that has a large deviation from its peer (to be honest, most of the pools have PRG's that deviate from the others in the pool but only noting the 2 with the largest deviations)

a) 4+4

b) 2+2

4) LUNS 91 and 92 together account for a large amount of the IOPs (they are practically all write)

- Both owned by SP B (candidates for separating LUN ownership?)

- Also accounts for higher overall throughput on SP B compared to SP A

5) Forced flushing, when it occurs, is mostly on SP B

- LUNs 117 and 116 are registering the most activity related to this

- LUN 117 is trespassed (current owner = SP B/default = SP A)

- When it occurs it of course affects all LUNs and will lead to high response times

- Review emc186107 for more information

6) Undersized RAID Groups

- Focus most attention on RAID Group 27 (period between 10am and 12pm). Then 24 and then 25 which are all undersized for the load being presented

- Review the individual disk IOPs and MB's that registered in excess of the documented rot: IOPs for 15K SAS drives = 180 (small block random) or 12MB/s (large block sequential)

- When write IOPs are higher than the documented saturation points, write cache can be heavily utilized

- You are underestimating the requirements for the LUNs currently on those RG's (and pools)

7) High queuelengths on LUNs (anything over 10 - 12 suggests it may have a noticeable impact)

a) 99, 66, 72, 74, 51 are hovering over this consistently

b) Others have registered spikes over 10-12

8) Overall SPB3 is (relatively speaking) underutilized (no queue full documented on any of the ports)

a) You mention all ESX 4, what is the PSP you are utilizing? Default is Fixed with Array Preference which shouldn't be used (notably it was removed from ESXi v5)

b) I can tell you aren't using PowerPath as you'd generally see an even distribution across the front-end ports

9) Are your Mirrors fractured?

No activity during this period had registered on ports SPA 0 or SPB 0 which are of course the MirrorView ports

10) LUN response times >20ms

Should look into those periods for any that exceed this *generally* speaking, undesirable threshold

MISCELLANEOUS

==============

1) Check for trespassed user LUNs (notably pool luns due to its immutable allocation owner which if trespassed means it will now utilize resources from both SP's)

naviseccli -h getlun -trespass

Command above reports on both pool and FLARE LUNs (as if you clicked on "Trespassed LUNs report" within Unisphere)

1
2

View All

No Events found!