6.5 Upgrade, now poor performance

Hi,
Let me preface by saying I also am opening a support case. However looking for suggestions on this one. Prior to vFoglight 6.5 upgrade pages were loading in about 5 seconds. Now logging in can take upwards of 180 seconds and various pages timeout regularly. We have the following setup:
Using VM with 4 vCPU (2.6 GHz AMD six cores underlying) 16 GB RAM (no swapping, ballooning, or such), using MSSQL 2005 backend. Current tuning params in foglight.config are:
foglight.vm.option0="-Xmx12g";
foglight.vm.option1="-Xms12g";
foglight.vm.option2="-XX:+DisableExplicitGC";
foglight.vm.option3="-Dfoglight.persistence.cache.duration=30000000";
We are monitoring 5 vCenters, about 2000 VMs. This is not a simple environment, however according to the sizing guidance our VM should be adequate. Any suggestions on where to look to track down the problem or additional tuning thoughts?
Thanks in advance.

Responses(7)

M

mcondy

18 Posts

0

December 9th, 2010 11:00

Hi
I don't know if you have an open support case for this but the way you discuss I suspect that you may. We have spotted a couple of issues that affect larger implementations and should be able to help. please email me the case number(s) (mike.condy@quest.com so that I can have the R&D team direct the support engineer accordingly.
Regards
-Mike C

L

lmurphy1

57 Posts

0

December 6th, 2010 15:00

A couple of things that you need to do and I am sure that support will also let you know this.
Your heap size is too high. 12GB is a lot to assign to the heap and is probably spending a lot of time in garbage collection.
I would not go higher than 6GB on the heap size unless instructed by support.
You also can take out option 2 and 3 as these are set already by default and are just redundant.
What pages are timing out? Support will need to know what specific dashboards are timing out.
Does it take 180 seconds for all users to login? Or is it just certain users? If certain users, what is there homepage set as?
-Larry

thejimmy1

7 Posts

0

December 6th, 2010 16:00

Thanks for the response. The 6GB limit was something we knew from FMS prior to 5.5.4. However they removed that recommendation and actually had us increase to there post FMS upgrade (in both this vFoglight environment, and we have a plain foglight as well). They had stated they changed some flags they used on garbage collection and up to 75% of the installed memory should be fine now. Options 2 and 3 were left over from that same pre-5.5.4 troubleshooting just to insure to the support engineer they were there.
As far as pages timing out, it really can be any of them. On the VMware enviornment dashboard for example the past 4 hours view will load, change it to 72 hours and it times out. Every user is using the "perspective" new welcome page as far as I'm aware. 3 users including the super user foglight we've tried and all either timeout or get in eventually. We have not been able to find a user with a changed homepage to test with. Since our foglight user is also suffering, we can't get in to change someone's homepage either.
We are going to try removing all the config options and see if it helps. It's also curious that they don't use large pages on 64 bit since the new JVM supports it.
Thanks.

thejimmy1

7 Posts

0

December 6th, 2010 21:00

Here is an update. We killed all extra options in the config. That got the FMS restarted. Upon restart I was able to get in with the foglight super, though it took some time. Note it has been like this before, we can usually log in a couple times immediately after restart. I promptly changed the foglight super's home page to the administration home b/c that page loads fast on login. Support thinks we have a lot of alarms. I used the delete-stale-objects.groovy script that I got from this forum and ran it against every object type in the VMware model (yes this took some time to do). Basically it entailed copying the entries from the data retention table into a text file, then doing a DOS for loop against the item names to feed the groovy script all the object types. After this our FMS is running better (not great, but it's usable at least). We are waiting support to send over some scripts since we use MSSQL backend to help regulate the alarm data. The perspectives page still takes the longest to load, but removing the stale data seems to have at least made it loadable. More than likely we'll proceed trimming data and eventually then tune the FMS config. Fingers crossed.

thejimmy1

7 Posts

0

December 9th, 2010 04:00

Well here is where we are at. There were over 3 million rows in the alarms table. We've went from 6.0 -> 6.1 -> 6.5 with this install. So we ran the cleanup scripts to get the alarms table under control. We are currently sub 10,000 after I decided to just keep December's alarms and move forward. After that most pages return except for the 2 newest: Welcome/Perspective page and automation. Timeouts in the UI still occur intermittently and our logs are spattered with derivation and rule type errors. This essentially means we can't enjoy any of the reasons for upgrading to 6.5. Awesome. Even better I suspect the next move will be to possibly clean out the DB if we can't get it working. So any help is still appreciated. The FMS diagnostic views are now loading. We see a vcontrol script in there that takes 800,000 ms to run (I can't make this stuff up).
Anyone have similar issues? At this point our FMS has become the bain of my existence since I've spent 2.5 days troubleshooting.

mrvirtual

3 Posts

0

December 21st, 2010 01:00

.

thejimmy1

7 Posts

0

January 24th, 2011 21:00

The Friday update. I have been put in touch with support and escalation thanks to Mike. So far everyone is a little stumped. We added the following to the config file. While I saw an incremental performance on some pages, I still see timeouts on critical pages like welcome, automation, alarms, and VMware environment drilldowns. Overall I believe the issue to be rooted in the Vmware side of monitoring, although there could be some kooky post upgrade residual bug that we uncovered thanks to our large environment (we are monitoring 2000 VMs across 5 Vc's and 7 data centers). Because of my own issues with impatience I attempted to initialize a parallel empty DB without much luck (the original problem system is still intact).
Here are some sample errors that seem to appear when hitting certain pages:
2010-12-05 12:56:07.529 WARN [Data-3-thread-109] com.quest.nitro.service.rule.rule.Rule - Failed to evaluate expression qObjs = #capacityAvailable from $scope.hostLogicalDisk#;
availbility = qObjs?.values(qObjs?.topologyObjects[0])?.get(0)?.getValue().getAvg();
return availbility; in rule: VMW Virtual Machine Logical Drive Utilization. Reason: com.quest.nitro.service.sl.interfaces.scripting.ScriptAbortException: script1000147: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
Here are the options we added to the config:
foglight.vm.option0="-Xmx6g";
foglight.vm.option1="-Xms6g";
foglight.vm.option2 = -"XX:PermSize=256m";
foglight.vm.option3 = "-XX:MaxPermSize=256m";
foglight.vm.option4 = "-XX:NewRatio=5";
foglight.vm.option5 = "-Dsun.rmi.dgc.client.gcInterval=3600000";
foglight.vm.option6 = "-Dsun.rmi.dgc.server.gcInterval=3600000";
foglight.vm.option7 = "-XX:LargePageSizeInBytes=256m";
foglight.vm.option8 = "-XX:+DisableExplicitGC";

View All

No Events found!

Virtualization Infrastructure Management

6.5 Upgrade, now poor performance