Insight IQ data retrieval delayed

Question

We are monitoring 4 clusters with Insight IQ. It has not retrieved data from any of them for the past 37 days. How do I get this working again? It has worked in the past. None of the connection ID or passwords have changed. It is version 3.0.0.0036. The datastore is only 42% full.

Anonymous User · Answer

There are multiple reasons for this. Sometimes you'll see the answer in /var/log/insightiq.log

Steps to try:

Restart the insightiq service on your appliance
suspend a collection, wait a few minutes, and resume it
restart the papi service on one of your clusters and see if that helps (assuming you're on 7.x)
Upgrade to IIQ 3.0.1. I've seen references to 3.1 but haven't seen a formal announcement for it. 3.0.1 is fairly stable.

dynamox · Answer

any data or just file system statistics, if it's just file system statistics make sure FSAnalyze job is running/completing.

david_knapp · Answer

Just statistics. FSanalyze is running correctly at 22:00 every day. The basic cluster info shows up, but no connectivity data, no file info, nothing that would be sort of real time.

david_knapp · Answer

I don't have the root password to get into the VM that is running Insight IQ so cannot see what services are running. I will try to have the whole thing rebooted.

chjatwork · Answer

Hey there been a resolution for this?  I have v3.1 and I still get this issue with some of my clusters.

Anonymous User · Answer

It's still an issue with 3.1. I have an open SR on it but I'm not seeing us getting any closer to resolution yet.

IIQ 3.1 is significantly worse than 3.0 for me. The upgrade alone took almost 12 hours from beginning to end by the time the 4 datastores were upgraded.

david_knapp · Answer

I had to reboot the Insight server. David Knapp Quest Diagnostics Nichols Institute Sent from Samsung Note II

chjatwork · Answer

Davek, are you using the VM appliance?

david_knapp · Answer

Yes. David Knapp

mattashton1 · Answer

Other VM's competing for compute and network resources? I suspect a networking bandwidth issue; i.e. several VM's competing for network traffic over a single interface.

For multiple clusters, I have seen better performance from dedicated hardware running InsightIQ.

How many nodes per cluster?

IIQ 3.1 is out.

Cheers,
Matt

david_knapp · Answer

We have four clusters, all have less than 8 nodes. I doubt that VM resource contention is a problem, but I don’t have rights to see that sort of thing. David

mattashton1 · Answer

WAN latency issues?

I would login to the IIQ VM and verify connectivity to each of the clusters as well as the NFS mount for the FSA data.

If you can login to each of the clusters from the VM, that would be a good test as well...

I assume you restarted the VM?

Under Settings, do all of the clusters show as green for Monitored Clusters? (Assuming you are using at least 3.0)

If so, suspend and resume them...

Have you opened an SR?

Cheers,
Matt

osaddict · Answer

Is it worse in how often you see this? If so, how often is that?

The upgrade being slow had to do with a need to look through the entire DB looking for pre-3.0 data, which stored data with a very different time samples. If detected, the upgrade script then gives the user the choice to upgrade that older collected data, drop the old data or cancel the upgrade.

Anonymous User · Answer

I have rights to my VMware infrastructure (I have my VCP and RHCE certs). It's not a VM issue - we've got lots of CPU, memory, and network. CPU and memory are under 20% utilization. Similarly, the hosts and network segments have LOTS of headroom.

This was working FINE with IIQ 3.0 and immediately following the 3.1 upgrade, it's been misbehaving badly. I do have an SR, and the analyst so far is stumped.

I've restarted the app many times but I've had hundreds of instances of it suspending (Supended [sic] monitoring) in my logs. This is not WAN-based monitoring - I'm in the same datacenter and pings average 0.145ms.

$ grep -ic upend cluster.log*

cluster.log:68

cluster.log.1:309

cluster.log.2:310

cluster.log.3:308

This is for just one cluster. The peak on the other 3 clusters is 11 across all versions of the cluster.log files. That's a far cry from a thousand.

The main issues I had with the upgrade taking so long are:

1. The doc says it should take about an hour

2. The upgraded failed and required manual intervention a half-dozen times. If it would have been automatic from beginning to end, I wouldn't have to sit there and monitor it. Not only that, but when it failed, it would restart from the beginning and could take a half-hour before it even got to the same point as it was before it failed. Some intelligent checkpointing should have been implemented.

Isilon

Insight IQ data retrieval delayed

Was this post helpful?