Rdamal

2 Intern

•

165 Posts

0

3122

April 18th, 2014 10:00

InsightIQ monitoring

Hi

What is the better way for a cluster to be monitored by InsightIQ ?

As per EMC recommendation - use smartconnect zone name if it includes CPU load balancing, if not use IP addr of cluster or a specific node. If we use IP addr of a specific node, how much load is generated by IIQ on a single node, any other suggestions ?

Thanks,

Damal

Responses(6)

AU

Anonymous User

170 Posts

0

April 18th, 2014 14:00

Monitoring a specific node is not a problem for load. However, there is a nasty bug in IIQ that will corrupt the database if the node goes away while IIQ is running - even if it's for scheduled routine maintenance like a node firmware or OneFS upgrade.

I have SmartConnect on all of my clusters and have both static and dynamic zones. I then point IIQ to the first address of the dynamic pool.

Don't forget to create the /net data structure for your FSA data if you change IP addresses on a monitored cluster.

Rdamal

2 Intern

•

165 Posts

0

April 18th, 2014 15:00

Not sure what you refer to as /net

AU

Anonymous User

170 Posts

1

April 19th, 2014 06:00

In order to process the FSA data, the cluster's database needs to be mounted on the InsightIQ server.

On the appliance/virtual machine, you'll have a /net directory. Under that, one directory for each of your clusters, and those directories are named by the IP address that you're using to monitoring them by. Under that,you''ll ifs/.ifsvar/modules/fsa and these need to be exported by the cluster.

For example, say your IP address is 1.2.3.4. You'll have an entry in /etc/fstab that reads:

1.2.3.4:/ifs/.ifsvar/modules/fsa

/net/1.2.3.4/.ifsvar/modules/fsa nfs rw 0 0

If this data is not mounted, you won't be able to bring up the File System Reporting information for that cluster.

nwillhite

1 Message

1

April 21st, 2014 13:00

ed.wilts - could you please provide bug number?

I ask because this behavior contradicts how the system is designed, and I feel that you might have been mis-informed by a lvl1 support tech.

InsightIQ doesn't generate a single database for Performance Data - you can see if you list everything under the datastore.

Ex.

# ls /datastore

4_GUID_EPOC - where 4 represents the data structor generation (4 = IIQ 3.0), GUID is the cluster unique ID, and EPOC is the data that it was created.

If you list what's in the InsightIQ data store, you'll see loads of sqlite databases

# ls /datastore/4_GUID_EPOC

STAT.raw.EPOC1-EPOC2.sqlite3 - Where STAT is the statistic that's contained within that database, raw is the non-downsampled data form, EPOC1 is the start date, and EPOC2 is the end date.

Furthermore - the system for gathering performance data is very fault tolerant. This is because of the persistence.db which is always on the cluster. There are two daemons that work to obtain statistical information from the cluster: isi_stats_d and isi_stats_history_d. The stats daemon collects the info, while the stats_history daemon is in charge of the persistence.db. If InsightIQ is unable to obtain data for a period of time, then when InisghtIQ is able to communicate with the cluster again, it'll retrieve the missing data in it's local dataset from the persistence.db.

The persistence.db only contains statistical data from the last few hours, that's why it's not nearly as big as what the IIQ dataset can be.

So the chances of the entire dataset becoming corrupt is extremely low.

Addressing the /net question -

InsightIQ uses autofs to create the mount points. In IIQ v2.5 and older, we used an older version of autofs that had several issues, one of which would cause broken-pipe errors if the node was taken off line (or nfs services were HUPed on the cluster) before having IIQ unmount the export. Restarting the VM (which would restart autofs) didn't resolve the problem because the autofs would reuse the old socket info instead of creating a new one. An old workaround to this problem was to comment out the user of /net with autofs, then edit fstab to manually mount the exports, but this is just a workaround. Upgrading autofs is the solution. The affected version of autofs is 5.0.5-54 (pulling out of memory, so I might be a bit off on the version number, but it's the default we shipped on IIQ 2.5.x).

TLDR - if you're running a non-affected version of autofs you don't need to create an entry in fstab or add /net : this is all handled automatically.

Anyways, to address Rdamal's question about how to configure IIQ to monitor a cluster -

If you have a reliable/properly configured (for Isilon smartconnect) DNS system, use a FQDN.

If you'd rather use and IP - point IIQ at a pool of dynamic IPs that has automatic failover. This way, if a node does go down, the IP will be moved to a different node.

Rdamal

2 Intern

•

165 Posts

0

April 22nd, 2014 06:00

I came across this from the user guide "it is recommended that you specify the monitored cluster by the IP address or host name of a specific node in the cluster; avoid specifying an IP address that can be transferred from node to node."

AU

Anonymous User

170 Posts

0

April 22nd, 2014 07:00

KB 88644 documents a fix for one such corruption scenario and I believe I was a trigger for it to be rewritten because it didn't work for me. I ran into this multiple times in 2.5.2 and the issue was duplicated by support, and was also duplicated by Engineering. I was told by Support that this was NOT fixed in 3.0 because 3.0 was too far along in the release cycle. Rebooting a node that was pointed to by IIQ will cause IIQ to silently corrupt the event.sqlite3 file and prevent it from collecting any more information. Furthermore, when ONE cluster has corrupt data, IIQ won't even start (SR 57047464).

> Furthermore - the system for gathering performance data is very fault tolerant

That may be the case with 3.0 - I haven't verified this - but is VERY definitely NOT the case with 2.5.2. And if it's so fault tolerant, why do you need to write a KB to recover from the corruption? . I can't begin to tell you how many times I corrupted IIQ from routine recommended maintenance on my Isilon clusters. We're not a small shop, with a half dozen clusters and over 4PB of storage with thousands of clients.

2.5.2 caused me no end of grief but my new configuration on 3.0 appears to be much more stable so far. I do manually mount my cluster FSA data on /net. I'm not running the IIQ appliance but installed the rpm on to our own server. Heck, IIQ 2.5.2. couldn't even properly handle the datastore being NFS-mounted on an Isilon cluster...

View All

No Events found!