Isilon Telemetry for the Hadoop Admin


Isilon Telemetry for the Hadoop Admin




Note: This topic is part of the Using Hadoop with OneFS - Isilon Info Hub.

This article describes how a Hadoop administrator can use Grafana for monitoring the resources in a Hadoop cluster.

Introduction

Grafana has widespread use for monitoring large scale architectures. Grafana was chosen by Hortonworks to augment Ambari's capabilities. A large portion of this article is focused on Ambari. However, this setup works for Cloudera distributions also.

This project is based on the Isilon Data Insights Connector package, available at https://github.com/Isilon/isilon_data_insights_connector. Review this package and its installation requirements, notably Grafana v. 3.1.1 or higher. This article is based on testing with OneFS version 8.0.1.1, but the procedures should work with all Isilon clusters that are at 8.0+.


Ambari Metrics

It is important to begin with a proper configuration of Ambari metrics for Isilon. There are other resources that describe that configuration.

Ambari Metrics is already configured to work with Grafana. This article builds on that model to make Isilon Statistics available to Grafana as well.

This solution does not offer the single pane of glass and high level view provided by Ambari Metrics service. However, it does provide statistics such as NN atomic operations and Isilon Cache performance. If that type of metrics interest you, then keep reading.

Installation

The only prerequisite to installation is a Centos VM with network access to the Isilon System Zone. Installation uses the following high level plan.

  1. Creation of a role and user on Isilon to read the statistics.
  2. Upgrading Grafana that is installed with Ambari.
  3. Importing new Ambari Dashboards.
  4. Installation of Influxdb on the same node as the Ambari Metrics collector.
  5. Installation of the Isilon Data Insights Connector package.
  6. Importing the dashboards for Isilon.
  7. Reviewing the dashboards.


RBAC

You need a user account for the Isilon Platform API (PAPI), which is our Restful programming interface and is responsible for pulling stats off the Isilon Cluster. To bridge the gap between the Hadoop admins and the Isilon admins, you can leverage Isilon's Role Based Access Controls (RBAC). Create a READ ONLY role on the Isilon cluster that has access to PAPI and the statistics subsystem. Then, assign this role to a particular user. This article use a local Isilon user from the System zone. You may choose an existing user from a different Authentication Provider (such as LDAP or AD). The decision should follow your corporate policies.

The Isilon admin can use the following commands to create an account for you.

isi auth roles create perfmon
isi auth roles modify perfmon --add-priv-ro ISI_PRIV_LOGIN_PAPI
isi auth roles modify perfmon --add-priv-ro ISI_PRIV_STATISTICS
isi auth users create pmon --home-directory=/ifs/home/pmon --enabled=true --set-password

The --set-password flag allows inputing a password on the command line during user creation without it being visible or in the history. You will need this password later, so remember it.

isi auth users modify pmon --password-expires false
isi auth roles modify perfmon --add-user pmon

To review:

isi auth roles view perfmon                                                            
       Name: perfmon
Description: -
    Members: pmon
Privileges
             ID: ISI_PRIV_LOGIN_PAPI
      Read Only: True

             ID: ISI_PRIV_STATISTICS
      Read Only: True

This user is the one to use in the setup of the Isilon Data Insights connector. You only need one instance of the connector, which updates the InfluxDB database. Other users only need Grafana logins, which you can set up separately.


Upgrading Grafana

Grafana 2.6 is installed by default when you deploy HDP 2.5. You need to replace that with a current Grafana. Perform this task on the node where the Ambari Metrics Collector is Installed.

You may instead install it as a standalone server and run two Grafana instances. However, that is a more confusing setup than upgrading. Notice that the Grafana that was installed with HDP is not removed. However, the new Grafana server should start ahead of the one installed with HDP.

The easiest way to do the install is to stop the Grafana service in Ambari and install the new Grafana RPM. We recommend first to check the local installation and back up the current configuration files and other files.

[root@nr-hdp-c3 ambari-metrics-grafana]# pwd
/var/log/ambari-metrics-grafana
[root@nr-hdp-c3 ambari-metrics-grafana]# tail grafana.log     
  [1]: default.paths.logs=/var/log/ambari-metrics-grafana
Paths:
  home: /usr/lib/ambari-metrics-grafana
  data: /var/lib/ambari-metrics-grafana
  logs: /var/log/ambari-metrics-grafana

You might have to install a few other packages.

yum install initscripts fontconfig

Obtain the rpm.

wget https://s3-us-west-2.amazonaws.com/grafana-releases/release/grafana-4.3.2-1.x86_64.rpm
yum localinstall grafana-4.2.0-1.x86_64.rpm

/bin/systemctl daemon-reload
/bin/systemctl enable grafana-server.service
/bin/systemctl start grafana-server.service

Ambari Dashboards

Because you have a new version of Grafana , you also need new versions of the standard Dashboards that are compatible. See https://grafana.com/plugins/praj-ams-datasource.

Influxdb

Influxdb is a time series database and quite useful in its own right. This solution uses Influxdb to store the Isilon Statistics that are extracted by the Data Insights package. This installation consists of creating the repo file, installing the package, checking the config file, and starting it.

1. Create the repo and install.

[root@nr-hdp-c3]#  cat > /etc/yum.repos.d/influxdb.repo
[influxdb]

name = InfluxDB Repository - RHEL \$releasever
baseurl = https://repos.influxdata.com/rhel/\$releasever/\$basearch/stable
enabled = 1
gpgcheck = 1
gpgkey = https://repos.influxdata.com/influxdb.key
<ctrl-D>
yum update
yum install influxdb

2. Check that the default port is ok. Influxdb uses port 8086 by default. If that port is already in use, this command will list it.

netstat -a -n | grep 8086

If needed, choose another port and update the config file at /etc/influxdb/infuxdb.conf

3. To start Influxdb:

service influxdb start
service influxdb status


Installing the Data Insights Connector

1. Review and follow as closely as possible the installation instructions for the Data Insights package onGitHub here: https://github.com/Isilon/isilon_data_insights_connector

You will notice two new files for Hadoop.

  • grafana_hadoop_home.json -- the Dashboard that replaces the HDFS Home dashbaord from Ambari metrics.
  • grafana_hadoop_datanodes.json -- the Dashboard that replaces the Data node dashboard from Ambari metrics.

Also, the example_isi_data_insights_d.cfg file is an updated version that includes statistics for the node-based network and disk statistics.

2. In the config file, in the clusters section, provide the RBAC user you created previously in the user name space. The following example provides the user pmon, which is the user we created in the RBAC example earlier.

clusters:
pmon:emc@10.111.158.70:False

3. Check the next section. If entries are missing, add them.

 active_stat_groups: cluster_cpu_stats
    cluster_network_traffic_stats
    cluster_client_activity_stats
    cluster_health_stats
    ifs_space_stats
    ifs_rate_stats
    node_load_stats
    node_disk_stats
    node_net_stats
    cluster_disk_rate_stats
    cluster_proto_stats
    cache_stats
    heat_total_stats


4. Add the following two sections:

[node_disk_stats]
update_interval: *
stats: node.disk.bytes.out.rate.avg
node.disk.bytes.in.rate.avg
node.disk.busy.avg
node.disk.xfers.out.rate.avg
node.disk.xfers.in.rate.avg
node.disk.xfer.size.out.avg
node.disk.xfer.size.in.avg
node.disk.access.latency.avg
node.disk.access.slow.avg
node.disk.iosched.queue.avg
node.disk.iosched.latency.avg

[node_net_stats]
update_interval: *
stats: node.net.int.bytes.in.rate
node.net.int.bytes.out.rate
node.net.ext.bytes.in.rate
node.net.ext.bytes.out.rate
node.net.int.errors.in.rate
node.net.int.errors.out.rate 
node.net.ext.errors.in.rate
node.net.ext.errors.out.rate

5. Run it:

./isi_data_insights_d.py start -c example_isi_data_insights_d.cfg


6. After startup is complete, you can customize Grafana. Follow the documentation above at Github and during the import, make sure to add the additional two .json files.



After you import the dashboards, they appear along with the other dashboards. There are four original dashboards that provide a very broad range of capabilities. These dashboards are targeted to the Isilon administrator. They report across the whole Isilon cluster and ALL protocols. They provide true data lake telemetry.

In the previous figure, the two dashboards that have (Isilon) after their names are custom dashboards.


HDFS - Home Isilon

The top part of this dashboard shows the network throughput (Read and Write) and the HDFS Throughput, which is a subset of the overall Isilon throughput. There are two filters: the Isilon cluster name, for customers lucky enough to have more than 1, and the node number. Not all statisitcs change when filtering by node number (that is,. capacity) but the ones that do change can be interesting, like Load and cache. There are also some links to the Isilon documentation on the right side.

This section shows the Open files and HDFS (only) connections per node. Below that is the OneFS filesytem throughput, which would be across *all* protocols. Also, the cluster capacity is across all Access Zones.



These two graphs show the HDFS Protocol broken out by class as a percentage of the whole, and by the atomic operations, as a number of Ops.



These are the L1/L2/L3 cache statistics for the cluster, which are very important to overall performance. L3 sits on SSD, so only nodes with SSDs can use this. The virtual nodes do not have this; so you see the value 0. The second graph is the metadata version, broken out by node. Use this graph to see how well balanced your workflow is.




HDFS Datanodes (Isilon)

This customized dashboard represents the behavior on Isilon that you might want to review about your Datanodes, such as disks and network.


This section shows spme disk specifics such as basic R/W and Latency. These values help you characterize disk level action.



Network traffic is important when evaluating performance. Isilon performance issues can often be caused by network issues.


Wrap-up

Grafana dashboards can help with daily reviewing and monitoring of your Isilon cluster. The magic of Grafana brings it all together nicely.



Article ID: SLN319156

Last Date Modified: 03/12/2020 06:46 PM

Rate this article

Accurate
Useful
Easy to understand
Was this article helpful?
0/3000 characters
Please provide ratings (1-5 stars).
Please provide ratings (1-5 stars).
Please provide ratings (1-5 stars).
Please select whether the article was helpful or not.
Comments cannot contain these special characters: <>()\
characters left.