Start a Conversation

Unsolved

This post is more than 5 years old

13695

October 4th, 2016 14:00

ScaleIO Historical Performance Monitoring

I did some research on this and got back some info from my EMC SE.

Monitoring EMC ScaleIO with Grafana using Intel SDI’s Snap | {code} by Dell EMC

Monitoring EMC Elastic Cloud Storage with Grafana | {code} by Dell EMC

or GitHub - swisscom/collectd-scaleio: A collectd plugin for scaleio

I chose the latter but found the instructions very sparse.  If anyone needs more detailed instructions, I have written them up for internal documentation as my manager didn't want me to have to reinvent the wheel in a year if I needed to configure this again. 

68 Posts

November 11th, 2016 10:00

Hello carterbury1,

I was planning to implement some historical metric monitoring for our production ScaleIO environment and I remembered your post.

If you don't mind sharing your instructions I'd like to take a look.

Kind regards,

Davide

25 Posts

November 11th, 2016 13:00

68 Posts

November 12th, 2016 17:00

Thanks a lot carterbury1,

I made a fix to the wrapper shell script wrote by swisscom because I had some problem using some specific special character in the password.

Now I'm planning to redesign the wrapper method: I'd like to use the ScaleIO REST API querying the directly the gateway for ScaleIO data instead of parsing the scli output using the wrapper script. I'll keep you updated.

Thanks again,

Davide

25 Posts

November 13th, 2016 09:00

I had the same problem with the password.  I figured it was in passing the password.  My workaround was storing the password in the other script file which was not ideal.  Let me know.

68 Posts

November 13th, 2016 20:00

Hi carterbury1,

to fix the password issue you can edit the file "/usr/share/collectd/python/scli_wrap.sh" substituting the line 11:

login_out=$(scli --login --username ${USER} --password ${PASS} 2>&1)

with these lines:

cmd="scli --login --username '"$USER"' --password '"$PASS"'"

login_out=$(eval $cmd)

Then edit the file "/etc/collectd/collectd.conf.d/python.conf" and double quote the User and the Password values this way:

User "your_user_here"         # ScaleIO user for getting metrics (creating a read-only user makes sense), default: admin
Password "your_password_here" # Password of the ScaleIO user, default: admin

Then restart collectd daemon and check if the fix works also for you.

In my spare time I wrote a python script useful to get the metric values directly from the ScaleIO gateway taking advantage of ScaleIO REST-API instead of wrapping and parsing scli output. I have to make some further check but it seems to work. In the next days I will integrate my work in the SwissCom collectd python script. It will take a bit because I'm doing this work in my spare time.

In my opinion, querying the gateway (using the provided API) is a better approach for some reasons:

- all the metric related components can be installed on a different VM and will query only the gateway (that isn't an active player in SIO environment). Running scli to collect and parse data continuously on a MDM node is not the best solution.

- the gateway knows the topology of the SIO environment at any given time on the contrary the SwissCom wrapper have to be installed on all MDMs because, if it is installed only on one MDM the data aren't collected in case of failure of that specific node.

I will write you as soon as I have a working and stable version of my implementation.

Thanks,

Davide

25 Posts

November 14th, 2016 15:00

That fix did work for the login being passed.  Thanks!

Yes, let me know when you come up with this alternative solution.  I would be willing to trial it out.  Removing the additional components is not a bad thing at all.  I currently upgraded to 2.0 so I have 3 MDM's, one of which only has mdm installed on it as a stand alone machine outside my hypervisor Cluster.  This is the mdm I have as master unless something goes bump on the network.  I only installed the solution on this one mdm that does not participate as sdc or sds. 

A quick look at top doesn't show a whole lot of activity for what this machine is doing but I get the concept and agree with it.

68 Posts

November 17th, 2016 20:00

Hi carterbury1,

today I merged my code with the SwissCom scaleio.py plugin. Now I'm testing it on my infrastructure and I'm improving the code adding some exception handling and logging for most common problems.

After testing I will release it on GitHub, I think before monday. I will keep you updated!

Kind Regards,

Davide

68 Posts

November 29th, 2016 19:00

Hello carterbury1,

I just released the version of the collectd plugin that relies on REST-API to get metrics from ScaleIO infrastructure. I would be very happy if you can test it.

- The dashboard was left untouched.

- The scli_wrap.sh is no longer needed.

- I added some directives in configuration file and I changed the name of some other (choosing more explicative names): for example "User" is now "MDMUser" and "Password" is "MDMPassword"

- As you can see in the README now all the StoragePools to be monitored must be declared in the configuration file using the parameters "Pools", they have to be separated by spaces.

- It is better to use double-quotes for all the configuration values according the example in the README file.

If you authorize me to revisit and include the instructions you sent me (according the new design) I'd like to include it as official documentation.

This is the link to our GitHub code repository where you can find the collectd plugin: https://github.com/epicnetworks/collectd-scaleio

Thanks in Advance,

Davide

25 Posts

November 30th, 2016 10:00

Davide,

I will be happy to test this out.  I should have time to put this in place next week.  I will let you know how it goes.

Thanks,

Chad

25 Posts

December 20th, 2016 08:00

Davide,

I have moved to your plugin without issue.  Finished late in the day yesterday re-configuring.  Came in today and it is looking good.  We have had a few core switch issues recently and I had lost stats curing those periods of time when my MDM failed to another.  It will be nice that the stats just keep coming in.  This is a great enhancement to the monitor.

Chad

68 Posts

December 20th, 2016 20:00

Hi Chad,

thanks a lot for your feedback. Now that your are quering the gateway you'll get the data also if the MDM role switches, that was one of the limit that I identified in SwissCom implementation.

I'm planning to release a new version of the plugin in order to gave the possibiity to monitor SDS network latencies and disk latencies. I'm thinking about how to implement this: probably I will add a configuration parameter to declare the list of SDS and disks to be monitored. In my small ScaleIO infrastructure (3 nodes) I can monitor all of them, in bigger deployments the graph can be unreadable so I want to give to the user the possibility to choose the components to be monitored.

I will update this forum as soon as I will release a newer version. If you have suggestions you're welcome!

Thanks again,

Davide

25 Posts

December 21st, 2016 07:00

That would be a great enhancement. I look forward to seeing it and appreciate your time.

23 Posts

March 2nd, 2018 16:00

Hi Davide, we found the python / collectd  implantation using the ScaleIO GW and we are in the process to test it out.


We are using version ScaleIO 2.0.13000.211


We can interactively log in to the GW using the browser and credentials, that works correctly


Here is the collectd.conf file


    ModulePath "/usr/share/collectd/python"

    Import scaleio

   

        Debug true                        # default: false

        Verbose true                      # default: false

        Gateway "##.##.##.##:443"         # ScaleIO Gateway IP Address and listening port. (Mandatory)

        Cluster "##"  # Cluster name will be reported as the collectd hostname, default: myCluster

        Pools "###"             # list of pools to be reported (Mandatory)

        MDMUser "#####"                   # ScaleIO MDM user for getting metrics (Mandatory)

        MDMPassword "####"            # Password of the ScaleIO MDM user (Mandatory)

   

All looks good, but unfortunately getting errors on the colelctd status.


# systemctl status collectd.service

● collectd.service - Collectd statistics daemon

   Loaded: loaded (/usr/lib/systemd/system/collectd.service; enabled; vendor preset: disabled)

   Active: active (running) since Fri 2018-03-02 16:19:10 PST; 7s ago

     Docs: man:collectd(1)

           man:collectd.conf(5)

Main PID: 10070 (collectd)

   CGroup: /system.slice/collectd.service

           └─10070 /usr/sbin/collectd

Mar 02 16:19:10 collectd[10070]: plugin_load: plugin "network" successfully loaded.

Mar 02 16:19:10 collectd[10070]: plugin_load: plugin "python" successfully loaded.

Mar 02 16:19:10 collectd[10070]: Systemd detected, trying to signal readyness.

Mar 02 16:19:10 collectd[10070]: ScaleIO: init callback

Mar 02 16:19:10 collectd[10070]: [2018-03-02 16:19:11] Systemd detected, trying to signal readyness.

Mar 02 16:19:10 collectd[10070]: [2018-03-02 16:19:11] ScaleIO: init callback

Mar 02 16:19:10 collectd[10070]: Initialization complete, entering read-loop.

Mar 02 16:19:10 collectd[10070]: [2018-03-02 16:19:11] Initialization complete, entering read-loop.

Mar 02 16:19:10 collectd[10070]: ScaleIO: Error establishing connection to the ScaleIO Gateway. Check your collectd module configuration. Exiting.

Mar 02 16:19:10 collectd[10070]: [2018-03-02 16:19:11] ScaleIO: Error establishing connection to the ScaleIO Gateway. Check your collectd module configuration. Exiting.

Please could you help us to troubleshoot the issue?


Thank you


Saul

5 Practitioner

 • 

274.2K Posts

April 5th, 2018 09:00

I think it's the wild west right now when it comes to monitoring scaleio. Even with the ready nodes and AMS, you really don't have historical visibility.

I ended up having to modify the original collectd python script to work with telegraf (we use a telegraf/influx/grafana stack). It works great but I always run into the question about how often to poll. So far i left it at 30 seconds but I can easily go down to every 5. I guess it really depends on how many SDSs you have configured.

I'll do a write up and share my script once it's cleaned up.

scaleio-metrics.png

17 Posts

August 9th, 2018 19:00

Hey,

Did you get a chance to complete the write-up?

No Events found!

Top