Unsolved
This post is more than 5 years old
25 Posts
0
13695
ScaleIO Historical Performance Monitoring
I did some research on this and got back some info from my EMC SE.
Monitoring EMC ScaleIO with Grafana using Intel SDI’s Snap | {code} by Dell EMC
Monitoring EMC Elastic Cloud Storage with Grafana | {code} by Dell EMC
or GitHub - swisscom/collectd-scaleio: A collectd plugin for scaleio
I chose the latter but found the instructions very sparse. If anyone needs more detailed instructions, I have written them up for internal documentation as my manager didn't want me to have to reinvent the wheel in a year if I needed to configure this again.
c0redump
68 Posts
0
November 11th, 2016 10:00
Hello carterbury1,
I was planning to implement some historical metric monitoring for our production ScaleIO environment and I remembered your post.
If you don't mind sharing your instructions I'd like to take a look.
Kind regards,
Davide
SysEng777
25 Posts
1
November 11th, 2016 13:00
Here ya go. Posted it over at github.
https://github.com/swisscom/collectd-scaleio/files/511533/Scaleioperformancemonitor_v1.txt
c0redump
68 Posts
0
November 12th, 2016 17:00
Thanks a lot carterbury1,
I made a fix to the wrapper shell script wrote by swisscom because I had some problem using some specific special character in the password.
Now I'm planning to redesign the wrapper method: I'd like to use the ScaleIO REST API querying the directly the gateway for ScaleIO data instead of parsing the scli output using the wrapper script. I'll keep you updated.
Thanks again,
Davide
SysEng777
25 Posts
0
November 13th, 2016 09:00
I had the same problem with the password. I figured it was in passing the password. My workaround was storing the password in the other script file which was not ideal. Let me know.
c0redump
68 Posts
0
November 13th, 2016 20:00
Hi carterbury1,
to fix the password issue you can edit the file "/usr/share/collectd/python/scli_wrap.sh" substituting the line 11:
login_out=$(scli --login --username ${USER} --password ${PASS} 2>&1)
with these lines:
cmd="scli --login --username '"$USER"' --password '"$PASS"'"
login_out=$(eval $cmd)
Then edit the file "/etc/collectd/collectd.conf.d/python.conf" and double quote the User and the Password values this way:
Then restart collectd daemon and check if the fix works also for you.
In my spare time I wrote a python script useful to get the metric values directly from the ScaleIO gateway taking advantage of ScaleIO REST-API instead of wrapping and parsing scli output. I have to make some further check but it seems to work. In the next days I will integrate my work in the SwissCom collectd python script. It will take a bit because I'm doing this work in my spare time.
In my opinion, querying the gateway (using the provided API) is a better approach for some reasons:
- all the metric related components can be installed on a different VM and will query only the gateway (that isn't an active player in SIO environment). Running scli to collect and parse data continuously on a MDM node is not the best solution.
- the gateway knows the topology of the SIO environment at any given time on the contrary the SwissCom wrapper have to be installed on all MDMs because, if it is installed only on one MDM the data aren't collected in case of failure of that specific node.
I will write you as soon as I have a working and stable version of my implementation.
Thanks,
Davide
SysEng777
25 Posts
0
November 14th, 2016 15:00
That fix did work for the login being passed. Thanks!
Yes, let me know when you come up with this alternative solution. I would be willing to trial it out. Removing the additional components is not a bad thing at all. I currently upgraded to 2.0 so I have 3 MDM's, one of which only has mdm installed on it as a stand alone machine outside my hypervisor Cluster. This is the mdm I have as master unless something goes bump on the network. I only installed the solution on this one mdm that does not participate as sdc or sds.
A quick look at top doesn't show a whole lot of activity for what this machine is doing but I get the concept and agree with it.
c0redump
68 Posts
0
November 17th, 2016 20:00
Hi carterbury1,
today I merged my code with the SwissCom scaleio.py plugin. Now I'm testing it on my infrastructure and I'm improving the code adding some exception handling and logging for most common problems.
After testing I will release it on GitHub, I think before monday. I will keep you updated!
Kind Regards,
Davide
c0redump
68 Posts
0
November 29th, 2016 19:00
Hello carterbury1,
I just released the version of the collectd plugin that relies on REST-API to get metrics from ScaleIO infrastructure. I would be very happy if you can test it.
- The dashboard was left untouched.
- The scli_wrap.sh is no longer needed.
- I added some directives in configuration file and I changed the name of some other (choosing more explicative names): for example "User" is now "MDMUser" and "Password" is "MDMPassword"
- As you can see in the README now all the StoragePools to be monitored must be declared in the configuration file using the parameters "Pools", they have to be separated by spaces.
- It is better to use double-quotes for all the configuration values according the example in the README file.
If you authorize me to revisit and include the instructions you sent me (according the new design) I'd like to include it as official documentation.
This is the link to our GitHub code repository where you can find the collectd plugin: https://github.com/epicnetworks/collectd-scaleio
Thanks in Advance,
Davide
SysEng777
25 Posts
0
November 30th, 2016 10:00
Davide,
I will be happy to test this out. I should have time to put this in place next week. I will let you know how it goes.
Thanks,
Chad
SysEng777
25 Posts
0
December 20th, 2016 08:00
Davide,
I have moved to your plugin without issue. Finished late in the day yesterday re-configuring. Came in today and it is looking good. We have had a few core switch issues recently and I had lost stats curing those periods of time when my MDM failed to another. It will be nice that the stats just keep coming in. This is a great enhancement to the monitor.
Chad
c0redump
68 Posts
0
December 20th, 2016 20:00
Hi Chad,
thanks a lot for your feedback. Now that your are quering the gateway you'll get the data also if the MDM role switches, that was one of the limit that I identified in SwissCom implementation.
I'm planning to release a new version of the plugin in order to gave the possibiity to monitor SDS network latencies and disk latencies. I'm thinking about how to implement this: probably I will add a configuration parameter to declare the list of SDS and disks to be monitored. In my small ScaleIO infrastructure (3 nodes) I can monitor all of them, in bigger deployments the graph can be unreadable so I want to give to the user the possibility to choose the components to be monitored.
I will update this forum as soon as I will release a newer version. If you have suggestions you're welcome!
Thanks again,
Davide
SysEng777
25 Posts
0
December 21st, 2016 07:00
That would be a great enhancement. I look forward to seeing it and appreciate your time.
SignatureIT
23 Posts
0
March 2nd, 2018 16:00
Hi Davide, we found the python / collectd implantation using the ScaleIO GW and we are in the process to test it out.
We are using version ScaleIO 2.0.13000.211
We can interactively log in to the GW using the browser and credentials, that works correctly
Here is the collectd.conf file
ModulePath "/usr/share/collectd/python"
Import scaleio
Debug true # default: false
Verbose true # default: false
Gateway "##.##.##.##:443" # ScaleIO Gateway IP Address and listening port. (Mandatory)
Cluster "##" # Cluster name will be reported as the collectd hostname, default: myCluster
Pools "###" # list of pools to be reported (Mandatory)
MDMUser "#####" # ScaleIO MDM user for getting metrics (Mandatory)
MDMPassword "####" # Password of the ScaleIO MDM user (Mandatory)
All looks good, but unfortunately getting errors on the colelctd status.
# systemctl status collectd.service
● collectd.service - Collectd statistics daemon
Loaded: loaded (/usr/lib/systemd/system/collectd.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2018-03-02 16:19:10 PST; 7s ago
Docs: man:collectd(1)
man:collectd.conf(5)
Main PID: 10070 (collectd)
CGroup: /system.slice/collectd.service
└─10070 /usr/sbin/collectd
Mar 02 16:19:10 collectd[10070]: plugin_load: plugin "network" successfully loaded.
Mar 02 16:19:10 collectd[10070]: plugin_load: plugin "python" successfully loaded.
Mar 02 16:19:10 collectd[10070]: Systemd detected, trying to signal readyness.
Mar 02 16:19:10 collectd[10070]: ScaleIO: init callback
Mar 02 16:19:10 collectd[10070]: [2018-03-02 16:19:11] Systemd detected, trying to signal readyness.
Mar 02 16:19:10 collectd[10070]: [2018-03-02 16:19:11] ScaleIO: init callback
Mar 02 16:19:10 collectd[10070]: Initialization complete, entering read-loop.
Mar 02 16:19:10 collectd[10070]: [2018-03-02 16:19:11] Initialization complete, entering read-loop.
Mar 02 16:19:10 collectd[10070]: ScaleIO: Error establishing connection to the ScaleIO Gateway. Check your collectd module configuration. Exiting.
Mar 02 16:19:10 collectd[10070]: [2018-03-02 16:19:11] ScaleIO: Error establishing connection to the ScaleIO Gateway. Check your collectd module configuration. Exiting.
Please could you help us to troubleshoot the issue?
Thank you
Saul
Anonymous
5 Practitioner
5 Practitioner
•
274.2K Posts
0
April 5th, 2018 09:00
I think it's the wild west right now when it comes to monitoring scaleio. Even with the ready nodes and AMS, you really don't have historical visibility.
I ended up having to modify the original collectd python script to work with telegraf (we use a telegraf/influx/grafana stack). It works great but I always run into the question about how often to poll. So far i left it at 30 seconds but I can easily go down to every 5. I guess it really depends on how many SDSs you have configured.
I'll do a write up and share my script once it's cleaned up.
charan25
17 Posts
0
August 9th, 2018 19:00
Hey,
Did you get a chance to complete the write-up?