Start a Conversation

Unsolved

This post is more than 5 years old

4835

January 26th, 2011 08:00

Problem with CommVault collector and volume data

I am logging this issue but I am also posting this one here....

We have a problem with our installation of EMC Data Protection Advisor 5.5.1 on Windows Server 2003 SP2 server and collectors on CommVault Galaxy 8.0.0 on Windows Server 2003 R2 SP2 backup servers (we also have EMC Legato Networker backup servers on which the collectors are working fine).  The version of the collector on these servers is 5.5.1.5272.

We have the following requests configured for CommVault clients:

clip_image002.jpg

The first three requests work fine and produce the data that would be expected (by extension I can confirm the DPA Collector service is working fine on the CommVault server).  The CommVault Volume Status reports no information back at all.  If I run a Test on that collector I get the following:

clip_image003.jpeg

However, checking the history for this collector it is evident that no data has been gathered:

clip_image001.jpeg

Note that the other requests return a full history here (and reports bring back relevant data which confirms data is being returned).

February 1st, 2011 06:00

Some further information - the collector command ran locally and piped to a text file indicates that the correct data is being generated correctly.

11 Posts

February 1st, 2011 06:00

We had the same problem with the collector running on Commvault 8.0SP5 and DPA 5.6 on linux.

Most data was comming through without problems, but 5 requests never returned any data, although they where deemed successfull within DPA.

It turned out that a firewall was blocking some of the traffic. The error-handling of the collector was very buggy, and always considered the data-transfer to be successfull, even when in reality it was not the case.

Turn on debugging on the collector on the Commvault-machine, and look for following messages:

fwk.base.netsock - nsSocketWait(): timeout receiving data

Around 5 seconds later followed by the same message:

fwk.base.netsock - nsSocketWait(): timeout receiving data

followed by the message

clctr.report - aapiSendReport(): successfully sent report to listener hostname:4001

In our case, the data was not sent because of a bug in our "Palo Alto"-firewall, but the problem can be caused by any filtering device I suppose.

February 1st, 2011 06:00

My mistake, I was looking at the wrong log!  Identical error messages!  I will follow this one up....

February 1st, 2011 06:00

That's interesting as we do have several firewalls here that may be preventing the data getting through, though I could not find any errors like the ones you mentioned (we had turned on debug logging).  I have just chased this up with EMC so it will be interesting to see what they say.  We have 4 CommVault servers and 3 of them have these symptoms but the 4th one seems to be reporting volume data back okay - all appear to be identically configured.

One further thing that I have discovered from investigating this issue is that the amount of data being gathered by CommVault servers is much less than that being gathered by Networker servers (despite this the impact of the Collector on the CommVault servers is very high to the extent that administrators sometimes have to stop the collector to allow the server to continue to function).

A further thing is that data is inconsistent between backup applications - for example LTO4 tapes are reported as "Ultrium V4" in CommVault and "Ultrium 4" in Networker (this can be worked around but it would be better if DPA stored consistent information across applications).

The state field for media has the value of "0" for CommVault tapes; however to query the state only fixed options of Networker media states are available.

It is unclear where the volume_ID is coming from; however this does not match the barcode and is a little confusing.

11 Posts

February 1st, 2011 07:00

>> One further thing that I have discovered from investigating this issue is that the amount of data being gathered by CommVault servers is much less than that being gathered by Networker servers (despite this the impact of the Collector on the CommVault servers is very high to the extent that administrators sometimes have to stop the collector to allow the server to continue to function).

Not sure about that. Some requests contain lots of data, others don't.

To lower the resource-hit on the server, you could fine-tune the periodicity of the requests. Some of them run evey 5 minutes, some of them once a day. A good standard DPA-report here is the "DPA -> Status -> Request Summary". It shows how many times the request has run for the specified time-window, run-time on the server and the average size. If you have requests with a high number of run-time seconds, you can set it to run less often, therefore lowering the resource-impact on the server.

I can not comment on the other topics, as I'm not familiar with Commvault as such.

>> Some further information - the collector command ran locally and piped to a text file indicates that the correct data is being generated correctly.

Yes, the data is being gathered, the problem is with transferring that data to the DPA-server itself.

One case you will find where the data is not gathered correctly, is the datasize of backups. It mostly returns negative numbers (although the sizes are correct in the Commvault-database), but I understand this is fixed in DPA 5.7. I have not upgraded to 5.7 yet, so I can't verify this.

>> My mistake, I was looking at the wrong log!  Identical error messages!  I will follow this one up....

Nice Curious what the outcome is going to be.

I had an EMC-case for this problem also, but as soon as I found the firewall was the problem, the bugs in the error-handling of the collector where apparently not deemed important enough to keep the case open and get them fixed

February 1st, 2011 08:00

I will experiment with reducing some of the frequencies to see if that improves performance.

We have seen some dubious looking negative figures in reporting recently so if that bug is fixed in 5.7 it is good reason to update the software; I believe I saw the bug referenced in the 5.5 release notes but it was neither listed as a bug or a resolved bug in the 5.6 and 5.7 manuals.

Looking at the netstat -an output it appears we have some dodgy connections (I have stopped the services but established a telnet session during these tests):

good client

  TCP     172.21.1.34:1979       194.238.48.38:4001      ESTABLISHED

bad client

  TCP     10.95.254.36:2134      194.238.48.38:4001      FIN_WAIT_1

  TCP     10.95.254.36:2393      194.238.48.38:4001      FIN_WAIT_1

  TCP     10.95.254.36:3004      194.238.48.38:4001      FIN_WAIT_1

  TCP     10.95.254.36:3241      194.238.48.38:4001      FIN_WAIT_1

  TCP     10.95.254.36:3855      194.238.48.38:4001      FIN_WAIT_1

  TCP     10.95.254.36:4019      194.238.48.38:4001      ESTABLISHED

dpa server

TCP     194.238.48.38:4001     10.255.31.34:1979      ESTABLISHED

TCP     194.238.48.38:4001     10.255.33.36:2134       ESTABLISHED

TCP     194.238.48.38:4001     10.255.33.36:2393       ESTABLISHED

TCP     194.238.48.38:4001     10.255.33.36:3004       ESTABLISHED

TCP     194.238.48.38:4001     10.255.33.36:3241       ESTABLISHED

TCP     194.238.48.38:4001     10.255.33.36:3855       ESTABLISHED

TCP     194.238.48.38:4001     10.255.33.36:4016       TIME_WAIT

TCP     194.238.48.38:4001     10.255.33.36:4019       ESTABLISHED

From what I understand the connections in a  FIN_WAIT_1 have not received an ACK response back to sending a FIN; presumably  as the listener port is still showing as ESTABLISHED the FIN must be being  dropped somewhere inbetween.

February 1st, 2011 08:00

And here is the error message from the collector log:

DBG1     3064.5708     20110127:161619 fwk.app.polledevents - polledEventsHandler(): running theevent 'collector_config_checker' in background thread
DBG1     3064.1496     20110127:161619 fwk.app.polledevents - runEvent(): theevent 'collector_config_checker' has completed
DBG1     3064.1980     20110127:161621     fwk.base.netsock - nsSocketWait(): timeout receiving data
DBG1     3064.1980     20110127:161626     fwk.base.netsock - nsSocketWait(): timeout receiving data
DBG1     3064.1980     20110127:161626         clctr.report - aapiSendReport(): successfully sent report to listener ba.ifl.net:4001

11 Posts

February 1st, 2011 10:00

>> We have seen some dubious looking negative figures in reporting recently so if that bug is fixed in 5.7 it is good reason to update the software; >> I believe I saw the bug referenced in the 5.5 release notes but it was neither listed as a bug or a resolved bug in the 5.6 and 5.7 manuals.

Indeed, I could not find it in any of the release notes either. But the notes for bugid DPA-8499 state that it should be fixed in 5.7. Let's see and hope

>> From what I understand the connections in a  FIN_WAIT_1 have not received an ACK response back to sending a FIN;

>> presumably  as the listener port is still showing as ESTABLISHED the FIN must be being  dropped somewhere inbetween.

Or the firewall is "denying" the request (instead of dropping it), and doesn't send (or is not allowed to) the ACK-reponse to the FIN.

In normal situations, FIN_WAIT_1 connections don't prevent the stack from initiating new connections. Off course, the data that was sent on the connections in FIN_WAIT_1 probably never reached the DPA-server, hence the missing data.

>> DBG1    3064.5708    20110127:161619 fwk.app.polledevents - polledEventsHandler(): running theevent 'collector_config_checker' in background thread >> DBG1    3064.1496    20110127:161619 fwk.app.polledevents - runEvent(): theevent 'collector_config_checker' has completed

>> DBG1    3064.1980    20110127:161621    fwk.base.netsock - nsSocketWait(): timeout receiving data

>> DBG1    3064.1980    20110127:161626    fwk.base.netsock - nsSocketWait(): timeout receiving data

>> DBG1    3064.1980    20110127:161626        clctr.report - aapiSendReport(): successfully sent report to listener ba.ifl.net:4001

Yep, same problem we had. It shows the problems with error-handling of the collector very clearly. Timeout, timeout, successfully sent

Actually, the first 2 lines are from a different thread in the collector. I have noticed that the first 2 numbers after the DBG1-part, seem to identify the thread-number. So, if you grep on "3064.1980" in this case, you should be able to find the request that is trying to send data. I would be very surprised if it is not the request that you are missing the data from.

Another (quite important) side-effect of this issue, is that we only received about 60% of the savejobs in the DPA-database. All the others where lost. As the collector assumed they where send successfully, they where off course never resend.

After we cleared the firewall-issue, I have spend a large amount of hours comparing commvault-jobs with the ones in the DPA-database, and we now get them all. Off course, whenever there is a hick-up on the netwerk (firewall or not), some jobs will be lost again. Which makes it all the more important that the error-handling gets fixed in future collector-versions!

Good luck.

February 2nd, 2011 00:00

We can see thread 1980 in the list of firewall ports so that matches up with the connections in the FIN_WAIT_1 state.  We will see what our firewall guy comes back with.

As you say this is particularly bad handling of an error and I am now a little concerned that we may be dropping other data as a result of this error.  I will keep you updated with EMC's response.

February 2nd, 2011 09:00

EMC have confirmed that DPA-8499 is resolved in 5.7

11 Posts

February 2nd, 2011 09:00

David Hampson wrote:

EMC have confirmed that DPA-8499 is resolved in 5.7

Thank you David, that is good news.

Guess it's time to start upgrading everything...

February 14th, 2011 07:00

With regards to the original problem, we have been told it is a known issue fixed in 5.7 (or by patch currently being written); however I am yet to receive details of any escalations which is might relate to.

February 16th, 2011 03:00

Patch was received yesterday and installed on two collectors; however this had no impact on the problem in that no volume data appears to be available

No Events found!

Top