David_Hampson_90e289

243 Posts

4835

January 26th, 2011 08:00

Problem with CommVault collector and volume data

I am logging this issue but I am also posting this one here....

We have a problem with our installation of EMC Data Protection Advisor 5.5.1 on Windows Server 2003 SP2 server and collectors on CommVault Galaxy 8.0.0 on Windows Server 2003 R2 SP2 backup servers (we also have EMC Legato Networker backup servers on which the collectors are working fine). The version of the collector on these servers is 5.5.1.5272.

We have the following requests configured for CommVault clients:

The first three requests work fine and produce the data that would be expected (by extension I can confirm the DPA Collector service is working fine on the CommVault server). The CommVault Volume Status reports no information back at all. If I run a Test on that collector I get the following:

However, checking the history for this collector it is evident that no data has been gathered:

Note that the other requests return a full history here (and reports bring back relevant data which confirms data is being returned).

Responses(13)

David_Hampson_90e289

243 Posts

0

February 1st, 2011 06:00

Some further information - the collector command ran locally and piped to a text file indicates that the correct data is being generated correctly.

Karleke

11 Posts

0

February 1st, 2011 06:00

We had the same problem with the collector running on Commvault 8.0SP5 and DPA 5.6 on linux.

Most data was comming through without problems, but 5 requests never returned any data, although they where deemed successfull within DPA.

It turned out that a firewall was blocking some of the traffic. The error-handling of the collector was very buggy, and always considered the data-transfer to be successfull, even when in reality it was not the case.

Turn on debugging on the collector on the Commvault-machine, and look for following messages:

fwk.base.netsock - nsSocketWait(): timeout receiving data

Around 5 seconds later followed by the same message:

fwk.base.netsock - nsSocketWait(): timeout receiving data

followed by the message

clctr.report - aapiSendReport(): successfully sent report to listener hostname:4001

In our case, the data was not sent because of a bug in our "Palo Alto"-firewall, but the problem can be caused by any filtering device I suppose.

David_Hampson_90e289

243 Posts

0

February 1st, 2011 06:00

My mistake, I was looking at the wrong log! Identical error messages! I will follow this one up....

David_Hampson_90e289

243 Posts

0

February 1st, 2011 06:00

That's interesting as we do have several firewalls here that may be preventing the data getting through, though I could not find any errors like the ones you mentioned (we had turned on debug logging). I have just chased this up with EMC so it will be interesting to see what they say. We have 4 CommVault servers and 3 of them have these symptoms but the 4th one seems to be reporting volume data back okay - all appear to be identically configured.

One further thing that I have discovered from investigating this issue is that the amount of data being gathered by CommVault servers is much less than that being gathered by Networker servers (despite this the impact of the Collector on the CommVault servers is very high to the extent that administrators sometimes have to stop the collector to allow the server to continue to function).

A further thing is that data is inconsistent between backup applications - for example LTO4 tapes are reported as "Ultrium V4" in CommVault and "Ultrium 4" in Networker (this can be worked around but it would be better if DPA stored consistent information across applications).

The state field for media has the value of "0" for CommVault tapes; however to query the state only fixed options of Networker media states are available.

It is unclear where the volume_ID is coming from; however this does not match the barcode and is a little confusing.

Karleke

11 Posts

0

February 1st, 2011 07:00

>> One further thing that I have discovered from investigating this issue is that the amount of data being gathered by CommVault servers is much less than that being gathered by Networker servers (despite this the impact of the Collector on the CommVault servers is very high to the extent that administrators sometimes have to stop the collector to allow the server to continue to function).

Not sure about that. Some requests contain lots of data, others don't.

To lower the resource-hit on the server, you could fine-tune the periodicity of the requests. Some of them run evey 5 minutes, some of them once a day. A good standard DPA-report here is the "DPA -> Status -> Request Summary". It shows how many times the request has run for the specified time-window, run-time on the server and the average size. If you have requests with a high number of run-time seconds, you can set it to run less often, therefore lowering the resource-impact on the server.

I can not comment on the other topics, as I'm not familiar with Commvault as such.

>> Some further information - the collector command ran locally and piped to a text file indicates that the correct data is being generated correctly.

Yes, the data is being gathered, the problem is with transferring that data to the DPA-server itself.

One case you will find where the data is not gathered correctly, is the datasize of backups. It mostly returns negative numbers (although the sizes are correct in the Commvault-database), but I understand this is fixed in DPA 5.7. I have not upgraded to 5.7 yet, so I can't verify this.

>> My mistake, I was looking at the wrong log! Identical error messages! I will follow this one up....

Nice Curious what the outcome is going to be.

I had an EMC-case for this problem also, but as soon as I found the firewall was the problem, the bugs in the error-handling of the collector where apparently not deemed important enough to keep the case open and get them fixed

David_Hampson_90e289

243 Posts

0

February 1st, 2011 08:00

I will experiment with reducing some of the frequencies to see if that improves performance.

We have seen some dubious looking negative figures in reporting recently so if that bug is fixed in 5.7 it is good reason to update the software; I believe I saw the bug referenced in the 5.5 release notes but it was neither listed as a bug or a resolved bug in the 5.6 and 5.7 manuals.

Looking at the netstat -an output it appears we have some dodgy connections (I have stopped the services but established a telnet session during these tests):

good client

TCP 172.21.1.34:1979 194.238.48.38:4001 ESTABLISHED

bad client

TCP 10.95.254.36:2134 194.238.48.38:4001 FIN_WAIT_1

TCP 10.95.254.36:2393 194.238.48.38:4001 FIN_WAIT_1

TCP 10.95.254.36:3004 194.238.48.38:4001 FIN_WAIT_1

TCP 10.95.254.36:3241 194.238.48.38:4001 FIN_WAIT_1

TCP 10.95.254.36:3855 194.238.48.38:4001 FIN_WAIT_1

TCP 10.95.254.36:4019 194.238.48.38:4001 ESTABLISHED

dpa server

TCP 194.238.48.38:4001 10.255.31.34:1979 ESTABLISHED

TCP 194.238.48.38:4001 10.255.33.36:2134 ESTABLISHED

TCP 194.238.48.38:4001 10.255.33.36:2393 ESTABLISHED

TCP 194.238.48.38:4001 10.255.33.36:3004 ESTABLISHED

TCP 194.238.48.38:4001 10.255.33.36:3241 ESTABLISHED

TCP 194.238.48.38:4001 10.255.33.36:3855 ESTABLISHED

TCP 194.238.48.38:4001 10.255.33.36:4016 TIME_WAIT

TCP 194.238.48.38:4001 10.255.33.36:4019 ESTABLISHED

From what I understand the connections in a FIN_WAIT_1 have not received an ACK response back to sending a FIN; presumably as the listener port is still showing as ESTABLISHED the FIN must be being dropped somewhere inbetween.

David_Hampson_90e289

243 Posts

0

February 1st, 2011 08:00

And here is the error message from the collector log:

DBG1 3064.5708 20110127:161619 fwk.app.polledevents - polledEventsHandler(): running theevent 'collector_config_checker' in background thread

DBG1 3064.1496 20110127:161619 fwk.app.polledevents - runEvent(): theevent 'collector_config_checker' has completed

DBG1 3064.1980 20110127:161621 fwk.base.netsock - nsSocketWait(): timeout receiving data

DBG1 3064.1980 20110127:161626 fwk.base.netsock - nsSocketWait(): timeout receiving data

DBG1 3064.1980 20110127:161626 clctr.report - aapiSendReport(): successfully sent report to listener ba.ifl.net:4001

Karleke

11 Posts

0

February 1st, 2011 10:00

>> We have seen some dubious looking negative figures in reporting recently so if that bug is fixed in 5.7 it is good reason to update the software; >> I believe I saw the bug referenced in the 5.5 release notes but it was neither listed as a bug or a resolved bug in the 5.6 and 5.7 manuals.

Indeed, I could not find it in any of the release notes either. But the notes for bugid DPA-8499 state that it should be fixed in 5.7. Let's see and hope

>> From what I understand the connections in a FIN_WAIT_1 have not received an ACK response back to sending a FIN;

>> presumably as the listener port is still showing as ESTABLISHED the FIN must be being dropped somewhere inbetween.

Or the firewall is "denying" the request (instead of dropping it), and doesn't send (or is not allowed to) the ACK-reponse to the FIN.

In normal situations, FIN_WAIT_1 connections don't prevent the stack from initiating new connections. Off course, the data that was sent on the connections in FIN_WAIT_1 probably never reached the DPA-server, hence the missing data.

>> DBG1 3064.5708 20110127:161619 fwk.app.polledevents - polledEventsHandler(): running theevent 'collector_config_checker' in background thread >> DBG1 3064.1496 20110127:161619 fwk.app.polledevents - runEvent(): theevent 'collector_config_checker' has completed

>> DBG1 3064.1980 20110127:161621 fwk.base.netsock - nsSocketWait(): timeout receiving data

>> DBG1 3064.1980 20110127:161626 fwk.base.netsock - nsSocketWait(): timeout receiving data

>> DBG1 3064.1980 20110127:161626 clctr.report - aapiSendReport(): successfully sent report to listener ba.ifl.net:4001

Yep, same problem we had. It shows the problems with error-handling of the collector very clearly. Timeout, timeout, successfully sent

Actually, the first 2 lines are from a different thread in the collector. I have noticed that the first 2 numbers after the DBG1-part, seem to identify the thread-number. So, if you grep on "3064.1980" in this case, you should be able to find the request that is trying to send data. I would be very surprised if it is not the request that you are missing the data from.

Another (quite important) side-effect of this issue, is that we only received about 60% of the savejobs in the DPA-database. All the others where lost. As the collector assumed they where send successfully, they where off course never resend.

After we cleared the firewall-issue, I have spend a large amount of hours comparing commvault-jobs with the ones in the DPA-database, and we now get them all. Off course, whenever there is a hick-up on the netwerk (firewall or not), some jobs will be lost again. Which makes it all the more important that the error-handling gets fixed in future collector-versions!

Good luck.

David_Hampson_90e289

243 Posts

0

February 2nd, 2011 00:00

We can see thread 1980 in the list of firewall ports so that matches up with the connections in the FIN_WAIT_1 state. We will see what our firewall guy comes back with.

As you say this is particularly bad handling of an error and I am now a little concerned that we may be dropping other data as a result of this error. I will keep you updated with EMC's response.

David_Hampson_90e289

243 Posts

0

February 2nd, 2011 09:00

EMC have confirmed that DPA-8499 is resolved in 5.7

Karleke

11 Posts

0

February 2nd, 2011 09:00

David Hampson wrote:
EMC have confirmed that DPA-8499 is resolved in 5.7

Thank you David, that is good news.

Guess it's time to start upgrading everything...

David_Hampson_90e289

243 Posts

0

February 14th, 2011 07:00

With regards to the original problem, we have been told it is a known issue fixed in 5.7 (or by patch currently being written); however I am yet to receive details of any escalations which is might relate to.

David_Hampson_90e289

243 Posts

0

February 16th, 2011 03:00

Patch was received yesterday and installed on two collectors; however this had no impact on the problem in that no volume data appears to be available

View All

No Events found!