Unsolved
This post is more than 5 years old
26 Posts
0
2090
RPC time outs during RMAN/Oracle backup
During RMAN/Oracle backups we receive RPC time outs. 99% of the time the backup started by the client (client initiated backups) they run fine. It is the other 1% of the time that the Oracle RMAN reports that the backup has failed due to a RPC timeout.
channel ORA_SBT_TAPE_2: starting piece 1 at 12-JUL-07
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of backup command on ORA_SBT_TAPE_1 channel at 07/12/2007 01:35:03
ORA-27191: sbtinfo2 returned error
Additional information: 2
ORA-19511: Error received from media manager layer, error text:
A connection to NW server 'ICTDSTA34.melkweg.tld' could not be established beca
use 'RPC receive operation failed. A network connection could not be establish
ed with the host.'. (2:10:0)
I have monitored when these time outs occur, on which database and on which server, but there doesn't seem to be a pattern to these occurences.
server time channel database
14-apr oberon2.sfb 9:36:37 PM ORA_SBT_TAPE_1 PRDUWV05
3-may oberon1.sfb 10:24:32 PM ORA_SBT_TAPE_1 PRDCDS01
3-jun oberon2.sfb 11:18:29 PM ORA_SBT_TAPE_2 PRDUWV05
10-jul oberon1.sfb 9:19:06 PM ORA_SBT_TAPE_1 PRDISS01
12-jul oberon1.sfb 1:35:03 AM ORA_SBT_TAPE_1 PRDCDS01
12-jul oberon2.sfb 1:27:48 AM ORA_SBT_TAPE_2 PRDUWV05
12-jul oberon1.sfb 1:33:56 AM ORA_SBT_TAPE_1 archivelogfiles
12-jul leda1.sfb 1:34:56 AM ORA_SBT_TAPE_4 TSTISS01
For these occurences I have looked in the daemon.log and found the following:
12-jul oberon2.sfb 1:27:48 AM ORA_SBT_TAPE_2 PRDUWV05
07/12/07 01:27:47 nsrindexd: select: Socket operation on non-socket07/12/07 01:27:48 nsrd: oberon2:RMAN:/OBERON2_INCR_PRDUWV05_8fimjnb0 done saving to pool 'RWKCDSunix' (CDLR003W) 777 MB
12-jul oberon1.sfb 1:33:56 AM ORA_SBT_TAPE_1 archivelogfiles
07/12/07 01:33:45 nsrindexd: select: Socket operation on non-socket07/12/07 01:33:45 nsrd: oberon1:RMAN:/OBERON1_ARCH_PRDCDS01_5limjtrh saving to pool 'RWKCDSunix' (CDLR004N)
07/12/07 01:33:56 nsrd: oberon1:RMAN:/OBERON1_ARCH_PRDUWV05_8iimjtig done saving to pool 'RWKCDSunix' (CDLR004N) 2815 MB
12-jul leda1.sfb 1:34:56 AM ORA_SBT_TAPE_4 TSTISS01
07/12/07 01:34:55 nsrindexd: select: Socket operation on non-socket07/12/07 01:35:00 nsrd: leda1:RMAN:4nimjtte_1_1 done saving to pool 'RWKCDSunix' (CDLR1027) 118 MB
12-jul oberon1.sfb 1:35:03 AM ORA_SBT_TAPE_1 PRDCDS01
07/12/07 01:34:55 nsrindexd: select: Socket operation on non-socket07/12/07 01:35:00 nsrd: leda1:RMAN:4nimjtte_1_1 done saving to pool 'RWKCDSunix' (CDLR1027) 118 MB
07/12/07 01:35:03 nsrd: leda1:RMAN:4oimjtte_1_1 done saving to pool 'RWKCDSunix' (CDLR1027) 153 MB
I haven't been able to find out what the socket operation on non-socket means. Also in the daemon.log on the HP-UX clients there doesn't seem to be any faults:
leda1 12 july
07/12/07 00:32:55 nsrmmd #376: Start nsrmmd #376, with PID 22176, at HOST leda1
07/12/07 01:32:46 nsrmmd #385: Start nsrmmd #385, with PID 165, at HOST leda1
Oberon1 10 july
07/10/07 01:56:58 nsrmmd #194: Start nsrmmd #194, with PID 7917, at HOST oberon1
07/10/07 06:00:26 nsrmmd #211: Start nsrmmd #211, with PID 9928, at HOST oberon1
07/10/07 11:54:00 nsrmmd #184: Start nsrmmd #184, with PID 28424, at HOST oberon1
07/10/07 14:06:17 nsrmmd #184: Start nsrmmd #184, with PID 7624, at HOST oberon1
07/10/07 20:01:22 nsrmmd #214: Start nsrmmd #214, with PID 5229, at HOST oberon1
07/10/07 20:01:56 nsrmmd #215: Start nsrmmd #215, with PID 6005, at HOST oberon1
oberon1 3 may
05/03/07 01:18:21 nsrmmd #1225: Start nsrmmd #1225, with PID 5219, at HOST oberon1
05/03/07 06:00:28 nsrmmd #1255: Start nsrmmd #1255, with PID 21608, at HOST oberon1
05/03/07 20:00:56 nsrmmd #1268: Start nsrmmd #1268, with PID 21264, at HOST oberon1
05/03/07 20:01:24 nsrmmd #1274: Start nsrmmd #1274, with PID 21398, at HOST oberon1
05/03/07 20:05:42 nsrmmd #1276: Start nsrmmd #1276, with PID 21910, at HOST oberon1
oberon1 12 july
07/12/07 06:00:25 nsrmmd #410: Start nsrmmd #410, with PID 2425, at HOST oberon1
oberon2 14 april
04/17/07 06:00:22 nsrmmd #4115: Start nsrmmd #4115, with PID 13358, at HOST oberon2
04/17/07 20:00:51 nsrmmd #4148: Start nsrmmd #4148, with PID 23924, at HOST oberon2
oberon2 3 june
06/03/07 00:02:43 nsrmmd #1563: Start nsrmmd #1563, with PID 1235, at HOST oberon2
06/03/07 00:31:41 nsrmmd #1563: Start nsrmmd #1563, with PID 4699, at HOST oberon2
06/03/07 06:00:23 nsrmmd #1605: Start nsrmmd #1605, with PID 20577, at HOST oberon2
06/03/07 20:07:19 nsrmmd #1624: Start nsrmmd #1624, with PID 25703, at HOST oberon2
06/03/07 20:47:59 nsrmmd #1625: Start nsrmmd #1625, with PID 12389, at HOST oberon2
oberon2 12 july
07/12/07 01:35:59 nsrmmd #386: Start nsrmmd #386, with PID 21905, at HOST oberon2
07/12/07 06:00:26 nsrmmd #411: Start nsrmmd #411, with PID 27508, at HOST oberon2
I have registered a service request with EMC Support and gave them a truck load of information, but I don't seem to be getting anywhere. The only thing that I can think of is that because al these time outs occur during the night, is that the backup server is to busy and that RMAN times out because of this. When the database admins start the backup the next day during the day there is never a problem. Also when a time out occurs the connection is reset. After that the other RMAN backups that are in the queue finish without any problems.
channel ORA_SBT_TAPE_2: starting piece 1 at 12-JUL-07
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of backup command on ORA_SBT_TAPE_1 channel at 07/12/2007 01:35:03
ORA-27191: sbtinfo2 returned error
Additional information: 2
ORA-19511: Error received from media manager layer, error text:
A connection to NW server 'ICTDSTA34.melkweg.tld' could not be established beca
use 'RPC receive operation failed. A network connection could not be establish
ed with the host.'. (2:10:0)
I have monitored when these time outs occur, on which database and on which server, but there doesn't seem to be a pattern to these occurences.
server time channel database
14-apr oberon2.sfb 9:36:37 PM ORA_SBT_TAPE_1 PRDUWV05
3-may oberon1.sfb 10:24:32 PM ORA_SBT_TAPE_1 PRDCDS01
3-jun oberon2.sfb 11:18:29 PM ORA_SBT_TAPE_2 PRDUWV05
10-jul oberon1.sfb 9:19:06 PM ORA_SBT_TAPE_1 PRDISS01
12-jul oberon1.sfb 1:35:03 AM ORA_SBT_TAPE_1 PRDCDS01
12-jul oberon2.sfb 1:27:48 AM ORA_SBT_TAPE_2 PRDUWV05
12-jul oberon1.sfb 1:33:56 AM ORA_SBT_TAPE_1 archivelogfiles
12-jul leda1.sfb 1:34:56 AM ORA_SBT_TAPE_4 TSTISS01
For these occurences I have looked in the daemon.log and found the following:
12-jul oberon2.sfb 1:27:48 AM ORA_SBT_TAPE_2 PRDUWV05
07/12/07 01:27:47 nsrindexd: select: Socket operation on non-socket07/12/07 01:27:48 nsrd: oberon2:RMAN:/OBERON2_INCR_PRDUWV05_8fimjnb0 done saving to pool 'RWKCDSunix' (CDLR003W) 777 MB
12-jul oberon1.sfb 1:33:56 AM ORA_SBT_TAPE_1 archivelogfiles
07/12/07 01:33:45 nsrindexd: select: Socket operation on non-socket07/12/07 01:33:45 nsrd: oberon1:RMAN:/OBERON1_ARCH_PRDCDS01_5limjtrh saving to pool 'RWKCDSunix' (CDLR004N)
07/12/07 01:33:56 nsrd: oberon1:RMAN:/OBERON1_ARCH_PRDUWV05_8iimjtig done saving to pool 'RWKCDSunix' (CDLR004N) 2815 MB
12-jul leda1.sfb 1:34:56 AM ORA_SBT_TAPE_4 TSTISS01
07/12/07 01:34:55 nsrindexd: select: Socket operation on non-socket07/12/07 01:35:00 nsrd: leda1:RMAN:4nimjtte_1_1 done saving to pool 'RWKCDSunix' (CDLR1027) 118 MB
12-jul oberon1.sfb 1:35:03 AM ORA_SBT_TAPE_1 PRDCDS01
07/12/07 01:34:55 nsrindexd: select: Socket operation on non-socket07/12/07 01:35:00 nsrd: leda1:RMAN:4nimjtte_1_1 done saving to pool 'RWKCDSunix' (CDLR1027) 118 MB
07/12/07 01:35:03 nsrd: leda1:RMAN:4oimjtte_1_1 done saving to pool 'RWKCDSunix' (CDLR1027) 153 MB
I haven't been able to find out what the socket operation on non-socket means. Also in the daemon.log on the HP-UX clients there doesn't seem to be any faults:
leda1 12 july
07/12/07 00:32:55 nsrmmd #376: Start nsrmmd #376, with PID 22176, at HOST leda1
07/12/07 01:32:46 nsrmmd #385: Start nsrmmd #385, with PID 165, at HOST leda1
Oberon1 10 july
07/10/07 01:56:58 nsrmmd #194: Start nsrmmd #194, with PID 7917, at HOST oberon1
07/10/07 06:00:26 nsrmmd #211: Start nsrmmd #211, with PID 9928, at HOST oberon1
07/10/07 11:54:00 nsrmmd #184: Start nsrmmd #184, with PID 28424, at HOST oberon1
07/10/07 14:06:17 nsrmmd #184: Start nsrmmd #184, with PID 7624, at HOST oberon1
07/10/07 20:01:22 nsrmmd #214: Start nsrmmd #214, with PID 5229, at HOST oberon1
07/10/07 20:01:56 nsrmmd #215: Start nsrmmd #215, with PID 6005, at HOST oberon1
oberon1 3 may
05/03/07 01:18:21 nsrmmd #1225: Start nsrmmd #1225, with PID 5219, at HOST oberon1
05/03/07 06:00:28 nsrmmd #1255: Start nsrmmd #1255, with PID 21608, at HOST oberon1
05/03/07 20:00:56 nsrmmd #1268: Start nsrmmd #1268, with PID 21264, at HOST oberon1
05/03/07 20:01:24 nsrmmd #1274: Start nsrmmd #1274, with PID 21398, at HOST oberon1
05/03/07 20:05:42 nsrmmd #1276: Start nsrmmd #1276, with PID 21910, at HOST oberon1
oberon1 12 july
07/12/07 06:00:25 nsrmmd #410: Start nsrmmd #410, with PID 2425, at HOST oberon1
oberon2 14 april
04/17/07 06:00:22 nsrmmd #4115: Start nsrmmd #4115, with PID 13358, at HOST oberon2
04/17/07 20:00:51 nsrmmd #4148: Start nsrmmd #4148, with PID 23924, at HOST oberon2
oberon2 3 june
06/03/07 00:02:43 nsrmmd #1563: Start nsrmmd #1563, with PID 1235, at HOST oberon2
06/03/07 00:31:41 nsrmmd #1563: Start nsrmmd #1563, with PID 4699, at HOST oberon2
06/03/07 06:00:23 nsrmmd #1605: Start nsrmmd #1605, with PID 20577, at HOST oberon2
06/03/07 20:07:19 nsrmmd #1624: Start nsrmmd #1624, with PID 25703, at HOST oberon2
06/03/07 20:47:59 nsrmmd #1625: Start nsrmmd #1625, with PID 12389, at HOST oberon2
oberon2 12 july
07/12/07 01:35:59 nsrmmd #386: Start nsrmmd #386, with PID 21905, at HOST oberon2
07/12/07 06:00:26 nsrmmd #411: Start nsrmmd #411, with PID 27508, at HOST oberon2
I have registered a service request with EMC Support and gave them a truck load of information, but I don't seem to be getting anywhere. The only thing that I can think of is that because al these time outs occur during the night, is that the backup server is to busy and that RMAN times out because of this. When the database admins start the backup the next day during the day there is never a problem. Also when a time out occurs the connection is reset. After that the other RMAN backups that are in the queue finish without any problems.
DennisP2
21 Posts
1
July 12th, 2007 07:00
Thanks,
Dennis
Martijn4
26 Posts
0
July 18th, 2007 04:00
1. Can we use this environment variable NSR_RPC_TIMEOUT?
2. To what value do we set this variable?
ble1
14.3K Posts
1
July 18th, 2007 05:00
I use that with SAP while doing PS backups for one db, but I had to get patched backint to be able to use it. I do not think nature of the problem that made us to use is involved to what you have.
The one you need (in minutes).