Solutions Enabler: vWitness daemon being unresponsive weekly
Summary: The vWitness daemon becomes unresponsive every few days.
Symptoms
The vWitness daemon is shutting down every few days. From the storvwlsd logs:
<Error> [1089 vwlsListen] pdsThrdCreate #3008 : Error Creating Thread: [Errno: 11 - Resource temporarily unavailable] <Error> [1089 vwlsListen] : [vwlsListen()] pdsThrdCreate() error. rc 700002011 ([PDS/Thrd] Error while trying to create a thread [Errno: 11 - Resource temporarily unavailable]) <Error> [1089 0005979xxxxx_MGMT-0] pdsSSLRead #3382 : SSL_read() failed: [Errno: 104 - Connection reset by peer] <Error> [1089 0005979xxxxx_MGMT-0] pdsIpcRecvMsg #3269 : Error during receive, [fd=507], handle=0x7fadfc4e9930, nRc=700004012 <Error> [1089 0005979xxxxx_MGMT-0] : [vwlsConn()] pdsIpcRecvMsg() failed rc 700004012 ([PDS/Sock] recv/recvfrom() failed [Errno: 104 - Connection reset by peer]) [1089 0005979xxxxx_MGMT-0] : [vwlsConn()] internal error pdsIpcRecvMsg, rc 700004012 ([PDS/Sock] recv/recvfrom() failed [Errno: 104 - Connection reset by peer]) [1089 0005979xxxxx_MGMT-0] : [vwlsConn()] connection to vWMD at ::ffff:xxx.xxx.xxx.xxx (::ffff:xxx.xxx.xxx.xxx) terminating [1089 Shutdown] : [daemonInst_shutdownCB()] storvwlsd shutting down
Similar errors that are seen on the MGMT containers in the storvwmd log every hour:
<Error> [851 vWitnessName] pdsSSLRead #3382 : SSL_read() failed: [Errno: 104 - Connection reset by peer] <Error> [851 vWitnessName] pdsIpcRecvMsg #3269 : Error during receive, [fd=9], handle=0x7f2e6c0008c0, nRc=700004012 <Error> [851 vWitnessName] : [pingVWLS()] Failed to receive PING response from vWitness vWitnessName: [PDS/Sock] recv/recvfrom() failed [Errno: 104 - Connection reset by peer] <Warn > [851 vWitnessName] : [pingVWLS()] Connection to vWitness vWitnessName is now closed <Dbg > [851 vWitnessName] : [attemptConnect()] Connecting to vWitness vWitnessName @ xxx.xxx.xxx.xxx:10123 [851 vWitnessName] : [attemptConnect()] Successfully reconnected to vWitness vWitnessName, UID: TEQbDim5a0Rh /var/messages shows: localhost kernel: [457290.134557] cgroup: fork rejected by pids controller in /system.slice/emc_storvwlsd.service
Cause
The customer's environment is configured to drop network connections after they have been connected for an hour. This causes the connections to reset, which triggers this issue. We see similar connection drops after 1 hour using an SSH connection to the vApp. Also, connecting to the arrays using Secure Remote Services (SRS), the connection drops at one hour.
This causes the storvwlsd daemon to create too many tasks, as each new connection creates a new task. You can see the task count gradually increasing:
# systemctl status emc_storvwlsd
[0m emc_storvwlsd.service - LSB: EMC Solutions Enabler Witness Lock Service Daemon
Loaded: loaded (/etc/init.d/emc_storvwlsd; bad; vendor preset: disabled)
Active: [0;1;32mactive (running)[0m since Fri 2022-04-01 12:02:33 -03; 2 days ago
Docs: man:systemd-sysv-generator(8)
Process: 873 ExecStart=/etc/init.d/emc_storvwlsd start (code=exited, status=0/SUCCESS)
Tasks: 251 (limit: 512)
CGroup: /system.slice/emc_storvwlsd.service
└─1076 storvwlsd start -name storvwlsd
Once the number of tasks reaches 512, the service crashes. The number of tasks are similarly reported to the daemon:
stordaemon action storvwlsd -cmd list -stats
storvwlsd Statistics:
# running threads : 321
# thread pools : 2
# active Mutex vars : 101
# active CondVars : 11
# active RW-locks : 1
# open IPC channels : 322
# active sockets ipV4 (total) : 320
# active sockets (secure) : 316
# files open : 3
# Page Faults : 64
Proc Size (KB) : 908500
When the limit of 512 is breached, the daemon shuts down as it cannot acquire the resources that are required to create a new connection.
Resolution
The customer must work with their network team to identify the cause of the connection drops every hour. In this case, it was found that vApps on different subnets were also having this issue, suggesting something in the routing is triggering this disconnect.