PowerFlex: Gateway High Availability setup causes 401 errors on REST clients
Summary: REST API client receives "401: Unauthorized" error when both Gateway/Apache services are running.
Symptoms
REST API client receives "401: Unauthorized" error when both Gateway/Apache services are running.
One example of REST API client is OpenStack cinder. This issue may cause certain ScaleIO volume operations (map, unmap, and so forth) in OpenStack to fail.
For every 10 successful REST API requests, 1 fails. For example, the Primary Apache service's mod_jk.log shows:
tail -f /var/log/apache2/mod_jk.log | grep ") status"
[6496:139877463439104] [debug] ajp_unmarshal_response::jk_ajp_common.c (739): (machine2) status = 200
[6497:139877295294208] [debug] ajp_unmarshal_response::jk_ajp_common.c (739): (machine2) status = 200
[6497:139877270116096] [debug] ajp_unmarshal_response::jk_ajp_common.c (739): (machine2) status = 200
[6496:139877429868288] [debug] ajp_unmarshal_response::jk_ajp_common.c (739): (machine1) status = 401 <---
[6497:139877219759872] [debug] ajp_unmarshal_response::jk_ajp_common.c (739): (machine2) status = 200
[6496:139877303686912] [debug] ajp_unmarshal_response::jk_ajp_common.c (739): (machine2) status = 200
[6497:139877228152576] [debug] ajp_unmarshal_response::jk_ajp_common.c (739): (machine2) status = 200
/var/log/nova/nova-compute.log shows:
2017-04-05 11:20:36.090 38186 ERROR nova.compute.manager [instance: 20e1036d-daf0-49b9-a228-07a1c48b882d] File "/usr/lib/python2.7/site-packages/os_brick/initiator/connector.py", line 1980, in connect_volume 2017-04-05 11:20:36.090 38186 ERROR nova.compute.manager [instance: 20e1036d-daf0-49b9-a228-07a1c48b882d] self.volume_id = self._get_volume_id() 2017-04-05 11:20:36.090 38186 ERROR nova.compute.manager [instance: 20e1036d-daf0-49b9-a228-07a1c48b882d] File "/usr/lib/python2.7/site-packages/os_brick/initiator/connector.py", line 1879, in _get_volume_id 2017-04-05 11:20:36.090 38186 ERROR nova.compute.manager [instance: 20e1036d-daf0-49b9-a228-07a1c48b882d] raise exception.BrickException(message=msg) 2017-04-05 11:20:36.090 38186 ERROR nova.compute.manager [instance: 20e1036d-daf0-49b9-a228-07a1c48b882d] BrickException: Error getting volume id from name oGXMByctQWesXL8PPKiyBQ==: Unauthorized 2017-04-05 11:20:36.090 38186 ERROR nova.compute.manager [instance: 20e1036d-daf0-49b9-a228-07a1c48b882d]
Cause
This is an error in the document. The workers.properties configuration in this document contains a load-balancing setup between two Gateway (Tomcat) instances, and the lbfactor is set to 10 and 1 for them. This means that the Apache service directs incoming requests to the two Gateways at a 10:1 ratio. As the REST API client acquires a token through one Gateway, and tokens are not shared between Gateways, a request that is sent to the second Gateway with this token fails with 401.
Note: If a client acquires a token from the Gateway with lbfactor 1, the failure rate is about 91%.
Resolution
Workaround
Use the following workers.properties file instead of the file in the document. This sets up the two Gateways in active-standby mode:
** /etc/apache2/workers.properties ***
worker.list=balance1
worker.machine1.type=ajp13
worker.machine1.host=<ip of GW 1>
worker.machine1.port=8009
worker.machine1.lbfactor=1
worker.machine1.activation=disabled
worker.machine2.type=ajp13
worker.machine2.host=<ip of GW 2>
worker.machine2.port=8009
worker.machine2.lbfactor=1
worker.machine2.redirect=machine1
worker.balance1.type=lb
worker.balance1.balance_workers=machine1,machine2
This configuration sets up machine2 as the primary, worker1 as the standby. The key differences between this configuration and the document are:
-
worker.machine1.activation=disabled
worker.machine2.redirect=machine1
-
worker.machine#.lbfactor=1
lbfactors setup if not required for both workers.
With this configuration:
- When both Gateways are up, all requests are directed to worker2, and there should be no 401.
- When worker2 goes down, the requests are directed to worker1. The REST client receives a 401, and can log in again to the REST API service and continue.
- When worker2 comes back and mod_jk module detects it, it directs requests to worker2 again, and the REST client receives another 401 but can log in again to RESET API service and continue.
Note: Both Apache services must have the same configurations in their workers.properties file. The Apache services are also set up as an active-standby cluster, by keepalived, and the mod_jk module in Apache service is responsible for directing REST API requests to Gateway services, that is based on the above configuration.
This is a documentation error. This KB can be used before the document is corrected.
Additional Information
The keepalived configuration can also be improved as it may not monitor apache/httpd services correctly.
The keepalived.conf uses "killall -0 apache2" for the "script." This returns 0 (success) if there is any process with "apache2" in the name, such as "tail -f /var/log/apache2/mod_jk.log."
To correctly monitor apache2 service, use "systemctl --no-pager status apache2" (Ubuntu), or "systemctl status httpd"(CentOS/RedHat).
The command used as "script" must return 0 if the apache2/httpd service is running, and none-zero, if it has stopped.