Dell Unity: SPs may go into Service Mode due to log bloating (/nbsnas partition becomes 100% full)
Summary: An array may go into Service Mode (Data Unavailable) due to log bloating (Dell Correctable)
Symptoms
For dual SP arrays, one SP of the storage system goes into service mode and the whole system cannot be operated through management interfaces, including CLI, UI, REST API, and SMI-S. This may also manifest as SPs rebooting alternately until both SPs end up in Service Mode.
A Unity array with both SPs in service mode will not service I/O, so this would be a Data Unavailable (DU) situation.
For VSA, the single SP may reboot into service mode or just stay in normal mode, losing management in either case.
The whole system cannot be operated through management interfaces, including CLI, UI, REST API, and SMI-S.
SSH or IPMI should work. IPMI works always, SSH may only work after the array was stabilized.
This problem is found on OE version 4.0.0.x and is fixed in OE version 4.0.1.x.
Cause
The log file /nbsnas/http/logs/mod_jk.log, which records every request from UI and REST, resides in a file system mounted on /nbsnas of the primary SP. Without a log rotation mechanism, bloating of this file continues to consume the available space of the file system. Other internal consumers start to fail after no space is left on the file system. One of the SPs goes into service mode when detects repeated failures of those components.
It was observed in the lab that when this happens and services try to failover to the secondary SP, it too experiences the same symptoms. The SPs reboot a few times alternately, and eventually both go into service mode.
Customers see this problem if: always use UI or REST API to configure the storage system, or open the UI in the browser and leave it there without closing. With only UI access, normally It takes a few months for customers to see this problem. If customers use REST API to query data from storage system frequently, this issue happens more quickly.
A second issue was found in which upgrading to Unity OE 4.0.1.8320161 may exacerbate the issue as it may duplicate the log file in question during the NDU, therefore accelerating the process.
You can confirm if so by checking the space consumption on /nbsbas. If space consumption is minimal or low, you did NOT experience this issue during NDU and therefore nothing else is required.
4.0.1.x codes already contain the fix for the main problem, so the log rotation itself is working correctly.
If the partition is showing a very high used percentage, then the responsible logs files may have to be deleted (requires Dell support).
Example of how to check space usage and what logs to delete can be found in the notes section.
Dell has decided to remove Unity OE 4.0.1.8320161 for Unity and UnityVSA from support.emc.com. A revised Unity OE release (4.0.1.8404134) was published in September 2016.
Resolution
To resolve this issue, it is necessary for Technical Support to gain root access to the array.
Contact Unity Technical Support and mention this KB article: 489057
Additional Information
Example of how to check space usage:
spX:~> df -h /nbsnas Filesystem Size Used Avail Use% Mounted on /dev/c4nasdba1 1013M 55M 908M 6% /nbsnas
The log or logs causing this can be found in /nbsnas/http/logs:
spx:~> cd /nbsnas/http/logs spx:/nbsnas/http/logs> ll -h total 975M -rw-r--r-- 1 root root 12K Sep 8 13:32 access_log -rw-r--r-- 1 root root 165K Sep 8 08:45 access_log.1.gz -rw-r--r-- 1 root root 239K Sep 8 06:59 access_log.2.gz -rw-r--r-- 1 root root 1.6M Sep 8 13:32 error_log -rw-r--r-- 1 root root 167K Sep 3 04:56 error_log.1.gz -rw-r--r-- 1 root root 495M Sep 8 13:32 mod_jk.log <<<<<<<<<< -rw-r--r-- 1 root root 475M Sep 8 08:45 mod_jk.log.1 <<<<<<<<<<
svc_dc -lcd (list core dumps) may also show a few dumps with the "_mgmtd" suffix.
These were created when the SPs panic as some services are unable to start (due to /nbsnas being full).
spx:/> svc_dc -lcd ======================== [DC copier]: Available on backend: CP_dump_spb_CKM00161701xxx_2016-09-08_13_29_47_17275_ECOM core-dump_dump_spb_CKM00161701xxx_2016-09-08_08_46_23_778_mgmtd core-dump_dump_spb_CKM00161701xxx_2016-09-08_09_18_19_11994_mgmtd core-dump_dump_spb_CKM00161701xxx_2016-09-08_09_18_53_21524_mgmtd core-dump_dump_spb_CKM00161701xxx_2016-09-08_09_41_05_11446_mgmtd core-dump_dump_spb_CKM00161701xxx_2016-09-08_09_41_45_24620_mgmtd core-dump_dump_spb_CKM00161701xxx_2016-09-08_13_28_30_3067_mgmtd core-dump_dump_spb_CKM00161701xxx_2016-09-08_13_29_08_15086_mgmtd