CloudLink encrypted PowerFlex SDS device errors after reboots due to SDS service starting before CloudLink agent unlocks drives
Summary: When PowerFlex SDS devices are encrypted by CloudLink, the mapper names can change after a reboot. This causes the SDS devices to reorder, and they appear as failed in the PowerFlex UI. ...
Symptoms
Affected products: The specific combination that includes:
- PowerFlex 3.6
- Cloud link 7.1
- RHEL 8.x
- SDS devices encrypted by CloudLink
After a reboot, the CloudLink encrypted SDS devices can appear as failed in the PowerFlex UI due to device mapper reordering.
The boot device logical mapping can swap between the first and last device letter.
The SDS errors can also occur after reboots if the SDS service starts before CloudLink has unlocked the drives. This will be fixed in CloudLink release 7.0.2.
Cause
- The encrypted CloudLink mapper is using the drive letter as an identifier. (for example, /dev/mapper/svm_sdb)
- When the drive letter changes, the mapper name changes
- During the reboot, the drive letter can change based on whether the boot drive or the SDS drives are detected first.
- This can also occur if the SDS service starts up before the CloudLink agent has unlocked the drives.
Resolution
Workaround for drive letter changing:
-
Option 1:
- Stop the SDS service on the PowerFlex node (/opt/emc/scaleio/sds/bin/delete_service.sh).
- Clear all the SDS alerts. This should trigger PowerFlex to rescan the system for the new device names and start to use them. (Found in Presentation Server > Devices.)
- Restart the SDS service after (/opt/emc/scaleio/sds/bin/create_service.sh).
-
Option 2: Remove the SDS drives from PowerFlex and readd them. (Found in Presentation Server > Devices)
-
Option 3: Reboot the PowerFlex node and the disk order may change back to the original mapping.
If subsequent reboots are using the new drive letter format, then the SDS errors continue to occur on the reboots. To change the PowerFlex drive letters so that the SDS errors do not occur, run the following scli commands:
Find the sds_id of the SDS node that had the errors:
Scli --query_all_sds
Update with the new drive letters:
Scli --update_sds_original_paths -sds_id <id>
Delay PowerFlex start until CloudLink unlocks the drives:
CloudLink upgrades to 7.0x, 7.1, 7.1.1 and 7.1.2 will remove a 60 sec sleep timer in /opt/emc/extra/pre_run.sh on SDS nodes. This will cause SDS errors after a reboot because PowerFlex starts using the drives before they are unlocked by CloudLink. To prevent SDS errors on reboots, add the 60 s delay back in /opt/emc/extra/pre_run.sh on SDS nodes after a CloudLink agent upgrade. This delays the SDS service starting until the encrypted drives have been unlocked by CloudLink.
An example showing the 60 s delay added to pre_run.sh:
#!/bin/bash -f if [ -f /sbin/svm ]; then echo svm is installed $(date) >> /var/log/svm-sds /sbin/svmd -l /var/log/svmd.log -p /var/run/svmd.pid & end=$((SECONDS+300)) while [ $SECONDS -lt $end ]; do /sbin/svm unlocked > /dev/null && break sleep 5 done fi sleep 60 echo pre_run returned...$(date) >> /var/log/svm-sds
Permanent Fix:
- New deployments - Install CloudLink 7.1.2 and PowerFlex 3.6.0.2
- Existing systems: Upgrade to CloudLink 7.1.2 and PowerFlex 3.6.0.2
CloudLink 7.1.2 and PowerFlex 3.6.0.2 will use the /dev/disk/by-id/.... And therefore use the /dev/mapper/svm_wwn-XXXX names vs the logical /dev/svm_sdXXX.