PowerFlex: Failed disk with wrong Device ID
Summary: ScaleIO system disk shows as failed when it was used on another SDS node.
Symptoms
Scenario
When a customer is using same disk enclosure for two SDS nodes or more, they must configure offline/online for each disk and the node it belongs to.
Sometimes the customer can be mistaken and online the same disk on both SDS nodes and then we see one disk as failed.
Possible mistakes:
- The customer online the same disk on both SDS nodes and then we see one disk as failed.
- The customer swap disks, meaning each disk is online on the wrong node, in that case we see two failed drives one, on each SDS node.
Symptoms
The wrong disk device ID found by the SDS node, and the SDS process set the disk to FAILED state.
On SDS process start after mosConf part SDS process moves to physical devices discovery, when disk is not for scaleio use (like system OS disk or free disk) "Invalid device header signature" error will be shown (first line in the output). When the disk is in use by ScaleIO a device is found and the device ID is shown next to it.
On the first output below (trc file from server 1) we can see that 12 devices were found, but looking closely we can see two devices are different (L,M) - the 12th char in device ID is 3 and not 0 as all the other devices IDs.
On the second output below (trc file from server 2) 12 disks were found and again two disks are different (K,L) - the 12th char in device ID is 0 and not 3 as all the other devices IDs.
After discovery process SDS process moves to adding devices back to SDS, when SDS won't find the disk rc result will be NOT_FOUND (trc file from server 1), as we can see in the examples below, on each SDS, we have disks with device ID not belonging to its SDS, SDS will show those disks as FAILED because they are NOT_FOUND (trc file from server 1).
trc file from server 1
30/04 09:48:16.328000 000000A170629EA0:phyDev_ReadDevId:02679: Invalid device header signature : path=C, devVersion=2807280628052804, sigStart=2803280228012800, sigEnd=283b283a28392838
30/04 09:48:16.328000 000000A170629EA0:phyDevMap_ReloadSpecific:00128: Failed to read DeviceId of C. rc=351
30/04 09:48:16.329000 000000A170629EA0:phyDevMap_ReloadSpecific:00136: Found device F ,a2901dcd00000000
30/04 09:48:16.330000 000000A170629EA0:phyDevMap_ReloadSpecific:00136: Found device G ,a2901dce00000001
30/04 09:48:16.331000 000000A170629EA0:phyDevMap_ReloadSpecific:00136: Found device H ,a2901dcf00000002
30/04 09:48:16.332000 000000A170629EA0:phyDevMap_ReloadSpecific:00136: Found device I ,a2901dd000000003
30/04 09:48:16.333000 000000A170629EA0:phyDevMap_ReloadSpecific:00136: Found device J ,a2901dd100000004
30/04 09:48:16.333000 000000A170629EA0:phyDevMap_ReloadSpecific:00136: Found device K ,a29044bf00000005
30/04 09:48:16.337000 000000A170629EA0:phyDevMap_ReloadSpecific:00136: Found device L ,a29044c400030006
30/04 09:48:16.342000 000000A170629EA0:phyDevMap_ReloadSpecific:00136: Found device M ,a29044c000030005
30/04 09:48:16.343000 000000A170629EA0:phyDevMap_ReloadSpecific:00136: Found device N ,a29044cb00000008
30/04 09:48:16.344000 000000A170629EA0:phyDevMap_ReloadSpecific:00136: Found device O ,a2906bcf00000009
30/04 09:48:16.345000 000000A170629EA0:phyDevMap_ReloadSpecific:00136: Found device P ,a2906bd30000000a
30/04 09:48:16.345000 000000A170629EA0:phyDevMap_ReloadSpecific:00136: Found device Q ,fbd792df0000000b
...
30/04 09:48:16.345000 000000A1730BCEA0:contCmd_AddDev:01204: DevId a2901dce00000001 - Start rc = SUCCESS
30/04 09:48:16.346000 000000A173086EA0:contCmd_AddDev:01204: DevId a29044c700000007 - Start rc = SUCCESS
30/04 09:48:16.346000 000000A173098EA0:contCmd_AddDev:01204: DevId a2906bd30000000a - Start rc = SUCCESS
30/04 09:48:16.346000 000000A1730E0EA0:contCmd_AddDev:01204: DevId fbd792e50000000c - Start rc = SUCCESS
30/04 09:48:16.346000 000000A1730B3EA0:contCmd_AddDev:01204: DevId a2901dcf00000002 - Start rc = SUCCESS
30/04 09:48:16.346000 000000A17310DEA0:contCmd_AddDev:01204: DevId a2901dcd00000000 - Start rc = SUCCESS
30/04 09:48:16.346000 000000A173062EA0:contCmd_AddDev:01204: DevId a29044cb00000008 - Start rc = SUCCESS
30/04 09:48:16.346000 000000A1730C5EA0:contCmd_AddDev:01204: DevId a2901dd100000004 - Start rc = SUCCESS
30/04 09:48:16.346000 000000A1730E0EA0:contCmd_AddDev:01391: DevId fbd792e50000000c - Done rc = NOT_FOUND
30/04 09:48:16.348000 000000A1730A1EA0:contCmd_AddDev:01204: DevId fbd792ee0000000e - Start rc = SUCCESS
30/04 09:48:16.348000 000000A1730A1EA0:contCmd_AddDev:01391: DevId fbd792ee0000000e - Done rc = NOT_FOUND
30/04 09:48:16.349000 000000A1730F2EA0:contCmd_AddDev:01204: DevId fbd792e90000000d - Start rc = SUCCESS
30/04 09:48:16.349000 000000A17306BEA0:contCmd_AddDev:01204: DevId a2901dd000000003 - Start rc = SUCCESS
30/04 09:48:16.349000 000000A17307DEA0:contCmd_AddDev:01204: DevId a2906bcf00000009 - Start rc = SUCCESS
30/04 09:48:16.349000 000000A173074EA0:contCmd_AddDev:01204: DevId a29044bf00000005 - Start rc = SUCCESS
30/04 09:48:16.349000 000000A173086EA0:contCmd_AddDev:01391: DevId a29044c700000007 - Done rc = NOT_FOUND
30/04 09:48:16.349000 000000A1730F2EA0:contCmd_AddDev:01391: DevId fbd792e90000000d - Done rc = NOT_FOUND
30/04 09:48:16.351000 000000A1730FBEA0:contCmd_AddDev:01204: DevId fbd792ef0000000f - Start rc = SUCCESS
30/04 09:48:16.352000 000000A1730FBEA0:contCmd_AddDev:01391: DevId fbd792ef0000000f - Done rc = NOT_FOUND
30/04 09:48:16.352000 000000A173104EA0:contCmd_AddDev:01391: DevId a29044c300000006 - Done rc = NOT_FOUND
trc file from server 2
30/04 11:37:57.065000 000000EE1DC2AEA0:phyDevMap_ReloadSpecific:00136: Found device F ,a2901dc800030000
30/04 11:37:57.065000 000000EE1DC2AEA0:phyDevMap_ReloadSpecific:00136: Found device G ,a2901dc900030001
30/04 11:37:57.065000 000000EE1DC2AEA0:phyDevMap_ReloadSpecific:00136: Found device H ,a2901dca00030002
30/04 11:37:57.065000 000000EE1DC2AEA0:phyDevMap_ReloadSpecific:00136: Found device I ,a2901dcb00030003
30/04 11:37:57.065000 000000EE1DC2AEA0:phyDevMap_ReloadSpecific:00136: Found device J ,a2901dcc00030004
30/04 11:37:57.081000 000000EE1DC2AEA0:phyDevMap_ReloadSpecific:00136: Found device K ,a29044c300000006
30/04 11:37:57.081000 000000EE1DC2AEA0:phyDevMap_ReloadSpecific:00136: Found device L ,a29044c700000007
30/04 11:37:57.081000 000000EE1DC2AEA0:phyDevMap_ReloadSpecific:00136: Found device M ,a29044c800030007
30/04 11:37:57.081000 000000EE1DC2AEA0:phyDevMap_ReloadSpecific:00136: Found device N ,a29044cc00030008
30/04 11:37:57.081000 000000EE1DC2AEA0:phyDevMap_ReloadSpecific:00136: Found device O ,a2906bd000030009
30/04 11:37:57.081000 000000EE1DC2AEA0:phyDevMap_ReloadSpecific:00136: Found device P ,a2906bd40003000a
30/04 11:37:57.081000 000000EE1DC2AEA0:phyDevMap_ReloadSpecific:00136: Found device Q ,fbda92e00003000b
SDS Device ID Explained
Each SDS device has a header saved on its 64th LB.
The header has the following structure:
64-Bit signature
64-Bit device version
64-Bit SDS ID
64-Bit SDS Device ID ß what you are looking for.
The SDS Device ID also known as TgtDevId is consisted of the following:
Unique ID 32 bits
TGT index 16 bits
Device index 16 bits
For example: an SDS with the ID 2df4737600000002, would have two devices with the IDs: 7fff29ea00020000, 7fff29eb00020001
Anyhow, if a device belonging to SDS x was swapped into SDS y, then upon reattaching the device into SDS y, it finds out it belongs to a different SDS, by checking the SDS ID saved on the header.
You can perhaps see it in the SDS logs, if you search for "Wrong device"
Impact
System rebuild and rebalance as disk is in FAILED state.
Cause
Disk device ID belongs to another SDS node, therefore ScaleIO will never use it.
Resolution
Adding the disk to the correct SDS node.
Impacted versions
All PowerFlex versions
Fixed in version
Work as design.