I have run into SQL server database corruption on two luns on a sc5020 running version 126.96.36.199. Both luns had data reduction ( compression and dedup) active. Have any of you run into something similar?
We have troubleshooted issue with Dell but their support do not seem to find any issue with the system.
What we see is 'older' ( a day or so) data getting corrupted and sql generating errors like:
sg 8964, Level 16, State 1, Line 1
Table error: Object ID 805577908, index ID 1, partition ID 72057594047954944, alloc unit ID 72057594080395264 (type LOB data). The off-row data node at page (1:1677118), slot 3, text ID 3729863409664 is not referenced.
Msg 8964, Level 16, State 1, Line 1
We see the error happening at a rate of 216Kbytes / 160 Gbytes data.
I know that there was an issue with dell compression
I suspect that there still is issues with dedup. Have you seen anything like this?
Below a bit more detail if the issue we have:
After upgrading SCOS to version 188.8.131.52 to 184.108.40.206 and activating data reduction on volumes we where alerted within a few weeks about database corruption on one MS SQL database. The issue affected older data ( reports ) , new data inserts into database worked ok.
All the volumes we use are live volumes between two SC 5020 units. I had only updated one of the units to 220.127.116.11 while the other was running 18.104.22.168 without data reduction.
During troubleshooting we found that moving the primary volume to the node with 22.214.171.124 and limiting paths to that node resolved the database corruption issue. When we moved the volume back the database broke again. Everything while the vm was up and running.
I did some further testing and shut down the problematic vms. I then took snapshots of the live volume on both units and mounted them to vmware. I then did a binary comparison on the problem vmdk-flat files. They should be identical since the vm was shut down during snapshot. Binary comparison gave that 216KB were different in the 260GB flat-vmdk file holding the database. That explained why IO served from 126.96.36.199 behaved differently than data served from the 188.8.131.52 unit. Recovering is simple as we have good data on the 184.108.40.206. The issue is not ms sql specific, but if the incorrect bytes hits your db you will now about it.
It would seem that issue is extremely rare since we seem alone with this. Dell techs are doing a good job trying to figure out what happened, I just wanted to share what we found and check if any of you have seen anything similar.
Is it so, that corruption only occurs on blocks those were written to the storage before the SCOS upgrade? I am more than interested what is the outcome of the support case.
Hoping that it is not yet another issue that is about to be fixed in SCOS 7.3.6 ...