sami.laihorinne

9 Posts

2319

January 22nd, 2019 15:00

SC5020 data corruption

Hi,

I have run into SQL server database corruption on two luns on a sc5020 running version 7.3.5.8. Both luns had data reduction ( compression and dedup) active. Have any of you run into something similar?

We have troubleshooted issue with Dell but their support do not seem to find any issue with the system.

What we see is 'older' ( a day or so) data getting corrupted and sql generating errors like:

sg 8964, Level 16, State 1, Line 1
Table error: Object ID 805577908, index ID 1, partition ID 72057594047954944, alloc unit ID 72057594080395264 (type LOB data). The off-row data node at page (1:1677118), slot 3, text ID 3729863409664 is not referenced.
Msg 8964, Level 16, State 1, Line 1

We see the error happening at a rate of 216Kbytes / 160 Gbytes data.

I know that there was an issue with dell compression
https://www.dell.com/support/article/us/en/04/sln314446/sc-storage-customer-notification-deduplication-operation-can-result-in-unexpected-system-behavior?lang=en

I suspect that there still is issues with dedup. Have you seen anything like this?

Responses(5)

DELL-Bob Mi

230 Posts

0

January 25th, 2019 15:00

First, the issue you point out for Dell Compression was resolved in SCOS07.03.04 so it was not the cause of your issue. Second, what type of SQL are you using? Microsoft, Sybase, MySQL, Oracle... Third, are you running database compression in addition to Data Reduction on the SAN?

sami.laihorinne

9 Posts

0

February 1st, 2019 15:00

Yes, this is something different than the known date reduction issue. I am just curius if anyone else has had issues with corrupted data.

sami.laihorinne

9 Posts

0

February 1st, 2019 16:00

Below a bit more detail if the issue we have:

After upgrading SCOS to version 7.3.2.48 to 7.3.5.8 and activating data reduction on volumes we where alerted within a few weeks about database corruption on one MS SQL database. The issue affected older data ( reports ) , new data inserts into database worked ok.

All the volumes we use are live volumes between two SC 5020 units. I had only updated one of the units to 7.3.5.8 while the other was running 7.3.2.48 without data reduction.

During troubleshooting we found that moving the primary volume to the node with 7.3.2.48 and limiting paths to that node resolved the database corruption issue. When we moved the volume back the database broke again. Everything while the vm was up and running.

I did some further testing and shut down the problematic vms. I then took snapshots of the live volume on both units and mounted them to vmware. I then did a binary comparison on the problem vmdk-flat files. They should be identical since the vm was shut down during snapshot. Binary comparison gave that 216KB were different in the 260GB flat-vmdk file holding the database. That explained why IO served from 7.3.5.8 behaved differently than data served from the 7.3.2.48 unit. Recovering is simple as we have good data on the 7.3.2.48. The issue is not ms sql specific, but if the incorrect bytes hits your db you will now about it.

It would seem that issue is extremely rare since we seem alone with this. Dell techs are doing a good job trying to figure out what happened, I just wanted to share what we found and check if any of you have seen anything similar.

HH

harri.hanninen

1 Message

1

February 10th, 2019 05:00

Sad news!

Is it so, that corruption only occurs on blocks those were written to the storage before the SCOS upgrade? I am more than interested what is the outcome of the support case.

Hoping that it is not yet another issue that is about to be fixed in SCOS 7.3.6 ...

-Harri

sami.laihorinne

9 Posts

0

February 18th, 2019 10:00

Hi, There has been no progress on the issue and the latest suggestion from dell support is all but helpful: 'There is nothing more we can do here'. Since i did find corruption on some luns my gameplan is to delete all the luns on the problematic system. Recreate the live volumes from the healthy node and then activate data reduction. Do some database checks and some random binary vmdk comparisons. Perhaps then upgrade the other system, but probably never dare to activate data reduction on that one.

View All

No Events found!

Compellent

SC5020 data corruption