ScaleIO Ready Node: Repeated rebuilds on ScaleIO Ready Node cluster
Summary: ScaleIO Dell Ready Node cluster can trigger repeated rebuilds when DAS Caches are configured incorrectly.
Symptoms
When DAS s Cache/SSD space is full, DAS Cache starts flushing the data in the SSD to the hard drive. The way it does that is by sending a lot of IOs to a small part of the drive, thus minimizing the seek and maximizing the throughput on the hard drive. If there are other IOs to the same hard drive, e.g. some large Reads that bypassed the cache, are issued to a different location in the drive, the RAID Controller and the drive will priorities the IOs with the small seek to get the max throughput and sometimes, will cause the other IOs to have high latency.
Cause
Resolution
Apply the below configuration settings, step by step needed to enable DAS Cache (per ScaleIO-ready Dell node):
1. Enter the relevant SDS in Maintenance Mode.
2. Change DAS cache configuration:
a. Set DAS cache parameters:
fscli --set-param AggressiveCachePopulation=0 fscli --set-param BypassLengthKB=128 fscli --set-param RcMaxLengthKB=32 fscli --set-param LowSpaceBypassKb=0
b. Modify DAS cache configuration file ("/etc/fio/config"):
FlusherCmdsNormalToBeStarted = 1 FlusherMaxCmdsToBeStarted = 2
c. Reset node to reload DAS cache driver to apply settings (only needed for step 'b') 2. Change server RAID writes cache settings to write through (effective immediately):
/opt/MegaRAID/perccli/perccli64 /c0/vall set wrcache=wt
3. Modify ScaleIO performance parameters as follows (management only - effective immediately):
scli --set_performance_parameters --sdc_max_inflight_requests 200 --all_sdc --tech scli --set_performance_parameters --sdc_max_inflight_data 20 --all_sdc --tech
4. Exit the relevant SDS from Maintenance Mode. We recommend applying the above settings to only one SDS at the start, checking everything is working properly for a few days before proceeding to the next SDS, and so on.