Avamar: Suspended Partitions, Stripes, and Hfscheck Failures on Avamar

摘要: This article talks about suspended partitions, stripes, and Hfscheck Failures on Avamar (Symptom Code 22632)

本文适用于 本文不适用于 本文并非针对某种特定的产品。 本文并非包含所有产品版本。

症状

1. The following error may appear in the Avamar Administrator Server UI. The message may generate a Dial Home Service Request (SR):

Symptom Code: 22632, Desc: A server disk has become suspended.
 

2. WARN messages related to perfbeat thread are reported on the data storage nodes in the /data01/cur/gsan.log:

WARN: <0968> perfbeat::outoftolerance mbpersec=0.31 average=5.66
WARN: <1051> tperfstatechanger::execute server_exception(MSG_ERR_UNNECESSARY) diskid=0 newstate=suspended
WARN: <1084> changing disk 0 on node 0.3 to suspended state
 

3. The status.dpn output shows that a disk has stripes suspended:
(This output is only produced when "WARN <1084>" occurs.)

For Example:

0.8 10.10.10.10 7.3.1-125 ONLINE fullaccess mhpu+0hpu+0hpu 1 false 7.36 16350564 3401334 56.0% 66%(onl:1,SUS:2374) 50%(onl:2439) 50%(onl:2433) 

This output shows that there are 2374 suspended stripes.

4. The hfscheck fails if a partition becomes suspended while the hfscheck is running. An example of an error from /data01/hfscheck/err.log or /data01/cur/err.log are: 

ERROR: <0001> indexstripe::hfschecksweepbody stripe=0.0-1209 proxy=0.0-1209 indexelem([hash=ee9b2fe66b4bd472e28c4f41c5097dbeaba7131a stripe=0.1-DF8 offset=1285]) goodowner=true goodelem=false

 

原因

Periodically, every five min by default, the gsan "tests" the I/O subsystem by performing small reads from the data partitions.

It verifies if the read performance is 10% that of normal performance.

 

In the example below the message is indicating that, on the particular node which generated the warning message, the average read performance over an extended number of trials while hfscheck was running is approximately 54.03 MB/second. However, on this particular test, the actual performance was 0.57 MB/second, which is below the "limit" of 10% of the average value, or 5.4029 MB/second.

Event Summary = perfbeat::outoftolerance mask=[hfscheck] average=54.03 limit=5.4029 mbpersec=0.57
 

The original purpose of this test was to provide some warning that there was some issue with the I/O subsystem which is causing the read performance to be excessively slow. 

In this case, slower than 10% of the "average" disk I/O performance.

The perftriallimit specifies the number of consecutive disk-read tests that must be out of tolerance before perfbeat suspects that a disk may be degraded.

The perfinterval (default 300s or 5 minutes) specifies how long to wait between each perftriallimit test.

 

When perfbeat suspects a disk is degraded, it tells the gsan to reach a cold state (stop all disk-related activity). 

It waits at most 20 minutes (hardwired) for the gsan to reach this state before timing out and not suspending the disk.

If the cold state is reached, then perfbeat performs perfcoldtriallimit (default 4) more read tests spaced perfcoldinterval (default 30) seconds apart.

Only if all these tests indicate that the disk is still degraded, will the disk be suspended.

 

Possible reasons for suspended disks:

  • When trying to reach a cold state, the gsan always waits at least one minute (hardwired). It also waits for all pending gsan disk I/O related activities to complete or suspend their operation. However, after a cold state has been reached, the operating system may still be performing disk I/O, such as flushing out its cache. This flushing activity is one possible explanation for why disks get suspended unnecessarily. With the larger amounts of memory, there can be a lot more cache data to flush.

  • Another possible explanation is that the performance history information is not accurately predicting what the expected disk read performance should be during various gsan activities because the gsan's behavior has changed too quickly for the history to reflect (the history is an average of the last 10 days worth of performance measurements).

  • Another possible explanation is that there could be an issue, such as not waiting for all gsan disk I/O activities to complete or suspend their operation before reaching a cold state.

Furthermore, research showed that during the hfscheck "indexsweep" phase (when all the hashes in the index stripes are being read and then performing massive random writes to many Data Referenced Log (DRL) files) the tested I/O performance drops off for a significant period of time.

On Avamar Data Store Gen4, Gen4s and Gen4T, write operations have been prioritized over read operations and the significance of testing the read performance of the I/O subsystem is much lower. Also, some drives (like Seagate Megalodon drives) use some different techniques that may confuse the tests being performed by that of the perfbeat thread.

解决方案

Background:

There are typically three different warning messages seen in the gsan logs:

WARN: <0968> perfbeat::outoftolerance mbpersec=0.31 average=5.66

Warning <0968> indicates that there was an individual gsan I/O test that was slow.

This message can be safely ignored.

 
WARN: <1051> tperfstatechanger::execute server_exception(MSG_ERR_UNNECESSARY) diskid=0 newstate=suspended

Warning <1051> indicates that there were enough slow reads that the gsan was considered putting the data partition into the suspended state, but decided not to do so. That is what MSG_ERR_UNNECESSARY indicates.

This message can be safely ignored.

 
WARN: <1084> changing disk 0 on node 0.3 to suspended state

Warning <1084> indicates that the gsan put the data partition into a "suspended state."

This message must not be ignored.

 
 

Resolution:

If the stripes are put into a suspended state, use the following guidelines to investigate and correct the following scenarios:

Perform the following to identify the location of the suspended partition:

1. Log in to the Avamar Utility Node as admin.

2. Elevate to root privilege.

3. Load the root keys per Avamar: How to Log in to an Avamar Server and Load Various Keys.

4. Run the following command to identify the location of the suspended partition:

mapall --noerror 'grep -i "suspended" /data01/cur/err.log'
 

5. Review the scenarios as they pertain to the results above:

Scenario# 1: Random portions on different storage nodes put into a suspended state:
    • No action is required. Stripes return online automatically. It is highly likely that hfscheck were running. 
 
Scenario# 2: The same partition on the same storage node put into a suspended state:
    • If stripes return online automatically, it is highly likely that garbage collection or hfscheck were running.
    • IMPORTANT: This could be an indication of a disk problem or some underlying problem.
    • Although the drive has not yet failed, it should still be checked using the steps below:

1. Determine which physical disks are associated with the disk that Avamar has suspended. Problems with physicals disk within a virtual disk suspending would be a root cause for a suspend:

avsysreport pdisk vdisk=x 

Where x is the number of the virtual disk (data partition) that has been suspended. For example, if the first data partition shows suspended stripes, query vdis=0.

Note: See Avamar: The location of a physical disk and which RAID group it belongs to in an Avamar node for more information about virtual and physical disk assignments.
 

2. Verify that there are no disk failures, predicted failures, or other errors at the physical disk level.

3. Confirm that there are no SCSI errors on physical disks that represent the virtual disk on the node in question (determined in Step 1). 

grep -i "MRMON\|scsi|Adaptec" /var/log/messages
 

4. Virtual Disks in Write Through Mode can cause disk suspends due to low I/O. Check the write policy on the controller:

mapall --noerror --all+ 'avsysreport vdisk | grep "Write Policy"'  
 

If any problems are detected in steps 2-4, open an SR with the Dell Technologies Avamar Support for further investigation.

 

Scenario# 3: Review the default perftriallimit settings:

1. Verify that the perftriallimit is set to 0:

avmaint config --ava | grep perftriallimit 
perftriallimit="0"
 

2. If the perftriallimit is anything other than zero:

a. Update it by running the command:

avmaint config --ava perftriallimit=0

b. Confirm the change:

avmaint config --ava | grep perftriallimit 
perftriallimit="0"
 

 

 

受影响的产品

Avamar

产品

Avamar, Avamar Server
文章属性
文章编号: 000061342
文章类型: Solution
上次修改时间: 17 6月 2025
版本:  10
从其他戴尔用户那里查找问题的答案
支持服务
检查您的设备是否在支持服务涵盖的范围内。