Highlighted
faltindal
1 Nickel

1 Stripe Error

Hello everybody,

When I checked the Server Informations, I saw the following picture. I checked the status and I saw the checkpoint was not be created. Also hfschek could not be completed. one of the stripe has a problem.

yellow.jpg

How can I fix the stripe?

Thanks a lot.

0 Kudos
10 Replies
rpervan
2 Iron

Re: 1 Stripe Error

I would suggest you to perform FULL HFS data consistensy check on latest CP . You can do this from GUI .

It will perform data integrity check on Avamar stripes and Avamar should repair it automaticaly ....

Otherweise , please open SR with EMC Avamar team and they will be glad to help you .

regards,

.r

0 Kudos
rpervan
2 Iron

Re: 1 Stripe Error

Another thing is to involve "avmaint testintegrity <STRIPE_NAME> --ava" to check particular stripe consistency .

0 Kudos
faltindal
1 Nickel

Re: 1 Stripe Error

Hi Rej,

When I look to status of avamar, I see hfscheck process is still terminating since Friday. When try to create new check point it is not possible to create a new one.

You can see the result of avmain testintegrity at the following.

root@origin1:~/#: avmaint testintegrity 0.2-B2A --ava

root@origin1:~/#:

Here is the status.dpn result;

root@origin1:~/#: status.dpn

Mon Aug 15 15:37:20 EEST 2011 Mon Aug 15 12:37:19 2011 UTC (Initialized Tue Jan 25 12:15:25 2011 UTC)

Node IP Address Version State Runlevel SrvrRootUser Dis Suspend Load UsedMB Errlen %Full Percent Full and Stripe Status by Disk

0.0 10.83.55.163 5.0.3-29 ONLINE fullaccess mhpu0hpu0hpu 5 false 2.37 16657472 1166380 34.3% 35%(onl:1301) 34%(onl:1314) 34%(onl:1326)

0.1 10.83.55.164 5.0.3-29 ONLINE fullaccess mhpu0hpu0hpu 4 false 5.60 16572136 1817362 34.4% 35%(onl:1315) 34%(onl:1316) 34%(onl:1311)

0.2 10.83.55.165 5.0.3-29 ONLINE fullaccess mhpu0hpu0hpu 2 false 0.39 18102840 1033958 34.1% 34%(onl:1304) 33%(onl:1330,ERR:1) 33%(onl:1301)

SrvrRootUser Modes = migrate + hfswriteable + persistwriteable + useraccntwriteable

All reported states=(ONLINE), runlevels=(fullaccess), modes=(mhpu0hpu0hpu)

System-Status: ok

Access-Status: full

ERROR 1 stripes OFFLINE_MEDIA_ERROR

Checkpoint failed with result MSG_ERR_OFFLINE : cp.20110815123632 started Mon Aug 15 15:36:32 2011 ended Mon Aug 15 15:37:12 2011, completed 1009 of 11819 stripes

Last GC: finished Fri Aug 12 08:14:53 2011 after 02m 53s >> recovered 5.07 MB (OK)

Hfscheck in progress: started Fri Aug 12 17:05:26 2011 (terminating)

Maintenance windows scheduler capacity profile is active.

The maintenance window is currently running.

Next backup window start time: Mon Aug 15 23:00:00 2011 EEST

Next blackout window start time: Tue Aug 16 08:00:00 2011 EEST

Next maintenance window start time: Tue Aug 16 11:00:00 2011 EEST

Do you know what should I do?

Thanks

0 Kudos
rpervan
2 Iron

Re: 1 Stripe Error

hello,

could you supply us outputs:

# dpnctl status

# cplist

switch to "admin" account , load the keys ... and

#  mapall 'ps -ef | grep gsan'

Please check the gsan log on all SN with commands ... it must be something relevant in gsan.log output .

#  mapall 'grep ERR /data01/cur/gsan.log'

1 stripes OFFLINE is not let's say so critilal (up till 8 we can deal with ) but if you facing with such things for the first time aybe you should contact EMC ...

In worest case you can rool back to latest validated check point ...

Cheers,

.r

0 Kudos
faltindal
1 Nickel

Re: 1 Stripe Error

Hi Rej,

Here is the results;

root@origin1:~/#: dpnctl status

Identity added: /home/dpn/.ssh/dpnid (/home/dpn/.ssh/dpnid)

dpnctl: INFO: gsan status: ready

dpnctl: INFO: MCS status: up.

dpnctl: INFO: EMS status: up.

dpnctl: INFO: Backup scheduler status: up.

dpnctl: INFO: dtlt status: up.

dpnctl: INFO: axionfs status: up.

dpnctl: INFO: Maintenance windows scheduler status: enabled.

dpnctl: INFO: Maintenance cron jobs status: enabled.

dpnctl: INFO: Unattended startup status: disabled.

root@origin1:~/#:

root@origin1:~/#: cplist

cp.20110614050327 Tue Jun 14 08:03:27 2011 valid rol --- nodes 3/3 stripes 9518

root@origin1:~/#:

admin@origin1:~/>: mapall 'ps -ef | grep gsan'

Using /usr/local/avamar/var/probe.xml

(0.0) ssh -x admin@10.83.55.163 'ps -ef | grep gsan'

admin 5801 5800 0 13:02 ? 00:00:00 bash -c ps -ef | grep gsan

admin 5817 5801 0 13:02 ? 00:00:00 grep gsan

admin 29274 1 0 Mar29 ? 00:00:00 ./gsan restart --runlevel=fullaccess --ramfsroot= --clientssl=false --altlogdir= --mainhost=10.83.55.163 --mainport=20000 --gatewayaddr=10.83.55.161

admin 29275 29274 3 Mar29 ? 4-11:40:31 ./gsan restart --runlevel=fullaccess --ramfsroot= --clientssl=false --altlogdir= --mainhost=10.83.55.163 --mainport=20000 --gatewayaddr=10.83.55.161

(0.1) ssh -x admin@10.83.55.164 'ps -ef | grep gsan'

admin 17575 1 0 Mar29 ? 00:00:00 ./gsan restart --runlevel=fullaccess --ramfsroot= --clientssl=false --altlogdir= --mainhost=10.83.55.163 --mainport=20000 --gatewayaddr=10.83.55.161

admin 17576 17575 3 Mar29 ? 4-09:38:49 ./gsan restart --runlevel=fullaccess --ramfsroot= --clientssl=false --altlogdir= --mainhost=10.83.55.163 --mainport=20000 --gatewayaddr=10.83.55.161

admin 24433 24432 0 13:02 ? 00:00:00 bash -c ps -ef | grep gsan

admin 24449 24433 0 13:02 ? 00:00:00 grep gsan

(0.2) ssh -x admin@10.83.55.165 'ps -ef | grep gsan'

admin 16905 1 0 Mar29 ? 00:00:00 ./gsan restart --runlevel=fullaccess --ramfsroot= --clientssl=false --altlogdir= --mainhost=10.83.55.163 --mainport=20000 --gatewayaddr=10.83.55.161

admin 16906 16905 3 Mar29 ? 4-04:56:17 ./gsan restart --runlevel=fullaccess --ramfsroot= --clientssl=false --altlogdir= --mainhost=10.83.55.163 --mainport=20000 --gatewayaddr=10.83.55.161

admin 23780 23779 0 13:02 ? 00:00:00 bash -c ps -ef | grep gsan

admin 23796 23780 0 13:02 ? 00:00:00 grep gsan

admin@origin1:~/>:

You can find the result of #mapall 'grep ERR /data01/cur/gsan.log' command at the attachment log file.

I thought turn back to checkpoint but the latest succesfuly cp date is 06.14.2011. So if it is possible, I do not prefer to turn CP.

Thanks a lot

0 Kudos
rpervan
2 Iron

Re: 1 Stripe Error

thanks for the update .

Yes, latest succesfuly CP was made on 14.06.2011 !  Strange! It should 2 CP per day created ...

OK, gsan is in fullaccess and that is fine .

Please switch to "admin" account. load the keys and create CP with:

# cp_cron --duplog

then use MCS and perform HFS "FULL" on this CP ...

or you can perform both operation from GUI if you want .

Please try with this one and let us know status update .

0 Kudos
faltindal
1 Nickel

Re: 1 Stripe Error

Hi Rej

I tried the commands but I couldn't be successful. Yesterday morning, second storage node has been offline. We can ping the node but gsan doesn't work. When I look to status.dpn I see the hfscheck process still terminating since friday. Hfscheck_kill command could not stop the process. Actualy I couldn't understand clearly what is happening on the system

Today, I will create a service request.

Thanks for your help

0 Kudos
rpervan
2 Iron

Re: 1 Stripe Error


> Yesterday morning, second storage node has been offline.
> We can ping the node but gsan doesn't work. When I look to status.dpn I see the hfscheck process still terminating since friday.
You can try with "restart.dpn --nodes=0.X " to restart offline SN … and to sync it back  …

> Hfscheck_kill command could not stop the process. Actualy I couldn't understand clearly what is happening on the system
Probably some process hang and it is hard decide what exactly without remote assistance .

Please open SR with EMC team !
rgds,

.r

0 Kudos
nielecn
1 Nickel

Re: 1 Stripe Error

did you resolve the problem?

0 Kudos