Mike_Phelps

12 Posts

3411

March 20th, 2017 13:00

ECS 3 CE single node services unavailable / root file system full

I have seen mention of out of space issues as well as unresponsive web UI and services. I exec'd into the container to find that root was full.

ecs0:/tmp # ls -lsa

total 52

0 drwxrwxrwt 5 root      root        192 Mar 20 19:45 .

4 drwxr-xr-x 28 root      root       4096 Feb 16 19:14 ..

0 -rw-r--r-- 1 storageos storageos     0 Mar 20 19:17 FPVMhealthcheck.heartbeat

0 -rw-r--r-- 1 root      root          0 Mar 20 19:17 certtool.lock

0 drwxr-xr-x 2 root      root         19 Mar 20 19:52 hsperfdata_root

0 drwxr-xr-x 2 storageos storageos    32 Mar 20 19:52 hsperfdata_storageos

0 drwx------ 2 root      root         94 Feb 16 19:30 run-crons.6yRf5Z

48 -rwxr-xr-x 1 storageos storageos 48432 Feb 16 19:55 snappy-1.0.5-libsnappyjava.so

0 -rw-r--r-- 1 root      root          0 Mar 20 19:17 systool.lock

ecs0:/tmp # cd run-crons.6yRf5Z/

ecs0:/tmp/run-crons.6yRf5Z # ls -la

total 8898796

drwx------ 2 root root         94 Feb 16 19:30 .

drwxrwxrwt 5 root root        192 Mar 20 19:45 ..

-rw-r--r-- 1 root root 9112358912 Mar 20 19:17 run-crons.hourly.16288

-rw-r--r-- 1 root root         32 Feb 16 19:30 run-crons_mail.16288

-rw-r--r-- 1 root root          0 Feb 16 19:30 run-crons_output.16288

ecs0:/tmp/run-crons.6yRf5Z # file run-crons.hourly.16288

run-crons.hourly.16288: ASCII text

ecs0:/tmp/run-crons.6yRf5Z # head run-crons.hourly.16288

timeout: invalid time interval 'find'

Try 'timeout --help' for more information.

timeout: invalid time interval 'find'

Try 'timeout --help' for more information.

timeout: invalid time interval 'find'

Try 'timeout --help' for more information.

timeout: invalid time interval 'find'

Try 'timeout --help' for more information.

timeout: invalid time interval 'find'

Try 'timeout --help' for more information.

ecs0:/tmp/run-crons.6yRf5Z # tail run-crons.hourly.16288

timeout: invalid time interval 'find'

Try 'timeout --help' for more information.

timeout: invalid time interval 'find'

Try 'timeout --help' for more information.

timeout: invalid time interval 'find'

Try 'timeout --help' for more information.

timeout: invalid time interval 'find'

Try 'timeout --help' for more information.

timeout: invalid time interval 'find'

Try 'timecs0:/tmp/run-crons.6yRf5Z # rm run-crons.hourly.16288

Right or wrong, I removed the file vs truncating it, but I question whether there is a bug to address, expected behavior, or a combination thereof. After exiting the container, I issued a docker restart on it, and all the services began to function normally.

Thoughts?

Thanks!! Mike

Responses(8)

A

Anonymous

5 Practitioner

•

274.2K Posts

0

April 7th, 2017 21:00

I ran in to this also.

Inside the docker container, in file /etc/cron.hourly/btree_gc, here are lines relating to the error with line numbers:

26 ECS_ROOT=/opt/storageos

27 VNEST_PROP_FILE=$ECS_ROOT/conf/vnest.object.properties

51 SCRIPT_TIMEOUT_SECS=`grep 'BTreeGCScriptTimeoutSecs' $VNEST_PROP_FILE | awk -F\= '{print $2}'`

133 timeout ${SCRIPT_TIMEOUT_SECS} find -L ${RECYCLE_BIN} -type f -not -name '*.gz' -exec gzip -q -f {} \;

Problem is that inside the file referenced by the variable, /opt/storageos/conf/vnest.object.properties, there is no entry named 'BTreeGCScriptTimeoutSecs.'

I added it as below but have no idea what a good value should be good.

echo 'object.BTreeGCScriptTimeoutSecs=5' >> /opt/storageos/conf/vnest.object.properties

I killed the following that was running for a long time, probably since the first time it ran after building the system, and restarted crond via systemctl.

root 2759 24.8 0.0 12168 2212 ? RN Apr02 1786:40 /bin/bash /etc/cron.hourly/btree_gc

After I killed all cron related but am not sure if this was necessary.

This should prevent this issue.

Again, not sure what a good value should be - I was just looking to not fill up / in container.

JasonCwik

281 Posts

0

March 21st, 2017 08:00

That's an interesting one. How long was your system up? What base OS are you installed on?

Mike_Phelps

12 Posts

0

March 21st, 2017 12:00

CentOS 7.3.1611 created on 2017 Feb 16

Docker Engine 1.13.1

emccorp/ecs-software-3.0.0:latest image id e3022d56bf25

4 vCPU, 32GB RAM, 80GB OS

4 1TB LUNs for ECS

A

Anonymous

5 Practitioner

•

274.2K Posts

0

April 11th, 2017 08:00

I agree that this is a bug... referencing a non-existent variable seems to be a silly oversight. I just implemented the same "fix" as you did, Chris, but I went crazy and set it to 10 seconds.

A

Anonymous

5 Practitioner

•

274.2K Posts

0

April 11th, 2017 09:00

https://github.com/EMCECS/ECS-CommunityEdition/tree/master/patches/3.0.0.1

It turns out there is patch at above URL - the setting no longer exists in real ECS.

Thanks to Jason for the pointer.

JasonCwik

281 Posts

1

April 11th, 2017 10:00

Chris, actually the opposite. The setting does exist in real ECS (7200) and our patch inadvertently removed it.

A

Anonymous

5 Practitioner

•

274.2K Posts

0

April 11th, 2017 12:00

Hi Jason, are you saying the timeout in real ECS is two hours? (7200s)

travis_wichert

16 Posts

0

July 21st, 2017 12:00

Yes, that time in real ECS is two hours (7200s).

Also, this should issue should be resolved in ECS CE since we are no longer overwriting vnest.object.properties when building the CE image from the upstream ECS release artifact.

View All

No Events found!