Can't create Virtual Data Center in ECS Community Edition

Question

Hi all,

I'm having a problem with my recently built ECS Single-Node Community Edition.

I'm running on CentOS 7.1.1503, Docker version 1.4.1, and ECS version 2.1.0.0

When I try to create a Virtual Data Center via the GUI, the error is:

"Error 7000 (http: 500): An error occurred in the API Service. An error occurred in the API Service. Cause: error insertVdcInfo. Virtual Data Center creation failure may occur when Data Services has not completed initialization. Please retry after a few minutes."

When I try this, the virtual machine's CPU usage hits 100%, and stays there.

I have also tried via the command line, using the "step2" python script, with a similar result (I didn't record the error, but will recreate if it's important).

Has anyone else had the same difficulty? More importantly, does anyone know how to fix this, or at least where I can look to see what the actual problem is?

Cheers,

Ben

HackySack · Accepted Answer

Solved!

This is mainly for the Googlers that hit this page with the same errors.

How I solved my problem with ECS 2.1 single node deployment:

Use a hostname that doesn't have any special characters in it (think -, _ etc). Just plain old alpha-numeric.
Disable the firewall on the host, or pay VERY careful attention to the ports needed as described here.
BEFORE YOU RUN THE PYTHON SCRIPTS - Change the "modify_container_conf_func()" in the step1_ecs_singlenode_install.py file to fix up the file name issues. Look at the attached to see what I did, and here for the inspiration. Lines 459-500. There are three files that need to be modified (in the container file system):
1. /opt/storageos/conf/cm.object.properties
2. /opt/storageos/conf/common.object.properties
3. /opt/storageos/ecsportal/conf/application.conf
As of 18th November 2015, you will also need to change this line in main() from
docker_image_name = "{}:{}".format(imagename, imagetag)
to
docker_image_name = "{}:{}".format(args.imagename, args.imagetag)
After running the Step1 script, add the partition on your data disk to the /etc/fstab (or equivalent) file so it will auto-mount when you reboot the host.
If you're using both scripts then in the step2_object_provisioning.py file, change the time.sleep (~ line 296-ish) from
time.sleep(20 * 60)
to
time.sleep(180 * 60)
or,
If you're doing it step 2 manually (GUI or CLI, doesn't matter), wait a long time between creating the Storage Pool (CreateObjectVArray & CreateDataStore & ) and creating the VDC (InsertVDC). A really long time. My data disk was 200GB, and this step took over 60 minutes. That's with 8 vCPUs & 128GB RAM. A good way to tell is to check the API login time from the host console - "curl -i -k https:// :4443/login -u root:ChangeMe". I found during the DataStore create, that login would take >4 minutes to return a token. Once it was done, <2 seconds.

That's it. Now, no doubt the good folk at EMC will sort out the problems with the step1 script soon, so you may not need any of this, other than the wait bit in step2. That is related closely to the size of your data disk.

Another tip that I found along the way. If you're playing around/suffering with failed deployments and find you are removing & re-deploying the container multiple times, you can achieve this without having to continually re-download the Docker package & ECS image. To do this, in the Step1 script, simply comment out the following lines:

In main()
- yum_func()
- package_install_func()
- docker_install_func()
- prep_file_func()
- docker_pull_func(docker_image_name)
In docker_cleanup_old_images()
- os.system("docker rmi -f $(docker images -q) 2>/dev/null")

Good luck

Ben

HackySack · Answer

OK, fixed this problem temporarily, but now I'm back to square one..

The VDC error went away once I fixed a DNS issue for this host. Once the host was able to resolve itself from the upstream DNS, I was able to create the VDC. At least, I think the two issues were related, or I may have just gotten lucky with the timing. I think this may have been caused by me using a hyphen in the hostname, as the Step1 script was adding some weird entries into the hosts file. Hyphen removed, no more weird entries.

Anyway, I got my ECS up & running, but then foolishly rebooted the host. After that, I could no longer log into the GUI. API login seems to work OK (curl -i -k https://blahblah:4443 -u root:ChangeMe), but takes ~5 minutes to return. That doesn't feel right.

If I try to create the VDC from the Command Line, I get this: "The Data Services system is being initialized. Please try again later." I've waited over an hour since the Storage Pool create, which is 3x longer than suggested in the doco.

So far, I'm not having much luck with ECS.

Aaron_Peterson · Answer

We've seen this issue in the single-node as a result of a directory table setting not being correctly scaled for fewer than 3 nodes. Try these steps to see if it resolves the authentication problem:

1. Enter the container using nsenter -i -m -p -u -t `pidof blobsvc`

2. Clear all previous VNest files using sudo rm -rf /data/vnest/vnest-main/*

3. Edit the value of "object.NumDirectoriesPerCoSForSystemDT" and "object.NumDirectoriesPerCoSForUserDT" from 128 to 32 in the file

/opt/storageos/conf/common.object.properties

4. Restart the portalsvc service or just exit the container and reboot the node entirely.

5. When node restarts, login and start the docker services:

service docker start

docker start ecsstandalone

6. Wait a couple of minutes for the container to initialize, then attempt to log in with curl again to ensure the authsvc service is started. If authentication completes successfully, you should be able to log into the GUI and run the provisioning script (step2_object_provisioning.py).

HackySack · Answer

Hi Aaron,Thanks for looking at this for me. I've just tried that & received the following response:HTTP/1.1 503 Service Temporarily UnavailableDate: Wed, 18 Nov 2015 08:22:55 GMTContent-Type: text/xmlContent-Length: 373Connection: keep-aliveETag: '5637b05b-175' 6503 Unable to connect to the service. The service is unavailable, try again later. The service is currently unavailable because a connection failed to a core component. Please contact an administrator or try again later. trueI've not seen that error before, but I'll do as it asks & try again later. As a positive, it responds much quicker. I should note that authentication works OK immediately following the deployment of the container. Once I create the Storage Pool though, it then gets really, really slow. The Curl call still returns a token, but it takes ~5 minutes for that to happen.I did see that change to the NumDirectories property in the step1 python deployment script, however I think I've spotted a problem. The current version on GitHub has this line to copy the existing file from the container (line 470):logger.info('Copy object properties files to host') os.system( 'docker exec -t ecsstandalone cp /opt/storageos/conf/cm.object.properties /host/cm.object.properties1')The script then tries to alter the file, but it's using a different name (Line 482):logger.info('Modify Directory Table config for single node') os.system( 'sed --expression='s/object.NumDirectoriesPerCoSForSystemDT=128/object.NumDirectoriesPerCoSForSystemDT=32/' --expression='s/object.NumDirectoriesPerCoSForUserDT=128/object.NumDirectoriesPerCoSForUserDT=32/'  /host/common.object.properties1  /host/common.object.properties')Does that look like intended behaviour?One more thing. The step1 script does a good job of partitioning the data disk, but doesn't add it to fstab. If you reboot the host, that's going to cause some issues for the container (I found this out the hard way). It might seem obvious to seasoned Linux people, but not so much for the rest of us. Maybe something to mention in the instructions, if it's not going to be included in the script?Cheers,Ben

HackySack · Answer

After hitting the same problems over & over again, I tried something a little different. I rebuilt the CentOS host from scratch, and then re-installed the ECS stuff. Same problem after the re-build.

This entry, from the ssm.log, looks relevant:

2015-11-18 10:19:46,471 [TaskScheduler-SSManager-BFW-ParallelExecutor-000] ERROR ChunkClientImpl.java (line 167) chunk creation failed 2092b6bb-2a84-4892-be0a-06219b9b9dbc, write context: [writerId: 10.0.0.12-1095-DIRECTORY_TABLE-BFW-0-writer-0, cos: urn:storageos:VirtualArray:4ceee845-9a9f-45a9-8866-13d0bb875557, level: 1, type: DIRECTORY_TABLE, skipChunkSequenceCounter: false], error com.emc.storageos.data.object.exception.ObjectControllerException: create chunk 2092b6bb-2a84-4892-be0a-06219b9b9dbc failed, request id 0a00000c:15119e687d1:18f:458, serverIp 10.0.0.12, status ERROR_NO_STORAGE_DEVICE_FOUND sleep 5000 millis before retry

That error shows up every 5 seconds or so (which you'd expect from the 5000 millis back-off)

I've checked the mount, and my data disk is partitioned, formatted & mounted. How do I check to see if the container can find it?

JasonCwik · Answer

Thanks Ben.  We'll make sure to update the scripts with your changes.

ECS

Can't create Virtual Data Center in ECS Community Edition

Was this post helpful?