Debugging Elastic Beanstalk Docker run failures? - docker

I'm new to EB and AWS, and my docker images build fine but fail to run on Elastic Beanstalk. My suspicion is that they are not connecting to the database correctly, however, I'm not getting anything useful when I run "eb logs" from the commandline. Here are the errors:
{
"status": "FAILURE",
"api_version": "1.0",
"results": [
{
"status": "FAILURE",
"msg": "(TRUNCATED)...rrun.aws.json: No such file or directory
73927c49adff622a1a229d9369bdd80674d96d20f3eb99a9cdea786f4411a368
Docker container quit unexpectedly after launch: Docker container quit unexpectedly on Wed May 20 17:15:02 UTC 2015:.
Check snapshot logs for details.
Hook /opt/elasticbeanstalk/hooks/appdeploy/pre/04run.sh failed.
For more detail, check /var/log/eb-activity.log using console or EB CLI",
"returncode": 1,
"events": [
{
"msg": "Successfully pulled node:0.12.2-slim",
"severity": "TRACE",
"timestamp": 1432142064
},
{
"msg": "Successfully built aws_beanstalk/staging-app",
"severity": "TRACE",
"timestamp": 1432142094
},
{
"msg": "Docker container quit unexpectedly after launch: Docker container quit unexpectedly on Wed May 20 17:15:02 UTC 2015:. Check snapshot logs for details.",
"severity": "ERROR",
"timestamp": 1432142102
}
]
}
],
"truncated": "true"
}
And after the build completes:
[2015-05-20T17:15:02.694Z] INFO [8603] - [CMD-AppDeploy/AppDeployStage0/AppDeployPreHook/04run.sh] : Activity execution failed, because: cat: /var/app/current/Dockerrun.aws.json: No such file or directory
cat: /var/app/current/Dockerrun.aws.json: No such file or directory
73927c49adff622a1a229d9369bdd80674d96d20f3eb99a9cdea786f4411a368
Docker container quit unexpectedly after launch: Docker container quit unexpectedly on Wed May 20 17:15:02 UTC 2015:. Check snapshot logs for details. (ElasticBeanstalk::ExternalInvocationError)
caused by: cat: /var/app/current/Dockerrun.aws.json: No such file or directory
cat: /var/app/current/Dockerrun.aws.json: No such file or directory
73927c49adff622a1a229d9369bdd80674d96d20f3eb99a9cdea786f4411a368
Docker container quit unexpectedly after launch: Docker container quit unexpectedly on Wed May 20 17:15:02 UTC 2015:. Check snapshot logs for details. (Executor::NonZeroExitStatus)
The docker containers work locally, so what else can I do to figure out what's going wrong? I keep hearing about "snapshot logs" but where do I check these snapshot logs? Are they the output of what I'm already running "eb logs"?

I had this issue for a day or two. I managed to see the logs by going AWS Console > Elastic Beanstalk > Environment > ${YOUR_APPLICATION_ENV}
On the left pane;
Log > Request Logs > Download > Open in any text editor.
/var/log/eb-docker/containers/eb-current-app/
Follow the path and you will see the what causing the error and can fix it.

Assuming you have SSH access to the EC2 instance running your container, these are a few log files useful for debugging single container Docker instances in Beanstalk:
/tmp/docker_build.log
/tmp/docker_pull.log
/tmp/docker_run.log
In order to look at the error logs for the running process, first read the
/tmp/docker_run.log file. This file contains the Docker process id. Something like this:
c6ae58e4ad77e926f6a8230237acf95771c6b5d80d48fb1bc20591f964fd690c
The first few characters should match the process listed from the command docker ps. Use this value to find the corresponding log file in the following directory:
/var/log/eb-docker/containers/eb-current-app/
The format of the file name is eb-docker-ps-id-stdouterr.log

I had this issue when my containers were crashing because there was no traffic allowed between EBS and RDS. If you use any database try curling it. Also, you might want to try sudo docker logs CONTAINER_ID a try catching something useful. What might help also is trying to launch container manually from the instance. There's slight possibility something will come up.

Related

Run buildah within gitlab-ci

I want to use buildah from gitlab-ci, in order to build an image, run a container from it and do some tests against it.
My current gitlab-ci is:
tests:
tags:
- docker
image: quay.io/buildah/stable
stage: test
variables:
STORAGE_DRIVER: "vfs"
BUILDAH_FORMAT: "docker"
BUILDAH_ISOLATION: "rootless"
only:
refs:
- merge_requests
changes:
- **/*
script:
- buildah info --debug
- buildah unshare docker/test/run.sh
My runner is private gitlab runner, I don't want to change its configuration (to not break other CI).
The content of run.sh is:
#!/usr/bin/env bash
set -euo pipefail
container=$(buildah --ulimit nofile=8192 --name my-container from phusion/baseimage:bionic-1.0.0-amd64)
The error is:
level=warning msg="error reading allowed ID mappings: error reading subuid mappings for user \"root\" and subgid mappings for group \"root\": No subuid ranges found for user \"root\" in /etc/subuid" level=warning msg="Found no UID ranges set aside for user \"root\" in /etc/subuid." level=warning msg="Found no GID ranges set aside for user \"root\" in /etc/subgid." No buildah sali-container already exists... Package Sali Creating sali-container Completed short name "phusion/baseimage" with unqualified-search registries (origin: /etc/containers/registries.conf) Getting image source signatures Copying blob
sha256:36505266dcc64eeb1010bd2112e6f73981e1a8246e4f6d4e287763b57f101b0b Copying blob
sha256:1907967438a7f3c5ff54c8002847fe52ed596a9cc250c0987f1e2205a7005ff9 Copying blob
sha256:23884877105a7ff84a910895cd044061a4561385ff6c36480ee080b76ec0e771 Copying blob
sha256:2910811b6c4227c2f42aaea9a3dd5f53b1d469f67e2cf7e601f631b119b61ff7 Copying blob
sha256:bc38caa0f5b94141276220daaf428892096e4afd24b05668cd188311e00a635f Copying blob
sha256:53c90fd859186b7b770d65adcb6ae577d4c61133f033e628530b1fd8dc0af643 Copying blob
sha256:d039079bb3a9bf1acf69e7c00db0e6559a86148c906ba5dab06b67c694bbe87c Copying config
sha256:32c929dd2961004079c1e35f8eb5ef25b9dd23f32bc58ac7eccd72b4aa19f262 Writing manifest to image destination Storing signatures level=error msg="Error while applying layer: ApplyLayer
exit status 1 stdout: stderr: potentially insufficient UIDs or GIDs available in user namespace (requested 0:42 for /etc/gshadow): Check /etc/subuid and /etc/subgid: lchown /etc/gshadow: invalid argument" 4 errors occurred while pulling:
* Error initializing source docker://registry.fedoraproject.org/phusion/baseimage:bionic-1.0.0-amd64: Error reading manifest bionic-1.0.0-amd64 in registry.fedoraproject.org/phusion/baseimage: manifest unknown: manifest unknown
* Error initializing source docker://registry.access.redhat.com/phusion/baseimage:bionic-1.0.0-amd64: Error reading manifest bionic-1.0.0-amd64 in registry.access.redhat.com/phusion/baseimage: name unknown: Repo not found
* Error initializing source docker://registry.centos.org/phusion/baseimage:bionic-1.0.0-amd64: Error reading manifest bionic-1.0.0-amd64 in registry.centos.org/phusion/baseimage: manifest unknown: manifest unknown
* Error committing the finished image: error adding layer with blob "sha256:23884877105a7ff84a910895cd044061a4561385ff6c36480ee080b76ec0e771": ApplyLayer exit status 1 stdout: stderr: potentially insufficient UIDs or GIDs available in user namespace (requested 0:42 for /etc/gshadow): Check /etc/subuid and /etc/subgid: lchown /etc/gshadow: invalid argument level=error msg="exit status 125" level=error msg="exit status 125"
The result of buildah info --debug:
{
"debug": {
"buildah version": "1.18.0",
"compiler": "gc",
"git commit": "",
"go version": "go1.15.2"
},
"host": {
"CgroupVersion": "v1",
"Distribution": {
"distribution": "fedora",
"version": "33"
},
"MemFree": 9021378560,
"MemTotal": 15768850432,
"OCIRuntime": "runc",
"SwapFree": 0,
"SwapTotal": 0,
"arch": "amd64",
"cpus": 4,
"hostname": "runner-cvBUQadt-project-2197143-concurrent-0",
"kernel": "4.14.83+",
"os": "linux",
"rootless": false,
"uptime": "6391h 28m 15.45s (Approximately 266.29 days)"
},
"store": {
"ContainerStore": {
"number": 0
},
"GraphDriverName": "vfs",
"GraphOptions": [
"vfs.imagestore=/var/lib/shared"
],
"GraphRoot": "/var/lib/containers/storage",
"GraphStatus": {},
"ImageStore": {
"number": 0
},
"RunRoot": "/var/run/containers/storage"
}
}
I read other posts about the errors I had and came to this configuration, which is not enough. I choose buildah by thinking it would be easy to use from a CI as it is supposed to run rootless, but this is a real nightmare... I am poor lonesome developer and not a sysadmin, I don't understand how to setup linux for buildah... Can somebody help me?
Buildah is going to need to run as root or within a user namespace with sufficent UIDs to install files with different UID.
This looks like for some reason buildah thought it should run within a user namespace and then did not find root listed within the user namespace. This usually happens when you did not run with enough privileges.

Facing issue while deploying Docker images through AWS-Greengrass Connector Service

BACKGROUND:
We are trying to deploy App as a docker container through AWS-Greengrass Connector Service to the edge device (Running Greengrass core as container in Linux env).
We are configuring the greengrass group connector in cloud for docker app deployment.
ISSUES:
While deploying from AWS greengrass group (AWS cloud), we are able to see successful deployment message, but application is not getting deployed to the edge device (running greengrass core as container).
LOGS:
DockerApplicationDeploymentLog:
[2020-11-05T10:35:42.632Z][FATAL]-lambda_runtime.py:381,Failed to initialize Lambda runtime due to exception: "getgrnam(): name not found: 'docker'"
[2020-11-05T10:35:44.789Z][WARN]-ipc_client.py:162,deprecated arg port=8000 will be ignored
[2020-11-05T10:35:45.012Z][WARN]-ipc_client.py:162,deprecated arg port=8000 will be ignored
[2020-11-05T10:35:45.012Z][INFO]-docker_deployer.py:41,docker deployer starting up
[2020-11-05T10:35:45.012Z][INFO]-docker_deployer.py:45,checking inputs
[2020-11-05T10:35:45.012Z][INFO]-docker_deployer.py:52,docker group permissions
[2020-11-05T10:35:45.02Z][FATAL]-lambda_runtime.py:141,Failed to import handler function "handlers.function_handler" due to exception: "getgrnam(): name not found: 'docker'"
RuntimeSystemLog:
[2020-11-05T10:31:49.78Z][DEBUG]-Restart worker because it was killed. {"workerId": "8b0ee21d-e481-4d27-5e30-cb4d912547f5", "funcArn": "arn:aws:lambda:ap-south-1:aws:function:DockerApplicationDeployment:6"}
[2020-11-05T10:31:49.78Z][DEBUG]-Reserve worker. {"workerId": "8b0ee21d-e481-4d27-5e30-cb4d912547f5", "funcArn": "arn:aws:lambda:ap-south-1:aws:function:DockerApplicationDeployment:6"}
[2020-11-05T10:31:49.78Z][DEBUG]-Doing start attempt: {"Attempt count": 0, "workerId": "8b0ee21d-e481-4d27-5e30-cb4d912547f5", "funcArn": "arn:aws:lambda:ap-south-1:aws:function:DockerApplicationDeployment:6"}
[2020-11-05T10:31:49.78Z][DEBUG]-Creating directory. {"dir": "/greengrass/ggc/packages/1.11.0/var/lambda/8b0ee21d-e481-4d27-5e30-cb4d912547f5"}
[2020-11-05T10:31:49.78Z][DEBUG]-changed ownership {"path": "/greengrass/ggc/packages/1.11.0/var/lambda/8b0ee21d-e481-4d27-5e30-cb4d912547f5", "new uid": 121, "new gid": 121}
[2020-11-05T10:31:49.782Z][DEBUG]-Resolving environment variable {"Variable": "PYTHONPATH=/greengrass/ggc/deployment/lambda/arn.aws.lambda.ap-south-1.aws.function.DockerApplicationDeployment.6"}
[2020-11-05T10:31:49.79Z][DEBUG]-Resolving environment variable {"Variable": "PATH=/usr/bin:/usr/local/bin"}
[2020-11-05T10:31:49.799Z][DEBUG]-Resolving environment variable {"Variable": "DOCKER_DEPLOYER_DOCKER_COMPOSE_DESTINATION_FILE_PATH=/home/ggc_user"}
[2020-11-05T10:31:49.82Z][DEBUG]-Creating new worker. {"functionArn": "arn:aws:lambda:ap-south-1:aws:function:DockerApplicationDeployment:6", "workerId": "8b0ee21d-e481-4d27-5e30-cb4d912547f5"}
[2020-11-05T10:31:49.82Z][DEBUG]-Starting worker process. {"workerId": "8b0ee21d-e481-4d27-5e30-cb4d912547f5"}
[2020-11-05T10:31:49.829Z][DEBUG]-Worker process started. {"workerId": "8b0ee21d-e481-4d27-5e30-cb4d912547f5", "pid": 20471}
[2020-11-05T10:31:49.83Z][DEBUG]-Start work result: {"workerId": "8b0ee21d-e481-4d27-5e30-cb4d912547f5", "funcArn": "arn:aws:lambda:ap-south-1:aws:function:DockerApplicationDeployment:6", "state": "Starting", "initDurationSeconds": 0.012234454}
[2020-11-05T10:31:49.831Z][INFO]-Created worker. {"functionArn": "arn:aws:lambda:ap-south-1:aws:function:DockerApplicationDeployment:6", "workerId": "8b0ee21d-e481-4d27-5e30-cb4d912547f5", "pid": 20471}
[2020-11-05T10:31:53.155Z][DEBUG]-Received a credential provider request {"serverLambdaArn": "arn:aws:lambda:::function:GGTES", "clientId": "8b0ee21d-e481-4d27-5e30-cb4d912547f5"}
[2020-11-05T10:31:53.156Z][DEBUG]-WorkManager getting work {"workerId": "148f7a1a-168f-40a5-682d-92e00d56a5df", "funcArn": "arn:aws:lambda:::function:GGTES", "invocationId": "955c2c43-1187-4001-7988-4213b95eb584"}
[2020-11-05T10:31:53.156Z][DEBUG]-Successfully GET work. {"invocationId": "955c2c43-1187-4001-7988-4213b95eb584", "fromWorkerId": "148f7a1a-168f-40a5-682d-92e00d56a5df", "ofFunction": "arn:aws:lambda:::function:GGTES"}
[2020-11-05T10:31:53.156Z][DEBUG]-POST work result. {"invocationId": "955c2c43-1187-4001-7988-4213b95eb584", "ofFunction": "arn:aws:lambda:::function:GGTES"}
[2020-11-05T10:31:53.156Z][DEBUG]-WorkManager putting work result. {"workerId": "148f7a1a-168f-40a5-682d-92e00d56a5df", "invocationId": "955c2c43-1187-4001-7988-4213b95eb584"}
[2020-11-05T10:31:53.156Z][DEBUG]-WorkManager put work result successfully. {"workerId": "148f7a1a-168f-40a5-682d-92e00d56a5df", "invocationId": "955c2c43-1187-4001-7988-4213b95eb584"}
[2020-11-05T10:31:53.156Z][DEBUG]-Successfully POST work result. {"invocationId": "955c2c43-1187-4001-7988-4213b95eb584", "ofFunction": "arn:aws:lambda:::function:GGTES"}
[2020-11-05T10:31:53.157Z][DEBUG]-Handled a credential provider request {"clientId": "8b0ee21d-e481-4d27-5e30-cb4d912547f5"}
[2020-11-05T10:31:53.158Z][DEBUG]-GET work item. {"fromWorkerId": "148f7a1a-168f-40a5-682d-92e00d56a5df", "ofFunction": "arn:aws:lambda:::function:GGTES"}
[2020-11-05T10:31:53.158Z][DEBUG]-Worker timer doesn't exist. {"workerId": "148f7a1a-168f-40a5-682d-92e00d56a5df"}
Did you doublecheck to meet the requirments listed in
https://docs.aws.amazon.com/greengrass/latest/developerguide/docker-app-connector.html
https://docs.aws.amazon.com/greengrass/latest/developerguide/docker-app-connector.html#docker-app-connector-linux-user
I dont know this particular error, but it complains about some missing basic user/group settings:
[2020-11-05T10:35:42.632Z][FATAL]-lambda_runtime.py:381,Failed to initialize Lambda runtime due to exception: "getgrnam(): name not found: 'docker'"

Sharing files with ECS and EFS

Could you help me, please?
I'm trying to configure an ECS cluster to share files using EFS but I'm facing the following issue:
level=info time=2020-03-02T17:30:27Z msg="TaskHandler: Sending task change: TaskChange:
[arn:aws:ecs:us-east-1:959242800104:task/74086a36-c405-4248-8475-3234b011bee8 -> STOPPED, Known
Sent: NONE, PullStartedAt: 2020-03-02 17:30:27.661062367 +0000 UTC m=+3131.201879282,
PullStoppedAt: 2020-03-02 17:30:27.744492758 +0000 UTC m=+3131.285309673, ExecutionStoppedAt:
2020-03-02 17:30:27.913073824 +0000 UTC m=+3131.453890739,
arn:aws:ecs:us-east-1:959242800104:task/74086a36-c405-4248-8475-3234b011bee8 redmine -> STOPPED, Reason
CannotCreateContainerError: Error response from daemon: failed to mount local volume: mount
:/mnt/efs/redmine:/var/lib/docker/volumes/ecs-redmine-22-attachments-cee2f0e7e0ebc5f55000/_data,
data: addr=10.0.0.127,nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport:
no such file or directory, Known Sent: NONE] sent: false" module=task_handler_types.go
If I only declare a volume inside my ECS task, the container started normally but if I try to map the outside volume with the container folder the issue happens.
I followed this tutorial: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/using_efs.html and it seems the problem isn't in security groups but the container itself.
I'm using the alpine version of Redmine.
Follow the config snippets:
...
"mountPoints": [
{
"readOnly": null,
"containerPath": "/usr/src/redmine/files",
"sourceVolume": "attachments"
}
],
...
"volumes": [
{
"efsVolumeConfiguration": {
"fileSystemId": "fs-xxxxx",
"rootDirectory": "/mnt/efs/redmine"
},
"name": "attachments",
"host": null,
"dockerVolumeConfiguration": null
}
],
Thanks in advance.
The log says: "no such file or directory": Make sure the directory on efs exists before using it.
Other considerations:
You cannot use "efsVolumeConfiguration" with ECS-Fargate. Currently only for ECS-on-EC2 (Fargate support is in the making).
I followed those links in order to solve my problem. I thought that EFS is not ready to use in ECS.
I had to map EFS inside EC2 and after that I had access from docker container.
https://gist.github.com/duluca/ebcf98923f733a1fdb6682f111b1a832#update-your-cloud-formation-template
https://xiaoyunyang.github.io/post/a-complete-guide-to-deploying-your-web-app-to-amazon-web-service/#set-up-efs-with-your-containers

docker: Error creating container: 400 Client Error: Bad Request (\"invalid reference format\")"

While trying to build an awx image (Ansible works) for ppc64le, the following comes up:
TASK [image_build : Build AWX distribution using container] ***************************************************************************************************************************************************
fatal: [localhost -> localhost]: FAILED! => {"changed": false, "msg": "Error creating container: 400 Client Error: Bad Request (\"invalid reference format\")"}
to retry, use: --limit #/root/awx/installer/install.retry
PLAY RECAP ****************************************************************************************************************************************************************************************************
localhost : ok=10 changed=3 unreachable=0 failed=1
How can I see what really happens in the background? Any verbose docker logs that I can look at? The message itself is somewhat useless to me. I already set Ansible to verbose but this also was of no help.
Docker image names can only consist of lowercase (a-z) characters.
Either you are giving a un-supported image name or the variable(or paths) passed to the buid(or the container) cannot be resolved.
To enable debug logs, add "--debug" to docker daemon (/etc/systemd/system/multi-user.target.wants/docker.service for systemd based linux env)
For reference: https://docs.docker.com/config/daemon/#configure-the-docker-daemon

Docker container with status "Dead" after consul healthcheck runs

I am using consul's healthcheck feature, and I keep getting these these "dead" containers:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
20fd397ba638 progrium/consul:latest "\"/bin/bash -c 'cur 15 minutes ago Dead
What is exactly a "Dead" container? When does a stopped container become "Dead"?
For the record, I run progrium/consul + gliderlabs/registrator images + SERVICE_XXXX_CHECK env variables to do health checking. It runs a healthcheck script running an image every X secs, something like docker run --rm my/img healthcheck.sh
I'm interested in general to what "dead" means and how to prevent it from happening. Another peculiar thing is that my dead containers have no name.
this is some info from the container inspection:
"State": {
"Dead": true,
"Error": "",
"ExitCode": 1,
"FinishedAt": "2015-05-30T19:00:01.814291614Z",
"OOMKilled": false,
"Paused": false,
"Pid": 0,
"Restarting": false,
"Running": false,
"StartedAt": "2015-05-30T18:59:51.739464262Z"
},
The strange thing is that only every now and then a container becomes dead and isn't removed.
Thank you
Edit:
Looking at the logs, I found what makes the container stop fail:
Handler for DELETE /containers/{name:.*} returned error: Cannot destroy container 003876e41429013e46187ebcf6acce1486bc5011435c610bd163b159ba550fbc:
Driver aufs failed to remove root filesystem 003876e41429013e46187ebcf6acce1486bc5011435c610bd163b159ba550fbc:
rename /var/lib/docker/aufs/diff/003876e41429013e46187ebcf6acce1486bc5011435c610bd163b159ba550fbc
/var/lib/docker/aufs/ diff/003876e41429013e46187ebcf6acce1486bc5011435c610bd163b159ba550fbc-removing:
device or resource busy
Why does this happen?
edit2:
found this: https://github.com/docker/docker/issues/9665
Update March 2016: issue 9665 has just been closed by PR 21107 (for docker 1.11 possibly)
That should help avoid the "Driver aufs failed to remove root filesystem", "device or resource busy" problem.
Original answer May 2015
Dead is one if the container states, which is tested by Container.Start()
if container.removalInProgress || container.Dead {
return fmt.Errorf("Container is marked for removal and cannot be started.")
}
It is set Dead when stopping fails, in order to prevent that container to be restarting.
Amongst the possible cause of failure, see container.Kill().
It means kill -15 and kill -9 are both failing.
// 1. Send a SIGTERM
if err := container.killPossiblyDeadProcess(15); err != nil {
logrus.Infof("Failed to send SIGTERM to the process, force killing")
if err := container.killPossiblyDeadProcess(9); err != nil {
That usually mean, as the OP mention, a busy device or resource, preventing the process to be killed.
There are a lot of bugs caused by EBUSY, in particular when devicemapper is used.
There is a tracker bug for all of the EBUSY related issues.
see https://github.com/docker/docker/issues/5684#issuecomment-69052334

Resources