Mesos Marathon(ctl) Debugging - "Abnormal executor termination: unknown container" - docker

I'm still new to Mesos, but am trying to figure out the best way to debug a Mesos application I'm attempting to develop. I'm getting the error message "Abnormal executor termination: unknown container" through the web application, and am unsure how to get more descriptive error messages to figure out what's going on. The error message would seem to indicate it can't find the Docker image, but I know for a fact it's referencing the correct image that is installed and running.
{
"id": "pgprimary",
"cmd": null,
"cpus": 1,
"mem": 128,
"disk": 0,
"instances": 1,
"container": {
"docker": {
"image": "example/postgres:centos7-10.0-1.6.0",
"network": "BRIDGE",
"parameters": [{
"key": "hostname",
"value": "pgprimary"
}],
"portMappings": [
]
},
"type": "DOCKER",
"volumes": [
{
"hostPath": "/mnt/nfsfileshare/pgdata",
"containerPath": "/pgdata",
"mode": "RW"
}
]
},
"env": {
"PG_MODE": "primary",
"PG_USER": "testuser",
"PG_PASSWORD": "testuser",
"PG_DATABASE": "userdb",
"PG_ROOT_PASSWORD": "password",
"PG_PRIMARY_USER": "primaryuser",
"PG_PRIMARY_PASSWORD": "password",
"PG_PRIMARY_PORT": "5432"
},
"labels": {},
"healthChecks": [
{
"protocol": "COMMAND",
"command": {
"value": "/usr/pgsql-10/bin/pg_isready --host=pgprimary.marathon.mesos"
},
"gracePeriodSeconds": 300,
"intervalSeconds": 60,
"timeoutSeconds": 20,
"maxConsecutiveFailures": 3,
"ignoreHttp1xx": false
}
]
}
The command I'm using to deploy the Marathon app:
marathonctl -h http://10.0.2.15:8080 app create postgres.json

Not image, but docker is what marathon cannot find.
Specify the use of the Docker containerizer:
echo 'docker,mesos' > /etc/mesos-slave/containerizers
Provisioning Containers with the Docker Containerizer
https://mesosphere.github.io/marathon/docs/native-docker.html

Related

Unable to spin dockerized cassandra cluster on Mesosphere DC/OS

Can anyone have some idea to create a Cassandra cluster on Mesosphere DC/OS using Docker?
The issue is that Cassandra containers keep getting started after every few seconds.
It seems that Marathon is failing to get the health status of the newly created containers because it keeps creating new ones continuously. In DC/OS GUI service debug, it shows
State: TASK_FAILED
Message: Container terminated with signal Broken pipe
While checking on the machine, the containers are up and running and also new containers are getting creating repeatedly in every minute or two.
Why marathon doesn't get the correct response from the container that it has started successfully so that it can stop creating a new one?
I am sharing my current JSON configuration for the service.
Cassandra.json
{
"id": "/cassandra",
"acceptedResourceRoles": [
"*"
],
"backoffFactor": 1.15,
"backoffSeconds": 1,
"container": {
"portMappings": [
{
"containerPort": 8000,
"hostPort": 0,
"protocol": "tcp",
"servicePort": 10003,
"name": "main"
}
],
"type": "DOCKER",
"volumes": [],
"docker": {
"image": "cassandra:3.9",
"forcePullImage": false,
"privileged": false,
"parameters": []
}
},
"cpus": 3,
"disk": 10000,
"instances": 1,
"maxLaunchDelaySeconds": 300,
"mem": 6000,
"gpus": 0,
"networks": [
{
"mode": "container/bridge"
}
],
"requirePorts": false,
"upgradeStrategy": {
"maximumOverCapacity": 1,
"minimumHealthCapacity": 1
},
"killSelection": "YOUNGEST_FIRST",
"unreachableStrategy": {
"inactiveAfterSeconds": 0,
"expungeAfterSeconds": 0
},
"fetch": [],
"constraints": []
}
DC/OS open source version 1.13
Marathon Version 1.8.194
Please help if anyone have some idea what's going on? I can share further details if needed.

Orphaned Tasks in Docker Swarm after removal of failed node

Last week I had to remove a failed node from my Docker Swarm Cluster, leaving some tasks that ran on that node in desired state "Remove".
Even after deleting the stack and recreating it with the same name, docker stack ps stackname still shows them.
Interestingly enough, after recreating the stack, the tasks are still there, but with no node assigned.
Here's what I tried so far to "cleanup" the stack:
Recreating the stack with the same name
docker container prune
docker volume prune
docker system prune
Is there a way to remove a specific task?
Here's the output for docker inspect fkgz0oihexzs, the first task in the list:
[
{
"ID": "fkgz0oihexzsjqwv4ju0szorh",
"Version": {
"Index": 14422171
},
"CreatedAt": "2018-11-05T16:15:31.528933998Z",
"UpdatedAt": "2018-11-05T16:27:07.422368364Z",
"Labels": {},
"Spec": {
"ContainerSpec": {
"Image": "redacted",
"Labels": {
"com.docker.stack.namespace": "redacted"
},
"Env": [
"redacted"
],
"Privileges": {
"CredentialSpec": null,
"SELinuxContext": null
},
"Isolation": "default"
},
"Resources": {},
"Placement": {
"Platforms": [
{
"Architecture": "amd64",
"OS": "linux"
}
]
},
"Networks": [
{
"Target": "3i998stqemnevzgiqw3ndik4f",
"Aliases": [
"redacted"
]
}
],
"ForceUpdate": 0
},
"ServiceID": "g3vk9tgfibmcigmf67ik7uhj6",
"Slot": 1,
"Status": {
"Timestamp": "2018-11-05T16:15:31.528892467Z",
"State": "new",
"Message": "created",
"PortStatus": {}
},
"DesiredState": "remove"
}
]
I had the same problem. I resolved it following this instructions :
docker run --rm -v /var/run/docker/swarm/control.sock:/var/run/swarmd.sock dperny/tasknuke <taskid>
Be sure to use the full long task id or it will not work (fkgz0oihexzsjqwv4ju0szorh in your case).

How to directly mount external NFS share/volume in kubernetes(1.10.3)

I am using kubernetes : v1.10.3 , i have one external NFS server which i am able to mount anywhere ( any physical machines). I want to mount this NFS directly to pod/container . I tried but every time i am getting error. don't want to use privileges, kindly help me to fix.
ERROR: MountVolume.SetUp failed for volume "nfs" : mount failed: exit
status 32 Mounting command: systemd-run Mounting arguments:
--description=Kubernetes transient mount for /var/lib/kubelet/pods/d65eb963-68be-11e8-8181-00163eeb9788/volumes/kubernetes.io~nfs/nfs
--scope -- mount -t nfs 10.225.241.137:/stagingfs/alt/ /var/lib/kubelet/pods/d65eb963-68be-11e8-8181-00163eeb9788/volumes/kubernetes.io~nfs/nfs
Output: Running scope as unit run-43393.scope. mount: wrong fs type,
bad option, bad superblock on 10.225.241.137:/stagingfs/alt/, missing
codepage or helper program, or other error (for several filesystems
(e.g. nfs, cifs) you might need a /sbin/mount. helper program)
In some cases useful info is found in syslog - try dmesg | tail or so.
NFS server : mount -t nfs 10.X.X.137:/stagingfs/alt /alt
I added two things for volume here but getting error every time.
first :
"volumeMounts": [
{
"name": "nfs",
"mountPath": "/alt"
}
],
Second :
"volumes": [
{
"name": "nfs",
"nfs": {
"server": "10.X.X.137",
"path": "/stagingfs/alt/"
}
}
],
---------------------complete yaml --------------------------------
{
"kind": "Deployment",
"apiVersion": "extensions/v1beta1",
"metadata": {
"name": "jboss",
"namespace": "staging",
"selfLink": "/apis/extensions/v1beta1/namespaces/staging/deployments/jboss",
"uid": "6a85e235-68b4-11e8-8181-00163eeb9788",
"resourceVersion": "609891",
"generation": 2,
"creationTimestamp": "2018-06-05T11:34:32Z",
"labels": {
"k8s-app": "jboss"
},
"annotations": {
"deployment.kubernetes.io/revision": "2"
}
},
"spec": {
"replicas": 1,
"selector": {
"matchLabels": {
"k8s-app": "jboss"
}
},
"template": {
"metadata": {
"name": "jboss",
"creationTimestamp": null,
"labels": {
"k8s-app": "jboss"
}
},
"spec": {
"volumes": [
{
"name": "nfs",
"nfs": {
"server": "10.X.X.137",
"path": "/stagingfs/alt/"
}
}
],
"containers": [
{
"name": "jboss",
"image": "my.abc.com/alt:7.1_1.1",
"resources": {},
"volumeMounts": [
{
"name": "nfs",
"mountPath": "/alt"
}
],
"terminationMessagePath": "/dev/termination-log",
"terminationMessagePolicy": "File",
"imagePullPolicy": "IfNotPresent",
"securityContext": {
"privileged": true
}
}
],
"restartPolicy": "Always",
"terminationGracePeriodSeconds": 30,
"dnsPolicy": "ClusterFirst",
"securityContext": {},
"schedulerName": "default-scheduler"
}
},
"strategy": {
"type": "RollingUpdate",
"rollingUpdate": {
"maxUnavailable": "25%",
"maxSurge": "25%"
}
},
"revisionHistoryLimit": 10,
"progressDeadlineSeconds": 600
},
"status": {
"observedGeneration": 2,
"replicas": 1,
"updatedReplicas": 1,
"readyReplicas": 1,
"availableReplicas": 1,
"conditions": [
{
"type": "Available",
"status": "True",
"lastUpdateTime": "2018-06-05T11:35:45Z",
"lastTransitionTime": "2018-06-05T11:35:45Z",
"reason": "MinimumReplicasAvailable",
"message": "Deployment has minimum availability."
},
{
"type": "Progressing",
"status": "True",
"lastUpdateTime": "2018-06-05T11:35:46Z",
"lastTransitionTime": "2018-06-05T11:34:32Z",
"reason": "NewReplicaSetAvailable",
"message": "ReplicaSet \"jboss-8674444985\" has successfully progressed."
}
]
}
}
Regards
Anupam Narayan
As stated in the error log:
for several filesystems (e.g. nfs, cifs) you might need a /sbin/mount. helper program
According to this question, you might be missing the nfs-commons package which you can install using sudo apt install nfs-common

How to pull docker image with marathon which need to be authorized

I wan to deploy a docker container with marathon, if the docker image without authorized, the image can be pull normally, but when I try to pull an image from repository which need to be authorized, task deploy fail, the response is
Failed to launch container: Failed to run 'docker -H unix:///var/run/docker.sock pull example.com/web:laest': exited with status 1; stderr='Error response from daemon: repository example.com/web not found: does not exist or no pull access '
I changed the permission of /var/run/docker.sock file to 777 on node, and master, but the issue is still appeared, that seems permission is not the root cause for the issue; I try to run "docker login" on the node, and pull the image manually, then the marathon task run correctly, my marathon json like below:
{
"id": "/web",
"cmd": "docker login --username='sam' --passwoer='123456' example.com/web:latest",
"cpus": 0.3,
"mem": 32,
"disk": 0,
"instances": 1,
"env": {
"EMAIL_USE_TLS": "False",
"DATABASE_URI": "mysql://user:123456#RDS:3306/test"
},
"container": {
"type": "DOCKER",
"volumes": [
{
"containerPath": "/data/supervisor/",
"hostPath": "/data/workspace/logs/supervisor/",
"mode": "RW"
}
],
"docker": {
"image": "daocloud.io/gizwits2015/gwaccounts:1.6.0",
"network": "BRIDGE",
"portMappings": [
{
"containerPort": 0,
"hostPort": 0,
"servicePort": 10000,
"protocol": "tcp",
"labels": {}
}
],
"privileged": false,
"parameters": [
{
"key": "add-host",
"value": "RDS:10.66.125.161"
}
],
"forcePullImage": false
}
},
"portDefinitions": [
{
"port": 10000,
"protocol": "tcp",
"name": "default",
"labels": {}
}
]
}
How can I pull the image with authorized with marathon?
You should read: https://mesosphere.github.io/marathon/docs/native-docker-private-registry.html
Follow step 1, and in step 2 replace the uris section with
"fetch" : [
{
"uri" : "https://path.to/file",
"extract" : true,
"outputFile" : "dockerConfig.tar.gz"
}
]
I've written more detailed explanation here: http://blog.itaysk.com/2017/05/22/using-a-custom-private-docker-registry-with-marathon

Kubernetes pod not binding volumes to container

I've got the following ReplicationController JSON defined:
{
"id": "PHPController",
"kind": "ReplicationController",
"apiVersion": "v1beta1",
"desiredState": {
"replicas": 2,
"replicaSelector": {"name": "php"},
"podTemplate": {
"desiredState": {
"manifest": {
"version": "v1beta1",
"id": "PHPController",
"volumes": [{ "name": "wordpress", "path": "/mnt/nfs/wordpress_a", "hostDir": "/mnt/nfs/wordpress_a"}],
"containers": [{
"name": "php",
"image": "internaluser/php53",
"ports": [{"containerPort": 80, "hostPort": 9021}],
"volumeMounts": [{"name": "wordpress", "mountPath": "/mnt/nfs/wordpress_a"}]
}]
}
},
"labels": {"name": "php"}
}},
"labels": {"name": "php"}
}
The container starts correctly when run with "docker run -t -i -p 0.0.0.0:9021:80 -v /mnt/nfs/wordpress_a:/mnt/nfs/wordpress_a:rw internaluser/php53".
/mnt/nfs/wordpress_a is an NFS share, mounted on all of the minions. Each minion has full RW access and I have verified that the share is present.
After creating the pod containers with the Replication Controller, I can see that the volume was never actually bound, and/or incorrectly mounted:
"Volumes": {
"/mnt/nfs/wordpress_a": "/var/lib/docker/vfs/dir/8b5dc8477958f5c1b894e68ab9412b41e81a34ef16dac81f0f9d4884352a90b7"
},
"VolumesRW": {
"/mnt/nfs/wordpress_a": true
}
"HostConfig": {
"Binds": null,
"ContainerIDFile": "",
"LxcConf": null,
"Privileged": false,
"PortBindings": {
"80/tcp": [
{
"HostIp": "",
"HostPort": "9021"
}
]
},
I find it strange that the container believes /mnt/nfs/wordpress_a is mapped to "/var/lib/docker/vfs/dir/8b5dc8477958f5c1b894e68ab9412b41e81a34ef16dac81f0f9d4884352a90b7".
From the kubelet log:
Desired [10.101.4.15]: [{Namespace:etcd Name:c823da9e-4437-11e4-a3b1-0050568421eb Manifest:{Version:v1beta1 ID:c823da9e-4437-11e4-a3b1-0050568421eb UUID:c823da9e-4437-11e4-a3b1-0050568421eb Volumes:[{Name:wordpress Source:}] Containers:[{Name:php Image:internaluser/php53 Command:[] WorkingDir: Ports:[{Name: HostPort:9021 ContainerPort:80 Protocol:TCP HostIP:}] Env:[{Name:SERVICE_HOST Value:10.1.1.1}] Memory:0 CPU:0 VolumeMounts:[{Name:wordpress ReadOnly:false MountPath:/mnt/nfs/wordpress_a}] LivenessProbe: Lifecycle: Privileged:false}] RestartPolicy:{Always:0xa99a20 OnFailure: Never:}}}]
Does anyone have experience with this sort of thing? I've been driving myself crazy troubleshooting this. Thanks!
Solved. The volumes syntax was incorrect.
https://github.com/GoogleCloudPlatform/kubernetes/issues/1446

Resources