Orphaned Tasks in Docker Swarm after removal of failed node - docker

Last week I had to remove a failed node from my Docker Swarm Cluster, leaving some tasks that ran on that node in desired state "Remove".
Even after deleting the stack and recreating it with the same name, docker stack ps stackname still shows them.
Interestingly enough, after recreating the stack, the tasks are still there, but with no node assigned.
Here's what I tried so far to "cleanup" the stack:
Recreating the stack with the same name
docker container prune
docker volume prune
docker system prune
Is there a way to remove a specific task?
Here's the output for docker inspect fkgz0oihexzs, the first task in the list:
[
{
"ID": "fkgz0oihexzsjqwv4ju0szorh",
"Version": {
"Index": 14422171
},
"CreatedAt": "2018-11-05T16:15:31.528933998Z",
"UpdatedAt": "2018-11-05T16:27:07.422368364Z",
"Labels": {},
"Spec": {
"ContainerSpec": {
"Image": "redacted",
"Labels": {
"com.docker.stack.namespace": "redacted"
},
"Env": [
"redacted"
],
"Privileges": {
"CredentialSpec": null,
"SELinuxContext": null
},
"Isolation": "default"
},
"Resources": {},
"Placement": {
"Platforms": [
{
"Architecture": "amd64",
"OS": "linux"
}
]
},
"Networks": [
{
"Target": "3i998stqemnevzgiqw3ndik4f",
"Aliases": [
"redacted"
]
}
],
"ForceUpdate": 0
},
"ServiceID": "g3vk9tgfibmcigmf67ik7uhj6",
"Slot": 1,
"Status": {
"Timestamp": "2018-11-05T16:15:31.528892467Z",
"State": "new",
"Message": "created",
"PortStatus": {}
},
"DesiredState": "remove"
}
]

I had the same problem. I resolved it following this instructions :
docker run --rm -v /var/run/docker/swarm/control.sock:/var/run/swarmd.sock dperny/tasknuke <taskid>
Be sure to use the full long task id or it will not work (fkgz0oihexzsjqwv4ju0szorh in your case).

Related

Docker service isn't starting with "node is missing network attachments, ip addresses may be exhausted" with --generic-resource

I have docker swarm of two nodes - manager node (aws instance) and worker node (multi-gpu rig on a desk next to me), both on Ubuntu 18.04 and Docker.io 19.03.6, build 369ce74a3c. On a worker node I set up nvidia-docker runtime and tested it (it works). On a manager node I set up an overlay network and now I'm trying to start service with gpu access and join it to my overlay network, but no luck - service isn't starting with assigned node no longer meets constraints. How I start service:
docker service create --name=hw --constraint=node.id==xyriecy63n8995enp2mro0nvx --network=d9gqsljvmpy7 --generic-resource "gpu=1" busybox:latest sh -c "while true; do echo Hello; sleep 2; done"
And what status it has:
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
ur0uut7xq8qyjafejwt3xlbv4 hw.1 busybox:latest#sha256:d366a4665ab44f0648d7a00ae3fae139d55e32f9712c67accd604bb55df9d05a node-4 Ready Rejected less than a second ago "assigned node no longer meets constraints"
w83690e7dzcc56ahysp8s5xi9 \_ hw.1 busybox:latest#sha256:d366a4665ab44f0648d7a00ae3fae139d55e32f9712c67accd604bb55df9d05a node-4 Shutdown Rejected less than a second ago "node is missing network attachments, ip addresses may be exhausted"
Task details:
docker inspect ur0uut7xq8qyjafejwt3xlbv4
[
{
"ID": "ur0uut7xq8qyjafejwt3xlbv4",
"Version": {
"Index": 156466
},
"CreatedAt": "2020-10-13T06:53:54.822993602Z",
"UpdatedAt": "2020-10-13T06:54:00.063967596Z",
"Labels": {},
"Spec": {
"ContainerSpec": {
"Image": "busybox:latest#sha256:d366a4665ab44f0648d7a00ae3fae139d55e32f9712c67accd604bb55df9d05a",
"Args": [
"sh",
"-c",
"while true; do echo Hello; sleep 2; done"
],
"Init": false,
"DNSConfig": {},
"Isolation": "default"
},
"Resources": {
"Limits": {},
"Reservations": {
"GenericResources": [
{
"DiscreteResourceSpec": {
"Kind": "gpu",
"Value": 1
}
}
]
}
},
"Placement": {
"Constraints": [
"node.id==xyriecy63n8995enp2mro0nvx"
],
"Platforms": [
{
"Architecture": "amd64",
"OS": "linux"
},
{
"OS": "linux"
},
{
"OS": "linux"
},
{
"OS": "linux"
},
{
"Architecture": "arm64",
"OS": "linux"
},
{
"Architecture": "386",
"OS": "linux"
},
{
"Architecture": "mips64le",
"OS": "linux"
},
{
"Architecture": "ppc64le",
"OS": "linux"
},
{
"Architecture": "s390x",
"OS": "linux"
}
]
},
"Networks": [
{
"Target": "d9gqsljvmpy7wjrxa5q09bgtb"
}
],
"ForceUpdate": 0
},
"ServiceID": "mef68axo6ztmu7ojkiwcxxj0a",
"Slot": 1,
"NodeID": "xyriecy63n8995enp2mro0nvx",
"Status": {
"Timestamp": "2020-10-13T06:53:59.979035656Z",
"State": "rejected",
"Message": "preparing",
"Err": "node is missing network attachments, ip addresses may be exhausted",
"ContainerStatus": {
"ContainerID": "",
"PID": 0,
"ExitCode": 0
},
"PortStatus": {}
},
"DesiredState": "shutdown",
"NetworksAttachments": [
{
"Network": {
"ID": "d9gqsljvmpy7wjrxa5q09bgtb",
"Version": {
"Index": 32157
},
"CreatedAt": "2020-10-12T13:39:55.061260869Z",
"UpdatedAt": "2020-10-12T13:39:55.062498427Z",
"Spec": {
"Name": "testnet",
"Labels": {},
"DriverConfiguration": {
"Name": "overlay"
},
"Attachable": true,
"IPAMOptions": {
"Driver": {
"Name": "default"
},
"Configs": [
{
"Subnet": "172.25.0.0/16",
"Gateway": "172.25.0.1"
}
]
},
"Scope": "swarm"
},
"DriverState": {
"Name": "overlay",
"Options": {
"com.docker.network.driver.overlay.vxlanid_list": "4097"
}
},
"IPAMOptions": {
"Driver": {
"Name": "default"
},
"Configs": [
{
"Subnet": "172.25.0.0/16",
"Gateway": "172.25.0.1"
}
]
}
},
"Addresses": [
"172.25.96.221/16"
]
}
],
"GenericResources": [
{
"NamedResourceSpec": {
"Kind": "gpu",
"Value": "GPU-50fd60c4"
}
}
]
}
]
My overlay network:
docker inspect d9gqsljvmpy7
[
{
"Name": "testnet",
"Id": "d9gqsljvmpy7wjrxa5q09bgtb",
"Created": "2020-10-12T13:39:55.061260869Z",
"Scope": "swarm",
"Driver": "overlay",
"EnableIPv6": false,
"IPAM": {
"Driver": "default",
"Options": null,
"Config": [
{
"Subnet": "172.25.0.0/16",
"Gateway": "172.25.0.1"
}
]
},
"Internal": false,
"Attachable": true,
"Ingress": false,
"ConfigFrom": {
"Network": ""
},
"ConfigOnly": false,
"Containers": null,
"Options": {
"com.docker.network.driver.overlay.vxlanid_list": "4097"
},
"Labels": null
}
]
Service starts normally without ether --network or --generic-resource. Starting without --network and attaching after start also doesn't work.
I enabled debug logs on both nodes but didn't see anything suspicious other than same error message:
Oct 12 13:40:45 node-4 dockerd[1166]: time="2020-10-12T13:40:45.975574449Z" level=error msg="fatal task error" error="node is missing network attachments, ip addresses may be exhausted" module=node/agent/taskmanager node.id=xyriecy63n8995enp2mro0nvx service.id=mef68axo6ztmu7ojkiwcxxj0a task.id=twcbj9emeopm2qfq0i7lwftbe
Also I tested network exhaustion with docker run -it --rm -v /var/run/docker.sock:/var/run/docker.sock docker/ip-util-check and obviously it finds nothing:
Overlay IP Utilization Report
----
Network testnet/d9gqsljvmpy7 has an IP address capacity of 65533 and uses 0 addresses spanning over 0 nodes
Network OK: network will have 49149 available IPs before passing the 75% subnet use
So, how can one start gpu-tied service and attach it to overlay network?
Apparently, there is no need to specify --generic-resource in my case. Without it service has access to all gpus, listed to docker via --node-generic-resource gpu=xxx. Downside is you can't control gpu count per service, but I can live with it.

How to directly mount external NFS share/volume in kubernetes(1.10.3)

I am using kubernetes : v1.10.3 , i have one external NFS server which i am able to mount anywhere ( any physical machines). I want to mount this NFS directly to pod/container . I tried but every time i am getting error. don't want to use privileges, kindly help me to fix.
ERROR: MountVolume.SetUp failed for volume "nfs" : mount failed: exit
status 32 Mounting command: systemd-run Mounting arguments:
--description=Kubernetes transient mount for /var/lib/kubelet/pods/d65eb963-68be-11e8-8181-00163eeb9788/volumes/kubernetes.io~nfs/nfs
--scope -- mount -t nfs 10.225.241.137:/stagingfs/alt/ /var/lib/kubelet/pods/d65eb963-68be-11e8-8181-00163eeb9788/volumes/kubernetes.io~nfs/nfs
Output: Running scope as unit run-43393.scope. mount: wrong fs type,
bad option, bad superblock on 10.225.241.137:/stagingfs/alt/, missing
codepage or helper program, or other error (for several filesystems
(e.g. nfs, cifs) you might need a /sbin/mount. helper program)
In some cases useful info is found in syslog - try dmesg | tail or so.
NFS server : mount -t nfs 10.X.X.137:/stagingfs/alt /alt
I added two things for volume here but getting error every time.
first :
"volumeMounts": [
{
"name": "nfs",
"mountPath": "/alt"
}
],
Second :
"volumes": [
{
"name": "nfs",
"nfs": {
"server": "10.X.X.137",
"path": "/stagingfs/alt/"
}
}
],
---------------------complete yaml --------------------------------
{
"kind": "Deployment",
"apiVersion": "extensions/v1beta1",
"metadata": {
"name": "jboss",
"namespace": "staging",
"selfLink": "/apis/extensions/v1beta1/namespaces/staging/deployments/jboss",
"uid": "6a85e235-68b4-11e8-8181-00163eeb9788",
"resourceVersion": "609891",
"generation": 2,
"creationTimestamp": "2018-06-05T11:34:32Z",
"labels": {
"k8s-app": "jboss"
},
"annotations": {
"deployment.kubernetes.io/revision": "2"
}
},
"spec": {
"replicas": 1,
"selector": {
"matchLabels": {
"k8s-app": "jboss"
}
},
"template": {
"metadata": {
"name": "jboss",
"creationTimestamp": null,
"labels": {
"k8s-app": "jboss"
}
},
"spec": {
"volumes": [
{
"name": "nfs",
"nfs": {
"server": "10.X.X.137",
"path": "/stagingfs/alt/"
}
}
],
"containers": [
{
"name": "jboss",
"image": "my.abc.com/alt:7.1_1.1",
"resources": {},
"volumeMounts": [
{
"name": "nfs",
"mountPath": "/alt"
}
],
"terminationMessagePath": "/dev/termination-log",
"terminationMessagePolicy": "File",
"imagePullPolicy": "IfNotPresent",
"securityContext": {
"privileged": true
}
}
],
"restartPolicy": "Always",
"terminationGracePeriodSeconds": 30,
"dnsPolicy": "ClusterFirst",
"securityContext": {},
"schedulerName": "default-scheduler"
}
},
"strategy": {
"type": "RollingUpdate",
"rollingUpdate": {
"maxUnavailable": "25%",
"maxSurge": "25%"
}
},
"revisionHistoryLimit": 10,
"progressDeadlineSeconds": 600
},
"status": {
"observedGeneration": 2,
"replicas": 1,
"updatedReplicas": 1,
"readyReplicas": 1,
"availableReplicas": 1,
"conditions": [
{
"type": "Available",
"status": "True",
"lastUpdateTime": "2018-06-05T11:35:45Z",
"lastTransitionTime": "2018-06-05T11:35:45Z",
"reason": "MinimumReplicasAvailable",
"message": "Deployment has minimum availability."
},
{
"type": "Progressing",
"status": "True",
"lastUpdateTime": "2018-06-05T11:35:46Z",
"lastTransitionTime": "2018-06-05T11:34:32Z",
"reason": "NewReplicaSetAvailable",
"message": "ReplicaSet \"jboss-8674444985\" has successfully progressed."
}
]
}
}
Regards
Anupam Narayan
As stated in the error log:
for several filesystems (e.g. nfs, cifs) you might need a /sbin/mount. helper program
According to this question, you might be missing the nfs-commons package which you can install using sudo apt install nfs-common

Mesos Marathon(ctl) Debugging - "Abnormal executor termination: unknown container"

I'm still new to Mesos, but am trying to figure out the best way to debug a Mesos application I'm attempting to develop. I'm getting the error message "Abnormal executor termination: unknown container" through the web application, and am unsure how to get more descriptive error messages to figure out what's going on. The error message would seem to indicate it can't find the Docker image, but I know for a fact it's referencing the correct image that is installed and running.
{
"id": "pgprimary",
"cmd": null,
"cpus": 1,
"mem": 128,
"disk": 0,
"instances": 1,
"container": {
"docker": {
"image": "example/postgres:centos7-10.0-1.6.0",
"network": "BRIDGE",
"parameters": [{
"key": "hostname",
"value": "pgprimary"
}],
"portMappings": [
]
},
"type": "DOCKER",
"volumes": [
{
"hostPath": "/mnt/nfsfileshare/pgdata",
"containerPath": "/pgdata",
"mode": "RW"
}
]
},
"env": {
"PG_MODE": "primary",
"PG_USER": "testuser",
"PG_PASSWORD": "testuser",
"PG_DATABASE": "userdb",
"PG_ROOT_PASSWORD": "password",
"PG_PRIMARY_USER": "primaryuser",
"PG_PRIMARY_PASSWORD": "password",
"PG_PRIMARY_PORT": "5432"
},
"labels": {},
"healthChecks": [
{
"protocol": "COMMAND",
"command": {
"value": "/usr/pgsql-10/bin/pg_isready --host=pgprimary.marathon.mesos"
},
"gracePeriodSeconds": 300,
"intervalSeconds": 60,
"timeoutSeconds": 20,
"maxConsecutiveFailures": 3,
"ignoreHttp1xx": false
}
]
}
The command I'm using to deploy the Marathon app:
marathonctl -h http://10.0.2.15:8080 app create postgres.json
Not image, but docker is what marathon cannot find.
Specify the use of the Docker containerizer:
echo 'docker,mesos' > /etc/mesos-slave/containerizers
Provisioning Containers with the Docker Containerizer
https://mesosphere.github.io/marathon/docs/native-docker.html

How to pull docker image with marathon which need to be authorized

I wan to deploy a docker container with marathon, if the docker image without authorized, the image can be pull normally, but when I try to pull an image from repository which need to be authorized, task deploy fail, the response is
Failed to launch container: Failed to run 'docker -H unix:///var/run/docker.sock pull example.com/web:laest': exited with status 1; stderr='Error response from daemon: repository example.com/web not found: does not exist or no pull access '
I changed the permission of /var/run/docker.sock file to 777 on node, and master, but the issue is still appeared, that seems permission is not the root cause for the issue; I try to run "docker login" on the node, and pull the image manually, then the marathon task run correctly, my marathon json like below:
{
"id": "/web",
"cmd": "docker login --username='sam' --passwoer='123456' example.com/web:latest",
"cpus": 0.3,
"mem": 32,
"disk": 0,
"instances": 1,
"env": {
"EMAIL_USE_TLS": "False",
"DATABASE_URI": "mysql://user:123456#RDS:3306/test"
},
"container": {
"type": "DOCKER",
"volumes": [
{
"containerPath": "/data/supervisor/",
"hostPath": "/data/workspace/logs/supervisor/",
"mode": "RW"
}
],
"docker": {
"image": "daocloud.io/gizwits2015/gwaccounts:1.6.0",
"network": "BRIDGE",
"portMappings": [
{
"containerPort": 0,
"hostPort": 0,
"servicePort": 10000,
"protocol": "tcp",
"labels": {}
}
],
"privileged": false,
"parameters": [
{
"key": "add-host",
"value": "RDS:10.66.125.161"
}
],
"forcePullImage": false
}
},
"portDefinitions": [
{
"port": 10000,
"protocol": "tcp",
"name": "default",
"labels": {}
}
]
}
How can I pull the image with authorized with marathon?
You should read: https://mesosphere.github.io/marathon/docs/native-docker-private-registry.html
Follow step 1, and in step 2 replace the uris section with
"fetch" : [
{
"uri" : "https://path.to/file",
"extract" : true,
"outputFile" : "dockerConfig.tar.gz"
}
]
I've written more detailed explanation here: http://blog.itaysk.com/2017/05/22/using-a-custom-private-docker-registry-with-marathon

Kubernetes pod not binding volumes to container

I've got the following ReplicationController JSON defined:
{
"id": "PHPController",
"kind": "ReplicationController",
"apiVersion": "v1beta1",
"desiredState": {
"replicas": 2,
"replicaSelector": {"name": "php"},
"podTemplate": {
"desiredState": {
"manifest": {
"version": "v1beta1",
"id": "PHPController",
"volumes": [{ "name": "wordpress", "path": "/mnt/nfs/wordpress_a", "hostDir": "/mnt/nfs/wordpress_a"}],
"containers": [{
"name": "php",
"image": "internaluser/php53",
"ports": [{"containerPort": 80, "hostPort": 9021}],
"volumeMounts": [{"name": "wordpress", "mountPath": "/mnt/nfs/wordpress_a"}]
}]
}
},
"labels": {"name": "php"}
}},
"labels": {"name": "php"}
}
The container starts correctly when run with "docker run -t -i -p 0.0.0.0:9021:80 -v /mnt/nfs/wordpress_a:/mnt/nfs/wordpress_a:rw internaluser/php53".
/mnt/nfs/wordpress_a is an NFS share, mounted on all of the minions. Each minion has full RW access and I have verified that the share is present.
After creating the pod containers with the Replication Controller, I can see that the volume was never actually bound, and/or incorrectly mounted:
"Volumes": {
"/mnt/nfs/wordpress_a": "/var/lib/docker/vfs/dir/8b5dc8477958f5c1b894e68ab9412b41e81a34ef16dac81f0f9d4884352a90b7"
},
"VolumesRW": {
"/mnt/nfs/wordpress_a": true
}
"HostConfig": {
"Binds": null,
"ContainerIDFile": "",
"LxcConf": null,
"Privileged": false,
"PortBindings": {
"80/tcp": [
{
"HostIp": "",
"HostPort": "9021"
}
]
},
I find it strange that the container believes /mnt/nfs/wordpress_a is mapped to "/var/lib/docker/vfs/dir/8b5dc8477958f5c1b894e68ab9412b41e81a34ef16dac81f0f9d4884352a90b7".
From the kubelet log:
Desired [10.101.4.15]: [{Namespace:etcd Name:c823da9e-4437-11e4-a3b1-0050568421eb Manifest:{Version:v1beta1 ID:c823da9e-4437-11e4-a3b1-0050568421eb UUID:c823da9e-4437-11e4-a3b1-0050568421eb Volumes:[{Name:wordpress Source:}] Containers:[{Name:php Image:internaluser/php53 Command:[] WorkingDir: Ports:[{Name: HostPort:9021 ContainerPort:80 Protocol:TCP HostIP:}] Env:[{Name:SERVICE_HOST Value:10.1.1.1}] Memory:0 CPU:0 VolumeMounts:[{Name:wordpress ReadOnly:false MountPath:/mnt/nfs/wordpress_a}] LivenessProbe: Lifecycle: Privileged:false}] RestartPolicy:{Always:0xa99a20 OnFailure: Never:}}}]
Does anyone have experience with this sort of thing? I've been driving myself crazy troubleshooting this. Thanks!
Solved. The volumes syntax was incorrect.
https://github.com/GoogleCloudPlatform/kubernetes/issues/1446

Resources