Mesos: Failed to get/update resource statistics for executor - docker

we are having issues with full logs from mesos-agents with messages like:
2018-06-19T07:31:05.247394+00:00 mesos-slave16 mesos-slave[10243]: W0619 07:31:05.244067 10249 slave.cpp:6750] Failed to get resource statistics for executor 'research_new-benchmarks_production_testbox-58-1529393461975-1-mesos_slave16' of framework Singularity-PROD: Failed to run 'docker -H unix:///var/run/docker.sock inspect mesos-7560fb72-28d3-4cce-8cb0-de889248cf93': exited with status 1; stderr='Error: No such object: mesos-7560fb72-28d3-4cce-8cb0-de889248cf93
or
2018-06-19T07:31:09.904414+00:00 mesos-slave16 mesos-slave[10243]: E0619 07:31:09.903687 10251 slave.cpp:4721] Failed to update resources for container b9a9f7f9-938b-4ec4-a245-331122471769 of executor 'hera_listening-api_production_checkAlert-93-1529393402085-1-mesos_slave16-us_west_2a' running task hera_listening-api_production_checkAlert-93-1529393402085-1-mesos_slave16 on status update for terminal task, destroying container: Failed to determine cgroup for the 'cpu' subsystem: Failed to read /proc/14447/cgroup: Failed to open file: No such file or directory
We are running 3x ha mesos-master, marathon framework, singularity framework - happening with tasks from both frameworks. Tasks running, crons (from singularity) running ok too, but i am confused of thouse messages. We have more than 600 long running marathon tasks and more than 30 crons starting per few minutes.
Docker version: 18.03.0-ce
Mesos version: 1.4.0-2.0.1
Marathon version: 1.4.2-1.0.647.ubuntu1604
Singularity version: 0.15.1
Masters and slaves running on Ubuntu 16.04 with AWS kernel - 4.4.0-1060-aws
I think that mesos executor on slave is deleted after task is finished, but mesos still trying to get info from docker, where task is no loger visible.
Any ideas? Thanks

Marathon is a scheduler framework for permanent tasks. Although tasks exit successfully, it would still insist to re-schedule tasks all the time.
We could see health check is one of its important features. Maybe try chronos. It’s another framework working on Apache mesos.

Related

Docker service showing no such image when trying to upgrade service

First of all sorry if I have a bad english.
We have a service that was being upgraded until 26 / September / 2022, via portainer or via terminal on Docker. It was on gitlab registry.
We did not make any changes but we are not able to upgrade it anymore!
How can we debug why this message is appearing?
No such image: registry.gitlab.com/xxxx/xxx/api:1.1.18#sha256:xxxx
Some additional informations:
-We are using docker login before trying to do the service update.
-We can do docker pull registry.gitlab.com/etc/etc (the version)
The problem only occurs when we try to upgrade it as a service.
There is some kind of debug on the service upgrade that can provide some additional information like firewall is blocking or something like this?
docker service update nameofservice
nameofservice
overall progress: 0 out of 1 tasks
overall progress: 0 out of 1 tasks
overall progress: 0 out of 1 tasks
overall progress: 0 out of 1 tasks
overall progress: 0 out of 1 tasks
overall progress: 0 out of 1 tasks
1/1: preparing [=================================> ]
Until return the error 'no such image'!
I am pretty sure the image exists.
If you are experiencing the same problem, check if you have more nodes, phisical machines or vms running connected to your docker node (docker node ls).
If that is your case, run docker pull gitlabaddressetcetc on the other nodes and check if everything is fine.
I found the message 'No space left on device', so I runned 'df -h' but a lot of space are available for the VM. Anyway I decided to run 'docker prune -f' to see what happens:
So running the 'docker system prune -f' seems to solved my problem, and everything is fine now.
After that I just needed to change the version of the portainer to a invalid one before trying again.

Getting error while running multiple docker containers of SonarQube

I'm trying to run 2 task of same SonarQube container using the AWS ECS service (EC2 instances and not Fargate). Only 1 ECS instance is running and I'm using the EBS volumes for storing the sonar data and sonar extensions like this:-
/opt/sonarqube/data
/opt/sonarqube/extensions
If I run just 1 ECS Task (1 docker container of SonarQube) then the SonarQube application runs perfectly and I can access it however if I scale the service to additional task (means 2 docker containers of SonarQube on same ECS instance) then I get below locking error hence one of the task never comes to 'RUNNING' state:-
2021-03-18 10:39:19at org.elasticsearch.node.Node.<init>(Node.java:289) ~[elasticsearch-7.10.2.jar:7.10.2]
2021-03-18 10:39:192021.03.18 05:09:19 ERROR es[][o.e.b.ElasticsearchUncaughtExceptionHandler] uncaught exception in thread [main]
2021-03-18 10:39:19org.elasticsearch.bootstrap.StartupException: java.lang.IllegalStateException: failed to obtain node locks, tried [[/opt/sonarqube/data/es7]] with lock id [0]; maybe these locations are not writable or multiple nodes were started without increasing [node.max_local_storage_nodes] (was [1])?
2021-03-18 10:39:19at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:174) ~[elasticsearch-7.10.2.jar:7.10.2]
How can I make sure this issue does not come up and I can scale the service as and when required using the autoscaling feature?
Cheers,

how can i launch the kafka scheduler using marathon in minimesos?

I'm trying to launch the kafka-mesos framework scheduler using the docker container as prescribed at https://github.com/mesos/kafka/tree/master/src/docker#running-image-in-marathon using the Marathon implementation running in minimesos (I would like to add a minimesos tag, but don't have the points). The app is registered and can be seen in the Marathon console but it remains in Waiting state and the Deployment GUI says that it is trying to ScaleApplication.
I've tried looking for /var/log files in the marathon and mesos-master containers that might show why this is happening. Initially i thought it may have been because the image was not pulled, so i added "forcePullImage": true to the JSON app configuration but it still waits. I've also changed the networking from HOST to BRIDGE on the assumption that this is consistent with the minimesos caveats at http://minimesos.readthedocs.org/en/latest/ .
In the mesos log i do see:
I0106 20:07:15.259790 15 master.cpp:4967] Sending 1 offers to framework 5e1508a8-0024-4626-9e0e-5c063f3c78a9-0000 (marathon) at scheduler-575c233a-8bc3-413f-b070-505fcf138ece#172.17.0.6:39111
I0106 20:07:15.266100 9 master.cpp:3300] Processing DECLINE call for offers: [ 5e1508a8-0024-4626-9e0e-5c063f3c78a9-O77 ] for framework 5e1508a8-0024-4626-9e0e-5c063f3c78a9-0000 (marathon) at scheduler-575c233a-8bc3-413f-b070-505fcf138ece#172.17.0.6:39111
I0106 20:07:15.266633 9 hierarchical.hpp:1103] Recovered ports(*):[33000-34000]; cpus(*):1; mem(*):1001; disk(*):13483 (total: ports(*):[33000-34000]; cpus(*):1; mem(*):1001; disk(*):13483, allocated: ) on slave 5e1508a8-0024-4626-9e0e-5c063f3c78a9-S0 from framework 5e1508a8-0024-4626-9e0e-5c063f3c78a9-0000
I0106 20:07:15.266770 9 hierarchical.hpp:1140] Framework 5e1508a8-0024-4626-9e0e-5c063f3c78a9-0000 filtered slave 5e1508a8-0024-4626-9e0e-5c063f3c78a9-S0 for 2mins
I0106 20:07:16.261010 11 hierarchical.hpp:1521] Filtered offer with ports(*):[33000-34000]; cpus(*):1; mem(*):1001; disk(*):13483 on slave 5e1508a8-0024-4626-9e0e-5c063f3c78a9-S0 for framework 5e1508a8-0024-4626-9e0e-5c063f3c78a9-0000
I0106 20:07:16.261245 11 hierarchical.hpp:1326] No resources available to allocate!
I0106 20:07:16.261335 11 hierarchical.hpp:1421] No inverse offers to send out!
but I'm not sure if this is relevant since it does not correlate to the resource settings in the Kafka App config. The GUI shows that no tasks have been created.
I do have ten mesosphere/inky docker tasks running alongside the attempted Kafka deployment. This may be a configuration issue specific to the Kafka docker image. I just don't know the best way to debug it. Perhaps a case of increasing the log levels in a config file. It may be an environment variable or network setting. I'm digging into it and will update my progress, but any suggestions would be appreciated.
thanks!
Thanks for trying this out! I am looking into this and you can follow progress on this issue at https://github.com/ContainerSolutions/minimesos/issues/188 and https://github.com/mesos/kafka/issues/172
FYI I got Mesos Kafka installed on minimesos via a quickstart shell script. See this PR on Mesos Kafka https://github.com/mesos/kafka/pull/183
It does not use Marathon and the minimesos install command yet. That is the next step.

Mesos killing tasks. Failed to determine cgroup for the 'cpu' subsystem

I'm running a bunch of services in dockers in Mesos(v0.22.1) via Marathon (v0.9.0) and sometimes Mesos killing tasks. Usually it happens for multiple services at once
Log line related to this issue from mesos-slave.ERROR log:
Failed to update resources for container 949b1491-2677-43c6-bfcf-bae6b40534fc
of executor production-app-emails.15437359-a95e-11e5-a046-e24e30c7374f running task production-app-emails.15437359-a95e-11e5-a046-e24e30c7374f
on status update for terminal task,
destroying container: Failed to determine cgroup for the 'cpu' subsystem:
Failed to read /proc/21292/cgroup:
Failed to open file '/proc/21292/cgroup': No such file or directory
I'd strongly suggest to update your stack. Mesos 0.22.1 and Marathon 0.9.0 are quite outdated as of today. Mesos 0.26.0 and Marathon 0.13.0 are out.
Concerning your problem, have a look at
https://issues.apache.org/jira/browse/MESOS-1837
https://github.com/mesosphere/marathon/issues/994
The first one suggests fixes on the Mesos side (post 0.22.1), and the second indicates a lack of resources of the started containers.
Maybe try to increase the RAM for the specific containers, and if that doesn't help, update the Mesos stack IMHO.

Unable to run rabbitmq using marathon mesos

I am unable to run rabbitmq using marathon/mesos framework. I have tried it with rabbitmq images available in docker hub as well as custom build rabbitmmq docker image. In the mesos slave log I see the following error:
E0222 12:38:37.225500 15984 slave.cpp:2344] Failed to update resources for container c02b0067-89c1-4fc1-80b0-0f653b909777 of executor rabbitmq.9ebfc76f-ba61-11e4-85c9-56847afe9799 running task rabbitmq.9ebfc76f-ba61-11e4-85c9-56847afe9799 on status update for terminal task, destroying container: Failed to determine cgroup for the 'cpu' subsystem: Failed to read /proc/13197/cgroup: Failed to open file '/proc/13197/cgroup': No such file or directory
On googling I could find one hit as follows
https://github.com/mesosphere/marathon/issues/632
Not sure if this is the issue even I am facing. Anyone tried running rabbitmq using marathon/mesos/docker?
Looks like the process went away (likely crashed) before the container was set up. You should check stdout and stderr to see what happened, and fix the root issue.
"cmd": "", is the like'y culprit. I'd look at couchbase docker containers for a few clues on how to get it working.

Resources