Dask doesn't clean up context in docker container - docker

We have a Dask pipeline in which we basically use a LocalCluster as a process pool. i.e. we start the cluster with LocalCluster(processes=True, threads_per_worker=1). Like so:
dask_cluster = LocalCluster(processes=True, threads_per_worker=1)
with Client(dask_cluster) as dask_client:
exit_code = run_processing(input_file, dask_client, db_state).value
Our workflow and task parallelization works great when run locally. However when we copy the code into a Docker container (centos based), the processing completes and we sometimes get the following error as the container exits:
Traceback (most recent call last):^M
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/multiprocessing/queues.py", line 240, in _feed^M
send_bytes(obj)^M
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/multiprocessing/connection.py", line 200, in send_bytes^M
self._send_bytes(m[offset:offset + size])^M
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/multiprocessing/connection.py", line 404, in _send_bytes^M
self._send(header + buf)^M
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/multiprocessing/connection.py", line 368, in _send^M
n = write(self._handle, buf)^M
BrokenPipeError: [Errno 32] Broken pipe^M
Furthermore, we get multiple instances of this error which makes me think that the error is coming from abandoned worker processes. Our current working theory is that this is related somehow to the "Docker zombie reaping problem" but we don't know how to fix it without starting from a completely different docker image and we don't want to do that.
Is there a way to fix this using only Dask cluster/client cleanup methods?

You should create the cluster as a context manager. It is actually the thing that launches processes, not the Client.
with LocalCluster(...):
...

Related

Why when installing cron along with playwright,errors occur in the absence of the playwright module

The launch takes place on a VPS. So, if you run the code without cron-a, then everything is OK, the sites are parsed. When I add cron, everything flies into a heap with errors. Here is what my log gives me
`Traceback (most recent call last):
File "/apars/nobr.py", line 39, in <module>
povtor()
File "/apars/nobr.py", line 9, in povtor
x = obr_cnt()
File "/apars/obrzka_count.py", line 8, in obr_cnt
id_mobile = parser()
File "/apars/parser_new.py", line 4, in parser
from playwright.sync_api import sync_playwright
ModuleNotFoundError: No module named 'playwright'`
Although everything is installed in ubuntu and this is what pip3 list shows...
enter image description here
Removing the entry about the program launch from the cron, everything falls into place. Perhaps someone has come across the same problem and can help me.
cron is probably being run as a different user and so it has a different site-packages directory where it pulls installed modules from.
You can either install playwright for root user, or schedule the cron job to be run by your current user. The second is probably the better option.
The other possibility is that you are running your tests in a virtual environment that cron doesn't have access to. Either way the solution is to make sure that cron has access to all of the python modules needed to run the script.
To schedule a job for a specific user use crontab -u username -e

Gremlin docker server connection not working

I'm running a gremlin server using the official docker container:
docker run --rm -it -p 8182:8182 --name gremlin tinkerpop/gremlin-server
I then try to run the following script from the host machine:
from gremlin_python.process.anonymous_traversal import traversal
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
if __name__ == "__main__":
g = traversal().withRemote(DriverRemoteConnection('ws://localhost:8182', 'g'))
g.V().drop()
g.V().addV('person')
l = g.V().hasLabel('person')
print(l.toList())
The connection seems to work (no errors), but the queries don't seem to be actually executed (the gremlin server statistics show no calls whatsoever).
The even more bizarre part is that the toList() call blocks execution, and returns nothing. If then I stop the docker container, the connection on the python side drops.
I'm using the default settings for the gremlin server.
Could someone help me understand what's going on?
EDIT: I also tried changing the gremlin configuration host to 0.0.0.0.
EDIT: so the reason why it would appear that only the toList waits for an answer is because the other queries aren't actually executed yet, you need .next().
It turns out there were two errors:
the address must end with /gremlin, so in my case 'ws://localhost:8182/gremlin'
when trying this, an exception appears which looks like a connection error at first :
RuntimeError: Event loop is closed
Exception ignored in: <function ClientResponse.__del__ at 0x7fb532031af0>
Traceback (most recent call last):
[..]
File "/usr/lib/python3.8/asyncio/selector_events.py", line 692, in close
File "/usr/lib/python3.8/asyncio/base_events.py", line 719, in call_soon
File "/usr/lib/python3.8/asyncio/base_events.py", line 508, in _check_closed
RuntimeError: Event loop is closed
this is actually not a connection error, but a warning that the connection was not properly closed. If you investigate, you would notice that the queries were in fact executed. The correct way to handle this is to write something along the lines of:
conn = DriverRemoteConnection('ws://localhost:8182/gremlin', 'g')
g = traversal().withRemote(conn)
[do your graph operations]
conn.close()
and with this, no exceptions, life is good. I am quite surprised this appears in no documentation.

Stopping OrientDB service fails, ETL import not possible

My goal is to import data from CSV-files into OrientDB.
I use the OrientDB 2.2.22 Docker image.
When I try to execute the /orientdb/bin/oetl.sh config.json script within Docker, I get the error: "Can not open storage it is acquired by other process".
I guess this is, because the OrientDB - service is still running. But, if I try to stop it i get the next error.
./orientdb.sh stop
./orientdb.sh: return: line 70: Illegal number: root
or
./orientdb.sh status
./orientdb.sh: return: line 89: Illegal number: root
The only way for to use the ./oetl.sh script is to stop the Docker instance and restart it in the interactive mode running the shell, but this is awkward because to use the "OrientDB Studio" I have to stop docker again and start it in the normal mode.
As Roberto Franchini mentioned above setting the dbURL parameter in the Loader to use a remote URL fixed the first issue "Can not open storage it is acquired by other process".
The issues with the .orientdb.sh still exists, but with the remote-URL approach I don't need to shutdown and restart the service anymore.

Image for Google Cloud Dataflow instances

When I run Dataflow job, it takes my small package (setup.py or requirements.txt) and uploads it to run on the Dataflow instances.
But what is actually running on the Dataflow instance? I got a stacktrace recently:
File "/usr/lib/python2.7/httplib.py", line 1073, in _send_request
self.endheaders(body)
File "/usr/lib/python2.7/httplib.py", line 1035, in endheaders
self._send_output(message_body)
File "/usr/lib/python2.7/httplib.py", line 877, in _send_output
msg += message_body
TypeError: must be str, not unicode
[while running 'write to datastore/Convert to Mutation']
But in theory, if I'm doing str += unicode, it implies I might not be running this Python patch? Can you point to the docker image that these jobs are running, so I can know what version of Python I'm working with, and make sure I'm not barking up the wrong tree here?
The cloud console shows me the instance template, which seems to point to dataflow-dataflow-owned-resource-20170308-rc02, but it seems I don't have permission to look at it. Is the source for it online anywhere?
Haven't tested (and maybe there is an easier way), but something like this might do the trick:
ssh into one of the Dataflow workers from the console
run docker ps to get the container id
run docker inspect <container_id>
grab the image id from the field Image
run docker history --no-trunc <image>
Then you should find what you are after.

Docker - Handling multiple services in a single container

I would like to start two different services in my Docker container and exit the container as soon as one of them exits. I looked at supervisor, but I can't find how to get it to quit as soon as one of the managed applications exits. It tries to restart them up to three times, as is the standard setting and then just sits there doing nothing. Is supervisor able to do this or is there any other tool for this? A bonus would be if there also was a way to let both managed programs write to stdout, tagged with their application name, e.g.:
[Program 1] Some output
[Program 2] Some other output
[Program 1] Output again
Since you asked if there was another tool... we designed and wrote a powerful replacement for supervisord that is designed specifically for Docker. It automatically terminates when all applications quit, as well as has special service settings to control this behavior, plus will redirect stdout with tagged syslog-compatible output lines as well. It's open source, and being used in production.
Here is a quick start for Docker: http://garywiz.github.io/chaperone/guide/chap-docker-simple.html
There is also a complete set of tested base-images which are a good example at: https://github.com/garywiz/chaperone-docker, but these might be overkill and the earlier quickstart may do the trick.
I found solutions to both of my requirements by reading through the docs some more.
Exit supervisord on application exit
This can be achieved by using a custom eventlistener. I had to add the following segment into my supervisord configuration file:
[eventlistener:shutdownevent]
command=/shutdownhandler.sh
events=PROCESS_STATE_EXITED
supervisord will start the referenced script and upon the given event being triggered (PROCESS_STATE_EXITED is triggered after the exit of one of the managed programs and it not restarting automatically) will send a line containing data about the event on the scripts stdin.
The referenced shutdownhandler-script contains:
#!/bin/bash
while :
do
echo -en "READY\n"
read line
kill $(cat /supervisord.pid)
echo -en "RESULT 2\nOK"
done
The script has to indicate being ready by sending "READY\n" on its stdout, after which it may receive an event data line on its stdin. For my use case upon receival of a line (meaning one of the managed programs has exited), a SIGTERM is sent to the supervisord process being found by the pid it leaves in its pid file (situated in the root directory by default). For technical completeness, I also included a positive answer for the eventlistener, though that one should never matter.
Tagged output on stdout
I did this by simply starting a tail process in the background before starting supervisord, tailing the programs output log and piping the lines through ts (from the moreutils package) to prepend a tag to it. This way it shows up via docker logs with an easy way to see which program actually wrote the line.
tail -fn0 /var/log/supervisor/program1.log | ts '[Program 1]' &

Resources