Running Scrapy in a docker container - docker

I am setting up a new application which I would like to package using docker-compose. Currently in one container I have a Flask-Admin application which also exposes a API for interacting with the database. I then will have lots of scrapers that need to run once a day. These scrapers should scrape the data, reformat the data and then send it to the API. I expect I should have another docker container running for the scrapers.
Currently, on my local machine I run Scrapy run-spider myspider.py to run each spider.
What would be the best way to have multiple scrapers in one container and have them scheduled to run at various points during the day?

You could configure your docker container that has the scrapers to use "cron" to fire off the spiders at appropriate times. Here's an example:"Run a cron job with Docker"

Related

Can multiple scrapy projects be run simultenously by Docker

So basically in my remote server, I have several scrapy projects in different folders. Normally I was running them consecutively on a specific time once a day as crone jobs.
Now, I am considering to increase my server's capacity and want to run 3 scraper simultenously.
Can this be done bu running them in different Via Docker containers? If so, how and how can they be added to cronetabs as well?
Thank you

How to spawn a Docker Container during a k6 cloud load test?

I have a load test that runs weekly on k6 cloud. For some reasons I need now to spawn a Docker container that exposes an endpoint that my load test must call in the setup phase. I'm stuck on this point, I searched almost everywhere on the internet but I didn't find a useful solution rather than hosting the docker container somewhere so that it's always up and running and I can call it publicly but I don't want to do this, I just want to spawn it when I need it.

How to return logs to Airflow container from remotely called, Dockerized celery workers

I am working on a Dockerized Python/Django project including a container for Celery workers, into which I have been integrating the off-the-shelf airflow docker containers.
I have Airflow successfully running celery tasks in the pre-existing container, by instantiating a Celery app with the redis broker and back end specified, and making a remote call via send_task; however none of the logging carried out by the celery task makes it back to the Airflow logs.
Initially, as a proof of concept as I am completely new to Airflow, I had set it up to run the same code by exposing it to the Airflow containers and creating airflow tasks to run it on the airflow celery worker container. This did result in all the logging being captured, but it's definitely not the way we want it architectured, as this makes the airflow containers very fat due to the repetition of dependencies from the django project.
The documentation says "Most task handlers send logs upon completion of a task" but I wasn't able to find more detail that might give me a clue how to enable the same in my situation.
Is there any way to get these logs back to airflow when running the celery tasks remotely?
Instead of "returning the logs to Airflow", an easy-to-implement alternative (because Airflow natively supports it) is to activate remote logging. This way, all logs from all workers would end up e.g. on S3, and the webserver would automatically fetch them.
The following illustrates how to configure remote logging using an S3 backend. Other options (e.g. Google Cloud Storage, Elastic) can be implemented similarly.
Set remote_logging to True in airflow.cfg
Build an Airflow connection URI. This example from the official docs is particularly useful IMO. One should end up having something like:
aws://AKIAIOSFODNN7EXAMPLE:wJalrXUtnFEMI%2FK7MDENG%2FbPxRfiCYEXAMPLEKEY#/?
endpoint_url=http%3A%2F%2Fs3%3A4566%2F
        It is also possible to create the connectino through the webserver GUI, if needed.
Make the connection URI available to Airflow. One way of doing so is to make sure that the environment variable AIRFLOW_CONN_{YOUR_CONNECTION_NAME} is available. Example for connection name REMOTE_LOGS_S3:
export AIRFLOW_CONN_REMOTE_LOGS_S3=aws://AKIAIOSFODNN7EXAMPLE:wJalrXUtnFEMI%2FK7MDENG%2FbPxRfiCYEXAMPLEKEY#/?endpoint_url=http%3A%2F%2Fs3%3A4566%2F
Set remote_log_conn_id to the connection name (e.g. REMOTE_LOGS_S3) in airflow.cfg
Set remote_base_log_folder in airflow.cfg to the desired bucket/prefix. Example:
remote_base_log_folder = s3://my_bucket_name/my/prefix
This related SO touches deeper on remote logging.
If debugging is needed, looking into any worker logs locally (i.e., inside the worker) should help.

Best Practices for Cron on Docker

I've transitioned to using docker with cron for some time but I'm not sure my setup is optimal. I have one cron container that runs about 12 different scripts. I can edit the schedule of the scripts but in order to deploy a new version of the software running (some scripts which run for about 1/2 day) I have to create a new container to run some of the scripts while others finish.
I'm considering either running one container per script (the containers will share everything in the image but the crontab). But this will still make it hard to coordinate updates to multiple containers sharing some of the same code.
The other alternative I'm considering is running cron on the host machine and each command would be a docker run command. Doing this would let me update the next run image by using an environment variable in the crontab.
Does anybody have any experience with either of these two solutions? Are there any other solutions that could help?
If you are just running docker standalone (single host) and need to run a bunch of cron jobs without thinking too much about their impact on the host, then making it simple running them on the host works just fine.
It would make sense to run them in docker if you benefit from docker features like limiting memory and cpu usage (so they don't do anything disruptive). If you also use a log driver that writes container logs to some external logging service so you can easily monitor the jobs.. then that's another good reason to do it. The last (but obvious) advantage is that deploying new software using a docker image instead of messing around on the host is often a winner.
It's a lot cleaner to make one single image containing all the code you need. Then you trigger docker run commands from the host's cron daemon and override the command/entrypoint. The container will then die and delete itself after the job is done (you might need to capture the container output to logs on the host depending on what logging driver is configured). Try not to send in config values or parameters you change often so you keep your cron setup as static as possible. It can get messy if a new image also means you have to edit your cron data on the host.
When you use docker run like this you don't have to worry when updating images while jobs are running. Just make sure you tag them with for example latest so that the next job will use the new image.
Having 12 containers running in the background with their own cron daemon also wastes some memory, but the worst part is that cron doesn't use the environment variables from the parent process, so if you are injecting config with env vars you'll have to hack around that mess (write them do disk when the container starts and such).
If you worry about jobs running parallel there are tons of task scheduling services out there you can use, but that might be overkill for a single docker standalone host.

Docker for Jobs

I am trying to understand if this is a valid use case for Docker.
Suppose I am creating a python application, and the user submits a job of a specific type. I would like to then use docker run programmatically to start a docker image that is designed to process the job. The docker image would need to be able to receive the job, and then exit sending me a message that the job is successful.
Does this make sense? How could this be done using python? How can I observe docker containers and their status?
Or would it make more sense for the docker image to simply run on a loop, looking for jobs from a shared volume?
It's not totally crazy idea but, you most likely would be better off having only one container which contains a server, that listens for requests to run a "job", and then runs the relevant R code. No need to start a container for each request if the image you're running will be the same every time, instead you just send an HTTP request.
If you really want to manipulate Docker with Python you can use docker-py. Docker Compose is based on this library.

Resources