Can multiple scrapy projects be run simultenously by Docker - docker

So basically in my remote server, I have several scrapy projects in different folders. Normally I was running them consecutively on a specific time once a day as crone jobs.
Now, I am considering to increase my server's capacity and want to run 3 scraper simultenously.
Can this be done bu running them in different Via Docker containers? If so, how and how can they be added to cronetabs as well?
Thank you

Related

Trying to understand the difference between using docker(Scheduler, Queue, Workers) VS Docker(Airflow)

Please pardon me IF I ASKED A VERY AMATEUR question but after reading multiple threads, posts, references and etc... I still do not understand the differences.
My current understanding:
1st method)
A traditional docker will compose of 3 dockers:
Scheduler that manages the schedule of the job
Queue that manages the queue of the multiple jobs
Worker that manages the work of each queue
I read from this source: https://laravel-news.com/laravel-scheduler-queue-docker
Docker + Apache Airflow will compose a single docker that does the same as above 3 dockers:
2nd method)
Worker(Airflow: since in airflow we can set the scheduler and also the queue)
I watched this tutorial: https://www.youtube.com/watch?v=vvr_WNzEXBE&t=575s
I first learnt from these two sources and other sources but I am confuse about:
Since I can use docker-compose to build all the services, all I need is just 1 docker(2nd method) and then set the scheduler which already in airflow to control the workflow, right? Then it means I do not need to create multiple dockers as the 1st method which separate all tasks into different dockers.
If both are different, then what are the differences? I try to find it for days but still could not figure out, I am sorry, I am new to this subject so I am currently still studying about it.
Thank you.
I think you are pointing at multi-node airflow vs single node airflow comparison. A multi-node airflow will provide you more computing power and higher availability for your Apache Airflow instance. You can run everything (webserver,scheduler and worker) on one machine/docker instance but if you project grows, you can make a cluster and scale your pipeline.
Actually, a Airflow instance can have number of Daemons workers that work together to provide the full functionality of Airflow.
With multiple gunicorn workers you can take and execute more task from the queue , in parallel/concurrently. On a single machine (depending on your usecase/cores of machine) you can define this in {AIRFLOW_HOME}/airflow.cfg (for example workers=6)
Now since Daemons don’t are independent of each other, people distribute them on multiple nodes/instances (Docker containers in your case). So, probably, this is what you have seen.
Update:
about the tutorial links you shared
As you asked in the comment section, the youtube tutorial that you have pointed to is also using one docker container where you run every thing, you arent using multi Node there.
For your first link (relating laravel scheduling), I am not sure but it seems like it is also using only one container
how you link multiple nodes of airflow in Multi-node setup
As an example, if you use same external Database instance (Mysql,postgres) and all your nodes interact with it, similarly, you have same queue from where they take task (may be external/same RabbitMQ cluster)
What is scheduler for, how to exucute
Scheduler is the thing that actually schedules the DAGs, for example running the weekly/daily/monthly etc as you have declared. In essense, one has only one
scheduler and there are workers that you need more. But it doesnt mean you cant have, you may have two webservers etc but you need to tackle port difference and share metadata between them.
To run the scheduler, you just need to do airflow scheduler and it will start picking Your DAGs and starts executing them once it is running successfully. For the first run it will see start_date and for subsequent, it will use schedule_interval that you have defined in the DAG to schedule them (thats why it is scheduler)

Port allocation when running build job in Jenkins

My project is structured in such a way that the build job in Jenkins is triggered from a push to Git. As part of my application logic, I spin up kafka and elastic search instances to be used in my test cases downstream.
The issue I have right now is, when a developer pushes his changes to Git, it triggers a build in Jenkins which in turn runs our code and spawns kafka broker in localhost:9092 and elastic search in localhost:9200.
When another developer working on some other change simultaneously, pushes his code, it triggers the build job again and tries to spin up another instance of kafka/elastic search but fails with the exception “Port already in use”.
I am looking at options on how to handle this scenario.
Will running these instances inside of docker container help to some extent? How do I handle the port issue in that case?
Yes dockerizing these instances can indeed help as you can spawn them multiple times.
You could create a docker container per component including your application and then let them talk to each other by linking them or using docker-compose
That way you would not have to expose the ports to the "outside" world but keep it internal within the docker environment.
That way you would not have the “Port already in use”. The only problem is memory in that case. e.g. if 100 pushes are done to the git repo, you might run out of memory...

Best Practices for Cron on Docker

I've transitioned to using docker with cron for some time but I'm not sure my setup is optimal. I have one cron container that runs about 12 different scripts. I can edit the schedule of the scripts but in order to deploy a new version of the software running (some scripts which run for about 1/2 day) I have to create a new container to run some of the scripts while others finish.
I'm considering either running one container per script (the containers will share everything in the image but the crontab). But this will still make it hard to coordinate updates to multiple containers sharing some of the same code.
The other alternative I'm considering is running cron on the host machine and each command would be a docker run command. Doing this would let me update the next run image by using an environment variable in the crontab.
Does anybody have any experience with either of these two solutions? Are there any other solutions that could help?
If you are just running docker standalone (single host) and need to run a bunch of cron jobs without thinking too much about their impact on the host, then making it simple running them on the host works just fine.
It would make sense to run them in docker if you benefit from docker features like limiting memory and cpu usage (so they don't do anything disruptive). If you also use a log driver that writes container logs to some external logging service so you can easily monitor the jobs.. then that's another good reason to do it. The last (but obvious) advantage is that deploying new software using a docker image instead of messing around on the host is often a winner.
It's a lot cleaner to make one single image containing all the code you need. Then you trigger docker run commands from the host's cron daemon and override the command/entrypoint. The container will then die and delete itself after the job is done (you might need to capture the container output to logs on the host depending on what logging driver is configured). Try not to send in config values or parameters you change often so you keep your cron setup as static as possible. It can get messy if a new image also means you have to edit your cron data on the host.
When you use docker run like this you don't have to worry when updating images while jobs are running. Just make sure you tag them with for example latest so that the next job will use the new image.
Having 12 containers running in the background with their own cron daemon also wastes some memory, but the worst part is that cron doesn't use the environment variables from the parent process, so if you are injecting config with env vars you'll have to hack around that mess (write them do disk when the container starts and such).
If you worry about jobs running parallel there are tons of task scheduling services out there you can use, but that might be overkill for a single docker standalone host.

Running Scrapy in a docker container

I am setting up a new application which I would like to package using docker-compose. Currently in one container I have a Flask-Admin application which also exposes a API for interacting with the database. I then will have lots of scrapers that need to run once a day. These scrapers should scrape the data, reformat the data and then send it to the API. I expect I should have another docker container running for the scrapers.
Currently, on my local machine I run Scrapy run-spider myspider.py to run each spider.
What would be the best way to have multiple scrapers in one container and have them scheduled to run at various points during the day?
You could configure your docker container that has the scrapers to use "cron" to fire off the spiders at appropriate times. Here's an example:"Run a cron job with Docker"

Multiple applications running simultaneously on a single Docker image

Suppose I have a webserver and a database server installed on the same common Docker image, Is it possible to run them simultaneously, as if they were running inside the same virtual machine?
Is it running docker run <args> twice the best practice for this use case?
You should not use a single image for your web server and the database. You should use one image for the web server and one for the database.
To run this, you would run your database server and then run your webserver and link it to your database server.
There are many examples on internet. I'll just leave this one here : https://github.com/saada/docker-compose-php-mysql
According to this stack overflow answer it is perfectly possible to do that via a script that takes charge of starting each of these services
Can I run multiple programs in a Docker container?
Although most people just tell you to micro service everything into multiple different containers. It might well be much more manageable in some cases to have containers that lunch more than one process if you think about cloud deployment where you might want to run multiple web apps each corresponding to a different system test.
So you would have your isolated small hsql db running In server mode followed by your wildly or springboot app and finally your system test by mvn..
If you have all three in one container... Then it is just a matter of choosing in which Jenkins node your all in one container is put to run. Since it packs all within irrespective of any other container and the image size is not monstrous... You are really agile. As an example.
So you have to see what is best for you.
With big dBs like mysql you are often better of running them on an isolated container as a base platform for all other docker containers. With dBs like hsql you can easily afford a db per container.

Resources