We are currently working on Dockerizing our Ruby on Rails application, which also includes Delayed Job. A question buzzing within our development team is whether and/or how to Dockerize the Delayed Job component separately from the application.
This would allow Delayed Job to start up new containers when necessary for high traffic within the jobs queue. In addition, since Delayed Job actually starts up the Rails application each time when first booting up, we thought the following benefits would follow:
The Delayed Job container might start up quicker
Application code would start up regardless of the Delayed Job container startup time
So I know a guy responsible for a rails app that uses delayed jobs. When it came time to dockerize said app, it got a container for each. Both containers are using the same codebase but one runs the frontend and one the jobs. It's not devops microservice-eriffic but it works.
Outside of the logical separation between the two, docker containers are supposed to only have a single process running inside. Could've hacked it together but it seemed wrong to break a docker fundamental out of the gate.
Related
Please pardon me IF I ASKED A VERY AMATEUR question but after reading multiple threads, posts, references and etc... I still do not understand the differences.
My current understanding:
1st method)
A traditional docker will compose of 3 dockers:
Scheduler that manages the schedule of the job
Queue that manages the queue of the multiple jobs
Worker that manages the work of each queue
I read from this source: https://laravel-news.com/laravel-scheduler-queue-docker
Docker + Apache Airflow will compose a single docker that does the same as above 3 dockers:
2nd method)
Worker(Airflow: since in airflow we can set the scheduler and also the queue)
I watched this tutorial: https://www.youtube.com/watch?v=vvr_WNzEXBE&t=575s
I first learnt from these two sources and other sources but I am confuse about:
Since I can use docker-compose to build all the services, all I need is just 1 docker(2nd method) and then set the scheduler which already in airflow to control the workflow, right? Then it means I do not need to create multiple dockers as the 1st method which separate all tasks into different dockers.
If both are different, then what are the differences? I try to find it for days but still could not figure out, I am sorry, I am new to this subject so I am currently still studying about it.
Thank you.
I think you are pointing at multi-node airflow vs single node airflow comparison. A multi-node airflow will provide you more computing power and higher availability for your Apache Airflow instance. You can run everything (webserver,scheduler and worker) on one machine/docker instance but if you project grows, you can make a cluster and scale your pipeline.
Actually, a Airflow instance can have number of Daemons workers that work together to provide the full functionality of Airflow.
With multiple gunicorn workers you can take and execute more task from the queue , in parallel/concurrently. On a single machine (depending on your usecase/cores of machine) you can define this in {AIRFLOW_HOME}/airflow.cfg (for example workers=6)
Now since Daemons don’t are independent of each other, people distribute them on multiple nodes/instances (Docker containers in your case). So, probably, this is what you have seen.
Update:
about the tutorial links you shared
As you asked in the comment section, the youtube tutorial that you have pointed to is also using one docker container where you run every thing, you arent using multi Node there.
For your first link (relating laravel scheduling), I am not sure but it seems like it is also using only one container
how you link multiple nodes of airflow in Multi-node setup
As an example, if you use same external Database instance (Mysql,postgres) and all your nodes interact with it, similarly, you have same queue from where they take task (may be external/same RabbitMQ cluster)
What is scheduler for, how to exucute
Scheduler is the thing that actually schedules the DAGs, for example running the weekly/daily/monthly etc as you have declared. In essense, one has only one
scheduler and there are workers that you need more. But it doesnt mean you cant have, you may have two webservers etc but you need to tackle port difference and share metadata between them.
To run the scheduler, you just need to do airflow scheduler and it will start picking Your DAGs and starts executing them once it is running successfully. For the first run it will see start_date and for subsequent, it will use schedule_interval that you have defined in the DAG to schedule them (thats why it is scheduler)
I am considering if replacing celery with dask. Currently we have a cluster where different jobs are submitted, each one generating multiple tasks that run in parallel. Celery has a killer feature, the "revoke" command: I can kill all the tasks of a given job without interfering with the other jobs that are running at the same time. How can I do that with dask? I only find references saying that this is not possible, but for us it would be a disaster. So don't want to be force the shut down the entire cluster when a calculation goes rogue, thus killing the jobs of the other users.
You can cancel tasks using the Client.cancel command.
If they haven't started yet then they won't start, however if they're already running in a thread somewhere then there isn't much that Python can do to stop them other than to tear down the process.
I'm performing some jobs on an AWS worker environment. I can't get why but for some reason my job is executed many times. I can say that because I save my job in the database and I saw it in a state running, then completed, and then running again. Even though I launched it just one time. My application is built in Ruby on Rails, I'm using active_job and the gem active_elastic_job because I take advantage of elastic beanstalk. Anyone have any idea? I can provide all the info you want.
Here in my company we have our regular application in aws ebs with some background jobs. The problem is, these jobs are starting to get heavier and we were thinking in separate them from the application. The question is: Where should we do it?
We were thinking in doing it in aws lambda, but then we would have to port our rails code to python, node or java, which seems to be a lot of work. What are other options for this? Should we just create another ec2 environment for the jobs? Thanks in advance.
Edit: I'm using shoryuken gem: http://github.com/phstc/shoryuken integrated with SQS. But its currently with some memory leak and my application is going down sometimes, I dont know if the memory leak is the cause tough. We already separated the application between an API part in EBS and a front-end part in S3.
Normally, just another EC2 instance with a copy of your Rails app, where instead of rails s to start the web server, you run rake resque:work or whatever your job runner start command is. Both would share the same Redis instance and database so that your web server writes the jobs to the queue and the worker picks them up and runs them.
If you need more workers, just add more EC2 instances pointing to the same Redis instance. I would advise separating your jobs by queue name, so that one worker can just process fast stuff e.g. email sending, and others can do long running or slow jobs.
We had a similar requirement, for us it was the sidekiq background jobs, they started to get very heavy, so we split it into a separate opsworks stack, with a simple recipe to build the machine dependencies ( ruby, mysql, etc ), and since we don't have to worry about load balancers and requests timing out, it's ok for all machines to deploy at the same time.
Also another thing you could use in opsworks is using scheduled machines ( if the jobs are needed at certain times during the day ), having the machine get provisioned few minutes before the time of the task, and then after the task is done you could make it shutdown automatically, that would reduce your cost.
EB also has a different type of application, which is the worker application, you could also check that out, but honestly I haven't looked into it so I can't tell you what are the pros and cons of that.
We recently passed on that route. I dockerized our rails app, and wrote a custom entrypoint to that docker container. In summary that entrypoint parses commands after you run docker run IMAGE_NAME
For example: If you run: docker run IMAGE_NAME sb rake do-something-magical entrypoint understands that it will run rake job with sandbox envrionment config. if you only run: docker run IMAGE_NAME it will do rails s -b 0.0.0.0
PS: I wrote custom entrypoint because we have 3 different environments, that entrypoint downloads environment specific config from s3.
And I set up an ECS Cluster, wrote an task-runner job on Lambda this lambda function schedules a task on ecs cluster, and we trigger that lambda from CloudWatch Events. You can send custom payloads to lambda when using CloudWatch Events.
It sounds complicated but implementation is really simple.
You may consider to submit your task to AWS SQS services, then you may use elasticbeantaslk worker enviroment to process your backgrown task.
Elasticbeanstalk supports rail application:
http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/create_deploy_Ruby_rails.html
Depending on what kind of work these background jobs perform, you might want to think about maybe extracting those functions into microservices if you are running your jobs on a difference instance anyway.
Here is a good codeship blog post on how to approach this.
For simple mailer type stuff, this definitely feels a little heavy handed, but if the functionality is more complex, e.g. general notification of different clients, it might well be worth the overhead.
I am currently working on moving my environment off Heroku and part of my application is runs a clock process that sets off a Sidekiq background job.
As I understand it, Sidekiq is composed of a client, which sends jobs off to be queued into Redis and a server which pulls off requests of the queue and processes them. I am now trying to split out my application into the following containers on Docker:
- Redis container
- Clock container (Using Clockwork gem)
- Worker container
- Web application container (Rails)
However, I am not sure how one is supposed to split up this Sidekiq server and client. Essentially, the clock container needs to be running Sidekiq on it so that the client can send off jobs to the Redis queue every so often. However, the worker containers should also run Sidekiq (the server though) on them so that they can process the jobs. I assume that splitting up the responsibilities between different containers should be quite possible to do since Heroku allows you to split this across various dynos.
I can imagine one way to do this would be to assign the clock container to pull off a non-existent queue so that it just never pulls any jobs off the queue and then set the worker to be pulling off a queue that exists. However, this just doesn't seem like the most optimal approach to me since it will still be checking for new jobs in this non-existing queue.
Any tips or guides on how I can start going about this?
The sidekiq client just publishes jobs into redis. A sidekiq deamon process just subscribes to redis and starts worker threads as they are published.
So you can just install the redis gem on both conatainers: Clock container and Worker Container and start the worker daemon only on the Worker Container and provide a proper redis config to both. You also have to make sure that the worker sourcecode is available on both servers/containers as Sidekiq client just stores the name of the worker class and then the daemon instanciates it through metaprogramming.
But actually you also can just include a sidekiq daemon process together with every application wich needs to process a worker job. Yes, there is this best practise of docker for one container per process, however imho this is not an all or nothing rule. In this case i see both processes as one unity. It's just a way of running some code in background. You then would just configure that instances of same applications just work against same sidekiq queues. Or you could even configure that every physical node runs again a separate queue.