digitalocean batch processing - docker

Wondering if digitalocean has a solution/alternative like aws has for batch job processing?
Something that automatically spins up X droplets(instances) does the job and then shuts down?
Trying to figure out how this can be implemented (maybe with not that much manually work)

Related

How to use delayed_job with Rails on Google App Engine?

This may be more of an App Engine question than a delayed_job question. But generally, how can I keep a long-lived process running to handling the scheduling of notifications and the sending of scheduled notifications on Google App Engine?
The maintainers of active_job https://github.com/collectiveidea/delayed_job include a script for production deploys, but this seems to stop after a few hours. Trying to figure out the best approach to ensure that the script stays running, and also that the script is able to access the logs for debugging purposes.
I believe that Google Pub/Sub is also a possibility, but I would ideally like to avoid setting up additional infrastructure for such a small project.
For running long processes that last for hours, App Engine will not be the ideal solution, since the requests are cap to 60 s (GAE Standard) and 60 m (GAE Flex).
The best would be to use a Compute Engine based solution, since the you would be able to keep the GCE VM up for long periods.
Once you have deployed on your GCE VM a RESTful application you can use Cloud Scheduler to create an scheduled job with this command:
gcloud scheduler jobs create http JOB --schedule=SCHEDULE --uri=APP_PATH
You can find more about this solution in this article
If App Engine is required take into consideration the mentioned maximum request times. And additionally you can give a look to Cloud Tasks, since those fit pretty much into your requirement.

How can i write a never ending job in Rails (Web Scraping)?

Goal: I want to make a web scraper in a Rails app that runs indefinitely and can be scaled.
Current stack app is running on:
ROR/Heroku/Redis/Postgres
Idea:
I was thinking of running a Sidekiq Job that runs every n minutes and checks if there are any proxies available to scrape with (these will be stored in a table with status sleeping/scraping).
Assuming there is a proxy available to scrape it will then check (using Sidekiq API) if there is any available workers to start up another job to scrape with the available proxy.
This means i could scale the scraper by increasing number of workers and the number of available proxies. If for any reason the Job fails the Job that looks for available proxies will just start it again.
Questions: Is this the best solution for my goal? Is utilizing long running Sidekiq jobs the best idea or could this blow up?
Sidekiq is designed to run individual jobs which are "units of work" to your organization.
You can build your own loop and, inside that loop, create jobs for each page to scrape but the loop itself should not be a job.
If you want a job to run every n minutes, you could schedule it.
And since you're using Heroku, there is an Add-on that : https://devcenter.heroku.com/articles/scheduler
Another solution would be to set cron jobs and schedule them with the whenever gem.

AWS ECS scale up and down ways

We are using AWS ECS service where docker containers are running into it. These docker container having application code which continuously polling SQS and gets the single message, process it and kill their self and that's the life cycle of task.
Now we are scaling tasks and EC2 in cluster based on number of messages comes to SQS. We are able to scale up but it's difficult to scale down because we don't know whether any task is still processing any message because message processing time is huge due to some complex logic.
Could anybody suggest what's the based way to scale up and scale down in this case?
Have you considered using AWS Lambda for this use case rather than ECS (provided that your application logic runs in less than 5 mins). You can use SQS as a trigger for the Lambda. AWS Documentation : Using AWS Lambda with Amazon SQS provides a comprehensive guide on how to achieve this using Lambda.
The use case you have mentioned doesn't mean for ECS for EC2 instances. You should consider AWS ECS Fargate or AWS BATCH. On one side fargate will give you more capabilities in term of infrastructure like The task can be run for longer periods or scaling of tasks according to some parameters like CPU or MEM. On another side, you will be paying only the number of tasks running at a moment in your cluster.
Ref: https://aws.amazon.com/fargate/

Usecase for Constantly running tasks

I have a few processes on my machine that I would like to have constantly running. I like however, how Jenkins organizes the jobs logging and I can go and see a build executing and see its STDOUT in realtime.
Would it be an issue to have a job that never finishes? I've heard after time there would be interruptions. Is there a better tool for something like this? Would basically love to be able to see the output from a web based perspective of the tool (and add hooks on failures)
For example if I were hosting a Node.js site, and wanted to be able to see the output of people connecting to the website or whatever is logged by the site. Ideally as long as you want to run the server, the process would be running constantly

Using "rails runner" for cron jobs is very CPU intensive - alternatives?

I'm currently using cron and "rails runner" to execute background jobs. For the most part these jobs are simple polls "Find the records that are due to receive a reminder email. Send that email."
I've been watching my Amazon EC2 Small instance, and noticed that each time one of these cron job kicks in, the CPU spikes to ~99%. The teeny tiny little query inside my current job is definitely not responsible. I'm presuming that the spike is simply due to the effort of loading the full rails environment via "rails runner".
Is there a more CPU efficient way to handle regularly scheduled batch jobs?
P.S. I know that in the particular example of sending a reminder email at time X in the future, I could delayed_jobs, and simply schedule the job in the future. Not every possible task fits into the delayed_jobs framework very well though, so I'm looking for a more traditional "cron job" type solution. Like "rails runner", but without the crazy CPU consequences.
You can use workers witch don't load rails env. Or load it only once(like resque)
I don't think there is a solution for this, since you do need to load a Rails environment to handle whatever that is you are handling. So when on the "cron" model you will be starting up a handler which in turn will create some load on your instance. I don't know how cloud services lend themselves to this, but I think the optimal model in your case would be to have a running daemon for job handling and forking coupled with REE for the job execution (that helps prevent memory leaks by letting as much as possible happen in the child process that will die at the end of the execution loop).
The daemon could be configured to accept signals (also via a job queue) that would spin off jobs doing specific things.

Resources