Running large amount of long running background jobs in Rails - ruby-on-rails

We're building a web-app where users will be uploading potentially large files that will need to be processed in the background. The task involves calling 3rd-party APIs so each job can take several hours to complete. We're using DelayedJob to run the background jobs. With every user kicking off a background job, each of which will take a few hours to finish, that will add up to a lot of background jobs every quickly. I am wondering what would be the best way to setup the deployment for this? We're currently hosted on DigitalOcean. I've kicked off 10 DelayedJob workers. Each one (when ideal) takes up 157MB. When actively running it utilizes around 900 MB. Our user-base right now is pretty small so it's not an issue but will be one soon. So on a 4GB droplet, I can probably run like 2 or 3 workers at a time. How should we approach this issue? Should we be looking at using DigitalOcean's API to auto-spin cheap droplets on demand? Should we subscribe to high-memory droplets on a monthly basis instead? If we go with auto-spinning droplets, should we stick with DigitalOcean or would Heroku make more sense? Or is the entire approach wrong and should we be approaching it from an entire different direction? Any help/advice would be very much appreciated.
Thanks!

It sounds like you are limited by memory on the number of workers that you can run on your DigitalOcean host.
If you are worried about scaling, I would focus on making the workers as efficient as possible. Have you done any benchmarking to understanding where the 900MB of memory is being allocated? I'm not sure what the nature of these jobs are, but you mentioned large files. Are you reading the contents of these files into memory, or are you streaming them? Are you using a database with SQL you can tune? Are you making many small API calls when you could be using a batch endpoint? Are you assigning intermediary variables that must then be garbage collected? Can you compress the files before you send them?
Look at the job structure itself. I've found that background jobs work best with many smaller jobs rather than one larger job. This allows execution to happen in parallel, and be more load balanced across all workers. You could even have a job that generates other jobs. If you need a job to orchestrate callbacks when a group of jobs finishes there is a DelayedJobGroup plugin at https://github.com/salsify/delayed_job_groups_plugin that allows you to invoke a final job only after the sibling jobs complete. I would aim for an execution time of a single job to be under 30 seconds. This is arbitrary but it illustrates what I mean by smaller jobs.
Some hosting providers like Amazon provide spot instances where you can pay a lower price on servers that do not have guaranteed availability. These pair well with the many fewer jobs approach I mentioned earlier.
Finally, Ruby might not be the right tool for the job. There are faster languages, and if you are limited by memory, or CPU, you might consider writing these jobs and their workers in another language like Javascript, Go or Rust. These can pair well with a Ruby stack, but offload computationally expensive subroutines to faster languages.
Finally, like many scaling issues, if you have more money than time, you can always throw more hardware at it. At least for a while.

I thing memory and time is more problem for you. you have to use sidekiq gem for this process because it will consume less time and memory consumption for doing the same job,because it uses redis as database which is key value pair db.if the problem continues go with java script.

Related

All tasks assigned to one worker when using Dask in adaptive mode

When using Dask normally things work fine. However, when I use Dask with an adaptive cluster I find that sometimes all the tasks get assigned to a single worker. Why is this?
This should be considered a usability bug, and it would be reasonable to file an issue about it.
However, to explain what is going on (at least today 2018-08-09) probably what happens is that
Your scheduler first has no tasks and so has no workers assigned to it
You submit a lot of work from a client, the scheduler responds and asks for many workers
The first worker arrives and the scheduler hands it all of the work
Milliseconds later, several other workers arrive. The scheduler then proceeds to load balance between the available workers
Ideally, the load balancing heuristics should handle the situation. There were older versions of Dask where this performed less well, but usually this is fine. I recommend first updating your version of the dask and distributed packages to the newest possible releases and if that doesn't work, report an issue with a minimal example if possible.

Concurrent SOAP api requests taking collectively longer time

I'm using savon gem to interact with a soap api. I'm trying to send three parallel request to the api using parallel gem. Normally each request takes around 13 seconds to complete so for three requests it takes around 39 seconds. After using parallel gem and sending three parallel requests using 3 threads it takes around 23 seconds to complete all three requests which is really nice but I'm not able to figure out why its not completing it in like 14-15 seconds. I really need to lower the total time as it directly affects the response time of my website. Any ideas on why it is happening? Are network requests blocking in nature?
I'm sending the requests as follows
Parallel.map(["GDSSpecialReturn", "Normal", "LCCSpecialReturn"], :in_threads => 3){ |promo_plan| self.search_request(promo_plan) }
I tried using multiple processes also but no use.
I have 2 theories:
Part of the work load can't run in parallel, so you don't see 3x speedup, but a bit less than that. It's very rare to see multithreaded tasks speedup 100% proportionally to the number of CPUs used, because there are always a few bits that have to run one at a time. See Amdahl's Law, which provides equations to describe this, and states that:
The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program
Disk I/O is involved, and this runs slower in parallel because of disk seek time, limiting the IO per second. Remember that unless you're on an SSD, the disk has to make part of a physical rotation every time you look for something different on it. With 3 requests at once, the disk is skipping repeatedly over the disk to try to fulfill I/O requests in 3 different places. This is why random I/O on hard drives is much slower than sequential I/O. Even on an SSD, random I/O can be a bit slower, especially if small-block read-write is involved.
I think option 2 is the culprit if you're running your database on the same system. The problem is that when the SOAP calls hit the DB, it gets hit on both of these factors. Even blazing-fast 15000 RPM server hard drives can only manage ~200 IO operations per second. SSDs will do 10,000-100,000+ IO/s. See figures on Wikipedia for ballparks. Though, most databases do some clever memory caching to mitigate the problems.
A clever way to test if it's factor 2 is to run an H2 Database in-memory DB and test SOAP calls using this. They'll complete much faster, probably, and you should see similar execution time for 1,3, or $CPU-COUNT requests at once.
That's actually is big question, it depends on many factors.
1. Ruby language implementation
It could be different between MRI, Rubinus, JRuby. Tho I am not sure if the parallel gem
support Rubinus and JRuby.
2. Your Machine
How many CPU cores do you have in your machine, you can leverage this using parallel process? Have you tried using process do this if you have multiple cores?
Parallel.map(["GDSSpecialReturn", "Normal", "LCCSpecialReturn"]){ |promo_plan| self.search_request(promo_plan) } # by default it will use [number] of processes if you have [number] of CPUs
3. What happened underline self.search_request?
If you running this in MRI env, cause the GIL, it actually running your code not concurrently. Or put it precisely, the IO call won't block(MRI implementation), so only the network call part will be running concurrently, but not all others. That's why I am interesting about what other works you did inside self.search_request, cause that would have impact on the overall performance.
So I recommend you can test your code in different environments and different machines(it could be different between your local machine and the real production machine, so please do try tune and benchmark) to get the best result.
Btw, if you want to know more about the threads/process in ruby, highly recommend Jesse Storimer's Working with ruby threads, he did a pretty good job explaining all this things.
Hope it helps, thanks.

Good background processing options

I am looking for a good background job processor with following ability,
Works well with MySql
Can have priorities
Can easily schedule anything in background( not just emails)
Ability to reinitialize the job after completion (callback would be good. I have few task/jobs that keeps on running after every minutes), even a repetitive scheduler would work
Should not eat up lot of memory, (have this experience with DJ)
Few options that I am looking into Resque, DJ, Beanstalkd (haven't explored completely)
I have my production env in Amazon EC2 (if this helps for better solution)
Please suggest me which is a good option, is there something else apart from these that people use nowadays ?
I'd heartily recommend sidekiq - it's extremely flexible and it uses far less resources than Resque or DelayedJob.
It does require redis (like Resque), but redis is valuable addition to any Rails project since it can be reused as a session store and cache. Our primary db is MySQL and we deploy to EC2 :-) We've used delayed job and rescue in the past, but found them problematic and heavy on the resources they use. Sidekiq uses threads and a single sidekiq worker is as efficient as several DJ/Resque workers. Here's an interesting part of the project's README that I can corroborate:
You'll find that you might need 50 200MB resque processes to peg your
CPU whereas one 300MB Sidekiq process will peg the same CPU and
perform the same amount of work. Please see my blog post on Resque's
memory efficiency and how I was able to shrink a Carbon Five client's
resque processing farm from 9 machines to 1 machine.
To sum it all up:
It works fine with MySQL - not really, but it doesn't have problems with MySQL either
You can have priorities by setting up different processing queues
You can easily schedule anything (and there is special ala DJ support for e-mails in particular)
Not quite sure about that, we use whenever + cron for repetitive jobs
You're gonna love Sidekiq's small memory footprint

How should I schedule many Google Search scrapes over the course of a day?

Currently, my Nokogiri script iterates through Google's SERPs until it finds the position of the target website. It does this for each keyword for each website that each user specifies (users are capped on amount of websites & keywords they can track).
Right now, it's run in a rake that's hard-scheduled every day and batches all scrapes at once by looping through all the websites in the database. But I'm concerned about scalability and swarming Google with a batch of requests.
I'd like a solution that scales and can run these scrapes over the course of the day. I'm not sure what kind of solution is available or what I'm really looking for.
Note: The amount of websites/keywords change from day to day as users add and delete their websites and keywords. I don't mean to make this question too superfluous, but is this the kind of thing Beanstalkd/Stalker (job queuing) can be used for?
You will have to balance two issues: Scalability for lots of users versus Google shutting you down for scaping in violation of their terms of use.
So your system will need to be able to distribute tasks to various different IPs to conceal your bulk scraping which suggests at least two levels of queuing. One to manage all the jobs and send them to each separate IP for subsequent searching and collecting results and queues on each separate machine to hold the requested searches until they are executed and the results returned.
I have no idea what Google's thresholds are (I am sure they don't advertise it) but exceeding them and getting cut off would obviously be devastating for what you are trying to do so your simple looping rake task is exactly what you shouldn't do after a certain number of users.
So yes, use a queue of some sort but realize that you probably have a different goal from the typical goal of a queue in that you want to deliberately delay jobs rather that offload word to avoid UI delays. So you will be seeking ways to slow down the queue rather than have it just execute job after job as they arrive in the queue.
So based on a cursory inspection of DelayedJob and BackgroundJobs it looks like DelayedJob has what you would need with the run_at attribute. But I am only speculating here and I am sure an expert would have more to say.
If I'm understanding correclty, it sounds like one of these tools might fit the bill:
Delayed_job: https://github.com/tobi/delayed_job
or
BackgroundJobs: http://codeforpeople.rubyforge.org/svn/bj/trunk/README
I've used both of them, and found them easy to work with.
There are definitely some background job libraries that might work.
delayed_job: https://github.com/collectiveidea/delayed_job (beware of the unmaintained branch from tobi!)
resque: https://github.com/defunkt/resque
However, you might think about just scheduling a Cron job that runs more times during the day, and processes less items per run.
SaaS solution: http://momentapp.com/ "Launch delayed jobs with scheduled http requests" - disclaimer a) in beta b) I am not affiliated with this service

Running Jobs when DB is free on Ruby on Rails Heroku

I have a ruby on rails app that uses Heroku. I have the need to run things like import/export tasks on our db that lock up the whole system since they are so heavy on the DB. Is there a way to tell the system to only run these tasks when the database is not being used at that second?
There is no built-in way to schedule a job like this. There are a few things you can do, though.
Schedule the jobs to run during the least busy hours of the day. That will depend on your business, customer base and so on, but hopefully there is a window that is more suitable than others.
You could write your batch job to run for a longer time, doing small units of work. Between each unit of work, sleep for a few seconds, or take a look at the current load average and decide what to do based on that. This should lower the impact of the batch jobs.
Have the website update a "lock" somewhere, either in the database or in a memcached or something. If your normal website usage updates the database, you could look at the existing updated_at. Then only do batch work when there hasn't been any activity for a while. This doesn't guarantee that a new user won't pop in at the same time your batch job runs, of course, but could be a way to find a window where the site is less used.
Have you looked into using Background Jobs / Workers on Heroku? It's also worth reading about Heroku's Delayed Job queuing system

Resources