How can i write a never ending job in Rails (Web Scraping)? - ruby-on-rails

Goal: I want to make a web scraper in a Rails app that runs indefinitely and can be scaled.
Current stack app is running on:
ROR/Heroku/Redis/Postgres
Idea:
I was thinking of running a Sidekiq Job that runs every n minutes and checks if there are any proxies available to scrape with (these will be stored in a table with status sleeping/scraping).
Assuming there is a proxy available to scrape it will then check (using Sidekiq API) if there is any available workers to start up another job to scrape with the available proxy.
This means i could scale the scraper by increasing number of workers and the number of available proxies. If for any reason the Job fails the Job that looks for available proxies will just start it again.
Questions: Is this the best solution for my goal? Is utilizing long running Sidekiq jobs the best idea or could this blow up?

Sidekiq is designed to run individual jobs which are "units of work" to your organization.
You can build your own loop and, inside that loop, create jobs for each page to scrape but the loop itself should not be a job.

If you want a job to run every n minutes, you could schedule it.
And since you're using Heroku, there is an Add-on that : https://devcenter.heroku.com/articles/scheduler
Another solution would be to set cron jobs and schedule them with the whenever gem.

Related

how to continuously deploy with long running jobs

We currently use delayed_job and rails to manage some long running jobs in our system. Some of these jobs take potentially hours to run, but we also like to deploy rather frequently, often many times a day. The problem with this setup is that we have to restart delayed_job during deployment to pick up code changes, so that any new jobs are processed with the latest code.
The solution we've arrived at is that for any job that needs to run for more than some small amount of time, we fork the delayed job so that it returns immediately, and the forked process handles the work. This way a deploy can restart all the delayed job processes, while the long-running 'job' keeps going until it's finished as an orphaned process.
We've looked at sidekiq, but it looks like we'd have the same issue there when trying to deploy new code.
Has anyone developed a solution they would recommend for dealing with long-running background processes that span multiple deployments?

Keep delayed job running on Heroku

I'm connecting to Twitter's streaming API to get a stream of updates to my Rails app, adding them to the db, etc, etc.
What's the best way to do this on Heroku? Right now, I'm using the delayed_job gem - problem is that the job (connecting to the Twitter Streaming API) expires after hours.
Is there a way to make the job run forever, or a better way to do this?
Thanks
I wouldn't make a job "run forever" as that would mean loading the CPU forever too.
The way this is usually handled is by using a cron job which starts the specific script at specific intervals (every minute, every hour, every few days, etc.).
Almost every webhost provides an easy interface to setup such cron jobs via their backend (eg: CPanel).
In case you're running your own server, you probably already know how to configure such jobs. If you don't, you'll have to lookup the individual setup guide which fits the operating system you're running on your server… there's always a way to run "jobs" at specific intervals (even on MS Windows servers — via scheduling).
And for a more detailed description and better insight into what "cron" is, you might want to check the "cron" article at Wikipedia , which also provides some pretty good examples.

Should the resque-scheduler queue be expected to handle /lots/ of delayed jobs?

I am currently using resque and resque-scheduler in an application that will have to handle a lot of recurring jobs - "do this every hour", "do this every day" etc. At the moment, I simply queue up the next run of the job in the job itself, the HourlyJob queue has a .enqueue_at(1.hour.from_now, HourlyJob) etc.
Should I be doing this? It "feels" like I should have a static recurring job using resque-schedulers cron-type functionality that then schedules up say the next 5 minutes worth of delayed jobs... but all I am really doing is moving the work from the (probably fast, redis based) resque-scheduler to my (probably less well implemented, mysql based) code, surely?
Is there anything wrong with how I'm doing it now?
I'd personally use the cron style provided by resque-scheduler, your use case is exactly what it was built for:
Your more directly indicate these are recurring jobs.
Everything is located in the same YAML file rather then multiple job classes/modules.
By queuing the next run of the job inside the actual job:
You run the risk of the next run going missing when your worker/job/server fails.
Your needlessly using more memory in Redis, the scheduler process will not add the jobs to Redis until there ready to be run.
Hops this helps.

Using "rails runner" for cron jobs is very CPU intensive - alternatives?

I'm currently using cron and "rails runner" to execute background jobs. For the most part these jobs are simple polls "Find the records that are due to receive a reminder email. Send that email."
I've been watching my Amazon EC2 Small instance, and noticed that each time one of these cron job kicks in, the CPU spikes to ~99%. The teeny tiny little query inside my current job is definitely not responsible. I'm presuming that the spike is simply due to the effort of loading the full rails environment via "rails runner".
Is there a more CPU efficient way to handle regularly scheduled batch jobs?
P.S. I know that in the particular example of sending a reminder email at time X in the future, I could delayed_jobs, and simply schedule the job in the future. Not every possible task fits into the delayed_jobs framework very well though, so I'm looking for a more traditional "cron job" type solution. Like "rails runner", but without the crazy CPU consequences.
You can use workers witch don't load rails env. Or load it only once(like resque)
I don't think there is a solution for this, since you do need to load a Rails environment to handle whatever that is you are handling. So when on the "cron" model you will be starting up a handler which in turn will create some load on your instance. I don't know how cloud services lend themselves to this, but I think the optimal model in your case would be to have a running daemon for job handling and forking coupled with REE for the job execution (that helps prevent memory leaks by letting as much as possible happen in the child process that will die at the end of the execution loop).
The daemon could be configured to accept signals (also via a job queue) that would spin off jobs doing specific things.

rails backgroundjob running jobs in parallel?

I'm very happy with By so far, only I have this one issue:
When one process takes 1 or 2 hours to complete, all other jobs in the queue seem to wait for that one job to finish. Worse still is when uploading to a server which time's out regularly.
My question: is Bj running jobs in parallel or one after another?
Thank you,
Damir
BackgroundJob will only allow one worker to run per webserver instance. This is by design to keep things simple. Here is a quote from Bj's README:
If one ignores platform specific details the design of Bj is quite simple: the
main Rails application submits jobs to table, stored in the database. The act
of submitting triggers exactly one of two things to occur:
1) a new long running background runner to be started
2) an existing background runner to be signaled
The background runner refuses to run two copies of itself for a given
hostname/rails_env combination. For example you may only have one background
runner processing jobs on localhost in development mode.
The background runner, under normal circumstances, is managed by Bj itself -
you need do nothing to start, monitor, or stop it - it just works. However,
some people will prefer manage their own background process, see 'External
Runner' section below for more on this.
The runner simply processes each job in a highest priority oldest-in fashion,
capturing stdout, stderr, exit_status, etc. and storing the information back
into the database while logging it's actions. When there are no jobs to run
the runner goes to sleep for 42 seconds; however this sleep is interuptable,
such as when the runner is signaled that a new job has been submitted so,
under normal circumstances there will be zero lag between job submission and
job running for an empty queue.
You can learn more on the github page: Here

Resources