I need to access and pull data from a number of API's over the the course of the a number of days. This is streaming data so the process will be running all the time. Each process will pulling in data and inserting it into a separate google fusion table.
As I want to run this processes in the background and forget about them, just being able to monitor is they fail and don't restart.
I have looked at Delayed Job, Resque, Beanstalk etc and my question is can these run processes concurrently. I don't want to queue processes just run them in the background.
I looked at Spawn as well, but didn't completely understand how it worked.
So what options are available to me, does anybody does have any recommendations?
I would use the whenever gem to schedule cron jobs to pull data.
every 2.hours do
YourApi.do_whatever
SecondApi.do_the_thing
end
Maybe a custom background daemon is a better fit you, take a look at daemon_generator. But note that you probably have to do some work if you want to do things concurrently but just processing things in serial should be quite easy.
Related
just wanted to know what the best approach would be:
let's say I have 3 processes, each one of them does its job, calculates and passes data to a final process whose function is that of taking the data from the other processes and populating a DB.
The reason for leaving the final process by itself is that the 3 other processes may take a variable time to complete, so I want each one of them to pass data to the final one as soon as it has completed its job in order to avoid wasting time, and I don't want multiple processes to write the DB at the same time.
But to do this, each process need to know whether the final process is busy or not, and in case it is available send their data, otherwise wait for it to complete before sending.
My idea is to use 'whenever' gem and create 3 processes that would run on their own, but I am puzzled by the last one as I don't know much about daemons and the like, and I know I might be making all of this much more complicated than it really is.
Any suggestion is welcome, thank you.
So I think I can provide some insight into your problem. My dev team uses a home-grown messaging que that's backed by our database. That means that messages (job metadata) are stored in our messages table.
Our rails app then creates a daemon process using the daemons gem. It makes instantiating daemon processes much simpler.There's no need to be afraid of what daemo processes are; they are just linux/unix processes that run in the background.
You specifically mention that you don't want multiple processes to write to your db It really sounds like you are concerned about deadlock issues from multiple daemons trying to read/write to the same table (please correct me if you are not, so I can modify my answer).
In order to avoid this issue, you can use row-level locking for your messages table. That way a daemon doesn't have to lock the entire table every time it wants to see if there are any jobs to pick up.
You also mention using 3 processes (I also call them daemons out of habit) to perform a task, then once those three are done, notify another process. you could possibly implement this functionality as a specific/unique message left by your 3 workers.
For example: worker A finished his job, so he writes a custom message to the special_messages_table. Workers B and C finish there task, and also write to this table. Now the entire time these daemons are processing, your third daemon would be polling the special_messages_table to see if any combination of these three jobs had finished. Once it detects that they have, it could then start.
This is just a rough outline of how you can use daemon processes to accomplish what you are asking. If you provide more details I would be happy to refine my answer. Don't be afraid of daemons!
I'm working with Amazon SQS queues and I have a class that consumes the messages on the queue. I am trying to get the messages consumed as close to real time as possible so I need the consuming code to be endlessly run. There will be messages on the queue consistently for more than half the day.
There are a few solutions I have come across to run this endlessly and I am wondering if there is a best practice for this type of need.
Option 1
On the web server use delayed_job or sidekiq to run the process continuously in the background.
Option 2
Have a separate server have a ruby application dedicated to consuming the messages.
Option 3
Placing the SQS consumer in a rake task and using a system call to fire off the task in the background.
Any insight is appreciated!
You can use shoryuken.
It will consume your messages continuously until your queue has messages.
shoryuken -r your_worker.rb -C shoryuken.yml \
-l log/shoryuken.log -p shoryuken.pid -d
As you've probably already discovered, there isn't one obvious right way™ to handle this kind of thing. It depends a lot on what work you do for each job, the size of your app and infrastrucure, and your personal preferences on APIs, message queuing philosophies, and architecture.
That said, I'd probably lean towards option 2 based on your description. Sidekiq and delayed_job don't speak SQS, and while you could teach them with something like sidekiq-sqs, it sounds like you might outgrow them pretty quick. Unless you need your Rails environment available to your workers, you'd have better luck separating your queue consumers into distinct applications, which makes it easy to scale horizontally just by starting more processes. It also allows you to further decouple the workers from your Rails app, which can make things easier to deploy and administer.
Option 3 is a non-starter IMO. You'll want to have a daemon running to process jobs as they come in, and if rake has to load your environment on each job, things are going to get sloooow.
I am using in my app a background job system (Sidekiq) to manage some heavy job that should not block the UI.
I would like to transmit data from the background job to the main thread when the job is finished, e.g. the status of the job or the data done by the job.
At this moment I use Redis as middleware between the main thread and the background jobs. It store data, status,... of the background jobs so the main thread can read what it happens behind.
My question is: is this a good practice to manage data between the scheduled job and the main thread (using Redis or a key-value cache)? There are others procedures? Which is best and why?
Redis pub/sub are thing you are looking for.
You just subscribe main thread using subscribe command on channel, in which worker will announce job status using publish command.
As you already have Redis inside your environment, you don't need anything else to start.
Here are two other options that I have used in the past:
Unix sockets. This was extremely fiddly, creating and closing connections was a nuisance, but it does work. Also dealing with cleaning up sockets and interacting with the file system is a bit involved. Would not recommend.
Standard RDBMS. This is very easy to implement, and made sense for my use case, since the heavy job was associated with a specific model, so the status of the process could be stored in columns on that table. It also means that you only have one store to worry about in terms of consistency.
I have used memcached aswell, which does the same thing as Redis, here's a discussion comparing their features if you're interested. I found this to work well.
If Redis is working for you then I would stick with it. As far as I can see it is a reasonable solution to this problem. The only things that might cause issues are generating unique keys (probably not that hard), and also making sure that unused cache entries are cleaned up.
I am trying to find out the best way to run scripts in the background. I have been looking around and found plenty of options, but many/most seem to have become inactive in the past few years. Let me describe my needs.
The rails app is basically a front-end to configure when and how these scripts will be run. The scripts run and generate reports and send email alerts. So the user must be able to configure the start times and how often these scripts will run dynamically. The scripts themselves should have access to the rails environment in order to save the resulting reports in the DB.
Just trying to figure out the best method from the myriad of options.
I think you're looking for a background job queuing system.
For that, you're either looking for resque or delayed_job. Both support scheduling tasks at some point in the future -- delayed_job does this natively, whereas resque has a plugin for it called resque_scheduler.
You would enqueue jobs in the background with parameters that you specify, and then at the time you selected they'll be executed. You can set jobs to recur indefinitely or a fixed number of times (at least with resque-scheduler, not sure about delayed_job).
delayed_job is easier to set up since it saves everything in the database. resque is more robust but requires you to have redis in your stack -- but if you do already it's pretty much the ideal solution for your problem.
I recently learned about Sidekiq, and I think it is really great.
There's also a RailsCast about it - Sidekiq.
Take a look at the gem whenever at https://github.com/javan/whenever.
It allows you to schedule tasks like cron jobs.
Works very well under linux, and the last commit was 14 days ago. A friend of mine used it in a project and was pretty satisfied with it.
edit: take a look at the gem delayed_job as well, it is good for executing long tasks in the background. Useful when creating a cron job only to start other tasks.
I need to build a background job that goes through a list of RSS feeds and analyze them say every 10 minutes.
I have been using delayed_job for handling background jobs and I liked it a lot. I believe though that it's not built for recurring background jobs. I guess I can auto-schedule background job at the end of everyone (maybe with begin..rescue just to ensure it gets executes). Or preschedule say a month of advance worth of jobs and have another one that reschedule the every month..etc
This raised some concerned to me as I started asking myself: what if the server goes down in the middle of execution and the jobs didn't get scheduled?
I have also looked at Daemons gems which seemed the like it runs simple Ruby scripts with start/stop commands. I like the way delayed_job schedules and handles retries.
What do you recommend using in this case? What do you think the best way to design such a system with recurring background jobs? Also do you know a way I can monitor that background process and get notified if it stops?
I just implemented delayed_job for a similar task (using :run_at => 2.days.from_now) and found it to be a perfect fit. The easiest way to handle your concern about a process failing is to make the first step of the job to create the next job. Also, you can create a has_many relationship to the delayed_job model which would allow you to access the :last_error. Or, look at the "Hooks" section of readme and it has a perfect example for failure.
I think that this was a similar question: A cron job for rails: best practices? - not only are there answers, but also links to railscasts about background jobs in rails.
I used cron + delayed_job, but scheduled tasks were supposed to run few times a day, mostly just once.
Take a look at SimpleWorker. It's an elastic scheduling and background processing worker queue. It's cloud based and has persistence and redundancy so you don't need to worry if your servers go down or are restarted.
Very flexible in terms of scheduling, provides great introspection of jobs in the queue as well as notifications on status and errors.
Full disclosure: I work at SimpleWorker.