just wanted to know what the best approach would be:
let's say I have 3 processes, each one of them does its job, calculates and passes data to a final process whose function is that of taking the data from the other processes and populating a DB.
The reason for leaving the final process by itself is that the 3 other processes may take a variable time to complete, so I want each one of them to pass data to the final one as soon as it has completed its job in order to avoid wasting time, and I don't want multiple processes to write the DB at the same time.
But to do this, each process need to know whether the final process is busy or not, and in case it is available send their data, otherwise wait for it to complete before sending.
My idea is to use 'whenever' gem and create 3 processes that would run on their own, but I am puzzled by the last one as I don't know much about daemons and the like, and I know I might be making all of this much more complicated than it really is.
Any suggestion is welcome, thank you.
So I think I can provide some insight into your problem. My dev team uses a home-grown messaging que that's backed by our database. That means that messages (job metadata) are stored in our messages table.
Our rails app then creates a daemon process using the daemons gem. It makes instantiating daemon processes much simpler.There's no need to be afraid of what daemo processes are; they are just linux/unix processes that run in the background.
You specifically mention that you don't want multiple processes to write to your db It really sounds like you are concerned about deadlock issues from multiple daemons trying to read/write to the same table (please correct me if you are not, so I can modify my answer).
In order to avoid this issue, you can use row-level locking for your messages table. That way a daemon doesn't have to lock the entire table every time it wants to see if there are any jobs to pick up.
You also mention using 3 processes (I also call them daemons out of habit) to perform a task, then once those three are done, notify another process. you could possibly implement this functionality as a specific/unique message left by your 3 workers.
For example: worker A finished his job, so he writes a custom message to the special_messages_table. Workers B and C finish there task, and also write to this table. Now the entire time these daemons are processing, your third daemon would be polling the special_messages_table to see if any combination of these three jobs had finished. Once it detects that they have, it could then start.
This is just a rough outline of how you can use daemon processes to accomplish what you are asking. If you provide more details I would be happy to refine my answer. Don't be afraid of daemons!
Related
I have a PendingEmail table which I push many records to for emails I want to send.
I then have multiple Que workers which process my app's jobs. One of said jobs is my SendEmailJob.
The purpose of this job is to check PendingEmail, pull the latest 500 ordered by priority, make a batch request to my 3rd party email provider, wait for array response of all 500 responses, then delete the successful items and mark the failed records' error column. The single job will continue in this fashion until the records returned from the DB are 0, and the job will exit/destroy.
The issues are:
It's critical only one SendEmailJob processes email at one time.
I need to check the database every second if a current SendEmailJob isn't running. If it is running, then there's no issue as that job will get to it in ~3 seconds.
If a table is locked (however that may be), my app/other workers MUST still be able to INSERT, as other parts of my app need to add emails to the table. I mainly just need to restrict SELECT I think.
All this needs to be FAST. Part of the reason I did it this way is for performance as I'm sending millions of email in a short timespan.
Currently my jobs are initiated with a clock process (Clockwork), so it would add this job every 1 second.
What I'm thinking...
Que already uses advisory locks and other PG mechanisms. I'd rather not attempt to mess with that table trying to prevent adding more than one job in the first place. Instead, I think it's ok that potentially many SendEmailJob could be running at once, as long as they abort early if there is a lock in place.
Apparently there are some Rails ways to do this but I assume I will need to execute code directly to PG to initiate some sort of lock in each job, but before doing that it checks if there already is one lock, and if there is it aborts)
I just don't know which type of lock to choose, whether to do it in Rails or in the database directly. There are so many of them with such subtle differences (I'm using PG). Any insight would be greatly appreciated!
Answer: I needed an advisory lock.
I am using Twilio to send/receive texts in a Rails 4.2 app. I am sending in bulk, around 1000 at a time, and receiving sporadically.
Currently when I receive a text I save it to the DB (to, from, body) and then pass that record to an ActiveJob worker to process later. For sending messages I currently persist the Twilio params to another DB and pass that record to a different ActiveJob worker. Since I am often doing it in batches I have two workers. The first outgoing message worker sends a single message. The second one queries the DB and finds all the user who should receive the message, creates a DB record for each message that should be sent, and then passes that record to the first outgoing message worker. So the second one basically just creates a bunch of jobs for the first one to process.
Right now I have the workers destroying the records once they finish processing (both incoming and outgoing). I am worried about not persisting things incase the server, redis, or resque go down but I do not know if this is actually a good design pattern. It was suggested to me just to use a vanilla ruby object and pass it's id to the worker but I am not sure how that effects data reliability. So is it over kill to be creating all these DBs and should I just be creating vanilla ruby objects and passing those object's ids to the workers?
Any and all insight is appreciated,
Drew
It seems to me that the approach of sending a minimal amount of data to your jobs is the best approach. Check out the 'Best Practices' section on the sidekiq wiki: https://github.com/mperham/sidekiq/wiki/Best-Practices
What if your queue backs up and that quote object changes in the meantime? Don't save state to Sidekiq, save simple identifiers. Look up the objects once you actually need them in your perform method.
Also in terms of reliability - you should be worried about your job queue going down. It happens. You either design your system to be fault tolerant of a failure or you find a job queue system that has higher reliability guarantees (but even then no queue system can guarantee 100% message deliverability). Sidekiq pro has better reliability guarantees than sidekiq (non-pro), but if you design your jobs with a little bit of forethought, you can create jobs that can scan your database after a crash and re-queue any jobs that may have been lost.
How much work you spend desinging fault tolerant solutions really just depends how critical it is that your information make it from point A to point B :)
I am using in my app a background job system (Sidekiq) to manage some heavy job that should not block the UI.
I would like to transmit data from the background job to the main thread when the job is finished, e.g. the status of the job or the data done by the job.
At this moment I use Redis as middleware between the main thread and the background jobs. It store data, status,... of the background jobs so the main thread can read what it happens behind.
My question is: is this a good practice to manage data between the scheduled job and the main thread (using Redis or a key-value cache)? There are others procedures? Which is best and why?
Redis pub/sub are thing you are looking for.
You just subscribe main thread using subscribe command on channel, in which worker will announce job status using publish command.
As you already have Redis inside your environment, you don't need anything else to start.
Here are two other options that I have used in the past:
Unix sockets. This was extremely fiddly, creating and closing connections was a nuisance, but it does work. Also dealing with cleaning up sockets and interacting with the file system is a bit involved. Would not recommend.
Standard RDBMS. This is very easy to implement, and made sense for my use case, since the heavy job was associated with a specific model, so the status of the process could be stored in columns on that table. It also means that you only have one store to worry about in terms of consistency.
I have used memcached aswell, which does the same thing as Redis, here's a discussion comparing their features if you're interested. I found this to work well.
If Redis is working for you then I would stick with it. As far as I can see it is a reasonable solution to this problem. The only things that might cause issues are generating unique keys (probably not that hard), and also making sure that unused cache entries are cleaned up.
I need to access and pull data from a number of API's over the the course of the a number of days. This is streaming data so the process will be running all the time. Each process will pulling in data and inserting it into a separate google fusion table.
As I want to run this processes in the background and forget about them, just being able to monitor is they fail and don't restart.
I have looked at Delayed Job, Resque, Beanstalk etc and my question is can these run processes concurrently. I don't want to queue processes just run them in the background.
I looked at Spawn as well, but didn't completely understand how it worked.
So what options are available to me, does anybody does have any recommendations?
I would use the whenever gem to schedule cron jobs to pull data.
every 2.hours do
YourApi.do_whatever
SecondApi.do_the_thing
end
Maybe a custom background daemon is a better fit you, take a look at daemon_generator. But note that you probably have to do some work if you want to do things concurrently but just processing things in serial should be quite easy.
I'm working on a Rails application that periodically needs to perform large numbers of IO-bound operations. These operations can be performed asynchronously. For example, once per day, for each user, the system needs to query Salesforce.com to fetch the user's current list of accounts (companies) that he's tracking. This results in huge numbers (potentially > 100k) of small queries.
Our current approach is to use ActiveMQ with ActiveMessaging. Each of our users is pushed onto a queue as a different message. Then, the consumer pulls the user off the queue, queries Salesforce.com, and processes the results. But this approach gives us horrible performance. Within a single poller process, we can only process a single user at a time. So, the Salesforce.com queries become serialized. Unless we run literally hundreds of poller processes, we can't come anywhere close to saturating the server running poller.
We're looking at EventMachine as an alternative. It has the advantage of allowing us to kickoff large numbers of Salesforce.com queries concurrently within a single EventMachine process. So, we get great parallelism and utilization of our server.
But there are two problems with EventMachine. 1) We lose the reliable message delivery we had with ActiveMQ/ActiveMessaging. 2) We can't easily restart our EventMachine's periodically to lessen the impact of memory growth. For example, with ActiveMessaging, we have a cron job that restarts the poller once per day, and this can be done without worrying about losing any messages. But with EventMachine, if we restart the process, we could literally lose hundreds of messages that were in progress. The only way I can see around this is to build a persistance/reliable delivery layer on top of EventMachine.
Does anyone have a better approach? What's the best way to reliably execute large numbers of asynchronous IO-bound operations?
I maintain ActiveMessaging, and have been thinking about the issues of a multi-threaded poller also, though not perhaps at the same scale you guys are. I'll give you my thoughts here, but am also happy to discuss further o the active messaging list, or via email if you like.
One trick is that the poller is not the only serialized part of this. STOMP subscriptions, if you do client -> ack in order to prevent losing messages on interrupt, will only get sent a new message on a given connection when the prior message has been ack'd. Basically, you can only have one message being worked on at a time per connection.
So to keep using a broker, the trick will be to have many broker connections/subscriptions open at once. The current poller is pretty heavy for this, as it loads up a whole rails env per poller, and one poller is one connection. But there is nothing magical about the current poller, I could imagine writing a poller as an event machine client that is implemented to create new connections to the broker and get many messages at once.
In my own experiments lately, I have been thinking about using Ruby Enterprise Edition and having a master thread that forks many poller worker threads so as to get the benefit of the reduced memory footprint (much like passenger does), but I think the EM trick could work as well.
I am also an admirer of the Resque project, though I do not know that it would be any better at scaling to many workers - I think the workers might be lighter weight.
http://github.com/defunkt/resque
I've used AMQP with RabbitMQ in a way that would work for you. Since ActiveMQ implements AMQP, I imagine you can use it in a similar way. I have not used ActiveMessaging, which although it seems like an awesome package, I suspect may not be appropriate for this use case.
Here's how you could do it, using AMQP:
Have Rails process send a message saying "get info for user i".
The consumer pulls this off the message queue, making sure to specify that the message requires an 'ack' to be permanently removed from the queue. This means that if the message is not acknowledged as processed, it is returned to the queue for another worker eventually.
The worker then spins off the message into the thousands of small requests to SalesForce.
When all of these requests have successfully returned, another callback should be fired to ack the original message and return a "summary message" that has all the info germane to the original request. The key is using a message queue that lets you acknowledge successful processing of a given message, and making sure to do so only when relevant processing is complete.
Another worker pulls that message off the queue and performs whatever synchronous work is appropriate. Since all the latency-inducing bits have already performed, I imagine this should be fine.
If you're using (C)Ruby, try to never combine synchronous and asynchronous stuff in a single process. A process should either do everything via Eventmachine, with no code blocking, or only talk to an Eventmachine process via a message queue.
Also, writing asynchronous code is incredibly useful, but also difficult to write, difficult to test, and bug-prone. Be careful. Investigate using another language or tool if appropriate.
also checkout "cramp" and "beanstalk"
Someone sent me the following link: http://github.com/mperham/evented/tree/master/qanat/. This is a system that's somewhat similar to ActiveMessaging except that it is built on top of EventMachine. It's almost exactly what we need. The only problem is that it seems to only work with Amazon's queue, not ActiveMQ.