Does rails HireFire support Queues? - ruby-on-rails

Background:
I have 50 clients. (for example) they have their data partitioned into 50 different schemas in postgresql.
I feel it's a good idea to keep their processing as separate as possible, so I think putting their DJ's into different queues is a good idea, At least grouping them based on their load, into different queues (because I have a limit on the number of workers)
If Client_A has 10 large actions in the queue, Client_B shouldn't have to wait for them to be done, to send an email.
DJ supports queue's based workers. I could be wrong, but I don't see a way to set queues in the hirefire paradigm.
Does anyone know how to setup-hirefire to run on a given queue?
I see more issues coming, but I'll ignore them for now :)

Related

Scaling chat log workers horizontally

I've thought about this a lot but can't come up with a solution I'm happy with.
Basicly this is the problem: Log 100k+ Chats (some slower, some faster) into cassandra. So save userId, channelId, timestamp and the message.
Cassandra already supports horizontal scaling out of the box, I have no issue here.
Now my software that reads these chats does it over TCP (IRC). Something like 300 messages / sec are usual for the top 1k channels and 1 single IRC connection can't handle that from my experiments.
What I now want to build is multiple instances (with Docker/Kubernetes) of the logger and share the load between those. So ideally if I have maybe 4 workers and 1k chats (example). They would each join atleast 250 channels. I say atleast because I would want optional redundancy so I can have 2 loggers in the same chat to make sure no messages get lost.
There is no issue with duplicates, because all messages have a unique ID.
Now how would I best and dynamically share the current channels joined between the workers. I wanna avoid having a master or controlling point. Should also be easy to add more workers that then reduce the load on other workers.
Are there any good articles about this kind of behaviour? Maybe good concepts or protocols already defined? Like I said i wanna avoid another central control point so no rabbitmq, redis or whatever.
Edit: I've looked into something like the Raft Consensus Algorithm, but it doesn't make sense I think, since I don't want my clients to agree on a shared state instead divide the state between them "equally".
I think in this case looking for a description of existing algorithm might be not very useful: the problem is not complicated and generic enough to be worth publication.
As described, the problem could be solved by using Cassandra itself as a mediator and to share chat channel assignment information among the workers.
So (trivial part) channels would have IDs and assigned worker ID(s), plus in the optional case of redundancy - required amount of workers (2 or whatever number of workers you want to process this chat). Worker, before assigning itself to a channel would check if there is already enough assignees. If so would continue to the next channel. If not, assign itself to the channel. This is one of the options (alternatively you can have workers holding the channel IDs, but since redundancy is rare this way seems to be simpler). Workers would have a limit of channels they can process and will not try exceeding it by assigning more channels.
Now we only have to deal with the case of assigning too much workers to the same channel, exceeding requirements and exhausting the worker capacity by monitoring all the same channels. Otherwise, if they start all at once, channels might have more assigned workers than needed. Even though it is unlikely will create a real problem in described case (just a bit more redundancy than requested), you can handle that by prioritising workers. Much like employing of school teachers in Canada, BC is done on seniority basis - the most senior gets job first, except that here it'd be voluntarily done by the workers themselves, not by school administration. What this means, is that each worker would have to check all it's assigned channels and, should there be more workers than needed at this time, would check if it has the smallest priority among all the assignees. If it does, it would resign - remove itself and stop processing the channel.
That requires assigning distinct priorities of the workers, which could be easily achieved when spawning them, by simply setting each to a next sequential number (the oldest has the highest priority, or v.v if you concerned of old, potentially dying workers taking up all the load, and would prefer new ones to take on more while still fresh). More elaborately, this could also be done by using Cassandra Lightweight transactions as described in one of the answers here (the one by AlonL). With just a few (you mentioned ~4) workers either way should work and concerns about scaling mentioned in the other answers there isn't a big deal for a few integer priorities. Also, instead of sequential number assignment, requiring the workers to self-assign a random 32-bit integer priority on initialization has virtually no chance of collision, so loop "until no collisions" should exit on the very first iteration (which would make a second iteration very rarely code path requiring an explicit test).
The trick is basically to limit the amount of data requiring synchronisation and putting the load of regulation onto the workers themselves. There is no need for consensus algorithms as there is not much complexity and we are not dealing with huge number of potentially fraudulent workers, trying to get assignments ahead of more senior peers.
The only issue I should mention is that there could be implicit worker rotation if channels go offline which makes worker to stop processing. You will get a different worker assignment next time the channel goes online.

Do you have to use worker pools in Erlang?

I have a server I am creating (a messaging service) and I am doing some preliminary tests to benchmark it. So far, the fastest way to process the data is to do it directly on the process of the user and to use worker pools. I have tested spawning and that is unbelievable slow.
The test is just connecting 10k users, and having each one send 15kb of data a couple of times at the same time(or trying too atleast) and having the server process the data (total length, headers, and payload).
The issue I have with worker pools is its only fast when you have enough workers to offset the amount of connections. For example, if you have 500k, or 1 million users, you would need more workers to process all the concurrent data coming in. And, as for my testing, having 1000 workers would make it unusable.
So my question is the following: When does it make sense to use pools of workers? Will there be a tipping point where I would have to use workers to process the data to free up the user process? How many workers is too much, is 500,000 too much?
And, if workers are the way to go (for those massive concurrent distributed servers), I am guessing you can dynamically create/delete as you need?
Any literature is also appreciated!
Thanks for your answer!
Maybe worker pools are not the best tool for your problem. If I were you I would try using Jay Nelson's epocxy, which gives you a very basic backpressure mechanism while still letting you parallelize your tasks. From that library I would check either concurrency fount or concurrency control tools.

Running large amount of long running background jobs in Rails

We're building a web-app where users will be uploading potentially large files that will need to be processed in the background. The task involves calling 3rd-party APIs so each job can take several hours to complete. We're using DelayedJob to run the background jobs. With every user kicking off a background job, each of which will take a few hours to finish, that will add up to a lot of background jobs every quickly. I am wondering what would be the best way to setup the deployment for this? We're currently hosted on DigitalOcean. I've kicked off 10 DelayedJob workers. Each one (when ideal) takes up 157MB. When actively running it utilizes around 900 MB. Our user-base right now is pretty small so it's not an issue but will be one soon. So on a 4GB droplet, I can probably run like 2 or 3 workers at a time. How should we approach this issue? Should we be looking at using DigitalOcean's API to auto-spin cheap droplets on demand? Should we subscribe to high-memory droplets on a monthly basis instead? If we go with auto-spinning droplets, should we stick with DigitalOcean or would Heroku make more sense? Or is the entire approach wrong and should we be approaching it from an entire different direction? Any help/advice would be very much appreciated.
Thanks!
It sounds like you are limited by memory on the number of workers that you can run on your DigitalOcean host.
If you are worried about scaling, I would focus on making the workers as efficient as possible. Have you done any benchmarking to understanding where the 900MB of memory is being allocated? I'm not sure what the nature of these jobs are, but you mentioned large files. Are you reading the contents of these files into memory, or are you streaming them? Are you using a database with SQL you can tune? Are you making many small API calls when you could be using a batch endpoint? Are you assigning intermediary variables that must then be garbage collected? Can you compress the files before you send them?
Look at the job structure itself. I've found that background jobs work best with many smaller jobs rather than one larger job. This allows execution to happen in parallel, and be more load balanced across all workers. You could even have a job that generates other jobs. If you need a job to orchestrate callbacks when a group of jobs finishes there is a DelayedJobGroup plugin at https://github.com/salsify/delayed_job_groups_plugin that allows you to invoke a final job only after the sibling jobs complete. I would aim for an execution time of a single job to be under 30 seconds. This is arbitrary but it illustrates what I mean by smaller jobs.
Some hosting providers like Amazon provide spot instances where you can pay a lower price on servers that do not have guaranteed availability. These pair well with the many fewer jobs approach I mentioned earlier.
Finally, Ruby might not be the right tool for the job. There are faster languages, and if you are limited by memory, or CPU, you might consider writing these jobs and their workers in another language like Javascript, Go or Rust. These can pair well with a Ruby stack, but offload computationally expensive subroutines to faster languages.
Finally, like many scaling issues, if you have more money than time, you can always throw more hardware at it. At least for a while.
I thing memory and time is more problem for you. you have to use sidekiq gem for this process because it will consume less time and memory consumption for doing the same job,because it uses redis as database which is key value pair db.if the problem continues go with java script.

Practical use of delayed background job when dealing with many users

When a background job starts, it's sent to the back of a queue where a worker handles it; a task clears and the other starts. I think I've got this one right except I don't understand the practical side of it in some cases. Sure, if you're a company sending out 15,000 newsletters once a week using a delayed job makes perfect sense. But when you have an application of even 100 users, in which some task is long enough to need background work (like sending/fetching emails that might take a minute) then each user will have to wait in line while another user gets cleared (in case there's a single worker).
This is the part I'm not sure I'm getting right. I'm talking about the same job, but individually for each user. Does that count as a job per user? If I have 100 users, do I need to keep 100 workers for each one's process to not get tied up?
I've tried using delayed_job to simulate that, and indeed when I sign in with a different account I have to wait until another user's email gets sent until mine is. While the plugin is swift and simple to work with, I think it's not the right approach here.
I've also tried using Ajax, but since it's an HTTP request it ties up the browser in loading mode until it gets a response from the server (even with async: true). Not sure if I ruled this one out too quickly, but I was sortof looking for a more elegant server solution.
Is there a way to achieve a background job like this? (I've heard of different, mostly commercial solutions promising little waiting time, but I'm interested in completely eliminating the queue between users). If not, is there a method to make an ajax request without waiting for a response? I realize my questions are both drastically different but both seem like an appropriate solution to this problem.
Resque is a background processing engine that can support multiple queues.
Ways you could use this:
Group your tasks into queues that make sense on their priority. If you need fast response times, use it in a 'foreground' queue. Slow? (like sending/receiving emails) can be in the 'background' queue
Have one queue per user (you will need to have many many workers for this)
This SO question also gives a way to use delayed_jobs with multiple queues/tables
The purpose of delayed_job and other message queues is to asynchronously process jobs outside of your core application. I always use a queue for sending email since I'm relying on an outside application (sometimes a third-party API like gmail) to send them and I can't guarantee available and operating efficiency.
So for your use case, even with very few users, I highly recommend offloading emails to delayed_job. This will speed up your front end (ajax) and will also give you retries upon failure. You could spin up multiple workers to process the queue, but it shouldn't be necessary with your numbers unless your calls to send mail are taking a really long time (more than a couple seconds?).
And yes in most situations I'd create separate jobs for each user even though the message might be identical. The only time I'd process them all together would be if the email application / API has bulk sending and you can reduce the number of calls significantly by sending a large payload in a few calls.

How should I schedule many Google Search scrapes over the course of a day?

Currently, my Nokogiri script iterates through Google's SERPs until it finds the position of the target website. It does this for each keyword for each website that each user specifies (users are capped on amount of websites & keywords they can track).
Right now, it's run in a rake that's hard-scheduled every day and batches all scrapes at once by looping through all the websites in the database. But I'm concerned about scalability and swarming Google with a batch of requests.
I'd like a solution that scales and can run these scrapes over the course of the day. I'm not sure what kind of solution is available or what I'm really looking for.
Note: The amount of websites/keywords change from day to day as users add and delete their websites and keywords. I don't mean to make this question too superfluous, but is this the kind of thing Beanstalkd/Stalker (job queuing) can be used for?
You will have to balance two issues: Scalability for lots of users versus Google shutting you down for scaping in violation of their terms of use.
So your system will need to be able to distribute tasks to various different IPs to conceal your bulk scraping which suggests at least two levels of queuing. One to manage all the jobs and send them to each separate IP for subsequent searching and collecting results and queues on each separate machine to hold the requested searches until they are executed and the results returned.
I have no idea what Google's thresholds are (I am sure they don't advertise it) but exceeding them and getting cut off would obviously be devastating for what you are trying to do so your simple looping rake task is exactly what you shouldn't do after a certain number of users.
So yes, use a queue of some sort but realize that you probably have a different goal from the typical goal of a queue in that you want to deliberately delay jobs rather that offload word to avoid UI delays. So you will be seeking ways to slow down the queue rather than have it just execute job after job as they arrive in the queue.
So based on a cursory inspection of DelayedJob and BackgroundJobs it looks like DelayedJob has what you would need with the run_at attribute. But I am only speculating here and I am sure an expert would have more to say.
If I'm understanding correclty, it sounds like one of these tools might fit the bill:
Delayed_job: https://github.com/tobi/delayed_job
or
BackgroundJobs: http://codeforpeople.rubyforge.org/svn/bj/trunk/README
I've used both of them, and found them easy to work with.
There are definitely some background job libraries that might work.
delayed_job: https://github.com/collectiveidea/delayed_job (beware of the unmaintained branch from tobi!)
resque: https://github.com/defunkt/resque
However, you might think about just scheduling a Cron job that runs more times during the day, and processes less items per run.
SaaS solution: http://momentapp.com/ "Launch delayed jobs with scheduled http requests" - disclaimer a) in beta b) I am not affiliated with this service

Resources