Max number of queues - amazon-sqs

Is there a limit on the number of queues you can create with AWS SQS?
This page https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-limits.html doesn't state so one way or the other.
We're not looking to create thousands of the things but might dynamically create a good few dozen for a while then destroy them. I've come across unexpected limits with AWS before (only 4 transcoding pipelines - why?) so need to be sure on this.
Thanks for any advice.
AB

Indeed there is no informations about that in the AWS documentation.
I don't think there is a limit on the queues number.
We are actually working with 28 fulltime Queues on our infrastructure without any problem.
If at least you hit a limit, a simple AWS support ticket can increase it.
Just like the Ec2 Number limit increase process.
Hope it helps

Related

Scaling chat log workers horizontally

I've thought about this a lot but can't come up with a solution I'm happy with.
Basicly this is the problem: Log 100k+ Chats (some slower, some faster) into cassandra. So save userId, channelId, timestamp and the message.
Cassandra already supports horizontal scaling out of the box, I have no issue here.
Now my software that reads these chats does it over TCP (IRC). Something like 300 messages / sec are usual for the top 1k channels and 1 single IRC connection can't handle that from my experiments.
What I now want to build is multiple instances (with Docker/Kubernetes) of the logger and share the load between those. So ideally if I have maybe 4 workers and 1k chats (example). They would each join atleast 250 channels. I say atleast because I would want optional redundancy so I can have 2 loggers in the same chat to make sure no messages get lost.
There is no issue with duplicates, because all messages have a unique ID.
Now how would I best and dynamically share the current channels joined between the workers. I wanna avoid having a master or controlling point. Should also be easy to add more workers that then reduce the load on other workers.
Are there any good articles about this kind of behaviour? Maybe good concepts or protocols already defined? Like I said i wanna avoid another central control point so no rabbitmq, redis or whatever.
Edit: I've looked into something like the Raft Consensus Algorithm, but it doesn't make sense I think, since I don't want my clients to agree on a shared state instead divide the state between them "equally".
I think in this case looking for a description of existing algorithm might be not very useful: the problem is not complicated and generic enough to be worth publication.
As described, the problem could be solved by using Cassandra itself as a mediator and to share chat channel assignment information among the workers.
So (trivial part) channels would have IDs and assigned worker ID(s), plus in the optional case of redundancy - required amount of workers (2 or whatever number of workers you want to process this chat). Worker, before assigning itself to a channel would check if there is already enough assignees. If so would continue to the next channel. If not, assign itself to the channel. This is one of the options (alternatively you can have workers holding the channel IDs, but since redundancy is rare this way seems to be simpler). Workers would have a limit of channels they can process and will not try exceeding it by assigning more channels.
Now we only have to deal with the case of assigning too much workers to the same channel, exceeding requirements and exhausting the worker capacity by monitoring all the same channels. Otherwise, if they start all at once, channels might have more assigned workers than needed. Even though it is unlikely will create a real problem in described case (just a bit more redundancy than requested), you can handle that by prioritising workers. Much like employing of school teachers in Canada, BC is done on seniority basis - the most senior gets job first, except that here it'd be voluntarily done by the workers themselves, not by school administration. What this means, is that each worker would have to check all it's assigned channels and, should there be more workers than needed at this time, would check if it has the smallest priority among all the assignees. If it does, it would resign - remove itself and stop processing the channel.
That requires assigning distinct priorities of the workers, which could be easily achieved when spawning them, by simply setting each to a next sequential number (the oldest has the highest priority, or v.v if you concerned of old, potentially dying workers taking up all the load, and would prefer new ones to take on more while still fresh). More elaborately, this could also be done by using Cassandra Lightweight transactions as described in one of the answers here (the one by AlonL). With just a few (you mentioned ~4) workers either way should work and concerns about scaling mentioned in the other answers there isn't a big deal for a few integer priorities. Also, instead of sequential number assignment, requiring the workers to self-assign a random 32-bit integer priority on initialization has virtually no chance of collision, so loop "until no collisions" should exit on the very first iteration (which would make a second iteration very rarely code path requiring an explicit test).
The trick is basically to limit the amount of data requiring synchronisation and putting the load of regulation onto the workers themselves. There is no need for consensus algorithms as there is not much complexity and we are not dealing with huge number of potentially fraudulent workers, trying to get assignments ahead of more senior peers.
The only issue I should mention is that there could be implicit worker rotation if channels go offline which makes worker to stop processing. You will get a different worker assignment next time the channel goes online.

Running large amount of long running background jobs in Rails

We're building a web-app where users will be uploading potentially large files that will need to be processed in the background. The task involves calling 3rd-party APIs so each job can take several hours to complete. We're using DelayedJob to run the background jobs. With every user kicking off a background job, each of which will take a few hours to finish, that will add up to a lot of background jobs every quickly. I am wondering what would be the best way to setup the deployment for this? We're currently hosted on DigitalOcean. I've kicked off 10 DelayedJob workers. Each one (when ideal) takes up 157MB. When actively running it utilizes around 900 MB. Our user-base right now is pretty small so it's not an issue but will be one soon. So on a 4GB droplet, I can probably run like 2 or 3 workers at a time. How should we approach this issue? Should we be looking at using DigitalOcean's API to auto-spin cheap droplets on demand? Should we subscribe to high-memory droplets on a monthly basis instead? If we go with auto-spinning droplets, should we stick with DigitalOcean or would Heroku make more sense? Or is the entire approach wrong and should we be approaching it from an entire different direction? Any help/advice would be very much appreciated.
Thanks!
It sounds like you are limited by memory on the number of workers that you can run on your DigitalOcean host.
If you are worried about scaling, I would focus on making the workers as efficient as possible. Have you done any benchmarking to understanding where the 900MB of memory is being allocated? I'm not sure what the nature of these jobs are, but you mentioned large files. Are you reading the contents of these files into memory, or are you streaming them? Are you using a database with SQL you can tune? Are you making many small API calls when you could be using a batch endpoint? Are you assigning intermediary variables that must then be garbage collected? Can you compress the files before you send them?
Look at the job structure itself. I've found that background jobs work best with many smaller jobs rather than one larger job. This allows execution to happen in parallel, and be more load balanced across all workers. You could even have a job that generates other jobs. If you need a job to orchestrate callbacks when a group of jobs finishes there is a DelayedJobGroup plugin at https://github.com/salsify/delayed_job_groups_plugin that allows you to invoke a final job only after the sibling jobs complete. I would aim for an execution time of a single job to be under 30 seconds. This is arbitrary but it illustrates what I mean by smaller jobs.
Some hosting providers like Amazon provide spot instances where you can pay a lower price on servers that do not have guaranteed availability. These pair well with the many fewer jobs approach I mentioned earlier.
Finally, Ruby might not be the right tool for the job. There are faster languages, and if you are limited by memory, or CPU, you might consider writing these jobs and their workers in another language like Javascript, Go or Rust. These can pair well with a Ruby stack, but offload computationally expensive subroutines to faster languages.
Finally, like many scaling issues, if you have more money than time, you can always throw more hardware at it. At least for a while.
I thing memory and time is more problem for you. you have to use sidekiq gem for this process because it will consume less time and memory consumption for doing the same job,because it uses redis as database which is key value pair db.if the problem continues go with java script.

Does rails HireFire support Queues?

Background:
I have 50 clients. (for example) they have their data partitioned into 50 different schemas in postgresql.
I feel it's a good idea to keep their processing as separate as possible, so I think putting their DJ's into different queues is a good idea, At least grouping them based on their load, into different queues (because I have a limit on the number of workers)
If Client_A has 10 large actions in the queue, Client_B shouldn't have to wait for them to be done, to send an email.
DJ supports queue's based workers. I could be wrong, but I don't see a way to set queues in the hirefire paradigm.
Does anyone know how to setup-hirefire to run on a given queue?
I see more issues coming, but I'll ignore them for now :)

How should I schedule many Google Search scrapes over the course of a day?

Currently, my Nokogiri script iterates through Google's SERPs until it finds the position of the target website. It does this for each keyword for each website that each user specifies (users are capped on amount of websites & keywords they can track).
Right now, it's run in a rake that's hard-scheduled every day and batches all scrapes at once by looping through all the websites in the database. But I'm concerned about scalability and swarming Google with a batch of requests.
I'd like a solution that scales and can run these scrapes over the course of the day. I'm not sure what kind of solution is available or what I'm really looking for.
Note: The amount of websites/keywords change from day to day as users add and delete their websites and keywords. I don't mean to make this question too superfluous, but is this the kind of thing Beanstalkd/Stalker (job queuing) can be used for?
You will have to balance two issues: Scalability for lots of users versus Google shutting you down for scaping in violation of their terms of use.
So your system will need to be able to distribute tasks to various different IPs to conceal your bulk scraping which suggests at least two levels of queuing. One to manage all the jobs and send them to each separate IP for subsequent searching and collecting results and queues on each separate machine to hold the requested searches until they are executed and the results returned.
I have no idea what Google's thresholds are (I am sure they don't advertise it) but exceeding them and getting cut off would obviously be devastating for what you are trying to do so your simple looping rake task is exactly what you shouldn't do after a certain number of users.
So yes, use a queue of some sort but realize that you probably have a different goal from the typical goal of a queue in that you want to deliberately delay jobs rather that offload word to avoid UI delays. So you will be seeking ways to slow down the queue rather than have it just execute job after job as they arrive in the queue.
So based on a cursory inspection of DelayedJob and BackgroundJobs it looks like DelayedJob has what you would need with the run_at attribute. But I am only speculating here and I am sure an expert would have more to say.
If I'm understanding correclty, it sounds like one of these tools might fit the bill:
Delayed_job: https://github.com/tobi/delayed_job
or
BackgroundJobs: http://codeforpeople.rubyforge.org/svn/bj/trunk/README
I've used both of them, and found them easy to work with.
There are definitely some background job libraries that might work.
delayed_job: https://github.com/collectiveidea/delayed_job (beware of the unmaintained branch from tobi!)
resque: https://github.com/defunkt/resque
However, you might think about just scheduling a Cron job that runs more times during the day, and processes less items per run.
SaaS solution: http://momentapp.com/ "Launch delayed jobs with scheduled http requests" - disclaimer a) in beta b) I am not affiliated with this service

Fastest way to track API access

I'm creating an API for my Rails app, and I want to track how many times a user calls a particular API method, and cap them say at like 1,000 requests per day. I'm expecting very high request volumes across multiple users.
Do you have a suggestion as to how I can keep track of something like that per user? I want to avoid having to write to the database repeatedly and deal with locks.
I'm okay doing a delayed write (API limit don't have to be super exact), but is there a standard way of doing this?
You could try Apigee. It looks like it's "free up to 10,000 messages per hour".
Disclaimer: I have never used Apigee.
It really depends on the # of servers, the dataset, the # of users, etc.
One approach would be to maintain a quota datastructure in memory on the server and update it per invocation. If you have multiple servers you could maintain a memcache of the quota. Obviously, a memory-based implementation wouldn't survive a reboot or restart, so some sort of serializiation would be required to support that.
If quota accuracy is critical it's probably best to just do it in the DB. You could do it in a file, but then you face the same issues you're trying to avoid /w the database.
EDIT:
You could also do a mixed approach -- maintain a memory-based cache of user|api|invocation counts and periodically write them to the database.
A bit more info on the requirements would help pare down the options..
Here's a way of doing it using the rails cache
call_count_key = "api_calls_#{params[:api_key]}_#{Time.now.strftime('%Y-%m-%d-%Hh')}"
call_count = Rails.cache.read(call_count_key) || 0
call_count += 1
Rails.cache.write call_count_key, call_count
# here is our limit
raise "too many calls" if call_count > 100
This isn't a perfect solution as it doesn't handle concurrency properly and if you're using the in memory cache (rails' default) then this will be a per process counter
If you're ok with a hosted solution, take a look at my company, WebServius ( http://www.webservius.com ) that does API management (issuing keys, enforcing quotas, etc). We also have a billing support, so that you will be able to set per-call prices, etc.

Resources