Task Assignment to agencies - task

I have to automate task assignments for agencies which do a particular work. the automation should allow for assignment of tasks based on a group of search criteria.
1. There are more than one criteria that needs to be looked upon for while assigning tasks.
2. The algorithm should take in account the amount of requests in the queue and act accordingly.
3. There should be fair assignment to the agencies

Related

Can I assign my own randomized Job ID to Sidekiq?

I am using Sidekiq to schedule some tasks based on a schedule that the user provides. However, if the user changes the schedule, I want to be able to simply update the old schedule with the new one.
Suggestion one
I saw a suggestion to just find the old job with Sidekiq::ScheduledSet.new.find_job(job_id), but I am trying to avoid having to create a new model just to simply store the job ID and the task.
Suggestion two
Another suggestion I saw was to just have the worker check if the time of the task matches the current time, but that won't work because if the server is offline, it won't process jobs when it returns back online because the time of those delayed jobs won't match the current time.
If I could assign my own job ID, like a hex version of the job name or a padded version of the task ID, then I could easily avoid having to create a new model to store the job IDs. So when the user reschedules a task, then it would be a lot easier.
Other thoughts
Maybe if I could check the job's at attribute and match that with the task, that may work, but I'm not sure how to access that attribute from within the worker without knowing the job ID.
Edit
I just tried to pull the current job's at attribute, but it looks like once the job kicks off, it doesn't exist anymore in Sidekiq::ScheduledSet, so there's no matching this job's time with Task's time from what it seems like.
I am using Sidekiq to schedule some tasks based on a schedule that the user provides...
There's an extension for that. Sidekiq-Scheduler gives you a cron-like schedule configuration file. Then you can alter the schedule as you see fit. This seems like the best option as it avoids having to write your own scheduler interface.
Can I assign my own randomized Job ID to Sidekiq?
Yes, though it's undocumented. You can give Sidekiq::Client.push a jid attribute.
Sidekiq::Client.push('class' => MyWorker, 'args' => [1, 2, 3], 'jid' => ... )
This is not a good way to solve your problem. It's relying on an undocumented feature. And it invites collisions with normal Sidekiq IDs.
Maybe if I could check the job's at attribute and match that with the task, that may work, but I'm not sure how to access that attribute from within the worker without knowing the job ID.
This sounds very error prone. You'd have to store the timestamp in a model anyway. Better to store the job ID in the first place.
I am trying to avoid having to create a new model just to simply store the job ID and the task.
Storing things in models is what Rails does really well. This would seem to be the way to go. It will take a trivial amount of coding, database storage, and processing. You should have a model, view, and controller for your scheduled jobs anyway else how will you create scheduled jobs and view your schedule?
However, the Sidekiq docs notes that find_job is "a slow, inefficient operation. Do not use under normal conditions. Sidekiq Pro contains a faster version." This is because it has to iterate through all jobs.
I had a case where I had to reschedule jobs based on updates from the User. It is actually pretty slow and complicated.
It's simpler to not reschedule, but instead make the old queued tasks no-ops (no operations) and then queue up the new tasks.
This is basically defined by the logic within the task. You'd have to know that the user updated their schedule somehow and check that within the old jobs and based on some if-check, not go through with the job.

Scaling chat log workers horizontally

I've thought about this a lot but can't come up with a solution I'm happy with.
Basicly this is the problem: Log 100k+ Chats (some slower, some faster) into cassandra. So save userId, channelId, timestamp and the message.
Cassandra already supports horizontal scaling out of the box, I have no issue here.
Now my software that reads these chats does it over TCP (IRC). Something like 300 messages / sec are usual for the top 1k channels and 1 single IRC connection can't handle that from my experiments.
What I now want to build is multiple instances (with Docker/Kubernetes) of the logger and share the load between those. So ideally if I have maybe 4 workers and 1k chats (example). They would each join atleast 250 channels. I say atleast because I would want optional redundancy so I can have 2 loggers in the same chat to make sure no messages get lost.
There is no issue with duplicates, because all messages have a unique ID.
Now how would I best and dynamically share the current channels joined between the workers. I wanna avoid having a master or controlling point. Should also be easy to add more workers that then reduce the load on other workers.
Are there any good articles about this kind of behaviour? Maybe good concepts or protocols already defined? Like I said i wanna avoid another central control point so no rabbitmq, redis or whatever.
Edit: I've looked into something like the Raft Consensus Algorithm, but it doesn't make sense I think, since I don't want my clients to agree on a shared state instead divide the state between them "equally".
I think in this case looking for a description of existing algorithm might be not very useful: the problem is not complicated and generic enough to be worth publication.
As described, the problem could be solved by using Cassandra itself as a mediator and to share chat channel assignment information among the workers.
So (trivial part) channels would have IDs and assigned worker ID(s), plus in the optional case of redundancy - required amount of workers (2 or whatever number of workers you want to process this chat). Worker, before assigning itself to a channel would check if there is already enough assignees. If so would continue to the next channel. If not, assign itself to the channel. This is one of the options (alternatively you can have workers holding the channel IDs, but since redundancy is rare this way seems to be simpler). Workers would have a limit of channels they can process and will not try exceeding it by assigning more channels.
Now we only have to deal with the case of assigning too much workers to the same channel, exceeding requirements and exhausting the worker capacity by monitoring all the same channels. Otherwise, if they start all at once, channels might have more assigned workers than needed. Even though it is unlikely will create a real problem in described case (just a bit more redundancy than requested), you can handle that by prioritising workers. Much like employing of school teachers in Canada, BC is done on seniority basis - the most senior gets job first, except that here it'd be voluntarily done by the workers themselves, not by school administration. What this means, is that each worker would have to check all it's assigned channels and, should there be more workers than needed at this time, would check if it has the smallest priority among all the assignees. If it does, it would resign - remove itself and stop processing the channel.
That requires assigning distinct priorities of the workers, which could be easily achieved when spawning them, by simply setting each to a next sequential number (the oldest has the highest priority, or v.v if you concerned of old, potentially dying workers taking up all the load, and would prefer new ones to take on more while still fresh). More elaborately, this could also be done by using Cassandra Lightweight transactions as described in one of the answers here (the one by AlonL). With just a few (you mentioned ~4) workers either way should work and concerns about scaling mentioned in the other answers there isn't a big deal for a few integer priorities. Also, instead of sequential number assignment, requiring the workers to self-assign a random 32-bit integer priority on initialization has virtually no chance of collision, so loop "until no collisions" should exit on the very first iteration (which would make a second iteration very rarely code path requiring an explicit test).
The trick is basically to limit the amount of data requiring synchronisation and putting the load of regulation onto the workers themselves. There is no need for consensus algorithms as there is not much complexity and we are not dealing with huge number of potentially fraudulent workers, trying to get assignments ahead of more senior peers.
The only issue I should mention is that there could be implicit worker rotation if channels go offline which makes worker to stop processing. You will get a different worker assignment next time the channel goes online.

Controlled concurrency with amazon SQS

I have a multiple publishers publishing events for a shipment entity on an SQS queue and I have multiple listeners on it for parallel processing. But I want events for a particular shipment (having some identifier) to be processed sequentially in order. Is there any in-built feature to support this?
ActiveMQ has a similar concept of Exclusive Consumer which is not exactly what I need but could be adapted
Yes, there is; they are called FIFO (First-In-First-Out) queues
FIFO (First-In-First-Out) queues are designed to enhance messaging between applications when the order of operations and events is critical, or where duplicates can't be tolerated.
You will need to ensure that the messages you want processed in the correct order belong to the same Message Group ID:-
The tag that specifies that a message belongs to a specific message group. Messages that belong to the same message group are always processed one by one, in a strict order relative to the message group (however, messages that belong to different message groups might be processed out of order).
Hope that helps!

Twilio TaskRouter - Controlling the order which TaskQueue selects Workers

Ahoy! First time TaskRouter user here. I have about 500 Workers which meet the TargetWorkers expression for my TaskQueue. I also have a priority for each Worker, where this priority is any integer.
Ideally: I'd like Tasks to be assigned to my Workers based on the Worker's priority. Doesn't need to be exactly like this, but that's my ideal. Any ideas on how to build this out in Twilio TaskRouter?
Example: Given two available Workers A and B, where A has high priority (1025) and B has lower priority (2). I want the incoming task to be assigned to A whenever possible. The task should only go to Worker B if Worker A is not available, times out, or rejects the offer.
Is there documentation somewhere which explains the order in which TaskQueues select between available workers?

How should I schedule many Google Search scrapes over the course of a day?

Currently, my Nokogiri script iterates through Google's SERPs until it finds the position of the target website. It does this for each keyword for each website that each user specifies (users are capped on amount of websites & keywords they can track).
Right now, it's run in a rake that's hard-scheduled every day and batches all scrapes at once by looping through all the websites in the database. But I'm concerned about scalability and swarming Google with a batch of requests.
I'd like a solution that scales and can run these scrapes over the course of the day. I'm not sure what kind of solution is available or what I'm really looking for.
Note: The amount of websites/keywords change from day to day as users add and delete their websites and keywords. I don't mean to make this question too superfluous, but is this the kind of thing Beanstalkd/Stalker (job queuing) can be used for?
You will have to balance two issues: Scalability for lots of users versus Google shutting you down for scaping in violation of their terms of use.
So your system will need to be able to distribute tasks to various different IPs to conceal your bulk scraping which suggests at least two levels of queuing. One to manage all the jobs and send them to each separate IP for subsequent searching and collecting results and queues on each separate machine to hold the requested searches until they are executed and the results returned.
I have no idea what Google's thresholds are (I am sure they don't advertise it) but exceeding them and getting cut off would obviously be devastating for what you are trying to do so your simple looping rake task is exactly what you shouldn't do after a certain number of users.
So yes, use a queue of some sort but realize that you probably have a different goal from the typical goal of a queue in that you want to deliberately delay jobs rather that offload word to avoid UI delays. So you will be seeking ways to slow down the queue rather than have it just execute job after job as they arrive in the queue.
So based on a cursory inspection of DelayedJob and BackgroundJobs it looks like DelayedJob has what you would need with the run_at attribute. But I am only speculating here and I am sure an expert would have more to say.
If I'm understanding correclty, it sounds like one of these tools might fit the bill:
Delayed_job: https://github.com/tobi/delayed_job
or
BackgroundJobs: http://codeforpeople.rubyforge.org/svn/bj/trunk/README
I've used both of them, and found them easy to work with.
There are definitely some background job libraries that might work.
delayed_job: https://github.com/collectiveidea/delayed_job (beware of the unmaintained branch from tobi!)
resque: https://github.com/defunkt/resque
However, you might think about just scheduling a Cron job that runs more times during the day, and processes less items per run.
SaaS solution: http://momentapp.com/ "Launch delayed jobs with scheduled http requests" - disclaimer a) in beta b) I am not affiliated with this service

Resources