Sidekiq worker and arguments size - ruby-on-rails

Currently I've this code in one of my Sidekiq worker :
def perform(identity_id, format, model, items_ids, columns, options)
ExportListService::Dispatcher.new(
identity: identity_from(identity_id),
format: format,
items: items_from(model, items_ids),
columns: columns,
options: options&.symbolize_keys! || {}
).perform
end
The items_from method is in charge of recovering each item from the database in the order it was sent within the array items_ids, then we proceed through the service.
The order is very important as the controller which initiate this worker has multiple filters and options, IDs can be sent in all kind of orders which should not be lost when transmitted to Sidekiq.
Works great, but I realised this items_ids array could have more than 5 000 entries in it.
In term of scalability, what would be best ?
Keep it as it is, it won't impact the worker performance if the array is big. I didn't find anything relevant to params length about Sidekiq, so I don't know if it'll break the performances.
OR
Take the whole controller logic to sort this items_ids and copy it into the worker (imply possible duplicate and hard to maintain code)
What solution should I take ? Is there any other possibility I didn't think of ?

I would avoid pushing this much data to Sidekiq. We use Sidekiq workers mainly to kick off some async processing, only passing the information required to start the processing.
From your description, I would move the ids sorting logic to the model and query the model from either controller or worker.

The way I do it is I create an ActiveRecord object/table called: XYZJobRequest
Now, this JobRequest entry gets created at some point in the application life cycle. It has all the info in it, or there is reference to the info in it.
When calling Sidekiq all I do is, pass in the ID of the JobRequest object as the only parameter. (Usually with a "on_commit" hook) This keeps things simple.
Would that make sense for you?

Related

Passing Complex Hashes to Sidekiq Jobs

From the Best Practices Guide to using Sidekiq, I understand it's best to pass "string, integer, float, boolean, null(nil), array and hash" as arguments to the job.
I often just pass the id of a persisted object to my jobs, but due to latency constraints I need to save the object after running the job.
The non-persisted object I'm working with contains a mixture of data types:
#MyObject<00x000>{
id: nil
start_time: Fri, 11 Dec 2020 08:45:00 PST -08:00 (*this is a TimeWithZone object)
rate: 18.0 (*this is a BigDecimal object)
...
}
I plan to pass this object to my job by converting it to a hash first:
MyJob.perform_async(my_object.attributes)
and then later persist the object like so:
MyObject.new(my_object_hash).save
My question is, is this safe? Even though I am passing a 'simple' datatype to Sidekiq, it actually contains complex objects. Am I going to lose precision?
Thank you!
This sounds like a "potayto, potahto" solution. You are not not using the serialisation of Sidekiq, but instead serialize it yourself.
Let's have a look at why sidekiq has this rule:
Even if they did serialize correctly, what happens if your queue backs up and that quote object changes in the meantime? [...]
Don't pass symbols, named parameters, keyword arguments or complex Ruby objects (like Date or Time!) as those will not survive the dump/load round trip correctly.
I like to add a third:
Serializing state makes it impossible to distinguish between persisted and ethereal (in-memory, memoized, lazy-loaded etc) data. E.g. a def sent_mails; #sent_mails ||= Mail.for(user_id: id); end now gets serialized: do you want that?
The solution is also provided by sidekiq:
Don't save state to Sidekiq, save simple identifiers. Look up the objects once you actually need them in your perform method.
The XY problem here
Your real problem is not where or how to serialize state. Because sidekiq warns against serializing state regardless of where and how you do this.
The problem you need to solve is either how to store state somewhere where it can be stored properly. Or to avoid storing the state at all: not in redis/sidekiq, nor in the storage that is giving you problems.
Latency
Is your storage slow? Is it not a validation, a serialisation, some side-effect of storage that is slow?
Can you improve this by making it a two-step: insert the state and update/enrich/validate it async later? If you are using Rails, it won't help you here, or might even work against you, but a common model is to store objects in a special "queue" table or events queue; e.g. kafka is famous for this.
When e.g. storage happens over a slow network to a slow API, this is probably unsolvable, but when storage happens in a local database, there are decades of solutions to improve write performance here that you can use. Both inside your database, or with some specialised queue for state-storage (sidekiq is not such a specialised storage queue) depending on the tech used to store. E.g. Linux will allow you to store through memory, making writes to disk really quick, but removing the guarantee that it was really written to disk.
E.g. In a bookkeeping api, we would store the validated object in PostgreSQL and then have async jobs add expensive attributes to this later (e.g. state that had to be retrieved from legacy APIs or through complex calculations).
E.g. in a write-heavy GIS system, we would store objects into a "to_process_places" table, that was monitored by tooling which processes the Places. It all really depends on your domain, and requirements.
Not using state.
A common solution is not to make objects, but use the actual payload by the customer. Just send the HTTP payload (in rails, the params) along and leave it at that. Maybe merge in a header (like the Request Date) or filter out some data (header tokens or cookies).
If your controller can operate with this data, so can a delayed job. Instead of building objects in the controller, leave that to the delayed job. This can even result in really neat and lean controllers: all they do is (some authentication and authorization and then) call the proper job and pass it a sanitized params.
Obviously this requires trade-offs like not being able to validate in-sync, but to give such info over email, push-notification, or delayed response instead, depending on your requirements (e.g. a large CSV import could just email any validation issues, but a login request might need to get immediate response if the login is invalid).
It also requires some thought: you probably don't want to send the Base64 encoded CSV along to sidekiq, but instead write the file to a (temp) storage and pass the filename/url along instead. This might sound obvious, because it is: file uploads are essentially an implementation of the earlier mentioned "temporary state storage": you don't pass the entire PDF/high-res-header-image/CSV along to sidekiq, but store it somewhere so sidekiq can pick it up later to process it. Why should the other attributes not employ the same pattern if passing them along to sidekiq is problematic?
The most important part from the best practices you linked is
Complex Ruby objects do not convert to JSON
Therefore you're not supposed to pass instances of a model to a worker.
If you're using Sidekiq workers, you should comply with this statement and the hash you're passing should be just fine. I am not exactly sure about the TimeWithZone object, but you could try converting this to a JSON or to a string as they do in the best practices guide.
However, if you're using ActiveJob instead of Sidekiq workers (does your Job inherit from ApplicationJob or does it include Sidekiq::Worker ?), then you don't have that problem because ActiveJob uses Global ID to convert objects into a String. And then before performing the job is deserializing the object again. Meaning you can pass an object to your job.
my_object = MyObject.find(1)
my_object.to_global_id #=> #<GlobalID:0x000045432da2344 [...] gid://your_app_name/MyObject/1>>
serialized_my_object = my_object.to_global_id.to_s
my_object = GlobalID.find(serialized_my_object)
You can find more information here
https://github.com/toptal/active-job-style-guide#active-record-models-as-arguments
After doing some experimentation on the Time objects in my job, I found that I am losing nanosecond precision at the other end of the job.
my_object.start_time
=> Mon, 21 Dec 2020 11:35:50 PST -08:00
my_object.strftime('%Y-%m-%d %H:%M:%S.%N')
=> "2020-12-21 11:35:50.151893000"
You can see here, we have precision including 6 digits after the decimal.
(see this answer for more about 'strftime')
Once we call JSON methods on the object:
generated = JSON.generate(my_object.attributes))
=> \"start_time\":\"2020-12-21T11:35:50.151-08:00\"
You can see here we are down to 3 digits of precision after the decimal. The remaining 3 digits are lost at this point.
parsed = JSON.parse(generated)
parsed[‘start_time’] = "2020-12-21T11:35:50.151-08:00"
It appears at the most basic level, the JSON library recursively calls as_json on each of the key-value pairs in the hash. So really it depends on how your particular object implements as_json.
This issue caused test failures that involved querying our db for persisted objects (initialized with something like, start_time = Time.zone.now (!)) that are meant to overlap in time exactly with our MyObject class. Once the half-baked my_object blueprints made it through Sidekiq, they lost a sliver of precision, causing a slight misalignment.
One way to hack away at this issue is by monkey patching the Time class.
In our case, a better solution was to go in the opposite direction and to not use so much precision in our tests. The my_object in the example is something that a human user will have on their calendar; in production we never receive so much precision from clients. So instead we fixed our tests by instructing some of our test objects to use something like Time.zone.now.beginning_of_minute, rather than Time.zone.now. We intentionally removed precision to fix the issue, as well as more closely mirror reality.

Rails - how to cache data for server use, serving multiple users

I have a class method (placed in /app/lib/) which performs some heavy calculations and sub-http requests until a result is received.
The result isn't too dynamic, and requested by multiple users accessing a specific view in the app.
So, I want to schedule a periodic run of the method (using cron and Whenever gem), store the results somewhere in the server using JSON format and, by demand, read the results alone to the view.
How can this be achieved? what would be the correct way of doing that?
What I currently have:
def heavyMethod
response = {}
# some calculations, eventually building the response
File.open(File.expand_path('../../../tmp/cache/tests_queue.json', __FILE__), "w") do |f|
f.write(response.to_json)
end
end
and also a corresponding method to read this file.
I searched but couldn't find an example of achieving this using Rails cache convention (and not some private code that I wrote), on data which isn't related with ActiveRecord.
Thanks!
Your solution should work fine, but using Rails.cache should be cleaner and a bit faster. Rails guides provides enough information about Rails.cache and how to get it to work with memcached, let me summarize how I would use it in your case
Heavy method
def heavyMethod
response = {}
# some calculations, eventually building the response
Rails.cache.write("heavy_method_response", response)
end
Request
response = Rails.cache.fetch("heavy_method_response")
The only problem here is that when ur server starts for the first time, the cache will be empty. Also if/when memcache restarts.
One advantage is that somewhere on the flow, the data u pass in is marshalled into storage, and then unmartialled on the way out. Meaning u can pass in complex datastructures, and dont need to serialize to json manually.
Edit: memcached will clear your item if it runs out of memory. Will be very rare since its using a LRU (i think) algoritm to expire things, and I presume you will use this often.
To prevent this,
set expires_in larger than your cron period,
change your fetch code to call the heavy_method if ur fetch fails (like Rails.cache.fetch("heavy_method_response") {heavy_method}, and change heavy_method to just return the object.
Use something like redis which will not delete items.

Delay sending mails to boost page load time

In my Product#create method I have something like
ProductNotificationMailer.notify_product(n.email).deliver
Which fires off if the product gets saved. Now thing is before the above gets fired off, there are bunch of logics and calculations happening which delays the confirmation page load time. Is there a way to make sure the next page loads first and the mail delivery can happen later or in the background?
Thanks
Yes, you'll want to look into background workers. Sidekiq, DelayedJob or Resque are some popular ones.
Here's a great RailsCast demonstrating Sidekiq.
class NotificationWorker
include Sidekiq::Worker
def perform(n_id)
n = N.find(n_id)
ProductNotificationMailer.notify_product(n.email).deliver
end
end
I'm not sure what n was in your example, so I just went with it. Now where you do the work, you can replace it with:
NotificationWorker.perform_async(n.id)
The reason you don't pass full object n as an argument, is because the arguments will be serialized, and it's easier/faster to serialize just the integer id.
Once the jobs is stored, you have a second process running in the background that will do the work, freeing up your web process to immediately go back to rendering the response.
Delayed jobs will do this:
Here is the github page.
and here is a railscast on setting it up.

Resque.. how can I get a list of the queues

Ok.. On heroku I have up 24 workers (as I understand it)
I have say 1000 clients. Each with their own "schema" in a postgresql database.
each client has tasks that can be done "later".. sending orders to my companies back end, is a great example.
I was thinking that I could create a new queue for each client, and each queue would have it's own worker(process). That it seems isn't in the cards.
So ok.. my thinking now is to have a queue field in client record..
so client 1 through 15 are in queue_a
and client 16 through 106 are in queue_b.. ect If one client is using heaps, we could move them to a new queue, or move others out of the slow Queue. clients with low volumns could be collected.. It would be a balancing act, but it wouldn't be all that hard to manage, if we kept track of metrics (which we will anyway)
(any counter ideas would be awesome to hear, I'm really in spit ball phase)
Right now, though. I'd like to figure out how to create a worker for each queue.
https://gist.github.com/486161 tells me how to create X workers, but doesn't really let me set a worker to a Queue. If I knew that, and how to get a list of queues, I think I'd be on my way to a viable solution to the limits.
Reading onhttp://blog.winfieldpeterson.com/2012/02/17/resque-queue-priority/
I realize that my plan is fraught with hardship.. The first client/queue to get added to the worker, would get priority.. I don't want that, I'd want them to all have the same. As long as they are part of the same queue..
i just stick to the topic :)
getting all queues in resque is pretty easy
Resque.queues
is a list of all queue names, it does not include the 'failed' queue, i did something like this
(['failed'] + Resque.queues).each do |queue|
queue_size = queue=='failed' ? Resque::Failure.count : Resque.size(queue)
end

Need alternative to filters/observers for Ruby on Rails project

Rails has a nice set of filters (before_validation, before_create, after_save, etc) as well as support for observers, but I'm faced with a situation in which relying on a filter or observer is far too computationally expensive. I need an alternative.
The problem: I'm logging web server hits to a large number of pages. What I need is a trigger that will perform an action (say, send an email) when a given page has been viewed more than X times. Due to the huge number of pages and hits, using a filter or observer will result in a lot of wasted time because, 99% of the time, the condition it tests will be false. The email does not have to be sent out right away (i.e. a 5-10 minute delay is acceptable).
What I am instead considering is implementing some kind of process that sweeps the database every 5 minutes or so and checks to see which pages have been hit more than X times, recording that state in a new DB table, then sending out a corresponding email. It's not exactly elegant, but it will work.
Does anyone else have a better idea?
Rake tasks are nice! But you will end up writing more custom code for each background job you add. Check out the Delayed Job plugin http://blog.leetsoft.com/2008/2/17/delayed-job-dj
DJ is an asynchronous priority queue that relies on one simple database table. According to the DJ website you can create a job using Delayed::Job.enqueue() method shown below.
class NewsletterJob < Struct.new(:text, :emails)
def perform
emails.each { |e| NewsletterMailer.deliver_text_to_email(text, e) }
end
end
Delayed::Job.enqueue( NewsletterJob.new("blah blah", Customers.find(:all).collect(&:email)) )
I was once part of a team that wrote a custom ad server, which has the same requirements: monitor the number of hits per document, and do something once they reach a certain threshold. This server was going to be powering an existing very large site with a lot of traffic, and scalability was a real concern. My company hired two Doubleclick consultants to pick their brains.
Their opinion was: The fastest way to persist any information is to write it in a custom Apache log directive. So we built a site where every time someone would hit a document (ad, page, all the same), the server that handled the request would write a SQL statement to the log: "INSERT INTO impressions (timestamp, page, ip, etc) VALUES (x, 'path/to/doc', y, etc);" -- all output dynamically with data from the webserver. Every 5 minutes, we would gather these files from the web servers, and then dump them all in the master database one at a time. Then, at our leisure, we could parse that data to do anything we well pleased with it.
Depending on your exact requirements and deployment setup, you could do something similar. The computational requirement to check if you're past a certain threshold is still probably even smaller (guessing here) than executing the SQL to increment a value or insert a row. You could get rid of both bits of overhead by logging hits (special format or not), and then periodically gather them, parse them, input them to the database, and do whatever you want with them.
When saving your Hit model, update a redundant column in your Page model that stores a running total of hits, this costs you 2 extra queries, so maybe each hit takes twice as long to process, but you can decide if you need to send the email with a simple if.
Your original solution isn't bad either.
I have to write something here so that stackoverflow code-highlights the first line.
class ApplicationController < ActionController::Base
before_filter :increment_fancy_counter
private
def increment_fancy_counter
# somehow increment the counter here
end
end
# lib/tasks/fancy_counter.rake
namespace :fancy_counter do
task :process do
# somehow process the counter here
end
end
Have a cron job run rake fancy_counter:process however often you want it to run.

Resources