Passing Complex Hashes to Sidekiq Jobs - ruby-on-rails

From the Best Practices Guide to using Sidekiq, I understand it's best to pass "string, integer, float, boolean, null(nil), array and hash" as arguments to the job.
I often just pass the id of a persisted object to my jobs, but due to latency constraints I need to save the object after running the job.
The non-persisted object I'm working with contains a mixture of data types:
#MyObject<00x000>{
id: nil
start_time: Fri, 11 Dec 2020 08:45:00 PST -08:00 (*this is a TimeWithZone object)
rate: 18.0 (*this is a BigDecimal object)
...
}
I plan to pass this object to my job by converting it to a hash first:
MyJob.perform_async(my_object.attributes)
and then later persist the object like so:
MyObject.new(my_object_hash).save
My question is, is this safe? Even though I am passing a 'simple' datatype to Sidekiq, it actually contains complex objects. Am I going to lose precision?
Thank you!

This sounds like a "potayto, potahto" solution. You are not not using the serialisation of Sidekiq, but instead serialize it yourself.
Let's have a look at why sidekiq has this rule:
Even if they did serialize correctly, what happens if your queue backs up and that quote object changes in the meantime? [...]
Don't pass symbols, named parameters, keyword arguments or complex Ruby objects (like Date or Time!) as those will not survive the dump/load round trip correctly.
I like to add a third:
Serializing state makes it impossible to distinguish between persisted and ethereal (in-memory, memoized, lazy-loaded etc) data. E.g. a def sent_mails; #sent_mails ||= Mail.for(user_id: id); end now gets serialized: do you want that?
The solution is also provided by sidekiq:
Don't save state to Sidekiq, save simple identifiers. Look up the objects once you actually need them in your perform method.
The XY problem here
Your real problem is not where or how to serialize state. Because sidekiq warns against serializing state regardless of where and how you do this.
The problem you need to solve is either how to store state somewhere where it can be stored properly. Or to avoid storing the state at all: not in redis/sidekiq, nor in the storage that is giving you problems.
Latency
Is your storage slow? Is it not a validation, a serialisation, some side-effect of storage that is slow?
Can you improve this by making it a two-step: insert the state and update/enrich/validate it async later? If you are using Rails, it won't help you here, or might even work against you, but a common model is to store objects in a special "queue" table or events queue; e.g. kafka is famous for this.
When e.g. storage happens over a slow network to a slow API, this is probably unsolvable, but when storage happens in a local database, there are decades of solutions to improve write performance here that you can use. Both inside your database, or with some specialised queue for state-storage (sidekiq is not such a specialised storage queue) depending on the tech used to store. E.g. Linux will allow you to store through memory, making writes to disk really quick, but removing the guarantee that it was really written to disk.
E.g. In a bookkeeping api, we would store the validated object in PostgreSQL and then have async jobs add expensive attributes to this later (e.g. state that had to be retrieved from legacy APIs or through complex calculations).
E.g. in a write-heavy GIS system, we would store objects into a "to_process_places" table, that was monitored by tooling which processes the Places. It all really depends on your domain, and requirements.
Not using state.
A common solution is not to make objects, but use the actual payload by the customer. Just send the HTTP payload (in rails, the params) along and leave it at that. Maybe merge in a header (like the Request Date) or filter out some data (header tokens or cookies).
If your controller can operate with this data, so can a delayed job. Instead of building objects in the controller, leave that to the delayed job. This can even result in really neat and lean controllers: all they do is (some authentication and authorization and then) call the proper job and pass it a sanitized params.
Obviously this requires trade-offs like not being able to validate in-sync, but to give such info over email, push-notification, or delayed response instead, depending on your requirements (e.g. a large CSV import could just email any validation issues, but a login request might need to get immediate response if the login is invalid).
It also requires some thought: you probably don't want to send the Base64 encoded CSV along to sidekiq, but instead write the file to a (temp) storage and pass the filename/url along instead. This might sound obvious, because it is: file uploads are essentially an implementation of the earlier mentioned "temporary state storage": you don't pass the entire PDF/high-res-header-image/CSV along to sidekiq, but store it somewhere so sidekiq can pick it up later to process it. Why should the other attributes not employ the same pattern if passing them along to sidekiq is problematic?

The most important part from the best practices you linked is
Complex Ruby objects do not convert to JSON
Therefore you're not supposed to pass instances of a model to a worker.
If you're using Sidekiq workers, you should comply with this statement and the hash you're passing should be just fine. I am not exactly sure about the TimeWithZone object, but you could try converting this to a JSON or to a string as they do in the best practices guide.
However, if you're using ActiveJob instead of Sidekiq workers (does your Job inherit from ApplicationJob or does it include Sidekiq::Worker ?), then you don't have that problem because ActiveJob uses Global ID to convert objects into a String. And then before performing the job is deserializing the object again. Meaning you can pass an object to your job.
my_object = MyObject.find(1)
my_object.to_global_id #=> #<GlobalID:0x000045432da2344 [...] gid://your_app_name/MyObject/1>>
serialized_my_object = my_object.to_global_id.to_s
my_object = GlobalID.find(serialized_my_object)
You can find more information here
https://github.com/toptal/active-job-style-guide#active-record-models-as-arguments

After doing some experimentation on the Time objects in my job, I found that I am losing nanosecond precision at the other end of the job.
my_object.start_time
=> Mon, 21 Dec 2020 11:35:50 PST -08:00
my_object.strftime('%Y-%m-%d %H:%M:%S.%N')
=> "2020-12-21 11:35:50.151893000"
You can see here, we have precision including 6 digits after the decimal.
(see this answer for more about 'strftime')
Once we call JSON methods on the object:
generated = JSON.generate(my_object.attributes))
=> \"start_time\":\"2020-12-21T11:35:50.151-08:00\"
You can see here we are down to 3 digits of precision after the decimal. The remaining 3 digits are lost at this point.
parsed = JSON.parse(generated)
parsed[‘start_time’] = "2020-12-21T11:35:50.151-08:00"
It appears at the most basic level, the JSON library recursively calls as_json on each of the key-value pairs in the hash. So really it depends on how your particular object implements as_json.
This issue caused test failures that involved querying our db for persisted objects (initialized with something like, start_time = Time.zone.now (!)) that are meant to overlap in time exactly with our MyObject class. Once the half-baked my_object blueprints made it through Sidekiq, they lost a sliver of precision, causing a slight misalignment.
One way to hack away at this issue is by monkey patching the Time class.
In our case, a better solution was to go in the opposite direction and to not use so much precision in our tests. The my_object in the example is something that a human user will have on their calendar; in production we never receive so much precision from clients. So instead we fixed our tests by instructing some of our test objects to use something like Time.zone.now.beginning_of_minute, rather than Time.zone.now. We intentionally removed precision to fix the issue, as well as more closely mirror reality.

Related

Sidekiq worker and arguments size

Currently I've this code in one of my Sidekiq worker :
def perform(identity_id, format, model, items_ids, columns, options)
ExportListService::Dispatcher.new(
identity: identity_from(identity_id),
format: format,
items: items_from(model, items_ids),
columns: columns,
options: options&.symbolize_keys! || {}
).perform
end
The items_from method is in charge of recovering each item from the database in the order it was sent within the array items_ids, then we proceed through the service.
The order is very important as the controller which initiate this worker has multiple filters and options, IDs can be sent in all kind of orders which should not be lost when transmitted to Sidekiq.
Works great, but I realised this items_ids array could have more than 5 000 entries in it.
In term of scalability, what would be best ?
Keep it as it is, it won't impact the worker performance if the array is big. I didn't find anything relevant to params length about Sidekiq, so I don't know if it'll break the performances.
OR
Take the whole controller logic to sort this items_ids and copy it into the worker (imply possible duplicate and hard to maintain code)
What solution should I take ? Is there any other possibility I didn't think of ?
I would avoid pushing this much data to Sidekiq. We use Sidekiq workers mainly to kick off some async processing, only passing the information required to start the processing.
From your description, I would move the ids sorting logic to the model and query the model from either controller or worker.
The way I do it is I create an ActiveRecord object/table called: XYZJobRequest
Now, this JobRequest entry gets created at some point in the application life cycle. It has all the info in it, or there is reference to the info in it.
When calling Sidekiq all I do is, pass in the ID of the JobRequest object as the only parameter. (Usually with a "on_commit" hook) This keeps things simple.
Would that make sense for you?

ActiveJob: how to do simple operations without a full blown job class?

With delayed_job, I was able to do simple operations like this:
#foo.delay.increment!(:myfield)
Is it possible to do the same with Rails' new ActiveJob? (without creating a whole bunch of job classes that do these small operations)
ActiveJob is merely an abstraction on top of various background job processors, so many capabilities depend on which provider you're actually using. But I'll try to not depend on any backend.
Typically, a job provider consists of persistence mechanism and runners. When offloading a job, you write it into persistence mechanism in some way, then later one of the runners retrieves it and runs it. So the question is: can you express your job data in a format, compatible with any action you need?
That will be tricky.
Let's define what is a job definition then. For instance, it could be a single method call. Assuming this syntax:
Model.find(42).delay.foo(1, 2)
We can use the following format:
{
class: 'Model',
id: '42', # whatever
method: 'foo',
args: [
1, 2
]
}
Now how do we build such a hash from a given call and enqueue it to a job queue?
First of all, as it appears, we'll need to define a class that has a method_missing to catch the called method name:
class JobMacro
attr_accessor :data
def initialize(record = nil)
self.data = {}
if record.present?
self.data[:class] = record.class.to_s
self.data[:id] = record.id
end
end
def method_missing(action, *args)
self.data[:method] = action.to_s
self.data[:args] = args
GenericJob.perform_later(data)
end
end
The job itself will have to reconstruct that expression like so:
data[:class].constantize.find(data[:id]).public_send(data[:method], *data[:args])
Of course, you'll have to define the delay macro on your model. It may be best to factor it out into a module, since the definition is quite generic:
def delay
JobMacro.new(self)
end
It does have some limitations:
Only supports running jobs on persisted ActiveRecord models. A job needs a way to reconstruct the callee to call the method, I've picked the most probable one. You can also use marshalling, if you want, but I consider that unreliable: the unmarshalled object may be invalid by the time the job gets to execute. Same about "GlobalID".
It uses Ruby's reflection. It's a tempting solution to many problems, but it isn't fast and is a bit risky in terms of security. So use this approach cautiously.
Only one method call. No procs (you could probably do that with ruby2ruby gem). Relies on job provider to serialize arguments properly, if it fails to, help it with your own code. For instance, que uses JSON internally, so whatever works in JSON, works in que. Symbols don't, for instance.
Things will break in spectacular ways at first.
So make sure to set up your debugging tools before starting off.
An example of this is Sidekiq's backward (Delayed::Job) compatibility extension for ActiveRecord.
As far as I know, this is currently not supported. You can easily simulate this feature using a custom-defined proxy-job that accepts a model or instance, a method to be performed and a list of arguments.
However, for the sake of code testing and maintainability, this shortcut is not a good approach. It's more effective (even if you need to write a little bit more of code) to have a specific job for everything you want to enqueue. It forces you to think more about the design of your app.
I wrote a gem that can help you with that https://github.com/cristianbica/activejob-perform_later. But be aware that I believe that having methods all around your code that might be executed in workers is the perfect recipe for disaster is not handled carefully :)

Rails - how to cache data for server use, serving multiple users

I have a class method (placed in /app/lib/) which performs some heavy calculations and sub-http requests until a result is received.
The result isn't too dynamic, and requested by multiple users accessing a specific view in the app.
So, I want to schedule a periodic run of the method (using cron and Whenever gem), store the results somewhere in the server using JSON format and, by demand, read the results alone to the view.
How can this be achieved? what would be the correct way of doing that?
What I currently have:
def heavyMethod
response = {}
# some calculations, eventually building the response
File.open(File.expand_path('../../../tmp/cache/tests_queue.json', __FILE__), "w") do |f|
f.write(response.to_json)
end
end
and also a corresponding method to read this file.
I searched but couldn't find an example of achieving this using Rails cache convention (and not some private code that I wrote), on data which isn't related with ActiveRecord.
Thanks!
Your solution should work fine, but using Rails.cache should be cleaner and a bit faster. Rails guides provides enough information about Rails.cache and how to get it to work with memcached, let me summarize how I would use it in your case
Heavy method
def heavyMethod
response = {}
# some calculations, eventually building the response
Rails.cache.write("heavy_method_response", response)
end
Request
response = Rails.cache.fetch("heavy_method_response")
The only problem here is that when ur server starts for the first time, the cache will be empty. Also if/when memcache restarts.
One advantage is that somewhere on the flow, the data u pass in is marshalled into storage, and then unmartialled on the way out. Meaning u can pass in complex datastructures, and dont need to serialize to json manually.
Edit: memcached will clear your item if it runs out of memory. Will be very rare since its using a LRU (i think) algoritm to expire things, and I presume you will use this often.
To prevent this,
set expires_in larger than your cron period,
change your fetch code to call the heavy_method if ur fetch fails (like Rails.cache.fetch("heavy_method_response") {heavy_method}, and change heavy_method to just return the object.
Use something like redis which will not delete items.

How does Rails 4 Russian doll caching prevent stampedes?

I am looking to find information on how the caching mechanism in Rails 4 prevents against multiple users trying to regenerate cache keys at once, aka a cache stampede: http://en.wikipedia.org/wiki/Cache_stampede
I've not been able to find out much information via Googling. If I look at other systems (such as Drupal) cache stampede prevention is implemented via a semaphores table in the database.
Rails does not have a built-in mechanism to prevent cache stampedes.
According to the README for atomic_mem_cache_store (a replacement for ActiveSupport::Cache::MemCacheStore that mitigates cache stampedes):
Rails (and any framework relying on active support cache store) does
not offer any built-in solution to this problem
Unfortunately, I'm guessing that this gem won't solve your problem either. It supports fragment caching, but it only works with time-based expiration.
Read more about it here:
https://github.com/nel/atomic_mem_cache_store
Update and possible solution:
I thought about this a bit more and came up with what seems to me to be a plausible solution. I haven't verified that this works, and there are probably better ways to do it, but I was trying to think of the smallest change that would mitigate the majority of the problem.
I assume you're doing something like cache model do in your templates as described by DHH (http://37signals.com/svn/posts/3113-how-key-based-cache-expiration-works). The problem is that when the model's updated_at column changes, the cache_key likewise changes, and all your servers try to re-create the template at the same time. In order to prevent the servers from stampeding, you would need to retain the old cache_key for a brief time.
You might be able to do this by (dum da dum) caching the cache_key of the object with a short expiration (say, 1 second) and a race_condition_ttl.
You could create a module like this and include it in your models:
module StampedeAvoider
def cache_key
orig_cache_key = super
Rails.cache.fetch("/cache-keys/#{self.class.table_name}/#{self.id}", expires_in: 1, race_condition_ttl: 2) { orig_cache_key }
end
end
Let's review what would happen. There are a bunch of servers calling cache model. If your model includes StampedeAvoider, then its cache_key will now be fetching /cache-keys/models/1, and returning something like /models/1-111 (where 111 is the timestamp), which cache will use to fetch the compiled template fragment.
When you update the model, model.cache_key will begin returning /models/1-222 (assuming 222 is the new timestamp), but for the first second after that, cache will keep seeing /models/1-111, since that is what is returned by cache_key. Once 1 second passes, all of the servers will get a cache-miss on /cache-keys/models/1 and will try to regenerate it. If they all recreated it immediately, it would defeat the point of overriding cache_key. But because we set race_condition_ttl to 2, all of the servers except for the first will be delayed for 2 seconds, during which time they will continue to fetch the old cached template based on the old cache key. Once the 2 seconds have passed, fetch will begin returning the new cache key (which will have been updated by the first thread which tried to read/update /cache-keys/models/1) and they will get a cache hit, returning the template compiled by that first thread.
Ta-da! Stampede averted.
Note that if you did this, you would be doing twice as many cache reads, but depending on how common stampedes are, it could be worth it.
I haven't tested this. If you try it, please let me know how it goes :)
The :race_condition_ttl setting in ActiveSupport::Cache::Store#fetch should help avoid this problem. As the documentation says:
Setting :race_condition_ttl is very useful in situations where a cache entry is used very frequently and is under heavy load. If a cache expires and due to heavy load seven different processes will try to read data natively and then they all will try to write to cache. To avoid that case the first process to find an expired cache entry will bump the cache expiration time by the value set in :race_condition_ttl. Yes, this process is extending the time for a stale value by another few seconds. Because of extended life of the previous cache, other processes will continue to use slightly stale data for a just a bit longer. In the meantime that first process will go ahead and will write into cache the new value. After that all the processes will start getting new value. The key is to keep :race_condition_ttl small.
Great question. A partial answer that applies to single multi-threaded Rails servers but not multiprocess(or) environments (thanks to Nick Urban for drawing this distinction) is that the ActionView template compilation code blocks on a mutex that is per template. See line 230 in template.rb here. Notice there is a check for completed compilation both before grabbing the lock and after.
The effect is to serialize attempts to compile the same template, where only the first will actually do the compilation and the rest will get the already completed result.
Very interesting question. I searched on google (you get more results if you search for "dog pile" instead of "stampede") but like you, did I not get any answers, except this one blog post: protecting from dogpile using memcache.
Basically does it store you fragment in two keys: key:timestamp (where timestamp would be updated_at for active record objects) and key:last.
def custom_write_dogpile(key, timestamp, fragment, options)
Rails.cache.write(key + ':' + timestamp.to_s, fragment)
Rails.cache.write(key + ':last', fragment)
Rails.cache.delete(key + ':refresh-thread')
fragment
end
Now when reading from the cache, and trying to fetch a non existing cache, will it instead try to fecth the key:last fragment instead:
def custom_read_dogpile(key, timestamp, options)
result = Rails.cache.read(timestamp_key(name, timestamp))
if result.blank?
Rails.cache.write(name + ':refresh-thread', 0, raw: true, unless_exist: true, expires_in: 5.seconds)
if Rails.cache.increment(name + ':refresh-thread') == 1
# The cache didn't exists
result = nil
else
# Fetch the last cache, as the new one has not been created yet
result = Rails.cache.read(name + ':last')
end
end
result
end
This is a simplified summary of the by Moshe Bergman that i linked to before, or you can find here.
There is no protection against memcache stampedes. This is a real problem when multiple machines are involved and multiple processes on those multiple machines. -Ouch-.
The problem is compounded when one of the key processes has "died" leaving any "locking" ... locked.
In order to prevent stampedes you have to re-compute the data before it expires. So, if your data is valid for 10 minutes, you need to regenerate again at the 5th minute and re-set the data with a new expiration for 10 more minutes. Thus you don't wait until the data expires to set it again.
Should also not allow your data to expire at the 10 minute mark, but re-compute it every 5 minutes, and it should never expire. :)
You can use wget & cron to periodically call the code.
I recommend using redis, which will allow you to save the data and reload it in the advent of a crash.
-daniel
A reasonable strategy would be to:
use a :race_condition_ttl with at least the expected time it takes to refresh the resource. Setting it to less time than expected to perform a refresh is not advisable as the angry mob will end up trying to refresh it, resulting in a stampede.
use an :expires_in time calculated as the maximum acceptable expiry time minus the :race_condition_ttl to allow for refreshing the resource by a single worker and avoiding a stampede.
Using the above strategy will ensure that you don't exceed your expiry/staleness deadline and also avoid a stampede. It works because only one worker gets through to refresh, whilst the angry mob are held off using the cache value with the race_condition_ttl extension time right up to the originally intended expiry time.

how to save activerecord object of rails to redis

I am using redis as my web cache, and I want to store those activerecord objects to redis directly, but using redis-rb I get an error.
It seems that I can't serialize it or some what. Is there a lib to do this for me?
Am I have to serialize it to json format?
Which serialization format would be the most efficient?
Redis stores strings (and a few other data structures of strings); so you can serialize into Redis values however you like so long as you end up with a string.
JSON is probably the best place to start as it's lean, not overly brittle, works well with live upgrade patterns, and is readable in situ. Later you can add more complexity to meet your goals as needed, e.g., compression. #to_json and #from_json are already on ActiveRecord if you want to use JSON (with YAJL or its ilk that shouldn't be excessively slow, relatively speaking.) #to_xml is also there, if you're into S&M.
Raw marshaling can also work, but occasionally goes horrifically wrong (I've had marshaled objects exceed 2MB after LZO compression that were only a few K in JSON.)
If it's really a bottleneck for you, you'll want to run your own efficiency tests for your goal(s), e.g., write speed, read speed, or storage size, with your own objects and data patterns.
You can convert your model to a hash using attributes method and then save it with mapped_hmset
def redis_set()
redis.mapped_hmset("namespace:modelName:#{self.id}", self.attributes)
end
def redis_get(id)
redis.hgetall("namespace:modelName:#{id}")
end
def self.set(friend_list, player_id)
redis.set("friend_list_#{player_id}", Marshal.dump(friend_list)) == 'OK' ? friend_list : nil
end
def self.get(player_id)
friend_list = redis.get("friend_list_#{player_id}")
Marshal.load(friend_list) if friend_list
end

Resources