Working with stale data when performing asynchronous jobs with Sidekiq

Working with stale data when performing asynchronous jobs with Sidekiq - ruby-on-rails

In order to process events asynchronously and create an activity feed, I'm using Sidekiq and Ruby on Rails' Global ID.
This works well for most types of activities, however some of them require data that could change by the time the job is performed.
Here's a completely made-up example:
class Movie < ActiveRecord::Base
include Redis::Objects
value :score # stores an integer in Redis
has_many :likes
def popular?
likes.count > 1000
end
end
And a Sidekiq worker performing a job every time a movie is updated:
class MovieUpdatedWorker
include Sidekiq::Worker
def perform(global_id)
movie = GlobalID::Locator.locate(global_id)
MovieUpdatedActivity.create(movie: movie, score: movie.score) if movie.popular?
end
end
Now, imagine Sidekiq is lagging behind and, before it gets a chance to perform its job, the movie's score is updated in Redis, some users unliked the movie and the popular method now returns false.
Sidekiq ends up working with updated data.
I'm looking for ways to schedule jobs while making sure the required data won't change when the job is performed. A few ideas:
1/ Manually pass in all the required data and adjust the worker accordingly:
MovieUpdatedWorker.perform_async(
movie: self,
score: score,
likes_count: likes.count
)
This could work but would require reimplementing/duplicating all methods that rely on data such as score and popular? (imagine an app with much more than these two/three movable pieces).
This doesn't scale well either since serialized objects could take up a lot of room in Redis.
2/ Stubbing some methods on the record passed in to the worker:
MovieUpdatedWorker.perform_async(
global_id,
stubs: { score: score, popular?: popular? }
)
class MovieUpdatedWorker
include Sidekiq::Worker
def perform(global_id, stubs: {})
movie = GlobalID::Locator.locate(global_id)
# inspired by RSpec
stubs.each do |message, return_value|
movie.stub(message) { return_value }
end
MovieUpdatedActivity.create(movie: movie, score: movie.score) if movie.popular?
end
end
This isn't functional, but you can imagine the convenience of dealing with an actual record, not having to reimplement existing methods, and dealing with the actual data.
Do you see other strategies to "freeze" object data and asynchronously process them? What do you think about these two?

I wouldn't say that the data was stale since you would actually have the newest version of it, just that it was no longer popular. It sounds like you actually want the stale version.
If you don't want the data to have changed you need to cache it somehow. Either like you say, pass the data to the job directly, or You can add some form of versioning of the data in the database and pass a reference to the old version.
I think passing on the data you need to Redis is a reasonable way. You could serialize only the attributes you actually care about, like score.

Related

Is it inefficient to continually call the parent model for specific data attributes?

Let's say on some Child model method I need to do calculations based on some data stored on its Parent model. For example,
def child_method(minutes)
remaining_time = minutes % self.parent.parent_settings
if remaining_time >= 1
return minutes/ self.parent.parent_settings
else
return [minutes/self.parent.parent_settings - 1 , 0].max
end
end
In the above I've called self.parent.parent_settings 3 times. Based on how Rails works, is this efficient? Or is it a terrible idea, and I should instead set the parent_settings locally, e.g.,:
def child_method(minutes)
parent_settings = self.parent.parent_settings
remaining_time = minutes % parent_settings
if remaining_time >= 1
return minutes/ parent_settings
else
return [minutes/parent_settings - 1 , 0].max
end
end
I have more complex instances of this (e.g., where in one child method I'm accessing multiple parent attributes, and also in some instances, grandparent attributes). I realize the answer might be "it depends" on exactly what is the data, etc., but looking to see if there are general rules of thumb or convention

Like you said, it depends.
Rails will cache fetched associations as long as the object remains in memory:
puts self.parent.parent_settings.object_id
# ... Some code
puts self.parent.parent_settings.object_id # => This should be the same object ID as before
This cache is cleared automatically by the framework and can be explicitly cleared via #reload:
self.reload
Your code should be fine as long as you're not running child_method multiple times in a request/response cycle. Even if you do run child_method multiple times in the same request/response cycle, there's another database query cache that will intercept the same DB queries. The db query cache is only active when in production mode or when a special ENV var is set.

Expire cache based on saved value

In my app there is a financial overview page with quite a lot of queries. This page is refreshed once a month after executing a background job, so I added caching:
#iterated_hours = Rails.cache.fetch("productivity_data", expires_in: 24.hours) do
FinancialsIterator.new.create_productivity_iterations(#company)
end
The cache must expire when the background job finishes, so I created a model CacheExpiration:
class CacheExpiration < ApplicationRecord
validates :cache_key, :expires_in, presence: true
end
So in the background job a record is created:
CacheExpiration.create(cache_key: "productivity_data", expires_in: DateTime.now)
And the Rails.cache.fetch is updated to:
expires_in = get_cache_key_expiration("productivity_data")
#iterated_hours = Rails.cache.fetch("productivity_data", expires_in: expires_in) do
FinancialsIterator.new.create_productivity_iterations(#company)
end
private def get_cache_key_expiration(cache_key)
cache_expiration = CacheExpiration.find_by_cache_key(cache_key)
if cache_expiration.present?
cache_expiration.expires_in
else
24.hours
end
end
So now the expiration is set to a DateTime, is this correct or should it be a number of seconds? Is this the correct approach to make sure the cache is expired only once when the background job finishes?

Explicitly setting an expires_in value is very limiting and error prone IMO. You will not be able to change the value once a cache value has been created (well you can clear the cache manually) and if ever you want to change the background job to run more/less often, you also have to remember to update the expires_in value. Additionally, the time when the background job is finished might be different from the time the first request to the view is made. As a worst case, the request is made a minute before the background job updates the information for the view. Your users will have to wait a whole day to get current information.
A more flexible approach is to rely on updated_at or in their absence created_at fields of ActiveRecord models.
For that, you can either rely on the CacheExpiration model you already created (it might already have the appropriate fields) or use the last of the "huge number of records" you create. Simply order them and take the last SomeArModel.order(created_at: :desc).first
The benefit of this approach is that whenever the AR model you create is updated/created, you cache is busted and a new one will be created. There is no longer a coupling between the time a user called the end point and the time the background job ran. In case a record is created by any means other than the background job, it will also simply be handled.
ActiveRecord models are first class citizens when it comes to caching. You can simply pass them in as cache keys. Your code would then change to:
Rails.cache.fetch(CacheExpiration.find_by_cache_key("productivity_data")) do
FinancialsIterator.new.create_productivity_iterations(#company)
end
But if at all possible, try to find an alternative model so you no longer have to maintain CacheExpiration.
Rails also has a guide on that topic

Send email to only certain Users, Rails 5

I'm having trouble sending an email blast to only certain users who have a boolean set to true and to not send the email to those users who have it set to false.
In my app I have Fans following Artists through Artists Relationships. Inside my ArtistRelationship model I have a boolean that fans can set to true or false based on if they want email blasts from Artists or not when the Artist makes a post.
So far, I have this:
artist.rb
class Artist < ApplicationRecord
def self.fan_post_email
Artist.inlcudes(:fans).find_each do |fan|
fan.includes(:artist_relationships).where(:post_email => true).find_each do |fan|
FanMailer.post_email(fan).deliver_now
end
end
end
end
posts_controller.rb
class Artists::PostsController < ApplicationController
def create
#artist = current_artist
#post = #artist.artist_posts.build(post_params)
if #post.save
redirect_to artist_path(#artist)
#artist.fan_post_email
else
render 'new'
flash.now[:alert] = "You've failed!"
end
end
end
I'm having trouble getting the fan_post_email method to work. I'm not entirely sure how to find the Fans that have the boolean set to true in the ArtistRelationship model.

You want to send mails to fans of a particular artist. Therefore you call
#artist.fan_post_email
That is you call a method on an instance of the Artist class. Instance methods are not defined with a self.[METHOD_NAME]. Doing so defines class methods (if you where to call e.g. Artist.foo).
First part then is to remove the self. part, second is adapting the scope. The complete method should look like this:
def fan_post_email
artists_relationships
.includes(:fan)
.where(post_email: true)
.find_each do |relationship|
FanMailer.post_email(relationship.fan).deliver_now
end
end
end
Let's walk through this method.
We need to get all fans in order to send mails to them. This can be done by using the artist_relationships association. But as we only want to have those fans having checked the e-mail flag, we limit those by the where statement.
The resulting SQL condition will give us all such relationships. But we do it in batches (find_each) in order to not have to load all of the records into memory upfront.
The block provided to find_each is yielded with an artists_relationships instance. But we need the fan instances and not the artists_relationships instances to send the mail in our block and thus call post_email with the fan instance associated with the relationship. In order to avoid N+1 queries (a query for the fan record of every artists_relationships record one by one) there, we eager load the fan association on the artists_relationships.
Unrelated to the question
The usage of that method within the normal request/response cycle of a user's request will probably slow down the application quite a lot. If an artists has many fans, the application will send an e-mail to every one of them before rendering the response for the user. If it is a popular artist, I can easily imagine this taking minutes.
There is a counterpart to deliver_now which is deliver_later (documentation. Jobs, like sending an e-mail, can be queued and resolved independent from the request/response cycle. It will require setting up a worker like Sidekiq or delayed_job but the increase in performance is definitely worth it.
If the queueing mechanism is set up, it probably makes sense to move the call to fan_post_email there as well as the method itself might also take some time.
Additionally, it might make sense to send e-mail as BCC which would allow you to send one e-mail to multiple fans at the same time.

sidekiq method to store some data in sqlite3 database, rather than redis?

I am currently working on a rails project, i was asked to save progress of a sidekiq workers and store it, so the user who is using the application can see the progress. Now i am faced with this dilemna, is it better to just write out to a text file or save it in a database.
If it is a database, then how to save it in a model object. I know we can store the progress of workers by just sending out the info to log file.
class YourWorker
include Sidekiq::Worker
def perform
logger.info { "Things are happening." }
logger.debug { "Here's some info: #{hash.inspect}" }
end
So if i want to save the progress of workers in a data model, then how?

Your thread title says that the data is unstructured, but your problem description indicates that the data should be structured. Which is it? Speed is not always the most important consideration, and it doesn't seem to be very important in your case. The most important consideration is how your data will be used. Will the way your data is used in the future change? Usually, a database with an appropriate model is the better answer because it allows flexibility for future requirements. It also allows other clients access to your data.

You can create a Job class and then update some attribute of the currently working job.
class Job < ActiveRecord::Base
# assume that there is a 'status' attribute that is defined as 'text'
end
Then when you queue something to happen you create a new Job and pass the id of the Job to perform or perform_async.
job = Job.create!
YourWorker.perform_async job.id
Then in your worker, you'd receive the id of the job to be worked on, and then retrieve and update that record.
def perform(job_id)
job = Job.find job_id
job.status = "It's happening!"
job.save
end

How do I change a variable every so minutes in rails?

I want to display a random record from the database for a certain amount of time, after that time it gets refreshed to another random record.
How would I go about that in rails?
Right now I'm looking in the directions of cronjobs, also the whenever gem, .. but I'm not 100% sure I really need all that for what seems to be a pretty simple action?

Use the Rails.cache mechanism.
In your controller:
#record = Rails.cache("cached_record", :expires_in => 5.minutes) do
Model.first( :offset =>rand(Model.count))
end
During the first execution, result gets cached in the Rails cache. A new random record is retrieved after 5 minutes.

I would have an expiry_date in my model and then present the user with a javascript timer. After the time has elapsed, i would send a request back to the server(ajax probably, or maybe refreshing the page) and check whether the time has indeed expired. If so, i would present the new record.

You could simply check the current time in your controller, something like:
def show
#last_refresh ||= DateTime.now
#current ||= MyModel.get_random
#current = MyModel.get_random if (DateTime.now - #last_refresh) > 5.minutes
end
This kind of code wouldn't scale to more servers (as it relies on class variables for data storage), so in reality you would wan't to store the two class variables in something like Redis (or Memcache even) - that is for high performance. Depends really on how accurately you need this and how much performance you need. You could as well use your normal database to store expiry times and then load the record whose time is current.

My first though was to cache the record in a global, but you could end up with different records being served by different servers. How about adding a :chosen_at datetime column to your record...
class Model < AR::Base
def self.random
##random = first(:conditions => 'chosen_at NOT NULL')
return ##random unless ##random.nil? or ##random.chosen_at < 5.minutes.ago
##random.update_attribute(:chosen_at,nil) if ##random
ids = connection.select_all("SELECT id FROM things")
##random = find(ids[rand(ids.length)]["id"].to_i)
end
end

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Working with stale data when performing asynchronous jobs with Sidekiq - ruby-on-rails

Related

Is it inefficient to continually call the parent model for specific data attributes?

Expire cache based on saved value

Send email to only certain Users, Rails 5

sidekiq method to store some data in sqlite3 database, rather than redis?

How do I change a variable every so minutes in rails?

Categories

Resources