I am trying to understand the race_condition_ttl directive in Rails when using Rails.cache.fetch.
I have a controller action that looks like this:
def foo
#foo = Rails.cache.fetch("foo-testing", expires_in: 30.seconds, race_condition_ttl: 60.seconds) do
Time.now.to_s
end
#foo # this gets used in a view down the line...
end
Based on what I'm reading in the Rails docs, this value should expire after 30 seconds, but the stale value is allowed to be served for another 60 seconds. However, I can't figure out how to reproduce conditions that will show me this behavior working. Here is how I'm trying to test it.
100.times.map do
t = Thread.new { RestClient.get("http://myenvironment/foo") }
t
end.map {|t| t.join.value }.uniq
I have my Rails app running on a VM behind a standard nginx/unicorn setup. I am trying to spawn 100 threads hitting the site simultaneously to simulate the "dog pile effect". However, when I run my test code, all the threads report the same value back. What I would expect to see is that one thread gets the fresh value, while at least one other thread gets served some stale content.
Any pointers are welcome! Thanks so much.
You are setting race_condition_ttl to 60 seconds which means your threads will only start getting the new value after this time expires, even not taking into account the initial 30 seconds.
Your test doesn't look like it would take 1.5 minutes to run which would be required in order for the threads to start getting the new value. From the Rails Cache docs:
Yes, this process is extending the time for a stale value by another few seconds. Because of extended life of the previous cache, other processes will continue to use slightly stale data for a just a bit longer.
The text implies using a small race_condition_ttl and it makes sense both for its purpose and your test.
UPDATE
Also note that the life of stale cache is extended only if it expired recently. Otherwise a new value is generated and :race_condition_ttl does not play any role.
Without reading source it is not particularly clear how Rails decides when its server is getting hammered or what exactly recently means in the quote above. It seems clear though that the first process (of many) of those waiting to access the cache gets to set the new value while extending life of the previous one. The presence of waiting processes might be the condition Rails looks for. In any case the expected behaviour should be observed after both initial timeout and ttl expire and cache starts serving the updated value. The delay between initial timeout and the time new value starts showing up should be similar to the ttl. Of course the precondition is the server should be hammered around the moment of initial timeout expiration.
Related
I want to run my Rails 5 app on Puma. I use low-level caching and suppose the way to have thread-safe caching:
# somewhere in a model ...
##mutex = Mutex.new
def nice_suff
Rails.cache.fetch("a_key") do
##mutex.synchronize do
Rails.cache.fetch("a_key", 60) do
Model.stuff.to_a
end
end
end
end
Will this be working fine?
The proper way to handle concurrent cache access is already built-in.
val_1 = cache.fetch('foo', race_condition_ttl: 10.seconds) do
Model.stuff.to_a
end
Setting :race_condition_ttl is very useful in situations where a cache entry is used very frequently and is under heavy load. If a cache expires and due to heavy load several different processes will try to read data natively and then they all will try to write to cache. To avoid that case the first process to find an expired cache entry will bump the cache expiration time by the value set in :race_condition_ttl. Yes, this process is extending the time for a stale value by another few seconds. Because of extended life of the previous cache, other processes will continue to use slightly stale data for a just a bit longer. In the meantime that first process will go ahead and will write into cache the new value. After that all the processes will start getting the new value. The key is to keep :race_condition_ttl small.
I set 10000 keys in memcache
for i in 1..10000
Rails.cache.write("short_key#{i}", i)
end
After ~500s (not benchmarked but happens around 10m), when I do
_random = rand(10000)
Rails.cache.read("short_key#{_random}")
returns nil. This is fine. Memcached LRU policy might have destroyed those keys.
But, issue is I see a lot of free memory on server.
Also, when I run following command in telnet session,
stats cachedump 1 10
I get some random keys which I set earlier in loop and even when I try to fetch them via rails or telnet/get, memcached is not able to read that value.
Those key/values are eating up memory but somehow getting destroyed.
I use dalli to connect with memcached.
How can I correct this?
At first glance, this seems possible if the default keep alive time value is low (10 minutes or 500 seconds are both possible default values).
Since you are not setting up the expires_in (or equivalent time_to_live field), the key will be setup for default time, after which the value will expire.
Referring here:
Setting :expires_in will set an expiration time on the cache. All caches support auto-expiring content after a specified number of seconds. This value can be specified as an option to the constructor (in which case all entries will be affected), or it can be supplied to the fetch or write method to effect just one entry.
I am looking to find information on how the caching mechanism in Rails 4 prevents against multiple users trying to regenerate cache keys at once, aka a cache stampede: http://en.wikipedia.org/wiki/Cache_stampede
I've not been able to find out much information via Googling. If I look at other systems (such as Drupal) cache stampede prevention is implemented via a semaphores table in the database.
Rails does not have a built-in mechanism to prevent cache stampedes.
According to the README for atomic_mem_cache_store (a replacement for ActiveSupport::Cache::MemCacheStore that mitigates cache stampedes):
Rails (and any framework relying on active support cache store) does
not offer any built-in solution to this problem
Unfortunately, I'm guessing that this gem won't solve your problem either. It supports fragment caching, but it only works with time-based expiration.
Read more about it here:
https://github.com/nel/atomic_mem_cache_store
Update and possible solution:
I thought about this a bit more and came up with what seems to me to be a plausible solution. I haven't verified that this works, and there are probably better ways to do it, but I was trying to think of the smallest change that would mitigate the majority of the problem.
I assume you're doing something like cache model do in your templates as described by DHH (http://37signals.com/svn/posts/3113-how-key-based-cache-expiration-works). The problem is that when the model's updated_at column changes, the cache_key likewise changes, and all your servers try to re-create the template at the same time. In order to prevent the servers from stampeding, you would need to retain the old cache_key for a brief time.
You might be able to do this by (dum da dum) caching the cache_key of the object with a short expiration (say, 1 second) and a race_condition_ttl.
You could create a module like this and include it in your models:
module StampedeAvoider
def cache_key
orig_cache_key = super
Rails.cache.fetch("/cache-keys/#{self.class.table_name}/#{self.id}", expires_in: 1, race_condition_ttl: 2) { orig_cache_key }
end
end
Let's review what would happen. There are a bunch of servers calling cache model. If your model includes StampedeAvoider, then its cache_key will now be fetching /cache-keys/models/1, and returning something like /models/1-111 (where 111 is the timestamp), which cache will use to fetch the compiled template fragment.
When you update the model, model.cache_key will begin returning /models/1-222 (assuming 222 is the new timestamp), but for the first second after that, cache will keep seeing /models/1-111, since that is what is returned by cache_key. Once 1 second passes, all of the servers will get a cache-miss on /cache-keys/models/1 and will try to regenerate it. If they all recreated it immediately, it would defeat the point of overriding cache_key. But because we set race_condition_ttl to 2, all of the servers except for the first will be delayed for 2 seconds, during which time they will continue to fetch the old cached template based on the old cache key. Once the 2 seconds have passed, fetch will begin returning the new cache key (which will have been updated by the first thread which tried to read/update /cache-keys/models/1) and they will get a cache hit, returning the template compiled by that first thread.
Ta-da! Stampede averted.
Note that if you did this, you would be doing twice as many cache reads, but depending on how common stampedes are, it could be worth it.
I haven't tested this. If you try it, please let me know how it goes :)
The :race_condition_ttl setting in ActiveSupport::Cache::Store#fetch should help avoid this problem. As the documentation says:
Setting :race_condition_ttl is very useful in situations where a cache entry is used very frequently and is under heavy load. If a cache expires and due to heavy load seven different processes will try to read data natively and then they all will try to write to cache. To avoid that case the first process to find an expired cache entry will bump the cache expiration time by the value set in :race_condition_ttl. Yes, this process is extending the time for a stale value by another few seconds. Because of extended life of the previous cache, other processes will continue to use slightly stale data for a just a bit longer. In the meantime that first process will go ahead and will write into cache the new value. After that all the processes will start getting new value. The key is to keep :race_condition_ttl small.
Great question. A partial answer that applies to single multi-threaded Rails servers but not multiprocess(or) environments (thanks to Nick Urban for drawing this distinction) is that the ActionView template compilation code blocks on a mutex that is per template. See line 230 in template.rb here. Notice there is a check for completed compilation both before grabbing the lock and after.
The effect is to serialize attempts to compile the same template, where only the first will actually do the compilation and the rest will get the already completed result.
Very interesting question. I searched on google (you get more results if you search for "dog pile" instead of "stampede") but like you, did I not get any answers, except this one blog post: protecting from dogpile using memcache.
Basically does it store you fragment in two keys: key:timestamp (where timestamp would be updated_at for active record objects) and key:last.
def custom_write_dogpile(key, timestamp, fragment, options)
Rails.cache.write(key + ':' + timestamp.to_s, fragment)
Rails.cache.write(key + ':last', fragment)
Rails.cache.delete(key + ':refresh-thread')
fragment
end
Now when reading from the cache, and trying to fetch a non existing cache, will it instead try to fecth the key:last fragment instead:
def custom_read_dogpile(key, timestamp, options)
result = Rails.cache.read(timestamp_key(name, timestamp))
if result.blank?
Rails.cache.write(name + ':refresh-thread', 0, raw: true, unless_exist: true, expires_in: 5.seconds)
if Rails.cache.increment(name + ':refresh-thread') == 1
# The cache didn't exists
result = nil
else
# Fetch the last cache, as the new one has not been created yet
result = Rails.cache.read(name + ':last')
end
end
result
end
This is a simplified summary of the by Moshe Bergman that i linked to before, or you can find here.
There is no protection against memcache stampedes. This is a real problem when multiple machines are involved and multiple processes on those multiple machines. -Ouch-.
The problem is compounded when one of the key processes has "died" leaving any "locking" ... locked.
In order to prevent stampedes you have to re-compute the data before it expires. So, if your data is valid for 10 minutes, you need to regenerate again at the 5th minute and re-set the data with a new expiration for 10 more minutes. Thus you don't wait until the data expires to set it again.
Should also not allow your data to expire at the 10 minute mark, but re-compute it every 5 minutes, and it should never expire. :)
You can use wget & cron to periodically call the code.
I recommend using redis, which will allow you to save the data and reload it in the advent of a crash.
-daniel
A reasonable strategy would be to:
use a :race_condition_ttl with at least the expected time it takes to refresh the resource. Setting it to less time than expected to perform a refresh is not advisable as the angry mob will end up trying to refresh it, resulting in a stampede.
use an :expires_in time calculated as the maximum acceptable expiry time minus the :race_condition_ttl to allow for refreshing the resource by a single worker and avoiding a stampede.
Using the above strategy will ensure that you don't exceed your expiry/staleness deadline and also avoid a stampede. It works because only one worker gets through to refresh, whilst the angry mob are held off using the cache value with the race_condition_ttl extension time right up to the originally intended expiry time.
My Survey model has about 2500 instances and I need to apply the set_state method to each instance twice. I need to apply it the second time only after every instance has had the method applied to it once. (The state of an instance can depend on the state of other instances.)
I'm using delayed_job to create delayed jobs and workless to automatically scale up/down my worker dynos as required.
The set_state method typically takes about a second to execute. So I've run the following at the heroku console:
2.times do
Survey.all.each do |survey|
survey.delay.set_state
sleep(4)
end
end
Shouldn't be any issues with overloading the API, right?
And yet I'm still seeing the following in my logs for each delayed job:
Heroku::API::Errors::ErrorWithResponse: Expected(200) <=> Actual(429 Unknown)
I'm not seeing any infinite loops -- it just returns this message as soon as I create the delayed job.
How can I avoid blowing Heroku's API rate limits?
Reviewing workless, it looks like it incurs an API call per delayed job to check the worker count and potentially a second API call to scale up/down. So if you are running 5000 (2500x2) jobs within a short period, you'll end up with 5000+ API calls. Which would be well in excess of the 1200/requests per hour limit. I've commented over there to hopefully help toward reducing the overall API usage (https://github.com/lostboy/workless/issues/33#issuecomment-20982433), but I think we can offer a more specific solution for you.
In the mean time, especially if your workload is pretty predictable (like this). I'd recommend skipping workless and doing that portion yourself. ie it sounds like you already know WHEN the scaling would need to happen (scale up right before the loop above, scale down right after). If that is the case you could do something like this to emulate the behavior in workless:
require 'heroku-api'
heroku = Heroku::API.new(:api_key => ENV['HEROKU_API_KEY'])
client.post_ps_scale(ENV['APP_NAME'], 'worker', Survey.count)
2.times do
Survey.all.each do |survey|
survey.delay.set_state
sleep(4)
end
end
min_workers = ENV['WORKLESS_MIN_WORKERS'].present? ? ENV['WORKLESS_MIN_WORKERS'].to_i : 0
client.post_ps_scale(ENV['APP_NAME'], 'worker', min_workers)
Note that you'll need to remove workless from these jobs also. I didn't see a particular way to do this JUST for certain jobs though, so you might want to ask on that project if you need that. Also, if this needs to be 2 pass (the first time through needs to finish before the second), the 4 second sleep may in some cases be insufficient but that is a different can of worms.
I hope that helps narrow in on what you needed, but I'm certainly happy to discuss further and/or elaborate on the above as needed. Thanks!
Rails has a nice set of filters (before_validation, before_create, after_save, etc) as well as support for observers, but I'm faced with a situation in which relying on a filter or observer is far too computationally expensive. I need an alternative.
The problem: I'm logging web server hits to a large number of pages. What I need is a trigger that will perform an action (say, send an email) when a given page has been viewed more than X times. Due to the huge number of pages and hits, using a filter or observer will result in a lot of wasted time because, 99% of the time, the condition it tests will be false. The email does not have to be sent out right away (i.e. a 5-10 minute delay is acceptable).
What I am instead considering is implementing some kind of process that sweeps the database every 5 minutes or so and checks to see which pages have been hit more than X times, recording that state in a new DB table, then sending out a corresponding email. It's not exactly elegant, but it will work.
Does anyone else have a better idea?
Rake tasks are nice! But you will end up writing more custom code for each background job you add. Check out the Delayed Job plugin http://blog.leetsoft.com/2008/2/17/delayed-job-dj
DJ is an asynchronous priority queue that relies on one simple database table. According to the DJ website you can create a job using Delayed::Job.enqueue() method shown below.
class NewsletterJob < Struct.new(:text, :emails)
def perform
emails.each { |e| NewsletterMailer.deliver_text_to_email(text, e) }
end
end
Delayed::Job.enqueue( NewsletterJob.new("blah blah", Customers.find(:all).collect(&:email)) )
I was once part of a team that wrote a custom ad server, which has the same requirements: monitor the number of hits per document, and do something once they reach a certain threshold. This server was going to be powering an existing very large site with a lot of traffic, and scalability was a real concern. My company hired two Doubleclick consultants to pick their brains.
Their opinion was: The fastest way to persist any information is to write it in a custom Apache log directive. So we built a site where every time someone would hit a document (ad, page, all the same), the server that handled the request would write a SQL statement to the log: "INSERT INTO impressions (timestamp, page, ip, etc) VALUES (x, 'path/to/doc', y, etc);" -- all output dynamically with data from the webserver. Every 5 minutes, we would gather these files from the web servers, and then dump them all in the master database one at a time. Then, at our leisure, we could parse that data to do anything we well pleased with it.
Depending on your exact requirements and deployment setup, you could do something similar. The computational requirement to check if you're past a certain threshold is still probably even smaller (guessing here) than executing the SQL to increment a value or insert a row. You could get rid of both bits of overhead by logging hits (special format or not), and then periodically gather them, parse them, input them to the database, and do whatever you want with them.
When saving your Hit model, update a redundant column in your Page model that stores a running total of hits, this costs you 2 extra queries, so maybe each hit takes twice as long to process, but you can decide if you need to send the email with a simple if.
Your original solution isn't bad either.
I have to write something here so that stackoverflow code-highlights the first line.
class ApplicationController < ActionController::Base
before_filter :increment_fancy_counter
private
def increment_fancy_counter
# somehow increment the counter here
end
end
# lib/tasks/fancy_counter.rake
namespace :fancy_counter do
task :process do
# somehow process the counter here
end
end
Have a cron job run rake fancy_counter:process however often you want it to run.