Rails comes with a handy session hash into which we can cram stuff to our heart's content. I would, however, like something like ASP's application context, which instead of sharing data only within a single session, will share it with all sessions in the same application. I'm writing a simple dashboard app, and would like to pull data every 5 minutes, rather than every 5 minutes for each session.
I could, of course, store the cache update times in a database, but so far haven't needed to set up a database for this app, and would love to avoid that dependency if possible.
So, is there any way to get (or simulate) this sort of thing? If there's no way to do it without a database, is there any kind of "fake" database engine that comes with Rails, runs in memory, but doesn't bother persisting data between restarts?
Right answer: memcached . Fast, clean, supports multiple processes, integrates very cleanly with Rails these days. Not even that bad to set up, but it is one more thing to keep running.
90% Answer: There are probably multiple Rails processes running around -- one for each Mongrel you have, for example. Depending on the specifics of your caching needs, its quite possible that having one cache per Mongrel isn't the worst thing in the world. For example, supposing you were caching the results of a long-running query which
gets fresh data every 8 hours
is used every page load, 20,000 times a day
needs to be accessed in 4 processes (Mongrels)
then you can drop that 20,000 requests down to 12 with about a single line of code
##arbitrary_name ||= Model.find_by_stupidly_long_query(param)
The double at-mark, a Ruby symbol you might not be familiar with, is a global variable. ||= is the commonly used Ruby idiom to execute the assignment if and only if the variable is currently nil or otherwise evaluates to false. It will stay good until you explicitly empty it OR until the process stops, for any reason -- server restart, explicitly killed, what have you.
And after you go down from 20k calculations a day to 12 in about 15 seconds (OK, two minutes -- you need to wrap it in a trivial if block which stores the cache update time in a different global), you might find that there is no need to spend additional engineering assets on getting it down to 4 a day.
I actually use this in one of my production sites, for caching a few expensive queries which literally only need to be evaluated once in the life of the process (i.e. they change only at deployment time -- I suppose I could precalculate the results and write them to disk or DB but why do that when SQL can do the work for me).
You don't get any magic expiry syntax, reliability is pretty slim, and it can't be shared across processes -- but its 90% of what you need in a line of code.
You should have a look at memcached: http://wiki.rubyonrails.org/rails/pages/MemCached
There is a helpful Railscast on Rails 2.1 caching. It is very useful if you plan on using memcached with Rails.
Using the stock Rails cache is roughly equivalent to this.
#p3t0r- is right,MemCached is probably the best option, but you could also use the sqlite database that comes with Rails. That won't work over multiple machines though, where MemCached will. Also, sqlite will persist to disk, though I think you can set it up not to if you want. Rails itself has no application-scoped storage since it's being run as one-process-per-request-handler so it has no shared memory space like ASP.NET or a Java server would.
So what you are asking is quite impossible in Rails because of the way it is designed. What you ask is a shared object and Rails is strictly single threaded. Memcached or similar tool for sharing data between distributed processes is the only way to go.
The Rails.cache freezes the objects it stores. This kind of makes sense for a cache but NOT for an application context. I guess instead of doing a roundtrip to the moon to accomplish that simple task, all you have to do is create a constant inside config/environment.rb
APP_CONTEXT = Hash.new
Pretty simple, ah?
Related
I am building a rails app, have a site wide counter variable (counter) that is used (read and write) by different pages, basically many pages can cause the counter increment, what is a multi-thread-safe way to store this variable so that
1) it is thread-safe, I may have many concurrent user access might read and write this variable
2) high performing, I originally thought about persist this variable in DB, but wondering is there a better way given there can be high volume of request and I don't want to make this DB query the bottleneck of my app...
suggestions?
It depends whether it has to be perfectly accurate. Assuming not, you can store it in memcached and sync to the database occasionally. If memcached crashes, it expires (shouldn't happen if configured properly), you have to shutdown, etc., reload it from the database on startup.
You could also look at membase. I haven't tried it, but to simplify, it's a distributed memcached server that automatically persists to disk.
For better performance and accuracy, you could look at a sharded approach.
Well you need persistence, so you have to store it in the Database/Session/File, AFAIK.
I have inherited an app that generates a large array for every user that visit the app. I recently discovered that it is identical for nearly all the users!!
Now I want to somehow make one copy of it so it is not built over and over again. I have thought of a few options and wanted input to see which one is the best:
1) Create a model and shove the data into the database
2) Create a YAML file and have the app load it when it initializes.
I personally like the model idea but a few engineers at work feel as though it does not deserve to be a full model. 97% of the times users will see the same exact thing but 3% of the time users will get a slightly different array (a few elements will have changed).
Any other approaches that I should consider.??..thanks in advance.
Remember that if you store the data in the DB, each request which requires the data will have to execute a DB query to pull it out. If you are running multiple server threads, each thread could have its own copy in memory (if they are all handling requests which require the use of the array). In that case, you wouldn't be saving any memory (though you might save time from not having to regenerate the array).
If you are running multiple server processes (not threads), and if the array contents change as the application is running, and the changes have to be visible to all the processes, caching in memory won't work. You will have to use the DB in that case.
From the information in your comment, I suggest you try something like this:
Store the array in your DB, and make sure that the record(s) used have created/updated timestamps. Cache the contents in memory using a constant/global variable/class variable. Also store the last time the cache was updated.
Every time you need to use the array, retrieve the relevant "updated" timestamp from the DB. (You may need to use hand-coded SQL and ModelName.connection.execute to avoid pulling back all the data in the record, which ActiveRecord will probably do.) If the timestamp is later than the last time your cache was updated, pull the array from the DB and update your cache.
Use a Mutex ('require thread') when retrieving/updating the cached data, in case your server setup may use multiple threads. (I don't think that Passenger does, but I have had problems similar to threading problems when using Passenger+RMagick, so I would still use a Mutex to be safe.)
Wrap all the code which deals with the cached array in a library class (or a class method on the model used to store the data), so the details of cache management don't spill over into the rest of the application.
Do a little bit of performance testing on the cache setup using Benchmark.measure {}. If a bug in the setup actually made performance worse rather than better, that would be sad...
I'd go with option 2. You can add two constants (for the 97% and 3%) that load from a YAML file when the app initializes. That ought to shrink your memory footprint considerably.
Having said that, yikes, this is just a band-aid on a hack, but you knew that already. I'd consider putting some time into a redesign, if you have that luxury.
Caching is something that I kind of ignored for a long time, as projects that I worked on were on local intranets with very little activity. I'm working on a much larger Rails 3 personal project now, and I'm trying to work out what and when I should cache things.
How do people generally determine this?
If I know a site is going to be relatively low-activity, should I just cache every single page?
If I have a page that calls several partials, is it better to do fragment caching in those partials, or page caching on those partials?
The Ruby on Rails guides did a fine job of explaining how caching in Rails 3 works, but I'm having trouble understanding the decision-making process associated with it.
Don't ever cache for the sake of it, cache because there's a need (with the exception of something like the homepage, which you know is going to be super popular.) Launch the site, and either parse your logs or use something like NewRelic to see what's slow. From there, you can work out what's worth caching.
Generally though, if something takes 500ms to complete, you should cache, and if it's over 1 second, you're probably doing too much in the request, and you should farm whatever you're doing to a background process…for example, fetching a Twitter feed, or manipulating images.
EDIT: See apneadiving's answer too, he links to some great screencasts (albeit based on Rails 2, but the theory is the same.)
You'll want to think about caching several kinds of things:
Requests that are hit a lot, and seldom change
Requests that are "expensive" to draw, lots of database calls, etc. Also hopefully these seldom change.
The other side of caching that shouldn't go without mention, is expiration. Its also often the harder part. You have to know when a cache is no longer good, and clear it out so fresh content will be generated. Sweepers, or Observers, depending on how you implement your cache can help you with this. You could also do it just based on a time value, allow caches to have a max-age and clear them after that no matter what.
As for fragment vs full page caching, think of it in terms of how often those parts are updated. If 3 partials of a page are never updated, and one is, maybe you want to cache those 3, and allow that 1 to be fetched live for so you can have up to the second accuracy. Or if the different partials of a page should have different caching rules: maybe a "timeline" section is cached, but has a cache-age of 1 minute. While the "friends" partial is cached for 12 hours.
Hope this helps!
If the site is relatively low activity you shouldn't cache any page. You cache because of performance problems, and performance problems come about because you have too much data to query, too many users, or worse, both of those situations at the same time.
Before you even think about caching, the first thing you do is look through your application for the requests that are taking up the most time. Not the slowest requests, but the requests your application spends the most aggregate time performing. That is if you have a request A that runs 10 times at 1500ms and request B that runs 5000 times at 250ms you work on optimizing B first.
It's actually pretty easy to grep through your production.log and extract rendering times and URLs to combine them into a simple report. You can even do that in real-time if you want.
Once you've identified a problematic request, you go about picking apart what it's doing to service the request. The first thing is to look for any queries that can be combined by using eager loading or by looking ahead a bit more to anticipate what you'll need. The next thing is to ensure you're not loading data that isn't used.
So many times you'll see code to list users and it's loading 50KB per person of biographical data, their Facebook and Twitter handles, literally everything about them, and all you use is their name.
Fetch as little as you need, and fetch it in the most efficient way you can. Use connection.select_rows when you don't need models.
The next step is to look at what kind of queries you're running, and how they're under-performing. Ensure your indexes are all set properly and are being used. Check that you're not doing complicated JOIN operations that could be resolved by a bit of tactical de-normalization.
Have a look at what data you are storing in your application, and try and find things that can be removed from your production database and warehoused somewhere else. Cycle your data out regularly when it's no longer relevant, preserve it in a separate database if you need to.
Then go over and have a look at how your database server is tuned. Does it have sufficiently large buffers? Is it on hardware that could be upgraded with more memory at a nominal cost? Too many people are running a completely un-tuned database server and with a few simple settings they can get ten-fold performance increases.
If, and only if, you still have a performance problem at this point then you might want to consider caching.
You know why you don't cache first? It's because once you cache something, that cached data is immediately stale. If parts of your application use this data under the assumption it's always up to date, you will have problems. If you don't expire this cache when the data does change, you will have problems. If you cache the data and never use it again, you're just clogging up your cache and you will have problems. Basically you'll have lots of problems when you use caching, so it's often a last resort.
I have always been taught that storing objects in a session was a bad idea. Instead IDs should be stored that retrieve the record when needed.
However, I have an application that I wonder is an exception to this rule. I'm building a flashcard application, and the words being quizzed are in a table in the database whose schema doesn't change. I want to store the words currently being quizzed in a session, so a user can finish where they started in case they move on to a separate page.
In this case, is it possible to get away with storing these words as objects in the database? If so, why? The reason I ask is because the quiz is designed to move quickly, and I'd hate to waste a database call on retrieving a record that never changes in the first place. However, perhaps there are other negatives to a large session that I'm not aware of.
*For the record, I have tried caching it with the built-in memcache methods in Rails 2.3, but apparently that has a maximum size per item of 1MB.
The main reason not to store objects in the session is that if the object structure changes, you will get an exception. Consider the following:
class Foo
attr_accessor :bar
end
class Bar
end
foo = Foo.new
foo.bar = Bar.new
put_in_session(foo)
Then, in a subsequent release of the project, you change Bar's name. You reboot the server, and try to grab foo out of the session. When it tries to deserialize, it fails to find Bar and explodes.
It might seem like it would be easy to avoid this pitfall, but in practice, I've seen it bite a number of people. This is just because serializing an object can sometimes take more along with it than is immediately apparent (this sort of thing is supposed to be transparent) and unless you have rigorous rules about this, things will tend to get flummoxed up.
The reason it's normally frowned upon is that it's extremely common for this to bite people in ActiveRecord, since it's quite common for the structure of your app to shift over time, and sessions can be deserialized a week or longer after they were originally created.
If you understand all that and are willing to put in the energy to be sure that your model does not change and is not serializing anything extra, you're probably fine. But be careful :)
Rails tends to encourage RESTful design, and using sessions isn't very RESTful. I'd probably make a Quiz resource that has a bunch of words, as well as a current_word. This way, when they come back, you'll know where they were.
Now, REST isn't everything (depending on who you talk to), but there's a pretty good case against large sessions. Remember that sessions write things to and from disk, and the more data that you're writing, the longer it takes to read back...
Since your app is a Rails app, I would suggest either:
Using your clients' ability to cache
by caching the cards in javascript.
(you'd need a fairly ajaxy app to
do this, see the latest RailsCast for some interesting points on javascript page caching)
Use one of the many other rails-supported server-side
caching options (i.e. MemCached) to
cache this data.
A much more insidious issue you'll encounter storing objects directly in the session is when you're using CookieStore (the default in Rails 2+ I believe). It's very easy to get CookieOverflow errors which are very hard to recover from.
There is a good chance that we will be tech crunched in the next few days. Unfortunately, we have not gone live yet so we don't have a good estimation of how our system handles a production audience.
Our production setup consists of 2 EngineYard slices each with 3 mongrel instances, using Postgres as the database server.
Obviously a huge portion of how our app will hold up is to do with our actual code and queries etc. However, it would be good to see if there are any tips/pointers on what kind of load to expect or experiences from people who have been through it. Does 6 mongrel instances (possibly 8 if the servers can take it) sound like it will handle the load, or are at least most of it?
I have worked on several rails applications that experienced high load due to viral growth on Facebook.
Your mongrel count should be based on several factors. If your mongrels make API calls or deliver email and must wait for responses, then you should run as many as possible. Otherwise, try to maintain one mongrel per CPU core, with maybe a couple extra left over.
Make sure your server is using a Fair Proxy Balancer (not round robin). Here is the nginx module that does this: http://github.com/gnosek/nginx-upstream-fair/tree/master
And here are some other tips on improving and benchmarking your application performance to handle the load:
ActiveRecord
The most common problem Rails applications face is poor usage of ActiveRecord objects. It can be quite easy to make 100's of queries when only one is necessary. The easiest way to determine if this could be a problem with your application is to set up New Relic. After making a request to each major page on your site, take a look at the newrelic SQL overview. If you see a large number of very similar queries sequentially (select * from posts where id = 1, select * from posts where id = 2, select * from posts...) this may be a sign that you need to use a :include in one of your ActiveRecord calls.
Some other basic ActiveRecord tips (These are just the ones I can think of off the top of my head):
If you're not doing it already, make sure to correctly use indexes on your database tables.
Avoid making database calls in views, especially partials, it can be very easy to lose track of how much you are making database queries in views. Push all queries and calculations into your models or controllers.
Avoid making queries in iterators. Usually this can be done by using an :include.
Avoid having rails build ActiveRecord objects for large datasets as much as possible. When you make a call like Post.find(:all).size, a new class is instantiated for every Post in your database (and it could be a large query too). In this case you would want to use Post.count(:all), which will make a single fast query and return an integer without instantiating any objects.
Associations like User..has_many :objects create both a user.objects and user.object_ids method. The latter skips instantiation of ActiveRecord objects and can be much faster. Especially when dealing with large numbers of objects this is a good way to speed things up.
Learn and use named_scope whenever possible. It will help you keep your code tiny and makes it much easier to have efficient queries.
External APIs & ActionMailer
As much as you can, do not make API calls to external services while handling a request. Your server will stop executing code until a response is received. Not only will this add to load times, but your mongrel will not be able to handle new requests.
If you absolutely must make external calls during a request, you will need to run as many mongrels as possible since you may run into a situation where many of them are waiting for an API response and not doing anything else. (This is a very common problem when building Facebook applications)
The same applies to sending emails in some cases. If you expect many users to sign up in a short period of time, be sure to benchmark the time it takes for ActionMailer to deliver a message. If it's not almost instantaneous then you should consider storing emails in your database an using a separate script to deliver them.
Tools like BackgroundRB have been created to solve this problem.
Caching
Here's a good guide on the different methods of caching in rails.
Benchmarking (Locating performance problems)
If you suspect a method may be slow, try benchmarking it in console. Here's an example:
>> Benchmark.measure { User.find(4).pending_invitations }
=> #<Benchmark::Tms:0x77934b4 #cutime=0.0, #label="", #total=0.0, #stime=0.0, #real=0.00199985504150391, #utime=0.0, #cstime=0.0>
Keep track of methods that are slow in your application. Those are the ones you want to avoid executing frequently. In some cases only the first call will be slow since Rails has a query cache. You can also cache the method yourself using Memoization.
NewRelic will also provide a nice overview of how long methods and SQL calls take to execute.
Good luck!
Look into some load testing software like WEBLoad or if you have money, Quick Test Pro. This will help give you some idea. WEBLoad might be the best test in your situation.
You can generate thousands of virtual nodes hitting your site and you can inspect the performance of your servers from that load.
In my experience having watched some of our customers absorb a crunching, the traffic was fairly modest- not the bone crushing spike people seem to expect. Now, if you get syndicated and make on Yahoo's page or something, things may be different.
Search for the experiences of Facestat.com if you want to read about how they handled it (the Yahoo FP.)
My advise is just be prepared to turn off signups or go to a more static version of your site if your servers get too hot. Using a monitoring/profiling tool is a good idea as well, I like FiveRuns Manage tool for ease of setup.
Since you're using EngineYard, you should be able to allocate more machines to handle the load if necessary
Your big problems will probably not be the number of incoming requests, but will be the amount of data in your database showing you where your queries aren't using the indexes your expecting, or are returning too much data, e.g. The User List page works with 10 users, but dies when you try to show 10,000 users on that one page because you didn't add pagination (will_paginate plugin is almost your friend - watch out for 'select count(*)' queries that are generated for you)
So the two things to watch:
Missing indexes
Too much data per page
For #1, there's a plugin that runs an 'explain ...' query after every query so you can check index usage manually
There is a plugin that can generate data for you for various types of data that may help you fill your database up to test these queries too.
For #2, use will_paginate plugin or some other way to reduce data per page.
We've got basically the same setup as you, 2 prod slices and a staging slice at EY. We found ab to be a great load testing tool - just write a bash script with the urls that you expect to get hit and point it at your slice. Watch NewRelic stats and it should give you some idea of the load your app can handle and where you might need to optimise.
We also found query_reviewer to be very useful as well. It is great for finding those un-indexed tables and n+1 queries.