How can I postpone database updates in Rails? - ruby-on-rails

I'm building something akin to Google Analytics and currently I'm doing real time database updates. Here's the workflow for my app:
User makes a RESTful API request
I find a record in a database, return it as JSON
I record the request counter for the user in the database (i.e. if I user makes 2
API calls, I increment the request counter for the user by '2').
1 and 2 are really fast in SQL - they are SELECTs. #3 is really slow, because it's an UPDATE. In the real world, my database (MySQL) is NOT scaling. According to New Relic, #3 is taking most of the time - up to 70%!
My thinking is that I need to stop doing synchronous DB operations. In the short term, I'm trying to reduce DB writes, so I'm thinking about a global hash (say declared in environment.rb) that is accessible from my controllers and models that I can write to in lieu of writing to the DB. Every so often I can have a task write the updates that need to be written to the DB.
Questions:
Does this sound reasonable? Any gotchas?
Will I run into any concurrency problems?
How does this compare with writing logs to the file system and importing later?
Should I be using some message queuing system instead, like Starling? Any recommendations?
PS: Here's the offending query -- all columns of interest are indexed:
UPDATE statistics_api SET count_request = COALESCE(count_request, ?) + ? WHERE (id = ?)

Your hash solution sounds like it's a bit too complex. This set of slides is an insightful and up-to-date resource that addresses your issue head on:
http://www.slideshare.net/mattmatt/the-current-state-of-asynchronous-processing-with-ruby
They say the simplest thing would be:
Thread.new do
MyModel.do_long_thing
end
But the Ruby mysql driver is blocking, so a mysql request in that thread could still block your request. You could use mysqlplus as a driver and get non-blocking requests, but now we're getting a pretty complex and specialized solution.
If you really just want this out of your request cycle, but can spare locking the server for it, you can do something like:
MyController
after_filter :do_jobs
def index
#job = Proc.new{ MyModel.do_long_thing }
end
private
def do_jobs
return unleses #job
#job.call
end
end
I'd abstract it into ApplicationController more, but you get the idea. The proc defers updates until after the request.
If you are serious about asynchronous and background processes, you'll need to look at the various options out there and make a decision about what fits you needs. Matt Grande recommended DelayedJob- that's a very popular pick right now, but if your entire server is bogged down with database writes, I would not suggest it. If this is just a particularly slow update, but your server is not over-loaded, then maybe it's a good solution.
I currently use Workling with Starling in my most complex project. Workling has been pretty extensible, but Starling has been a little less than ideal. One of Workling's advantages is the ability to swap backends, so we can move off Starling if it becomes a large problem.
If your server is bogged with writes, you will need to look at scaling it up regardless of your asynchronous task approach.
Good luck! It sounds like you're app is growing at an exciting pace :-)

I just asked a similar question over on the EventMachine mailing list, and I was suggested that I try phat (http://www.mikeperham.com/2010/04/03/introducing-phat-an-asynchronous-rails-app/) to get asynchronous database access.
Maybe you should try it out.

Do it later with DelayedJob.
Edit: If your DB is being hit so much that one UPDATE is noticeably slowing down your requests, maybe you should consider setting up a master-slave database architecture.

Related

rails 3 & activerecord: do I need to take special record locking precautions to update a counter field?

Each time a user's profile gets displayed we update an integer counter in the Users table.
I started thinking about high-concurrencey situations and wondered what happens if a bunch of people hit a user's profile page at the exact same time: does rails/activerecord magically handle the record locking and semaphore stuff for me?
Or does my app need to explicitly call some sort of mechanism to avoid missing update events when concurrent calls are made to my update method?
def profile_was_hit
self.update_attributes :hitcounter => self.hitcount + 1
end
And along those lines, when should I use something like Users.increment_counter(:hit counter, self.id) instead?
In the default configuration, a single Rails instance is only handling a single request at a time, so you don't have to worry about any concurrency trouble on the application layer.
If you have multiple instances of your application running (which you probably do/will), they will all make requests to the database without talking to one another about it. This is why this is handled at the database layer. MySQL, PostgreSQL, etc. are all going to lock the row on write.
The way you are handling this situation isn't ideal for performance though because your application is reading the value, incrementing it, and then writing it. That lag between read and write does allow for you to miss values. You can avoid this by pushing the increment responsibility to your database (UPDATE hitcounter SET hitcount = hitcount + 1;). I believe ActiveRecord has support for this built in, I'll/you'll have to go dig around for it though. Update: Oh, duh, yes you want to use the increment_counter method to do this. Reference/Documentation.
Once you update your process push incrementing responsibility to the database I wouldn't worry about performance for a while. I once had a PHP app do this once per request and it scaled gloriously for me to 100+ updates/second (mysqld on the same machine, no persistent connections, and always < 10% cpu).

Using resque for implementing a command pattern

I am working on a multi-user tree editing app. It uses resque gem for background processes. To avoid runtime multiuser conflicts I want to use command pattern and store user actions in a resque queue so if someone is deleting a branch other user cannot edit children of that branch.
It works, but it is quite slow to pick the job first time from a queue, because resque worker checks for the jobs using 5 seconds interval. It slows down editing interface significantly. It is possibe to do something like this:
cmd = MyCommand.create!(:attr1 => 'foo', :attr2 => 'bar')
Resque.enqueue(MyCommand, cmd.id)
workers = Resque.workers.select {|w| w.queues.include?('my_queue') }
raise "Should be only one queue for commands!" if workers.size != 1
not_done = true
while not_done
not_done = workers[0].process
end
It does what I need, but I wonder if there is a more elegant way of doing this. Also, :process is a deprecated method for Worker instances.
I think your design approach is somewhat sound, but Redis/Resque may not be appropriate. What you want is a super fast in-memory queue that's similar to Resque, but that does not come with a polling delay.
I am pretty sure you can use MemCached for this, but there maybe other options. Any solution where your queued commands have to be pulled at certain interval would probably not provide acceptable performance for collaborative editing, unless it's OK to poll maybe every 100ms or even more often.
Finally, if you are placing every action on a single queue which can only process command serially (one at a time), you are inevitably going to end up in a situation where the queue may backup because too many commands are coming in, and they are not processing as fast. This is why a more scalable solution maybe with using versioning, where each element of the tree is versioned, and when updated/changed, all child elements are updated with a new version too. That way, an edit on an older version number is rejected.
Anyway.. good luck, sounds like a non-trivial problem to solve.

Storing Objects in a Session in Rails

I have always been taught that storing objects in a session was a bad idea. Instead IDs should be stored that retrieve the record when needed.
However, I have an application that I wonder is an exception to this rule. I'm building a flashcard application, and the words being quizzed are in a table in the database whose schema doesn't change. I want to store the words currently being quizzed in a session, so a user can finish where they started in case they move on to a separate page.
In this case, is it possible to get away with storing these words as objects in the database? If so, why? The reason I ask is because the quiz is designed to move quickly, and I'd hate to waste a database call on retrieving a record that never changes in the first place. However, perhaps there are other negatives to a large session that I'm not aware of.
*For the record, I have tried caching it with the built-in memcache methods in Rails 2.3, but apparently that has a maximum size per item of 1MB.
The main reason not to store objects in the session is that if the object structure changes, you will get an exception. Consider the following:
class Foo
attr_accessor :bar
end
class Bar
end
foo = Foo.new
foo.bar = Bar.new
put_in_session(foo)
Then, in a subsequent release of the project, you change Bar's name. You reboot the server, and try to grab foo out of the session. When it tries to deserialize, it fails to find Bar and explodes.
It might seem like it would be easy to avoid this pitfall, but in practice, I've seen it bite a number of people. This is just because serializing an object can sometimes take more along with it than is immediately apparent (this sort of thing is supposed to be transparent) and unless you have rigorous rules about this, things will tend to get flummoxed up.
The reason it's normally frowned upon is that it's extremely common for this to bite people in ActiveRecord, since it's quite common for the structure of your app to shift over time, and sessions can be deserialized a week or longer after they were originally created.
If you understand all that and are willing to put in the energy to be sure that your model does not change and is not serializing anything extra, you're probably fine. But be careful :)
Rails tends to encourage RESTful design, and using sessions isn't very RESTful. I'd probably make a Quiz resource that has a bunch of words, as well as a current_word. This way, when they come back, you'll know where they were.
Now, REST isn't everything (depending on who you talk to), but there's a pretty good case against large sessions. Remember that sessions write things to and from disk, and the more data that you're writing, the longer it takes to read back...
Since your app is a Rails app, I would suggest either:
Using your clients' ability to cache
by caching the cards in javascript.
(you'd need a fairly ajaxy app to
do this, see the latest RailsCast for some interesting points on javascript page caching)
Use one of the many other rails-supported server-side
caching options (i.e. MemCached) to
cache this data.
A much more insidious issue you'll encounter storing objects directly in the session is when you're using CookieStore (the default in Rails 2+ I believe). It's very easy to get CookieOverflow errors which are very hard to recover from.

How to prepare to be tech crunched

There is a good chance that we will be tech crunched in the next few days. Unfortunately, we have not gone live yet so we don't have a good estimation of how our system handles a production audience.
Our production setup consists of 2 EngineYard slices each with 3 mongrel instances, using Postgres as the database server.
Obviously a huge portion of how our app will hold up is to do with our actual code and queries etc. However, it would be good to see if there are any tips/pointers on what kind of load to expect or experiences from people who have been through it. Does 6 mongrel instances (possibly 8 if the servers can take it) sound like it will handle the load, or are at least most of it?
I have worked on several rails applications that experienced high load due to viral growth on Facebook.
Your mongrel count should be based on several factors. If your mongrels make API calls or deliver email and must wait for responses, then you should run as many as possible. Otherwise, try to maintain one mongrel per CPU core, with maybe a couple extra left over.
Make sure your server is using a Fair Proxy Balancer (not round robin). Here is the nginx module that does this: http://github.com/gnosek/nginx-upstream-fair/tree/master
And here are some other tips on improving and benchmarking your application performance to handle the load:
ActiveRecord
The most common problem Rails applications face is poor usage of ActiveRecord objects. It can be quite easy to make 100's of queries when only one is necessary. The easiest way to determine if this could be a problem with your application is to set up New Relic. After making a request to each major page on your site, take a look at the newrelic SQL overview. If you see a large number of very similar queries sequentially (select * from posts where id = 1, select * from posts where id = 2, select * from posts...) this may be a sign that you need to use a :include in one of your ActiveRecord calls.
Some other basic ActiveRecord tips (These are just the ones I can think of off the top of my head):
If you're not doing it already, make sure to correctly use indexes on your database tables.
Avoid making database calls in views, especially partials, it can be very easy to lose track of how much you are making database queries in views. Push all queries and calculations into your models or controllers.
Avoid making queries in iterators. Usually this can be done by using an :include.
Avoid having rails build ActiveRecord objects for large datasets as much as possible. When you make a call like Post.find(:all).size, a new class is instantiated for every Post in your database (and it could be a large query too). In this case you would want to use Post.count(:all), which will make a single fast query and return an integer without instantiating any objects.
Associations like User..has_many :objects create both a user.objects and user.object_ids method. The latter skips instantiation of ActiveRecord objects and can be much faster. Especially when dealing with large numbers of objects this is a good way to speed things up.
Learn and use named_scope whenever possible. It will help you keep your code tiny and makes it much easier to have efficient queries.
External APIs & ActionMailer
As much as you can, do not make API calls to external services while handling a request. Your server will stop executing code until a response is received. Not only will this add to load times, but your mongrel will not be able to handle new requests.
If you absolutely must make external calls during a request, you will need to run as many mongrels as possible since you may run into a situation where many of them are waiting for an API response and not doing anything else. (This is a very common problem when building Facebook applications)
The same applies to sending emails in some cases. If you expect many users to sign up in a short period of time, be sure to benchmark the time it takes for ActionMailer to deliver a message. If it's not almost instantaneous then you should consider storing emails in your database an using a separate script to deliver them.
Tools like BackgroundRB have been created to solve this problem.
Caching
Here's a good guide on the different methods of caching in rails.
Benchmarking (Locating performance problems)
If you suspect a method may be slow, try benchmarking it in console. Here's an example:
>> Benchmark.measure { User.find(4).pending_invitations }
=> #<Benchmark::Tms:0x77934b4 #cutime=0.0, #label="", #total=0.0, #stime=0.0, #real=0.00199985504150391, #utime=0.0, #cstime=0.0>
Keep track of methods that are slow in your application. Those are the ones you want to avoid executing frequently. In some cases only the first call will be slow since Rails has a query cache. You can also cache the method yourself using Memoization.
NewRelic will also provide a nice overview of how long methods and SQL calls take to execute.
Good luck!
Look into some load testing software like WEBLoad or if you have money, Quick Test Pro. This will help give you some idea. WEBLoad might be the best test in your situation.
You can generate thousands of virtual nodes hitting your site and you can inspect the performance of your servers from that load.
In my experience having watched some of our customers absorb a crunching, the traffic was fairly modest- not the bone crushing spike people seem to expect. Now, if you get syndicated and make on Yahoo's page or something, things may be different.
Search for the experiences of Facestat.com if you want to read about how they handled it (the Yahoo FP.)
My advise is just be prepared to turn off signups or go to a more static version of your site if your servers get too hot. Using a monitoring/profiling tool is a good idea as well, I like FiveRuns Manage tool for ease of setup.
Since you're using EngineYard, you should be able to allocate more machines to handle the load if necessary
Your big problems will probably not be the number of incoming requests, but will be the amount of data in your database showing you where your queries aren't using the indexes your expecting, or are returning too much data, e.g. The User List page works with 10 users, but dies when you try to show 10,000 users on that one page because you didn't add pagination (will_paginate plugin is almost your friend - watch out for 'select count(*)' queries that are generated for you)
So the two things to watch:
Missing indexes
Too much data per page
For #1, there's a plugin that runs an 'explain ...' query after every query so you can check index usage manually
There is a plugin that can generate data for you for various types of data that may help you fill your database up to test these queries too.
For #2, use will_paginate plugin or some other way to reduce data per page.
We've got basically the same setup as you, 2 prod slices and a staging slice at EY. We found ab to be a great load testing tool - just write a bash script with the urls that you expect to get hit and point it at your slice. Watch NewRelic stats and it should give you some idea of the load your app can handle and where you might need to optimise.
We also found query_reviewer to be very useful as well. It is great for finding those un-indexed tables and n+1 queries.

Application Context in Rails

Rails comes with a handy session hash into which we can cram stuff to our heart's content. I would, however, like something like ASP's application context, which instead of sharing data only within a single session, will share it with all sessions in the same application. I'm writing a simple dashboard app, and would like to pull data every 5 minutes, rather than every 5 minutes for each session.
I could, of course, store the cache update times in a database, but so far haven't needed to set up a database for this app, and would love to avoid that dependency if possible.
So, is there any way to get (or simulate) this sort of thing? If there's no way to do it without a database, is there any kind of "fake" database engine that comes with Rails, runs in memory, but doesn't bother persisting data between restarts?
Right answer: memcached . Fast, clean, supports multiple processes, integrates very cleanly with Rails these days. Not even that bad to set up, but it is one more thing to keep running.
90% Answer: There are probably multiple Rails processes running around -- one for each Mongrel you have, for example. Depending on the specifics of your caching needs, its quite possible that having one cache per Mongrel isn't the worst thing in the world. For example, supposing you were caching the results of a long-running query which
gets fresh data every 8 hours
is used every page load, 20,000 times a day
needs to be accessed in 4 processes (Mongrels)
then you can drop that 20,000 requests down to 12 with about a single line of code
##arbitrary_name ||= Model.find_by_stupidly_long_query(param)
The double at-mark, a Ruby symbol you might not be familiar with, is a global variable. ||= is the commonly used Ruby idiom to execute the assignment if and only if the variable is currently nil or otherwise evaluates to false. It will stay good until you explicitly empty it OR until the process stops, for any reason -- server restart, explicitly killed, what have you.
And after you go down from 20k calculations a day to 12 in about 15 seconds (OK, two minutes -- you need to wrap it in a trivial if block which stores the cache update time in a different global), you might find that there is no need to spend additional engineering assets on getting it down to 4 a day.
I actually use this in one of my production sites, for caching a few expensive queries which literally only need to be evaluated once in the life of the process (i.e. they change only at deployment time -- I suppose I could precalculate the results and write them to disk or DB but why do that when SQL can do the work for me).
You don't get any magic expiry syntax, reliability is pretty slim, and it can't be shared across processes -- but its 90% of what you need in a line of code.
You should have a look at memcached: http://wiki.rubyonrails.org/rails/pages/MemCached
There is a helpful Railscast on Rails 2.1 caching. It is very useful if you plan on using memcached with Rails.
Using the stock Rails cache is roughly equivalent to this.
#p3t0r- is right,MemCached is probably the best option, but you could also use the sqlite database that comes with Rails. That won't work over multiple machines though, where MemCached will. Also, sqlite will persist to disk, though I think you can set it up not to if you want. Rails itself has no application-scoped storage since it's being run as one-process-per-request-handler so it has no shared memory space like ASP.NET or a Java server would.
So what you are asking is quite impossible in Rails because of the way it is designed. What you ask is a shared object and Rails is strictly single threaded. Memcached or similar tool for sharing data between distributed processes is the only way to go.
The Rails.cache freezes the objects it stores. This kind of makes sense for a cache but NOT for an application context. I guess instead of doing a roundtrip to the moon to accomplish that simple task, all you have to do is create a constant inside config/environment.rb
APP_CONTEXT = Hash.new
Pretty simple, ah?

Resources