Validating a legacy table with ActiveRecord - ruby-on-rails

Good day all,
We are doing a data migration from one system to a Rails application. Some of the tables we are working with are very large and moving them over 1 record at a time using ActiveRecord takes far too long. Therefore we resorted to copying the table over in SQL and validating after the fact.
The one-by-one validation check is still slow, but the speed increase from the SQL copy more than makes up for it. However, that hasn't quenched our thirst to see if we can get the validation check to happen more quickly. We attempted to split the table into chunks and pass each chunk to a Thread but it actually executed slower.
The question is, large table, currently iterating row-by-row to do the validation, like so
Model.find_each do |m|
logger.info "M #{m.id} is not valid" unless m.valid?
end
Anyone have any recommendations on how to speed this up?
Thanks
peer
EDIT I should say not specifically this code. We are looking for recommendations on how we can run this concurrently, giving each process a chunk of data, without needed a machine per process

find_each is using find_in_batches, which fetches 1000 rows at a time by default. You could try playing with the batch_size option. The way you have it above seems pretty optimal; it's fetching from the database in batches and iterating over each one, which you need to do. I would monitor your RAM to see if the batch size is optimal, and you could also try using Ruby 1.9.1 to speed things up if you're currently using 1.8.*.
http://api.rubyonrails.org/classes/ActiveRecord/Batches/ClassMethods.html#M001846

I like zgchurch's response as a starting point.
What I would add is that threading is definitely not going to help here, especially because Ruby uses green threads (at least in 1.8.x), so there is no opportunity to utilize multiple processors anyway. Even if that weren't the case it's very likely that this operation is IO-heavy enough that you would get IO contention eating into any multi-core benefits.
Now if you really want to speed this up you should take a look at the actual validations and figure out a more efficient way to achieve them. Just loading all the rows and instantiating an ActiveRecord object is going to tend to dominate the performance in most validation situations. You may be spending 90-99.99% of your time just loading and unloading the data from memory.
In these types of situations I tend to go towards raw SQL. You can do things like validating foreign key integrity tens of thousands of times faster than raw ActiveRecord validation callbacks. Of course the viability of this approach depends on the actual ins and outs of your validations. Even if you need something a little richer than SQL to define validity, you could still probably get a 10-100x speed increase just be loading the minimal data with a thinner SQL interface and examining the data directly. If that's the case Perl or Python might be a better choice for raw performance.

Related

Does the use of many rails associations cause significant memory growth?

I have a Rails active job running on a Heroku dyno via Sidekiq.
When the job runs, the memory grows from 150MB to 550MB.
I found out the cause was n+1 queries to the DB, and fixed it by doing my calculations on the data in the DB query instead of in the code.
Afterwards I wanted to refactor a bit, as I generally like to keep the SQL somewhat simple, and have the logic in the code. For this reason I switched my "joins" with "includes", which allows me to use the associations of the objects. It turned out though that this refactoring introduced the memory issue once again though.
So my conclusion is, that it is the use of the associations that causes the memory growth, as the number of SQL queries are the same after I fixed the N+1 issue. Please note the job handles a lot of objects, around 250.000. It seems reasonable c.f. the delete vs. destroy functionality where delete performs better as the objects themselves are not instantiated as they are with destroy.
Is my conclusion accurate, or am I missing something?
Thanks,
-Louise

Rails app performance decrease due to associations?

My current Rails 3 app has seen a performance decrease as I introduce more associations to the schema.
This performance is not related to page loads, but to a background task that I run using resque. The task parses an external data source to populate the database. It does this through a custom helper method (quite a long one).
I don't expect the task to execute very quickly, because it is parsing a lot of data, but recently after adding a few additional associations, I have seen the execution time for my test method increase from 2 minutes to about 5 minutes. I'm running on a VM, so perhaps that's why it is so slow in general.
Using ruby-prof, it seems that the bulk of the additional computation is spent in methods dealing with the associations I added:
ActiveRecord::Associations::CollectionAssociation#* (where * is any number of methods)
When I have added associations, I have included indexes on the association table. However, other than this, I'm not sure what, if anything, I can do to mitigate the performance hit I'm seeing from adding these associations.
So my questions are:
1) Is it possible that adding associations can cause such drastic performance decrease in a Rails app?
2) What are the best practices for having a Rails app with many associations that performs well?
EDIT: Some additional info as requested
database: Postgres
the associations I added were both HABTM
it is the populating of these associations that takes up the time. I am creating hundreds, if not thousands of associations during the population process. For bulk inserts, I have been using activerecord-import, which speeds things up significantly, but I am not aware of an equivalent for associations.
I was doing the following (variables changed for simplicity):
// flavors_array is an array of all the Flavor objects I want to associate with my ice_cream
ice_cream.flavors = flavors_array
I did it this way because I know with certainty that there are no preexisting associations for this 'ice_cream' instance that I would be deleting by using '='
However, this method is still very slow
IMPORTANT: When I replace the above with a SQL statement to bulk insert the associations directly into the association table (via IDs), the performance improves dramatically. The operations take almost no time.
You also need to analyze your query, there is no magic formula, but evaluate the usage of "include vs join". Usually with associations, the data is not loaded unless required.
The railscast below provide excellent idea on this,
http://railscasts.com/episodes/181-include-vs-joins
Eager loading might help prevent round trips to DB and may give better performance.

Cucumber BDD for RAILS / ActiveRecord - Speedup slow tests by shimming out the database layer?

My company is using the Cucumber BDD test framework for a moderately large RAILS app.
It's slow. Dog slow. Just the single cucumber that I'm working on takes 30s-40s to run, and there's dozens and dozens of different test files. It takes like an hour plus to run the whole thing.
Looking into things a bit, there's a huge amount of repetitive database work but the total data size is pretty small. For example, I see slow statements like:
Given I have all the discussion threads for "fake user 47"
That takes 1000+ ms. The only thing it does is to take a few dozen rows of stock fixture data from a file on disk and load it into the database one row at a time. It's not a huge quantity of data, a few KiB at most. But once you add up all the overhead of ActiveRecord processing and converting it to a few dozen SQL statements, and then synchronously doing cross-process IPC to mysql to execute said statements ... well, it seems to add up.
Mysql itself is NOT the issue here, and switching to in-memory tables / etc has pretty much 0 impact on the run time of the tests. The cucumber process runs at about 70% CPU utilization.
However, hitting the database is still a pretty expensive operation given that it's a cross-process request that has to go through multiple layers of adapters, data drivers and marshaling and whatnot on both sides, not to mention 2+ context switches per request.
So my theory is that if we were just storing data in, eg a simple hash table, then the tests would run a LOT faster. (looking up how to do profiling in Ruby so I can prove this theory).
Is it possible to run our test suite using a shimmed-out in-process "simulated" database layer? Ideally, something super lightweight that ideally doesn't even need to deal with SQL at all, just implement whatever contract that ActiveRecord expects.
Also, we seem to do the same setup multiple times. One test will say "Given I have data X" and spend ages loading data X to do a few checks. Then another test will say "Given I have data Y", clear the db, spend ages loading data Y, and do a few checks. Then a 3rd test says "Given I have data X", clears the db, and spends ages again reloading X. And of course the 4th test probably uses data Y again....
Is there a clean pattern to reuse test initialization code across scenarios? Or to cache a particular initialized state?
This may not be a complete answer to your question. But as for the final part of the problem you can use cucumber Global hooks. Since it is integration testing I suggest you populate the database according to the real life scenario. So wherever it is appropriate populate both data X and data Y before all the tests using cucumber hooks. With careful planning the time taken may be optimized quite a bit.

Heroku database performance experience needed?

We are experiencing some serious scaling challenges for our intelligent search engine/aggregator. Our database holds around 200k objects. From profiling and newrelic it seems most of our troubles may come from the database. We are using the smallest dedicated database Heroku provide (Ronin).
We have been looking into indexing and caching. So far we managed to solve our problems by reducing database calls and caching content intelligently, but now even this seems to reach an end. We are constantly asking ourselves if our code/configuration is good enough or if we are simply not using enough "hardware".
We suspect that the database solution we buy from Heroku may be performing insufficiently. For example, just doing a simple count (no joins, no nothing) on the 200k items takes around 250ms. This seems like a long time, even though postgres is known for its bad performance on counts?
We have also started to use geolocation lookups based on latitude/longitude. Both columns are indexed floats. Doing a distance calculation involves pretty complicated math, but we are using the very well recommended geocoder gem that is suspected to run very optimized queries. Even geocoder still takes 4-10 seconds to perform a lookup on, say, 40.000 objects, returning only a limit of the first nearest 10. This again sounds like a long time, and all the experienced people we consult says that it sound very odd, again hinting at the database performance.
So basically we wonder: What can we expect from the database? Might there be a problem? And what can we expect if we decide to upgrade?
An additional question I have is: I read here that we can improve performance by loading the entire database into memory. Are we supposed to configure this ourselves and if so how?
UPDATE ON THE LAST QUESTION:
I got this from the helpful people at Heroku support:
"What this means is having enough memory (a large enough dedicated
database) to store your hot data set in memory. This isn't something
you have to do manually, Postgres is configured automatically use all
available memory on our dedicated databases.
I took a look at your database and it looks like you're currently
using about 1.25 GB of RAM, so you haven't maxed your memory usage
yet."
UPDATE ON THE NUMBERS AND FIGURES
Okay so now I've had time to look into the numbers and figures, and I'll try to answer the questions below as follows:
First of all, the db consists of around 29 tables with a lot of relations. But in reality most queries are done on a single table (some additional resources are joined in, to provide all needed information for the views).
The table has 130 columns.
Currently it holds around 200k records but only 70k are active - hence all indexes are made as partial-indexes on this "state".
All columns we search are indexed correctly and none is of text-type, and many are just booleans.
Answers to questions:
Hmm the baseline performance it's kind of hard to tell, we have sooo many different selects. The time it takes varies typically from 90ms to 250ms selecting a limit of 20 rows. We have a LOT of counts on the same table all varying from 250ms to 800ms.
Hmm well, that's hard to say cause they wont give it a shot.
We have around 8-10 users/clients running requests at the same time.
Our query load: In new relic's database reports it says this about the last 24 hours: throughput: 9.0 cpm, total time: 0.234 s, avg time: 25.9 ms
Yes we have examined the query plans of our long-running queries. The count queries are especially slow, often over 500ms for a pretty simple count on the 70k records done on indexed columns with a result around 300
I've tuned a few Rails apps hosted on Heroku, and also hosted on other platforms, and usually the problems fall into a few basic categories:
Doing too much in ruby that could be done at the db level (sorting, filtering, join data, etc)
Slow queries
Inefficient use of indexes (not enough, or too many)
Trying too hard to do it all in the db (this is not as common in rails, but does happen)
Not optimizing cacheable data
Not effectively using background processing
Right now its hard to help you because your question doesn't contain any specifics. I think you'll get a better response if you pinpoint the biggest issue you need help with and then ask.
Some info that will help us help you:
What is the average response time of your actions? (from new relic, request-log-analyzer, logs)
What is the slowest request that you want help with?
What are the queries and code in that request?
Is the site's performance different when you run it locally vs. heroku?
In the end I think you'll find that it is not an issue specific to Heroku, and if you had your app deployed on amazon, engineyard, etc you'd have the same performance. The good news is I think that your problems are common, and shouldn't be too hard to fix once you've done some benchmarking and profiling.
-John McCaffrey
We are constantly asking...
...this seems a lot...
...that is suspected...
...What can we expect...
Good news! You can put and end to seeming, suspecting wondering and expecting through the magic of measurement!!!
Seriously though, you've not mentioned any of the basic points you'd need to get a useful answer:
What's the baseline performance of the DB running a sequential scan and single-row index fetches? You say Heroku say your DB fits in RAM, so you shouldn't see disk I/O issues when you measure.
Does this performance match whatver Heroku say it should be?
How many concurrent clients?
What's your query load - what queries and how often?
Have you checked the query plans for any of your suspiciously long-running queries?
Once you've got this sort of information, maybe someone can say something useful. As it stands anything you read here is just guesswork.
First: you should check your postgres configuration. (show all from within psql or another client, or just look at postgres.conf in the data directory) The parameter with the largest impact on performance is effective_cache_size, which should be set to about (total_physical_ram - memory_in_use_by_kernel_and_all_processes). For a 4GB machine, this often is around 3GB (4-1). (this is very course tuning, but will give the best results for a first step)
Second: why do you want all the counts? Better use a typical query: just ask for what is needed, not what is available. (reason: there is no possible optimisation for a COUNT(*): eiither the whole table, or a whole index needs to be scanned)
Third: start gathering and analysing some queryplans (for typical queries that perform badly). You can get a query plan by putting EXPLAIN ANALYZE before the actual query. (another way is to increase the logging level, and obtain them from the logfile) A bad queryplan can point you at missing statistics or indexes, or even at bad data-modelling.
Newrelic monitoring can be included as an add-on for heroku (http://devcenter.heroku.com/articles/newrelic). At the very least this should give you a lot of insight into what is happening behind the scenes, and may help you pinpoint some issues.

How to prepare to be tech crunched

There is a good chance that we will be tech crunched in the next few days. Unfortunately, we have not gone live yet so we don't have a good estimation of how our system handles a production audience.
Our production setup consists of 2 EngineYard slices each with 3 mongrel instances, using Postgres as the database server.
Obviously a huge portion of how our app will hold up is to do with our actual code and queries etc. However, it would be good to see if there are any tips/pointers on what kind of load to expect or experiences from people who have been through it. Does 6 mongrel instances (possibly 8 if the servers can take it) sound like it will handle the load, or are at least most of it?
I have worked on several rails applications that experienced high load due to viral growth on Facebook.
Your mongrel count should be based on several factors. If your mongrels make API calls or deliver email and must wait for responses, then you should run as many as possible. Otherwise, try to maintain one mongrel per CPU core, with maybe a couple extra left over.
Make sure your server is using a Fair Proxy Balancer (not round robin). Here is the nginx module that does this: http://github.com/gnosek/nginx-upstream-fair/tree/master
And here are some other tips on improving and benchmarking your application performance to handle the load:
ActiveRecord
The most common problem Rails applications face is poor usage of ActiveRecord objects. It can be quite easy to make 100's of queries when only one is necessary. The easiest way to determine if this could be a problem with your application is to set up New Relic. After making a request to each major page on your site, take a look at the newrelic SQL overview. If you see a large number of very similar queries sequentially (select * from posts where id = 1, select * from posts where id = 2, select * from posts...) this may be a sign that you need to use a :include in one of your ActiveRecord calls.
Some other basic ActiveRecord tips (These are just the ones I can think of off the top of my head):
If you're not doing it already, make sure to correctly use indexes on your database tables.
Avoid making database calls in views, especially partials, it can be very easy to lose track of how much you are making database queries in views. Push all queries and calculations into your models or controllers.
Avoid making queries in iterators. Usually this can be done by using an :include.
Avoid having rails build ActiveRecord objects for large datasets as much as possible. When you make a call like Post.find(:all).size, a new class is instantiated for every Post in your database (and it could be a large query too). In this case you would want to use Post.count(:all), which will make a single fast query and return an integer without instantiating any objects.
Associations like User..has_many :objects create both a user.objects and user.object_ids method. The latter skips instantiation of ActiveRecord objects and can be much faster. Especially when dealing with large numbers of objects this is a good way to speed things up.
Learn and use named_scope whenever possible. It will help you keep your code tiny and makes it much easier to have efficient queries.
External APIs & ActionMailer
As much as you can, do not make API calls to external services while handling a request. Your server will stop executing code until a response is received. Not only will this add to load times, but your mongrel will not be able to handle new requests.
If you absolutely must make external calls during a request, you will need to run as many mongrels as possible since you may run into a situation where many of them are waiting for an API response and not doing anything else. (This is a very common problem when building Facebook applications)
The same applies to sending emails in some cases. If you expect many users to sign up in a short period of time, be sure to benchmark the time it takes for ActionMailer to deliver a message. If it's not almost instantaneous then you should consider storing emails in your database an using a separate script to deliver them.
Tools like BackgroundRB have been created to solve this problem.
Caching
Here's a good guide on the different methods of caching in rails.
Benchmarking (Locating performance problems)
If you suspect a method may be slow, try benchmarking it in console. Here's an example:
>> Benchmark.measure { User.find(4).pending_invitations }
=> #<Benchmark::Tms:0x77934b4 #cutime=0.0, #label="", #total=0.0, #stime=0.0, #real=0.00199985504150391, #utime=0.0, #cstime=0.0>
Keep track of methods that are slow in your application. Those are the ones you want to avoid executing frequently. In some cases only the first call will be slow since Rails has a query cache. You can also cache the method yourself using Memoization.
NewRelic will also provide a nice overview of how long methods and SQL calls take to execute.
Good luck!
Look into some load testing software like WEBLoad or if you have money, Quick Test Pro. This will help give you some idea. WEBLoad might be the best test in your situation.
You can generate thousands of virtual nodes hitting your site and you can inspect the performance of your servers from that load.
In my experience having watched some of our customers absorb a crunching, the traffic was fairly modest- not the bone crushing spike people seem to expect. Now, if you get syndicated and make on Yahoo's page or something, things may be different.
Search for the experiences of Facestat.com if you want to read about how they handled it (the Yahoo FP.)
My advise is just be prepared to turn off signups or go to a more static version of your site if your servers get too hot. Using a monitoring/profiling tool is a good idea as well, I like FiveRuns Manage tool for ease of setup.
Since you're using EngineYard, you should be able to allocate more machines to handle the load if necessary
Your big problems will probably not be the number of incoming requests, but will be the amount of data in your database showing you where your queries aren't using the indexes your expecting, or are returning too much data, e.g. The User List page works with 10 users, but dies when you try to show 10,000 users on that one page because you didn't add pagination (will_paginate plugin is almost your friend - watch out for 'select count(*)' queries that are generated for you)
So the two things to watch:
Missing indexes
Too much data per page
For #1, there's a plugin that runs an 'explain ...' query after every query so you can check index usage manually
There is a plugin that can generate data for you for various types of data that may help you fill your database up to test these queries too.
For #2, use will_paginate plugin or some other way to reduce data per page.
We've got basically the same setup as you, 2 prod slices and a staging slice at EY. We found ab to be a great load testing tool - just write a bash script with the urls that you expect to get hit and point it at your slice. Watch NewRelic stats and it should give you some idea of the load your app can handle and where you might need to optimise.
We also found query_reviewer to be very useful as well. It is great for finding those un-indexed tables and n+1 queries.

Resources