Can default_scope when used to not order records by ID significantly slow down a Rails application?
For example, I have a Rails (currently 3.1) app using PostgreSQL where nearly every Model has a default_scope ordering records by their name:
default_scope order('users.name')
Right now because the default_scope's order records by name rather by ID, I am worried I might be incurring a significant performance penalty when normal queries are run. For example with:
User.find(5563401)
or
#User.where('created_at = ?', 2.weeks.ago)
or
User.some_scope_sorted_best_by_id.all
In the above examples, what performance penalty might I incur by having a default_scope by name on my Model? Should I be concerned about this default_scope affecting application performance?
Your question is missing the point. The default scope itself is just a few microseconds of Ruby execution to cause an order by clause to be added to every SQL statement sent to PostgreSQL.
So your question is really asking about the performance difference between unordered queries and ordered ones.
Postgresql documentation is pretty explicit. Ordered queries on unindexed fields are much slower than unordered because (no surprise), PostgreSQL must sort the results before returning them, first creating temporary table or index to contain the result. This could easily be a factor of 4 in query time, possibly much more.
If you introduce an index just to achieve quick ordering, you are still paying to maintain the index on every insert and update. And unless it's the primary index, sorted access still involves random seeks, which may actually be slower than creating a temporary table. This also is discussed in the Postgres docs.
In a nutshell, NEVER add an order clause to an SQL query that doesn't need it (unless you enjoy waiting for your database).
NB: I doubt a simple find() will have order by attached because it must return exactly one result. You can verify this very quickly by starting rails console, issuing a find, and watching the generated SQL scroll by. However, the where and all definitely will be ordered and consequently definitely be slower than needed.
Related
I'm building my first Rails app and have it working great with Thinking Sphinx. I'm understanding most of it but would love it if someone could help me clarify a few conceptual questions
When displaying search results after a sphinx query, should I be using the sphinx_attributes that are returned from the sphinx query? Or should my view use normal rails objects, such as #property.title, #property.amenities.title etc? If I use normal rails objects, doesn't that mean its doing extra queries?
In a forum, I'd like to display 'unread posts'. Obviously this is true/false for each user/topic combination, so I'm thinking I should be caching the 'reader' ids within the topic's sphinx index. This way I can quickly do a query for all unread posts for a given user_id. I've got this working, but then realised its pointless, as there is a time delay between sphinx indexes. So if a user clicks on an unread post, it will still appear unread until the sphinx DB is re-indexed
I'm still on development so I'm manually indexing/rebuilding, but on production, what is a standard time between re-indexing?
I have a model with several text fields - should I concat these all into one column in the sphinx index for a keyword search? Surely this is quicker than indexing all the separate fields.
Slightly off-topic, but just wondering - when you access nested models, for example #property.agents.name, does this affect performance? Or does rails automatically fetch all associated entries when a property is pulled from the database?
To answer each of your points:
For both of your examples, sphinx_attributes would not be helpful. Firstly, you've already loaded the property, so the title is available directly without an extra database hit. And for property.amenities.title you're dealing with an array of strings, which Sphinx has no concept of. Generally, I would only use sphinx_attributes for complicated calculated attributes, not standard column references.
Yes, you're right, there will be a delay with this value.
It depends on how often your data changes. I have some apps where I can index every day because changes are so rare, but others where we'll run it every 10 minutes. If the data is particularly volatile, I'll look at using deltas (usually via Sidekiq) to have changes reflected in Sphinx in a few seconds.
I don't think it's much difference either way - unless you want to search on any of those columns separately? If so, it'll need to be a separate field.
By default, as you use each property's agents, the agents for that property will be loaded from the database (one SQL call per property). You could look at the eager loading docs for how to manage this better when you're dealing with multiple records. Thinking Sphinx has the ability to pass through :include options to the underlying ActiveRecord call.
I've been reading / playing around with the idea of using Redis to complement my ActiveRecord models, in particular as a way of modeling relationships. Also watched a few screencasts like this one: http://www.youtube.com/watch?v=dH6VYRMRQFw
It seems like a good idea in cases where you want to fetch one object at a time, but it seems like the approach breaks down when you need to show a list of objects along with each of their associations (e.g. in a View or in a JSON response in the case of an API).
Whereas in the case of using purely ActiveRecord, you can use includes and eager loading to avoid running N more queries, I can't seem to think of how to do so when depending purely on Redis to model relationships.
For instance, suppose you have the following (taken from the very helpful redis_on_rails project):
class Conference < ActiveRecord::Base
def attendees
# Attendee.find(rdb[:attendee_ids])
Attendee.find_all_by_id(rdb[:attendee_ids].smembers)
end
def register(attendee)
Redis.current.multi do
rdb[:attendee_ids].sadd(attendee.id)
attendee.rdb[:conference_ids].sadd id
end
end
def unregister(attendee)
Redis.current.multi do
rdb[:attendee_ids].srem(attendee.id)
attendee.rdb[:conference_ids].srem id
end
end
end
If I did something like
conferences = Conference.first(20)
conferences.each {|c|
c.attendees.each {|a| puts a.name}
}
I'm simply getting the first 20 conferences and getting the attendees in each and printing them out, but you can imagine a case where I am rendering the conferences along with a list of the attendees in a list in a view. In the above case, I would be running into the classic N+1 query problem.
If I had modeled the relationship in SQL along with has_many, I would have been able to use the includes function to avoid the same problem.
Ideas, links, questions welcome.
Redis can provide major benefits to your application's infrastructure, but I've found that, due to the specific operations you can perform on the various data types, you really need to put some thought ahead of time into how you're going to access your data. In this example, if you are very often iterating over a bunch of conferences and outputting the attendees, and are not otherwise benefiting from Redis' ability to do rich set operations (such as intersections, unions, etc.), maybe it's not a good fit for that data model.
On the other hand, if you are benefiting from Redis in performance-intensive parts of your application, it may be worth eating the occasional N+1 GET on Redis in order to reap those benefits. You have to do profiling on the parts of the app that you care about to see if the tradeoffs are worth it.
You may also be able to structure your data in Redis/your application in such a way that you can avoid the N+1 GETs; for example, if you can get all the keys up front, you can use MGET to get all the keys at once, which is a fast O(N) operation, or use pipelining to avoid network latency for multiple lookups.
In an application I work on, we've built a caching layer that caches the foreign key IDs for has_many relationships so that we can do fast lookups on cached versions of a large set of models that have complex relationships with each other; while fetching these by SQL, we generate very large, relatively slow SQL queries, but by using Redis and the cached foreign keys, we can do a few MGETs without hitting the database at all. However, we only arrived at that solution by investigating where our bottlenecks were and discussing how we might avoid them.
After realizing that an application suffer of the N+1 problem because the ORM, I would like to have more information about the improvements that can be performed and the statistics with the time compared before the improvements (with the N+1 problem) and after. So what is the time difference before and after such improvements ? Can anyone give me a link to some paper that analyze the problem and retrieve statisics on that?
You really don't need statistical data for this, just math. N+1 (or better 1+N) stands for
1 query to get a record, and
N queries to get all records associated with it
The bigger N is, the more a performance hit this becomes, particularly if your queries are sent across the network to a remote database. That's why N+1 problems keep cropping up in production - they're usually insignificant in development mode with little data in the DB, but as your data grows in production to thousands or millions of rows, your queries will slowly choke your server.
You can instead use
a single query (via a join) or
2 queries (one for the primary record, one for all associated records
The first query will return more data than strictly needed (the data of the first record will be duplicated in each row), but that's usually a good tradeoff to make. The second query might get a bit cumbersome for large data sets since all foreign keys are passed in as a single range, but again, it's usually a tradeoff worth making.
The actual numbers depend on too many variables for statistics to be meaningful. Number or records, DB version, hardware etc. etc.
Since you tagged this question with rails, ActiveRecord does a good job avoiding N+1 queries if you know how to use it. Check out the explanation of eager loading.
The time difference would depend on how many additional selects were performed because of the N+1 problem. Here's a quote from an answer given to another stackoverflow question regarding N+1 -
Quote Start
SELECT * FROM Cars;
/* for each car */
SELECT * FROM Wheel WHERE CarId = ?
In other words, you have one select for the Cars, and then N additional selects, where N is the total number of cars.
Quote End
In the example above the time difference would depend on how many car records were in the database and how long it took to query the 'Wheel' table each time the code/ORM fetched a new record. If you only had 2 car records then the difference after removing the N+1 problem would be negligible, but if you have a million car records then it would have a significant affect.
On Ruby on Rails, say, if the Actor model object is Tom Hanks, and the "has_many" fans is 20,000 Fan objects, then
actor.fans
gives an Array with 20,000 elements. Probably, the elements are not pre-populated with values? Otherwise, getting each Actor object from the DB can be extremely time consuming.
So it is on a "need to know" basis?
So does it pull data when I access actor.fans[500], and pull data when I access actor.fans[0]? If it jumps from each record to record, then it won't be able to optimize performance by doing sequential read, which can be faster on the hard disk because those records could be in the nearby sector / platter layer -- for example, if the program touches 2 random elements, then it will be faster just to read those 2 records, but what if it touches all elements in random order, then it may be faster just to read all records in a sequential way, and then process the random elements. But how will RoR know whether I am doing only a few random elements or all elements in random?
Why would you want to fetch 50000 records if you only use 2 of them? Then fetch only those two from DB. If you want to list the fans, then you will probably use pagination - i.e. use limit and offset in your query, or some pagination gem like will_paginate.
I see no logical explanation why should you go the way you try to. Explain a real situation so we could help you.
However there is one think you need to know wile loading many associated objects from DB - use :include like
Actor.all(:include => :fans)
this will eager-load all the fans so there will only be 2 queries instead of N+1, where N is a quantity of actors
Look at the SQL which is spewed out by the server in development mode, and that will tell you how many fan records are being loaded. In this case actor.fans will indeed cause them all to be loaded, which is probably not what you want.
You have several options:
Use a paginator as suggested by Tadas;
Set up another association with the fans table that pulls in just the ones you're interested in. This can be done either with a conditions on the has_many statement, e.g.
has_many :fans, :conditions => "country of residence = 'UK'"
Specifying the full SQL to narrow down the rows returned with the :finder_sql option
Specifying the :limit option which will, well, limit, the number of rows returned.
All depends on what you want to do.
I've developed a web based point of sale system for one of my clients in Ruby on Rails with MySQL backend. These guys are growing so fast that they are ringing close to 10,000 transactions per day corporate-wide. For this question, I will use the transactions table as an example. Currently, I store the transactions.status as a string (ie: 'pending', 'completed', 'incomplete') within a varchar(255) field that has an index. In the beginning, it was fine when I was trying to lookup records by different statuses as I didn't have to worry about so many records. Over time, using the query analyzer, I have noticed that performance has worsened and that varchar fields can really slowdown your query speed over thousands of lookups. I've been thinking about converting these varchar fields to integer based status fields utilizing STATUS CONSTANT within the Transaction model like so:
class Transaction < ActiveRecord::Base
STATUS = { :incomplete => 0, :pending => 1, :completed => 2 }
def expensive_query_by_status(status)
self.find(:all,
:select => "id, cashier, total, status",
:condition => { :status => STATUS[status.to_sym] })
end
Is this the best route for me to take? What do you guys suggest? I am already using proper indexes on various lookup fields and memcached for query caching wherever possible. They're currently setup on a distributed server environment of 3 servers where 1st is for application, 2nd for DB & 3rd for caching (all in 1 datacenter & on same VLAN).
Have you tried the alternative on a representative database? From the example given, I'm a little sceptical that it's going to make much difference, you see. If there are only three statuses then a query by status may be better-off not using an index at all.
Say "completed" comprises 80% of your table - with no other indexed column involved, you're going to be requiring more reads if the index is used than not. So a query of that type is almost certainly going to get slower as the table grows. "incomplete" and "pending" queries would probably still benefit from an index, however; they'd only be affected as the total number of rows with those statuses grew.
How often do you look at everything, complete and otherwise, without some more selective criterion? Could you partition the table in some (internal or external) way? For example, store completed transactions in a separate table, moving new ones there as they reach their final (?) state. I think internal database partitioning was introduced in MySQL 5.1 - looking at the documentation it seems that a RANGE partition might be appropriate.
All that said, I do think there's probably some benefit to moving away from storing statuses as strings. Storage and bandwidth considerations aside, it's a lot less likely that you'll inadvertently mis-spell an integer or, better yet, a constant or symbol.
You might want to start limiting your searchings (if your not doing that already), #find(:all) is pretty taxing on that scale. Also you might want to think about what your Transaction model is reaching out for as it gets translated into your views and perhaps eager load those to minimize requests to the db for extra information.