With performance improvements in mind, I was wondering if and which indexes are helpful on a join table (specifically used in a Rails 3 has_and_belongs_to_many context).
Model and Table Setup
My models are Foo and Bar and per rails convention, I have a join table called bars_foos. There is no primary key or timestamps making the old fields in this table bar_id:integer and foo_id:integer. I'm interested in knowing which of the following indexes is best and is without duplication:
A compound index: add_index :bars_foos, [:bar_id, :foo_id]
Two indexes
A. add_index :bars_foos, :bar_id
B. add_index :bars_foos, :foo_id
A combination of both 1 and 2-B
Basically, I'm not sure if the compound index is enough assuming it is helpful to begin with. I believe that a compound index can be used as a single index for the first item which is why I am pretty sure that using all three lines would certainly result in unnecessary duplication.
Likely Usage
The most common usage will be given an instance of model Foo, I will be asking for its associated bars using the RoR syntax of foo.bars and vice versa with bar.foos for an instance of the model Bar.
These will generate queries of the type SELECT * FROM bars_foos WHERE foo_id = ? and SELECT * FROM bars_foos WHERE bar_id = ? respectively and then using those resultant IDs to SELECT * FROM bars WHERE ID in (?) and SELECT * FROM foos WHERE ID in (?).
Please correct me in the comments if I am incorrect, but I do not believe that, in the context of the Rails application, it is ever going to try to do a query where it specifies both IDs like SELECT * FROM bars_foos where bar_id = ? AND foo_id = ?.
Databases
In the event there are database specific optimization techniques, I will most likely be using PostgreSQL. However, others using this code may want to use it in MySQL or SQLite depending on their Rails configuration so all answers are appreciated.
The Answer
The oft repeated answer, which tends to always be the case more often than not is, "it depends." More specifically, it depends on what your data is and how it will be used.
tl;dr Explanation
The short tl;dr answer for my specific case (and to cover all future bases) is choice #2 which is what I suspected. However, choice #3 would work just fine as, depending on my usage of the data, the extra time and space used creating the compound index could reduce future query lookups.
The Full Explanation
The reason for this is that databases try to be smart and try to do things as fast as possible regardless of programmer input. The most basic item to consider when adding an index is will this object be looked up by this key. If yes, an index can potentially help speed that up. However, whether this index is even used all comes down to selectivity and the cardinality of the field.
Since foreign keys are typically the IDs of another AR class, cardinality usually will be high. But again, this depends on your data. In my example if there are many Foos but few Bars, many of the entries in my join table will have simliar bar_ids. With bar_ids having a low cardinality, an index on bar_id may never be used and may be getting in the way by having the database devote time and resources* to adding to this index every time a new bars_foos entry is created. The same goes with many Bars and few Foos and few of both.
The general lesson is that when considering an index on a table, decide if the entries will be both looked up by this field and if this field has a high cardinality. That is, does this field have many distinct values? In the case of most join tables "it depends" and we must think more carefully about what the data represents and the relationships themselves. In my case, I will have both many Foos and Bars and will be looking up Foos by their associated bars and vice versa.
Another good answer I got at the office was, "why are you worrying about your indexes? Build your app!"
Footnotes
* In a similar question on indexes on STI it was pointed out that the cost of an index is very low so when in doubt, just add it.
Depends on how you are going to query the data.
Assuming you want to search for all of these...
WHERE bar_id = ?
WHERE foo_id = ?
WHERE bar_id = ? AND foo_id = ?
...then you should probably go with an index on {bar_id, foo_id} and an index on {foo_id}.
While you could also create a third index on {bar_id}, the price of maintaining additional index would probably outweigh the benefit of better clustering in the smaller index.
Also, how do you plan to cover your queries with indexes? Some of the alternatives, such as...
{foo_id, bar_id} and {bar_id}
{foo_id, bar_id} and {bar_id, foo_id}
...might cover certain kinds of queries better.
Covering is a balancing act - sometimes adding a field to an index just for covering purposes is justified, sometimes it's not. You won't know until you measure on realistic amounts of data.
(Disclaimer: I'm not familiar with Ruby. This answer is purely from the database perspective.)
Related
Hi Guys, i would like to know, if i create unique index of two columns on postgreSQL, does normal indexing for the both columns also work by same unique index or i have to create one unique index and two more index for both columns as shown in the code? I want to create unique index of talent_id, job_id, also both columns should separately indexed. I read many resources but does not get appropriate answer.
add_index :talent_actions, [:talent_id, :job_id], unique: true
Does above code also handles below indexing also or i have to add below indexing separately?
add_index :talent_actions, :talent_id
add_index :talent_actions, :job_id
Thank you.
An index is an object in the database, which can be used to look up data faster, if the query planner decides it will be appropriate. So the trivial answer to your question is "no", creating one index will not result in the same structures in the database as creating three different indexes.
I think what you actually want to know is this:
Do I need all three indexes, or will the unique index already optimise all queries?
This, as with any database optimisation, depends on the queries you run, and the data you have.
Here are some considerations:
The order of columns in a multi-column index matters. If you have an index of people sorted by surname then first name, then you can use it to search for everybody with the same surname; but you probably can't use it to search for somebody when you only know their first name.
Data distribution matters. If everyone in your list has the surnames "Smith" and "Jones", then you can use a surname-first index to search for a first name fairly easily (just look up under Jones, then under Smith).
Index size matters. The fewer columns an index has, the more of it fits in memory at once, so the faster it will be to use.
Often, there are multiple indexes the query planner could use, and its job is to estimate the cost of the above factors for the query you've written.
Usually, it doesn't hurt to create multiple indexes which you think might help, but it does use up disk space, and occasionally can cause the query planner to pick a worse plan. So the best approach is always to populate a database with some real data, and look at the query plans for some real queries.
I have an app that consists mainly of restaurant model instances. One of the essential attributes for these restaurants is labeling the cuisine it falls under. I'm currently at odds with myself in regards to designing this. On one hand I thought of creating a Cuisine model and creating either a HMT or HABTM association between Restaurants and Cuisines.
More recently I came across this post which shows how to create a pre-defined set of attributes. To take the answer one step further I'm assuming (in my case) I'd add a string-based cuisine column to my restaurant model and setup a select box in my restaurant form that would save the selected value.
What I was wondering was what would be the most efficient way of doing this? The goal is to eventually be able to query restaurants based what cuisine(s) they fall under. I wasn't sure if a model would be the best choice due to it only serving as a join table in a sense with a name attribute. Wasn't sure if having this extra table for something so minute would be optimal.
On the other hand I didn't know if using YAML for this would be conducive since the values are essentially dummy strings with no tangible records on file like I'd have with a model instance. Can someone help me sort out this confusion?
There are many benefits of normalizing many-to-many relationships in the db. Here are some:
Searching, sorting, and creating indexes is faster, since tables are narrower, and more rows fit on a data page.
You can have more clustered indexes (one per table), so you get more flexibility in tuning queries.
Index searching is often faster, since indexes tend to be narrower and shorter.
More tables allow better use of segments to control physical placement of data.
You usually have fewer indexes per table, so data modification commands are faster.
Fewer null values and less redundant data, making your database more compact.
Triggers execute more quickly if you are not maintaining redundant data.
Data modification anomalies are reduced.
Normalization is conceptually cleaner and easier to maintain and change as your needs change.
Also, by normalizing you get the cleaner syntax and other infrastructure benefits from ActiveRecord, e.g.
cuisine.restaurants.where(city: 'Toledo')
In my Rails application I have an index view that lists all of my projects.
This list can be sorted by clicking on any of the table column headers, e.g. Date, Name, updated_at etc. This happens by appending a &sort= GET parameter to the URL.
My question is: From a performance point-of-view, would it be advisable to add indexes to these columns in my database?
This is what a migration might look like:
class AddMoreIndexes < ActiveRecord::Migration
def change
add_index :projects, :date
add_index :projects, :name
add_index :projects, :update_at
end
end
Will I get any performance gains from this?
Indexes can be used to speed an order-by, but if you were identifying a subset of rows to display then an index that is helpful for that is likely to be chosen in preference. You'd need composite indexes in such a situation.
There're a couple of other problems.
Firstly, ordering on an indexed string value may require a linguistically sorted index, not the regular ASCII/Binary sort, so multilingual applications may not be helped at all.
Secondly, it can discourage normalisation of the database because you really need the display values to be in the table you're selecting.
You might like to look at using another method for the sort. I've been very happy with using Google visualisation tables, which come with JQuery sorting built in.
Depending on how you query your database, then yes, it will give you performance gains. For example, whenever I add a foreign key to a table, I immediately index by it. Why? I know queries will be running through it in my application. If not, I wouldn't have put a foreign key. In this way, especially when you accumulate a large amount of data in your database, it will definitely give performance gains (sometimes, by an incredible amount). If you plan to query your database by date, name, or updated at, then yes, it could potentially be a performance gain depending on your query. Otherwise, there really is no point.
Note, you wouldn't want to add an index for every column. Having necessary indices will help you, but if you have an index for every column, then you run the risk of confusing the SQL Query Optimizer and actually hindering your performance.
My suggestion: Add an index for every foreign key you have in your table, but if you're also running some heavy queries with other columns, then add an index there too.
In Rails while using activeRecord why are join queries considered bad.
For example
Here i'm trying to find the number of companies that belong to a certain category.
class Company ActiveRecord::Base
has_one :company_profile
end
Finding the number of Company for a particular category_id
number_of_companies = Company.find(:all, :joins=>:company_profile, :conditions=>["(company_profiles.category_id = #{c_id}) AND is_published = true"])
How could this be better or is it just poor design?
company_profiles = CompanyProfile.find_all_by_category_id(c_id)
companies = []
company_profiles.each{|c_profile| companies.push(c_profile.company) }
Isn't it better that the first request creates a single query while i'd be running several queries for the second case.
Could someone explain why joins are considered to be bad practice in Rails
Thanks in advance
To my knowledge, there is no such rule. The rule is to hit the database as least as possible, and rails gives you the right tools for that, using the joins.
The example Sam gives above is exemplary. Simple code, but behind the scenes rails has to do two queries, instead of only one using a join.
If there is one rule that comes to mind, that i think is related, is to avoid SQL where possible and use the rails way as much as possible. This keeps your code database agnostic (as rails handles the differences for you). But sometimes even that is unavoidable.
It comes down to good database design, creating the correct indexes (which you need to define manually in migrations), and sometimes big nested structures/joins are needed.
Join queries are not bad, in fact, they are good, and ActiveRecord has them at its very heart. You don't need to break into find_by_sql to use them, options like :include will handle it for you. You can stay within the ORM, which gives the readability and ease of use, whilst still, for the most part, creating very efficient SQL (providing you have your indexes right!)
Bottom line - you need to do the bare minimum of database operations. Joins are a good way of letting the database do the heavy lifting for you, and lowering the number of queries that you execute.
By the by, DataMapper and Arel (the query engine in Rails 3) feature a lot of lazy loading - this means that code such as:
#category = Category.find(params[:id])
#category.companies.size
Would most likely result in a join query that only did a COUNT operation, as the first line wouldn't result in a query being sent to the db.
If you just want to find the number of companies on a category all you need to do is find the category and then call the association name and size because it will return an array.
#category = Category.find(params[:id])
#category.companies.size
In my Rails application, I have a variety of database tables that contain user data. Some of these tables have a lot of rows (as many as 500,000 rows per user in some cases) and are queried frequently. Whenever I query any table for anything, the user_id of the current user is somewhere in the query - either directly, if the table has a direct relation with the user, or through a join, if they are related through some other tables.
Should I denormalize the user_id and include it in every table, for faster performance?
Here's one example:
Address belongs to user, and has a user_id
Envelope belongs to user, and has a user_id
AddressesEnvelopes joins an Address and an Envelope, so it has envelope_id and address_id -- it doesn't have user_id, but could get to it through either the envelope or the address (which must belong to the same user).
One common expensive query is to select all the AddressesEnvelopes for a particular user, which I could accomplish by joining with either Address or Envelope, even though I don't need anything from those tables. Or I could just duplicate the user id in this table.
Here's a different scenario:
Letter belongs to user, and has a user_id
Recepient belongs to Letter, and has a letter_id
RecepientOption belongs to Recepient, and has a recepient_id
Would it make sense to duplicate the user_id in both Recepient and RecepientOption, even though I could always get to it by going up through the associations, through Letter?
Some notes:
There are never any objects that are
shared between users. An entire
hierarchy of related objects always
belongs to the same user.
The user owner of objects never changes.
Database performance is important because it's a data intensive application. There are many queries and many tables.
So should I include user_id in every table so I can use it when creating indexes? Or would that be bad design?
I'd like to point out that it isn't necessary to denormalize, if you are willing to work with composite primary keys. Sample for AddressEnvelop case:
user(
#user_id
)
address(
#user_id
, #addres_num
)
envelope(
#user_id
, #envelope_num
)
address_envelope(
#user_id
, #addres_num
, #envelope_num
)
(the # indicates a primary key column)
I am not a fan of this design if I can avoid it, but considering the fact that you say that all these objects are tied to a user, this type of design would make it relatively simply to partition your data (either logically, put ranges of users in separate tables or physically, using multiple databases or even machines)
Another thing that would make sense with this type of design is using clustered indexes (in MySQL, the primary key of InnoDB tables are built from a clustered index). If you ensure the user_id is always the first column in your index, it will ensure that for each table, all data for one user is stored close together on disk. This is great when you always query by user_id, but it can hurt perfomance if you query by another object (in which case duplication like you sugessted may be a better solution)
At any rate, before you change the design, first make sure your schema is already optimized, and you have proper indexes on your foreign key columns. If performance really is paramount, you should simply try several solutions and do benchmarks.
As long as you
a) get a measurable performance improvement
and
b) know which parts of your database are real normalized data and which are redundant improvements
there is no reason not to do it!
Do you actually have a measured performance problem? 500 000 rows isn't very large table. Your selects should be reasonable fast if they are not very complex and you have proper indexes on your columns.
I would first see if there are slow queries and try to optimize them with indexes. If that is not enough, only then I would look into denormalization.
Denormalizations that you suggest seem reasonable if you can't achieve the required performance with other means. Just make sure that you keep denormalized fields up-to-date.