In my database, I have a model which has a field which should be selected from one of a list of options. As an example, consider a model which needs to store a measurement, such as 5ft or 13cm or 12.24m3. The obvious way to achieve this is to have a decimal field and then some other field to store the unit of measurement.
So what is the best way to store the unit of measurement? I've used a couple of approaches in the past:
1) Storing the various options in another DB table (and associated model), and linking the two with a standard foreign key (and usually eager loading the associated model). This seems like overkill, as you are forcing the DB to perform a join on every query.
2) Storing the options as a constant Hash, loaded in one of the initializers, where the key into the Hash is stored in the unit of measurement field. This way, you effectively do the join in Ruby (which may or may not be a performance increase), but you lose the ability to query from the "unit of measurement" side. This wouldn't be a problem provided it's unlikely you'd need to do queries like "find me all measurements with units of cm".
Neither of these feel particularly elegant to me.. can anyone suggest something better?
Have you seen constant_cache? It's sort of the combination of the best of 1 and 2 - lookup data is stored in the DB, but it's exposed as class constants on the lookup model and only loaded at application start, so you don't suffer the join penalties constantly. The following example comes from the README:
migration:
create_table :account_statuses do |t|
t.string :name, :description
end
AccountStatus.create!(:name => 'Active', :description => 'Active user account')
AccountStatus.create!(:name => 'Pending', :description => 'Pending user account')
AccountStatus.create!(:name => 'Disabled', :description => 'Disabled user account')
model:
class AccountStatus < ActiveRecord::Base
caches_constants
end
using it:
Account.new(:username => 'preagan', :status => AccountStatus::PENDING)
I would go with option one. How large will it be the UnitOfMeasurement table? And, if using an integer primary key, why do you worry so much about speed?
Option 1 is the way to go for design reasons. Just declare it with an integer (even smallint) primary key and a field for the unit description.
Has ActiveRecord gotten support for natural keys, yet? If it has, you can just make the name (or whatever) column of the UnitOfMeasure table the PK, that way the value of the FK column has all the info you need, and you still have a fully normalized DB with a canonical set of UnitOfMeasurement values.
Do you need to perform lookups on these values? If not, you could as well store them as a string and parse the string later on in the application that reads the values. While you risk storing unparseable data, you gain speed and reduce DB complexity. Sometimes normalizing a database is not helpful. In the end /something/ within your system needs to know that "cm" is a length measure and "m3" is a room measure and comparing "3cm" to "1m3" doesn't make any sense anyway. So you just as well can put all that knowledge in code.
Let's say you are only going to display that data anyway, what is normalizing good for here?
Related
I'm building a Rails app that will have a very high number of models using single-table inheritance. For the model subclasses, I want to be able to set constant values for things like name and description. I know I can set defaults using attribute, like this:
class SpecialWidget < Widget
attribute :name, :string, default: "Special Widget"
attribute :description, :text, default: "This is an awfully important widget."
end
The advantage here, as I understand it, is that by storing the defaults in the database, I retain the ability to do things use #order to sort by name, and paginate. But it seems bad to store constants in the database like that. It seems better to use constant methods, like this:
class SpecialWidget < Widget
def name
"Special Widget"
end
def description
"This is an awfully important widget."
end
end
In fact, that's what I was doing originally, but then I read posts like these (one, two, three), which pointed out that then if I wanted to do nice things like sort by the methods, I'd have to load the entire Widget.all into memory and then do a plain-old Ruby sort.
My application is built quite heavily around these STI models, and I will definitely have to sort by constants like name. Are the concerns about sorting and pagination significant disadvantages that will cause me to come to regret using methods in the future, or will the difference be negligible? What other disadvantages/problems might I have? I'd really like to be able to use methods instead of storing constants in the database, if possible without crippling my app's performance.
There are many benefits and few downsides to storing the default values in the database. But if it troubles you, you can have similar sorting efficiency by constructing your sort like this:
class SpecialWidget < Widget
DefaultAttrs = {name: 'Special Widget', description: 'This is... etc'}
end
class Widget < ApplicationRecord
def self.sort_by_name
types = pluck(:type).uniq
case_statements = types.map{|type| "WHEN '#{type}` THEN `#{type.constantize.const_get(:'DefaultAttrs')[:name]}'"
case_sql = "CASE type #{case_statements.join(' ') END"
order(case_sql)
end
end
... not very elegant, but it does the job!
maybe better to put the constants in the database!
It depends entirely on the shape of your data and how you want to use it. You haven't provided enough contextual specifics to guarantee that my recommendation applies to your situation, but it's a recommendation that's specifically designed to work for 95+% of all situations.
Just Put the Data in the Relational Database
The database is the store for all things in your domain that is dynamic and needs to be persisted, i.e. state. It should be internally consistent, meaningfully self-descriptive, and well-structured in order to fully leverage the power of a relational db to flexibly manipulate and represent complex inter-related data.
Based on what you've said, and assuming that there are a bunch of different "widget types" implemented using Rail's STI implementation with a type column, I would model Widget and SpecialWidget in the database like this:
widgets
id | type
-------------------
1 | 'Widget'
2 | 'SpecialWidget'
3 | 'Widget'
4 | 'Widget'
widget_types
type | name | description
--------------------------------------------------------------
'Widget' | 'Normal Widget' | 'A normal widget.'
'SpecialWidget' | 'Special Widget' | 'This is an awfully important widget.'
You called these values a "constant", but are they really? In the purposes of your domain, will they never change like the value of Matth::PI never changes? Or will descriptions be changed, widgets renamed, widgets added, and widgets expired? Without knowing for sure I'm going to assume they're not actually Constant.
Having name and description as methods is effectively storing that widget_types table in your application source code, moving data out of your database. If you really can't afford the extra millisecond a simple JOIN for two small strings on each Widget incurs, then just load the full widget_types table into cache once on application startup, and it'll perform the same as saving it in source code.
This schema is more normalized (incurring benefits), the data itself describes all I need to know, and as you've pointed out, I can flexibly operate on that data (important since you "will definitely have to sort"). The data in this form is also extensible for future changes as they come.
Again: the database stores structured data for the purpose of on-demand flexible manipulation -- you can make up queries on the fly, and the DB can answer it.
I Really Don't Want to Put Data in the Database
Okay... then you'll have to pass that data into the database every time you want to operate on it. You can do it like so:
SELECT w.id, w.type, wt.name
FROM widgets w
INNER JOIN (
VALUES ('Widget', 'Normal Widget'), ('SpecialWidget', 'Special Widget')
) wt(type, name) ON wt.type = w.type
ORDER BY wt.name
The VALUES expression creates an ad-hoc table mapping the class to the name. By passing in that mapping and joining on it (every time), you can tell the DB to ORDER BY it.
In my Rails application I have an index view that lists all of my projects.
This list can be sorted by clicking on any of the table column headers, e.g. Date, Name, updated_at etc. This happens by appending a &sort= GET parameter to the URL.
My question is: From a performance point-of-view, would it be advisable to add indexes to these columns in my database?
This is what a migration might look like:
class AddMoreIndexes < ActiveRecord::Migration
def change
add_index :projects, :date
add_index :projects, :name
add_index :projects, :update_at
end
end
Will I get any performance gains from this?
Indexes can be used to speed an order-by, but if you were identifying a subset of rows to display then an index that is helpful for that is likely to be chosen in preference. You'd need composite indexes in such a situation.
There're a couple of other problems.
Firstly, ordering on an indexed string value may require a linguistically sorted index, not the regular ASCII/Binary sort, so multilingual applications may not be helped at all.
Secondly, it can discourage normalisation of the database because you really need the display values to be in the table you're selecting.
You might like to look at using another method for the sort. I've been very happy with using Google visualisation tables, which come with JQuery sorting built in.
Depending on how you query your database, then yes, it will give you performance gains. For example, whenever I add a foreign key to a table, I immediately index by it. Why? I know queries will be running through it in my application. If not, I wouldn't have put a foreign key. In this way, especially when you accumulate a large amount of data in your database, it will definitely give performance gains (sometimes, by an incredible amount). If you plan to query your database by date, name, or updated at, then yes, it could potentially be a performance gain depending on your query. Otherwise, there really is no point.
Note, you wouldn't want to add an index for every column. Having necessary indices will help you, but if you have an index for every column, then you run the risk of confusing the SQL Query Optimizer and actually hindering your performance.
My suggestion: Add an index for every foreign key you have in your table, but if you're also running some heavy queries with other columns, then add an index there too.
Given I have a model house and it lives through several states, something like this:
Dreaming —> Planning —> Building —> Living —> Tearing down
If I would want to retrieve let's say ten houses from the databse and order them by the state field, I'd get first all houses in the Building state, then Dreaming, then Living, …
Is it possible to fetch all houses from the database and order them by the state in the order intended before retrieving them? Meaning, first all houses in the Dreaming state, then Planning, etc. E.g. by providing the order in an array for comparison of sorts.
I'd like to avoid doing this in Ruby after having fetched all entries as well as I wouldn't want to use IDs for the states.
After reading up on enum implementations, I guess, if I can make it work, I'll try to combine the enum column plugin with the state_machine plugin to achieve what I'm after. If anyone has done something like this before (especially the combination under Rails 3), I'd be grateful for input!
Here's some information on how to use SQL ENUMs in rails -- they're relatively database portable and do roughly what you want -- http://www.snowgiraffe.com/tech/311/enumeration-columns-with-rails/
If you are using MySQL, then the solution is to do ORDER BY FIELD(state, 'Building', 'Dreaming', 'Living', ...):
House.order("FIELD(state, 'Building', 'Dreaming', 'Living', ...)")
If you want to order the collection after a certain criteria then you must store that criteria somewhere.
I don't know if this goes against your "not in Ruby" criteria but I would probably do something like this:
class House < ActiveRecord::Base
STATES { 0 => "Dreaming",
1 => "Planning",
2 => "Building",
3 => "Living",
4 => "Tearing Down" }
validates_inclusion_of :state, :in => STATES.keys
def state_name
STATES[self.state]
end
end
#houses = House.order("state")
In this case the db field state is an integer instead of a string. It makes it very effective for database storage as well as querying.
Then in your view, you call state_name to get the correct name from the STATES hash stored in the model. This can also be changed to use i18n localization by using labels instead of strings in the hash.
I have a model House that has many boolean attributes, like has_fireplace, has_basement, has_garage, and so on. House has around 30 such boolean attributes. What is the best way to structure this model for efficient database storage and search?
I would like to eventually search for all Houses that have a fireplace and a garage, for example.
The naive way, I suppose, would be to simply add 30 boolean attributes in the model that each corresponds to a column in the database, but I'm curious if there's a Rails best practice I'm unaware of.
Your 'naive' assumption is correct - the most efficient way from a query speed and productivity perspective is to add a column for each flag.
You could get fancy as some others have described, but unless you're solving some very specific performance problems, it's not worth the effort. You'd end with a system that's harder to maintain, less flexible and that takes longer to develop.
For that many booleans in a single model you might consider using a single integer and bitwise operations to represent, store and retrieve values. For example:
class Model < ActveRecord::Base
HAS_FIREPLACE = (1 << 0)
HAS_BASEMENT = (1 << 1)
HAS_GARAGE = (1 << 2)
...
end
Then some model attribute called flags would be set like this:
flags |= HAS_FIREPLACE
flags |= (HAS_BASEMENT | HAS_GARAGE)
And tested like this:
flags & HAS_FIREPLACE
flags & (HAS_BASEMENT | HAS_GARAGE)
which you could abstract into methods. Should be pretty efficient in time and space as an implementation
I suggest the flag_shih_tzu gem. It helps you store many boolean attributes in one integer column. It gives you named scopes for each attribute and a way to chain them together as active record relations.
Here's another solution.
You could make a HouseAttributes model and set up a two way has_and_belongs_to_many association
# house.rb
class House
has_and_belongs_to_many :house_attributes
end
# house_attribute.rb
class HouseAttribute
has_and_belongs_to_many :houses
end
Then each attribute for a house would be a database entry.
Don't forget to set up your join table on your database.
If you're wanting to query on those attributes, then you're unfortunately probably stuck with first-class fields, if performance is a consideration. Bitfields and flag strings are an easy way to solve the problem, but they don't scale well against production data sets.
If you aren't going to worry about performance, then I'd use an implementation where each property is represented by a character ("a" = "garage", "b" = "fireplace", etc), and you just build a string that represents all the flags a record has. The primary advantage this has over a bitfield is that a) it's easier for a human to debug, and b) you don't need to worry about the size of your data types.
If performance is a concern, then you will likely need to promote them to first-class fields.
Normally I'd agree that your naive assumption is correct.
If the number of boolean fields keep growing and growing (has_fusion_reactor?) you may also consider serializing an array of flags
# house.rb
class House
serialize :flags
…
end
# Setting flags
#house.flags = [:fireplace, :pool, :doghouse]
# Appending
#house.flags << :sauna
#Querying
#house.flags.has_key? :porch
#Searching
House.where "flags LIKE ?", "pool"
I'm thinking about something like this
You have a House Table (for details of the house)
You have another master table called Features (which has features, like 'fireplace', 'basement' etc..)
and you have a joining table like Houses_Features
and it has house_id and feature_id
By that way you can assign features to a given house. dont know whether this matches to your needs, but just think about it :D
thanks and regards
sameera
You could always have a TEXT column that you hold JSON in (say, data), and then your queries could use SQL's LIKE.
Eg: house.data #=> '{"has_fireplace":true,"has_basement":false,"has_garage":true}'
Thus, doing a find using LIKE '%"has_fireplace":true%' would return anything with a fireplace.
Using model relationships (eg, a model for Fireplace, Basement, and Garage in addition to just House) would be extremely cumbersome in this case, since you have so many models.
I was watching a screencast where the author said it is not good to have a primary key on a join table but didn't explain why.
The join table in the example had two columns defined in a Rails migration and the author added an index to each of the columns but no primary key.
Why is it not good to have a primary key in this example?
create_table :categories_posts, :id => false do |t|
t.column :category_id, :integer, :null => false
t.column :post_id, :integer, :null => false
end
add_index :categories_posts, :category_id
add_index :categories_posts, :post_id
EDIT: As I mentioned to Cletus, I can understand the potential usefulness of an auto number field as a primary key even for a join table. However in the example I listed above, the author explicitly avoids creating an auto number field with the syntax ":id => false" in the "create table" statement. Normally Rails would automatically add an auto-number id field to a table created in a migration like this and this would become the primary key. But for this join table, the author specifically prevented it. I wasn't sure why he decided to follow this approach.
Some notes:
The combination of category_id and post_id is unique in of itself, so an additional ID column is redundant and wasteful
The phrase "not good to have a primary key" is incorrect in the screencast. You still have a Primary Key -- it is just made up of the two columns (e.g. CREATE TABLE foo( cid, pid, PRIMARY KEY( cid, pid ) ). For people who are used to tacking on ID values everywhere this may seem odd but in relational theory it is quite correct and natural; the screencast author would better have said it is "not good to have an implicit integer attribute called 'ID' as the primary key".
It is redundant to have the extra column because you will place a unique index on the combination of category_id and post_id anyway to ensure no duplicate rows are inserted
Finally, although common nomenclature is to call it a "composite key" this is also redundant. The term "key" in relational theory is actually the set of zero or more attributes that uniquely identify the row, so it is fine to say that the primary key is category_id, post_id
Place the MOST SELECTIVE column FIRST in the primary key declaration. A discussion of the construction of b(+/*) trees is out of the scope of this answer ( for some lower-level discussion see: http://www.akadia.com/services/ora_index_selectivity.html ) but in your case, you'd probably want it on post_id, category_id since post_id will show up less often in the table and thus make the index more useful. Of course, since the table is so small and the index will be, essentially, the data rows, this is not very important. It would be in broader cases where the table is wider.
It is a bad idea not to have a primary key on any table, period (if the DBMS is a relational DBMS - or an SQL DBMS). Primary keys are a crucial part of the integrity of your database.
I suppose if you don't mind your database being inaccurate and providing incorrect answers every so often, then you could do without...but most people want accurate answers from their DBMS and for such people, primary keys are crucial.
A DBA would tell you that the primary key in this case is actually the combination of the two FK columns. Since Rails/ActiveRecord doesn't play nice with composite PKs (by default, at least), that may be the reason.
The combination of foreign keys can be a primary key (called a composite primary key). Personally I favour using a technical primary key instead of that (auto number field, sequence, etc). Why? Well, it makes it much easier to identify the record, which you may need to do if you're going to delete it.
Think about it: if you're going to present a Webpage of all the linkages, having a primary key to identify the record makes it much easier.
Basically because there's no need for it. The combination of the two foreign key field adequately uniquely identifies any row.
But that merely says why it's not a Good Idea.... but why would it be a Bad Idea?
Consider the overhead adding a identity column would add. The table would take up 50% more disk space. Worse is the index situation. With a identity field, you have to maintain the identity count, plus a second index. You'll be tripling the disk space and tripling the work the needs to be performed on every insert. With the only advantage being a slightly shorter WHERE clause in a DELETE command.
On the other hand, If the composite key fields are the entire table, then the index can be the table.
Placing the most selective column first should only be relevant in the INDEX declaration. In the KEY declaration, it should not matter (because, as has been correctly pointed out, the KEY is a SET, and inside a set, order doesn't matter - the set {a1,a2} is the same set as {a2,a1}).
If a DBMS product is such that ordering of attributes inside a KEY declaration makes a difference, then that DBMS product is guilty of not properly distinguishing between the logical design of a database (the part where you do the KEY declaration) and the physical design of the database (the part where you do the INDEX declaration).
I wanted to comment on the following comment : "It is not correct to say zero or more".
I wanted to remark that the text to which this comment was added simply did not contain the text "zero or more", so the author of the comment I wanted to comment on was criticizing someone else for something that hadn't been said.
I also wanted to comment that it is not correct to say that it is not correct say "zero or more". Relational theory as commonly known today among the few people who still bother to study the details of that theory, actually REQUIRES the possibility of a key with no attributes.
But when I pressed the button "comment", the system responded to me that commenting requires a reputation score of 50 (or some such).
A sad illustration of how the world seems to have forgotten that science is not democracy, and that in science, the truth is not determined by whoever happens to be the majority, nor by whoever happens to have "enough reputation".
Pros of having a single PK
Uniquely identifies a row with a single value
Makes it easy to reference the relationship from elsewhere if needed
Some tools want you to have a single integer value pk
Cons of having a single PK
Uses more disk space
Need 3 indexes rather than 1
Without a unique constraint you could end up with multiple rows for the same relationship
Notes
You need to define a unique constraint if you want to avoid duplicates
In my opinion don't use the single pk if you're table is going to be huge, otherwise trade off some disk space for the convenience. Yes it's wasteful, but who cares about a few MB on disk in real world applications.