Advice on handling versioning of an EAV-model table - ruby-on-rails

there are many ActiveRecord versioning gems available to Rails but most if not all of them are having trouble being maintained. on top of that, some of them seem to have various foreign key association issues.
I'm in the process of coding a content management system where pages are stored in a tree-like hierarchy and the page fields are stored in a separate table using EAV model.
keeping that in mind, I'm not looking for an all encompassing revisioning gem because I honestly don't think I'll find one. what I am looking for is some advice on how to handle this as a custom implementation. should I have a separate table for storing revisions and referring to a revision number in my EAV table? I foresee that this may lead to some complex validation problems though. I currently have a problem finding a clean way to validate a regular EAV table anyway so if anyone can comment on this it would be very much appreciated as well.
I hope this question is written well enough to SO standards. if you need any additional information, please do not hesitate to ask and I will try to help you help me. :)

I currently have a problem finding a clean way to validate a regular
EAV table anyway so if anyone can comment on this it would be very
much appreciated as well.
There isn't a clean way to either validate or constrain an EAV table. That's why DBAs call it an anti-pattern. (EAV starts on slide 16.) Bill doesn't talk about version, so I will.
Versioning looks simple, but it's not. To version a row, you can add a column. It doesn't really matter much whether it's a version number or a timestamp.
create table test (
test_id integer not null,
attr_ts timestamp not null default current_timestamp,
attr_name varchar(35) not null,
attr_value varchar(35) not null,
primary key (test_id, attr_ts, attr_name)
);
insert into test (test_id, attr_name, attr_value) values
(1, 'emp_id', 1),
(1, 'emp_name', 'Alomar, Anton');
select * from test;
test_id attr_ts attr_name attr_value
--
1 2012-10-28 21:00:59.688436 emp_id 1
1 2012-10-28 21:00:59.688436 emp_name Alomar, Anton
Although it might not look like it on output, all those attribute values are varchar(35). There's no simple way for the dbms to prevent someone from entering 'wibble' as an emp_id. If you need type checking, you have to do it in application code. (And you have to keep sleep-deprived DBAs from using the command-line and GUI interfaces the dbms provides.)
With a normalized table, of course, you'd just declare emp_id to be of type integer.
With versioning, updating Anton's name becomes an insert.
insert into test (test_id, attr_name, attr_value) values
(1, 'emp_name', 'Alomar, Antonio');
With versioning, selection is mildly complicated. You can use a view instead of a common table expression.
with current_values as (
select test_id, attr_name, max(attr_ts) cur_ver_ts
from test
-- You'll probably need an index on this pair of columns to get good performance.
group by test_id, attr_name
)
select t.test_id, t.attr_name, t.attr_value
from test t
inner join current_values c
on c.test_id = t.test_id
and c.attr_name = t.attr_name
and c.cur_ver_ts = t.attr_ts
test_id attr_name attr_value
--
1 emp_id 1
1 emp_name Alomar, Antonio
A normalized table of 1 million rows and 8 non-nullable columns has a million rows. A similar EAV table has 8 million rows. A versioned EAV table has 8 million rows, plus a row for every change to every value and every attribute name.
Storing a version number, and joining to a second table that contains the current values doesn't gain much, if anything at all. Every (traditional) insert would require inserts into two tables. What would be one row of 8 columns becomes 16 rows (8 in each of two tables.).
Selection is a little simpler, requiring only a join.

Related

Identifying the fact table in data warehouse design

I'm trying to design my first data mart with a star schema from an Excel Sheet containing informations about a Help Desk Service calls, this sheet contains 33 fields including different informations and I can't identify the fact table because I want to do the reporting later based on different KPI's.
I want to know how to identify the fact table measures easily and I have another question which is : Can a fact table contain only foreign keys of dimensions and no measures? Thanks in advance guys and sorry for my bad English.
You can have more than one fact table.
A fact table represents an event or process that you want to analyze.
The structure of the fact tables depend on the process or event that you are trying to analyze.
You need to tell us the events or processes that you want to analyze before we can help you further.
Can a fact table contain only foreign keys of dimensions and no measures?
Yes. This is called a factless fact table.
Let's say you want to do a basic analysis of calls:
Your full table might look like this
CALL_ID
START_DATE
DURATION
AGENT_NAME
AGENT_TENURE (how long worked for company)
CUSTOMER_NAME
CUSTOMER_TENURE (how long a customer)
PRODUCT_NAME (the product the customer is calling about)
RESOLVED
You would turn this into a fact table like this:
CALL_ID
START_DATE_KEY
AGENT_KEY
CUSTOMER_KEY
PRODUCT_KEY
DURATION (measure)
RESOLVED (quasi-measure)
And you would have a DATE dimension table, AGENT dimension table, CUSTOMER dimension table and PRODUCT dimension table.
Agile Data Warehouse Design is a good book, as are the ones by Kimball.
In general, the way I've done it (and there are a number of ways to do anything) is that the categorical data is referenced with a FKey in the fact table, but anything you want to perform aggregations on (typically as data types $/integers/doubles etc) can be in the fact table as well. So for example, a fact table might contain a hierarchy of types, such as product_category >> product_name, and it usually contains a time and/or location field as well; all of which would be referenced by a FKEY to a lookup table. The measure columns are usually integer based or money data, and are used in aggregate functions grouped by the other fields like this:
select sum(measureOne) as sum, product_category from facttable
where timeCol between X and Y group by product_category...etc
At one time a few years ago, I did have a fact table that had no measure column... because the only measure I had was based on count, which I would do dynamically by grouping different dimensions in the fact table.

How can I make an INT field in rails that starts to auto_increment at 10001?

How can I create a field (non-primary id) in a rails table that auto_increments like an id but that starts from the number 10,000? My app is using sqlite3 and rails 4.
Thank you
The suggested solution does not address my question. I was looking for a raw SQL answer.
If you had two auto increment columns on one table both value would always be in sync (with an offset given by the start value). Therefore it doesn't make much sense to have two independent auto increment columns on one table. It would just be a waste of diskspace.
That said: Some database engines support combined auto increment columns (auto increment columns that depend on other columns), but afaik no engine support two independent columns.
A simple work around would be something like this:
def secondary_id
id + 10_000
end

2 column table, ignore duplicates on mass insert postgresql

I have a Join table in Rails which is just a 2 column table with ids.
In order to mass insert into this table, I use
ActiveRecord::Base.connection.execute("INSERT INTO myjointable (first_id,second_id) VALUES #{values})
Unfortunately this gives me errors when there are duplicates. I don't need to update any values, simply move on to the next insert if a duplicate exists.
How would I do this?
As an fyi I have searched stackoverflow and most the answers are a bit advanced for me to understand. I've also checked the postgresql documents and played around in the rails console but still to no avail. I can't figure this one out so i'm hoping someone else can help tell me what I'm doing wrong.
The closest statement I've tried is:
INSERT INTO myjointable (first_id,second_id) SELECT 1,2
WHERE NOT EXISTS (
SELECT first_id FROM myjointable
WHERE first_id = 1 AND second_id IN (...))
Part of the problem with this statement is that I am only inserting 1 value at a time whereas I want a statement that mass inserts. Also the second_id IN (...) section of the statement can include up to 100 different values so I'm not sure how slow that will be.
Note that for the most part there should not be many duplicates so I am not sure if mass inserting to a temporary table and finding distinct values is a good idea.
Edit to add context:
The reason I need a mass insert is because I have a many to many relationship between 2 models where 1 of the models is never populated by a form. I have stocks, and stock price histories. The stock price histories are never created in a form, but rather mass inserted themselves by pulling the data from YahooFinance with their yahoo finance API. I use the activerecord-import gem to mass insert for stock price histories (i.e. Model.import columns,values) but I can't type jointable.import columns,values because I get the jointable is an undefined local variable
I ended up using the WITH clause to select my values and give it a name. Then I inserted those values and used WHERE NOT EXISTS to effectively skip any items that are already in my database.
So far it looks like it is working...
WITH withqueryname(first_id,second_id) AS (VALUES(1,2),(3,4),(5,6)...etc)
INSERT INTO jointablename (first_id,second_id)
SELECT * FROM withqueryname
WHERE NOT EXISTS(
SELECT first_id FROM jointablename WHERE
first_id = 1 AND
second_id IN (1,2,3,4,5,6..etc))
You can interchange the Values with a variable. Mine was VALUES#{values}
You can also interchange the second_id IN with a variable. Mine was second_id IN #{variable}.
Here's how I'd tackle it: Create a temp table and populate it with your new values. Then lock the old join values table to prevent concurrent modification (important) and insert all value pairs that appear in the new table but not the old one.
One way to do this is by doing a left outer join of the old values onto the new ones and filtering for rows where the old join table values are null. Another approach is to use an EXISTS subquery. The two are highly likely to result in the same query plan once the query optimiser is done with them anyway.
Example, untested (since you didn't provide an SQLFiddle or sample data) but should work:
BEGIN;
CREATE TEMPORARY TABLE newjoinvalues(
first_id integer,
second_id integer,
primary key(first_id,second_id)
);
-- Now populate `newjoinvalues` with multi-valued inserts or COPY
COPY newjoinvalues(first_id, second_id) FROM stdin;
LOCK TABLE myjoinvalues IN EXCLUSIVE MODE;
INSERT INTO myjoinvalues
SELECT n.first_id, n.second_id
FROM newjoinvalues n
LEFT OUTER JOIN myjoinvalues m ON (n.first_id = m.first_id AND n.second_id = m.second_id)
WHERE m.first_id IS NULL AND m.second_id IS NULL;
COMMIT;
This won't update existing values, but you can do that fairly easily too by using with a second query that does an UPDATE ... FROM while still holding the write table lock.
Note that the lock mode specified above will not block SELECTs, only writes like INSERT, UPDATE and DELETE, so queries can continue to be made to the table while the process is ongoing, you just can't update it.
If you can't accept that an alternative is to run the update in SERIALIZABLE isolation (only works properly for this purpose in Pg 9.1 and above). This will result in the query failing whenever a concurrent write occurs so you have to be prepared to retry it over and over and over again. For that reason it's likely to be better to just live with locking the table for a while.

Best SQL indexes for join table

With performance improvements in mind, I was wondering if and which indexes are helpful on a join table (specifically used in a Rails 3 has_and_belongs_to_many context).
Model and Table Setup
My models are Foo and Bar and per rails convention, I have a join table called bars_foos. There is no primary key or timestamps making the old fields in this table bar_id:integer and foo_id:integer. I'm interested in knowing which of the following indexes is best and is without duplication:
A compound index: add_index :bars_foos, [:bar_id, :foo_id]
Two indexes
A. add_index :bars_foos, :bar_id
B. add_index :bars_foos, :foo_id
A combination of both 1 and 2-B
Basically, I'm not sure if the compound index is enough assuming it is helpful to begin with. I believe that a compound index can be used as a single index for the first item which is why I am pretty sure that using all three lines would certainly result in unnecessary duplication.
Likely Usage
The most common usage will be given an instance of model Foo, I will be asking for its associated bars using the RoR syntax of foo.bars and vice versa with bar.foos for an instance of the model Bar.
These will generate queries of the type SELECT * FROM bars_foos WHERE foo_id = ? and SELECT * FROM bars_foos WHERE bar_id = ? respectively and then using those resultant IDs to SELECT * FROM bars WHERE ID in (?) and SELECT * FROM foos WHERE ID in (?).
Please correct me in the comments if I am incorrect, but I do not believe that, in the context of the Rails application, it is ever going to try to do a query where it specifies both IDs like SELECT * FROM bars_foos where bar_id = ? AND foo_id = ?.
Databases
In the event there are database specific optimization techniques, I will most likely be using PostgreSQL. However, others using this code may want to use it in MySQL or SQLite depending on their Rails configuration so all answers are appreciated.
The Answer
The oft repeated answer, which tends to always be the case more often than not is, "it depends." More specifically, it depends on what your data is and how it will be used.
tl;dr Explanation
The short tl;dr answer for my specific case (and to cover all future bases) is choice #2 which is what I suspected. However, choice #3 would work just fine as, depending on my usage of the data, the extra time and space used creating the compound index could reduce future query lookups.
The Full Explanation
The reason for this is that databases try to be smart and try to do things as fast as possible regardless of programmer input. The most basic item to consider when adding an index is will this object be looked up by this key. If yes, an index can potentially help speed that up. However, whether this index is even used all comes down to selectivity and the cardinality of the field.
Since foreign keys are typically the IDs of another AR class, cardinality usually will be high. But again, this depends on your data. In my example if there are many Foos but few Bars, many of the entries in my join table will have simliar bar_ids. With bar_ids having a low cardinality, an index on bar_id may never be used and may be getting in the way by having the database devote time and resources* to adding to this index every time a new bars_foos entry is created. The same goes with many Bars and few Foos and few of both.
The general lesson is that when considering an index on a table, decide if the entries will be both looked up by this field and if this field has a high cardinality. That is, does this field have many distinct values? In the case of most join tables "it depends" and we must think more carefully about what the data represents and the relationships themselves. In my case, I will have both many Foos and Bars and will be looking up Foos by their associated bars and vice versa.
Another good answer I got at the office was, "why are you worrying about your indexes? Build your app!"
Footnotes
* In a similar question on indexes on STI it was pointed out that the cost of an index is very low so when in doubt, just add it.
Depends on how you are going to query the data.
Assuming you want to search for all of these...
WHERE bar_id = ?
WHERE foo_id = ?
WHERE bar_id = ? AND foo_id = ?
...then you should probably go with an index on {bar_id, foo_id} and an index on {foo_id}.
While you could also create a third index on {bar_id}, the price of maintaining additional index would probably outweigh the benefit of better clustering in the smaller index.
Also, how do you plan to cover your queries with indexes? Some of the alternatives, such as...
{foo_id, bar_id} and {bar_id}
{foo_id, bar_id} and {bar_id, foo_id}
...might cover certain kinds of queries better.
Covering is a balancing act - sometimes adding a field to an index just for covering purposes is justified, sometimes it's not. You won't know until you measure on realistic amounts of data.
(Disclaimer: I'm not familiar with Ruby. This answer is purely from the database perspective.)

Why is it not good to have a primary key on a join table?

I was watching a screencast where the author said it is not good to have a primary key on a join table but didn't explain why.
The join table in the example had two columns defined in a Rails migration and the author added an index to each of the columns but no primary key.
Why is it not good to have a primary key in this example?
create_table :categories_posts, :id => false do |t|
t.column :category_id, :integer, :null => false
t.column :post_id, :integer, :null => false
end
add_index :categories_posts, :category_id
add_index :categories_posts, :post_id
EDIT: As I mentioned to Cletus, I can understand the potential usefulness of an auto number field as a primary key even for a join table. However in the example I listed above, the author explicitly avoids creating an auto number field with the syntax ":id => false" in the "create table" statement. Normally Rails would automatically add an auto-number id field to a table created in a migration like this and this would become the primary key. But for this join table, the author specifically prevented it. I wasn't sure why he decided to follow this approach.
Some notes:
The combination of category_id and post_id is unique in of itself, so an additional ID column is redundant and wasteful
The phrase "not good to have a primary key" is incorrect in the screencast. You still have a Primary Key -- it is just made up of the two columns (e.g. CREATE TABLE foo( cid, pid, PRIMARY KEY( cid, pid ) ). For people who are used to tacking on ID values everywhere this may seem odd but in relational theory it is quite correct and natural; the screencast author would better have said it is "not good to have an implicit integer attribute called 'ID' as the primary key".
It is redundant to have the extra column because you will place a unique index on the combination of category_id and post_id anyway to ensure no duplicate rows are inserted
Finally, although common nomenclature is to call it a "composite key" this is also redundant. The term "key" in relational theory is actually the set of zero or more attributes that uniquely identify the row, so it is fine to say that the primary key is category_id, post_id
Place the MOST SELECTIVE column FIRST in the primary key declaration. A discussion of the construction of b(+/*) trees is out of the scope of this answer ( for some lower-level discussion see: http://www.akadia.com/services/ora_index_selectivity.html ) but in your case, you'd probably want it on post_id, category_id since post_id will show up less often in the table and thus make the index more useful. Of course, since the table is so small and the index will be, essentially, the data rows, this is not very important. It would be in broader cases where the table is wider.
It is a bad idea not to have a primary key on any table, period (if the DBMS is a relational DBMS - or an SQL DBMS). Primary keys are a crucial part of the integrity of your database.
I suppose if you don't mind your database being inaccurate and providing incorrect answers every so often, then you could do without...but most people want accurate answers from their DBMS and for such people, primary keys are crucial.
A DBA would tell you that the primary key in this case is actually the combination of the two FK columns. Since Rails/ActiveRecord doesn't play nice with composite PKs (by default, at least), that may be the reason.
The combination of foreign keys can be a primary key (called a composite primary key). Personally I favour using a technical primary key instead of that (auto number field, sequence, etc). Why? Well, it makes it much easier to identify the record, which you may need to do if you're going to delete it.
Think about it: if you're going to present a Webpage of all the linkages, having a primary key to identify the record makes it much easier.
Basically because there's no need for it. The combination of the two foreign key field adequately uniquely identifies any row.
But that merely says why it's not a Good Idea.... but why would it be a Bad Idea?
Consider the overhead adding a identity column would add. The table would take up 50% more disk space. Worse is the index situation. With a identity field, you have to maintain the identity count, plus a second index. You'll be tripling the disk space and tripling the work the needs to be performed on every insert. With the only advantage being a slightly shorter WHERE clause in a DELETE command.
On the other hand, If the composite key fields are the entire table, then the index can be the table.
Placing the most selective column first should only be relevant in the INDEX declaration. In the KEY declaration, it should not matter (because, as has been correctly pointed out, the KEY is a SET, and inside a set, order doesn't matter - the set {a1,a2} is the same set as {a2,a1}).
If a DBMS product is such that ordering of attributes inside a KEY declaration makes a difference, then that DBMS product is guilty of not properly distinguishing between the logical design of a database (the part where you do the KEY declaration) and the physical design of the database (the part where you do the INDEX declaration).
I wanted to comment on the following comment : "It is not correct to say zero or more".
I wanted to remark that the text to which this comment was added simply did not contain the text "zero or more", so the author of the comment I wanted to comment on was criticizing someone else for something that hadn't been said.
I also wanted to comment that it is not correct to say that it is not correct say "zero or more". Relational theory as commonly known today among the few people who still bother to study the details of that theory, actually REQUIRES the possibility of a key with no attributes.
But when I pressed the button "comment", the system responded to me that commenting requires a reputation score of 50 (or some such).
A sad illustration of how the world seems to have forgotten that science is not democracy, and that in science, the truth is not determined by whoever happens to be the majority, nor by whoever happens to have "enough reputation".
Pros of having a single PK
Uniquely identifies a row with a single value
Makes it easy to reference the relationship from elsewhere if needed
Some tools want you to have a single integer value pk
Cons of having a single PK
Uses more disk space
Need 3 indexes rather than 1
Without a unique constraint you could end up with multiple rows for the same relationship
Notes
You need to define a unique constraint if you want to avoid duplicates
In my opinion don't use the single pk if you're table is going to be huge, otherwise trade off some disk space for the convenience. Yes it's wasteful, but who cares about a few MB on disk in real world applications.

Resources