Database Normalization Validation - normalization

How do I know if I normalized correctly to 2NF or 3NF? I am still struggling how to validate, that I followed the algorithm correctly.
Is this a normlization that would correspond to 3NF? I an a little bit lost.

According to your data schema you have these rules:
At an Incident there can be MANY Responders.
A Responder can have ONE Device.
A Responder can have ONE res_latitude and ONE res_longitude.
A Device can have ONE Dev_installation.
If the above are what you want then i think it's ok (but see again the primary keys).
Also, i forgot to mention that the reason of keeping the responder_id and device_id in a separate table is to keep historical data in case device_id change responder_id. You could also merge ResponceIncidentDevice in one table with keys incident_id, responder_id, device_id so you will be able to know in what incidents a reponder went carrying what devices.
EDIT:
According to your comment you need to make the following changes. Also note that it is better to use lower case for all your tables and columns to avoid case sensitivity problems due to various engine implemantations.
Responders
responder_id res_latitude res_longitude
Responders_Devices (pk: responder_id, device_id)
responder_id device_id
1 1
1 2
2 3
2 4
3 5

Hey there are a lot many tutorials available on the subject but they are a little complex, I can understand your problem.
First of all your project isn't even legal for First Normalization form because your second table, RespondersIncidents, is a table which has two foreign keys but you have no primary keys.
Now let me simplify the rules for you.
1NF - You must have a primary key (One sentence layman definition)
2NF - No Partial Dependency, Try not to have two entries in one column and make sure that your primary key uniquely identifies the whole row.
3NF - No Functional Dependency, Make sure that in one row only your specific primary key has the power to identify the whole row. For e.g. if in one row there is primary key (auto generated) and student id as well which is unique then we have functional dependency here that means we don't need a separate primary key, we can use student id as primary key.
I hope this was informative for you. I kept it short and simple.

Related

Problems with Column in Fact Table

I'm building a DW just like the one from AdventureWorks. I have one fact table called FactSales and theres a table in the database called SalesReason that tells us the reason why a certain costumer buys our product.
The thing is there are two types of costumers - the resselers and the online customers - and only the online customers have a sales reason linked to them.
First of all, can I vave to Dimension tables pointing to the same FK in the Fact? Like in my case - Sk_OnlineCustomer and SK_Resseler both point to FK_Customer. Their Id numbers don't overlap-
And Second,
Should I build a reason dimension, link it to the fact and have a FK that most of the times is null or with a "dummy reason"?
Should I just put the reason in the fact sales without it being a key, just like a technical description that is nullable?
Should I divide the fact in two fact tables with one for the resselers and one for the online customers? But even in that case, I would have some costumers that don't answer to the reason, so the fk_reason would be null in some of its appearences in the new fact_Online_Customer.
In a solution I saw from the adventure works tutorial, it's created a new fact table called fact_reason. It Links the factSales with a DimReason.
That looks like a good solution, but I don't know how it works, because I never lerned in my classes that I could link a fact to a fact, thus I wouldn't be able to justify my option to my teacher.
If you could explain it I would appreciate it.
Thanks!
Please find my comments for your questions:
First of all, can I vave to Dimension tables pointing to the same FK in the Fact? Like in my case - Sk_OnlineCustomer and SK_Resseler both point to FK_Customer. Their Id numbers don't overlap-
Yes the dimension in this case would be Dim_Customer(for eg) and this could be a role playing dimension. You can expose reporting views to separate the Online customer and Reseller customer
And Second, Should I build a reason dimension, link it to the fact and have a FK that most of the times is null or with a "dummy reason"?
Yes it would make sense to build a reason dimension. In this you can tag a fact record to the reason
Should I divide the fact in two fact tables with one for the resselers and one for the online customers? But even in that case, I would have some costumers that don't answer to the reason, so the fk_reason would be null in some of its appearences in the new fact_Online_Customer.
I would suggest you keep one fact as your business activity is sales, you can add context to it, online or reseller using your dimensions. If you would prefer you can have separate Dim_Sales dimension to include the sales type and other details of the sales which you cannot include in the dact
To summarise you probably might be well off with the following facts:
Fact_Sales linked to
Dim_Customer
Dim_Sales
Dim_Reason (This can also may be go to the Dim_Sales)
Dim_Date(always include a date dimension when you build a DWH solution)
Hope that helps...

Can I add an identity column (non-PK) to table for use in application?

I am building a db in sql server, then I will build the web app. The data has some pretty good natural key columns that I will use as the PKs. However, several of them are composite keys, which will be a bit unwieldy in the application side.
For example, a golf course can have 1, 2, or 3 courses at one location. The location has a number (32201) and each course has a name (bayou front, bayou back, bayou exec) so I would make a composite key from the number and name.
However, in the application (asp mvc) the internal routing allows for passing a single integer id from controller to view, etc for use in identifying stuff. So I am thinking of adding an non-key identity column to the Golf Course table, and others like it, so I can us the identity filed in the app. So someone can select the bayou back course on the index page of the app, and I would pass the non-key identity id number to the controller to select the details for the course from the DB and pass them to the view.
It seems at first, like I would be bypassing the purpose of the primary key by using this non-key integer, but the PK would come into play for things like referential integrity when doing updates, inserts, etc.
Thoughts?
What you're talking about sounds like a surrogate key (as opposed to a natural key). There are pros and cons to using each, and both approaches have been discussed before, rather extensively, I think.
From the answer here: go ahead and use both. They are both tools, and each one should be used when it is the best tool for the job.
The only thing I'd add for your situation is that, since your tables would have both a surrogate and natural key on the same table, make sure that one (natural key or surrogate key) is the actual PK and that the other has a unique key or index on it. That way, either one can be used to uniquely identify rows in your tables.

Star schema: how to handle dimension table with constantly changing set of columns?

First project using star schema, still in planning stage. We would appreciate any thoughts and advice on the following problem.
We have a dimension table for "product features used", and the set of features grows and changes over time. Because of the dynamic set of features, we think the features cannot be columns but instead must be rows.
We have a fact table for "user events", and we need to know which product features were used within each event.
So it seems we need to have a primary key on the fact table, which is used as a foreign key within the dimension table (exactly the opposite direction from a conventional star schema). We have several different dimension tables with similar dynamics and therefore a similar need for a foreign key into the fact table.
On the other hand, most of our dimension tables are more conventional and the fact table can just store a foreign key into these conventional dimension tables. We don't like that this means that some joins (many-to-one) will use the dimension table's primary key, but other joins (one-to-many) will use the fact table's primary key. We have considered using the fact table key as a foreign key in all the dimension tables, just for consistency, although the storage requirements increase.
Is there a better way to implement the keys for the "dynamic" dimension tables?
Here's an example that's not exactly what we're doing but similar:
Suppose our app searches for restaurants.
Optional features that a user may specify include price range, minimum star rating, or cuisine. The set of optional features changes over time (for example we may get rid of the option to specify cuisine, and add an option for most popular). For each search that is recorded in the database, the set of features used is fixed.
Each search will be a row in the fact table.
We are currently thinking that we should have a primary key in the fact table, and it should be used as a foreign key in the "features" dimension table. So we'd have:
fact_table(search_id, user_id, metric1, metric2)
feature_dimension_table(feature_id, search_id, feature_attribute1, feature_attribute2)
user_dimension_table(user_id, user_attribute1, user_attribute2)
Alternatively, for consistent joins and ignoring storage requirements for the sake of argument, we could use the fact table's primary key as a foreign key in all the dimension tables:
fact_table(search_id, metric1, metric2) /* no more user_id */
feature_dimension_table(feature_id, search_id, feature_attribute1, feature_attribute2)
user_dimension_table(user_id, search_id, user_attribute1, user_attribute2)
What are the pitfalls with these key schemas? What would be better ways to do it?
You need a Bridge table, it is the recommended solution for many-to-many relationships between fact and dimension.
http://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/multivalued-dimension-bridge-table/
Edit after example added to question:
OK, maybe it is not a bridge, the example changes my view.
A fundamental requirement of dimensional modelling is to correctly identify the grain of your fact table. A common example is invoice and line-item, where the grain is usually line-item.
Hypothetical examples are often difficult because you can never be sure that the example mirrors the real use case, but I think that your scenario might be search-and-criteria, and that your grain should be at the criteria level.
For example, your fact table might look like this:
fact_search (date_id,time_id,search_id,criteria_id,criteria_value)
Thinking about the types of query I might want to do against search data, this design is my best choice. The only issue I see is with the data type of criteria_value, it would have to be a choice/text value, and would definitely be non-additive.

How to create a fact table using natural keys

We've got a data warehouse design with four dimension tables and one fact table:
dimUser id, email, firstName, lastName
dimAddress id, city
dimLanguage id, language
dimDate id, startDate, endDate
factStatistic id, dimUserId, dimAddressId, dimLanguageId, dimDate, loginCount, pageCalledCount
Our problem is: We want to build the fact table which includes calculating the statistics (depending on userId, date range) and filling the foreign keys.
But we don't know how, because we don't understand how to use natural keys (which seems to be the solution to our problem according to the literature we read).
I believe a natural key would be the userId, which is needed in all ETL jobs which calculate the dimension data.
But there are many difficulties:
in the ETL jobs load(), we do bulk inserts with INSERT IGNORE INTO to remove duplicates => we don't know the surrogate keys which were generated
if we create meta data (including a set of dimension_name, surrogate_key, natural_key) this will not work because of the duplicate elimination
The problem seems to be the duplicate elimination strategy. Is there a better approach?
We are using MySQL 5.1, if it makes any difference.
If your fact table is tracking logins and page calls per user, then you should have set of source tables which track these things, which is where you'll load your fact table data from. I would probably build the fact table at the grain of one row per user / login date - or even lower to persist atomic data if at all possible.
Here you would then have a fact table with two dimensions - User and Date. You can persist address and language as dimensions on the fact as well, but these are really just attributes of user.
Your dimensions should have surrogate keys, but also should have the source "business" or "natural" key available - either as an attribute on the dimension itself, or through a mapping table as your colleague suggested. It's not "wrong" to use a mapping table - it does make things easier when there are multiple sources.
If you store the business keys on a mapping table, or in the dimension as an attribue, then for each row to load in the fact, it's a simple lookup (usually via a join) against the dim or mapping table to get the surrogate key for the user (and then from the user to get the user's "current" address / language to persist on the fact). The date dimension usually hase a surrogate key stored in a YYYYMMDD or other "natural" format - you can just generate this from the date information on your source record that you're loading into the fact.
do not force for single query, try to load the data in separated queries and mix the data in some provider...

Why is it not good to have a primary key on a join table?

I was watching a screencast where the author said it is not good to have a primary key on a join table but didn't explain why.
The join table in the example had two columns defined in a Rails migration and the author added an index to each of the columns but no primary key.
Why is it not good to have a primary key in this example?
create_table :categories_posts, :id => false do |t|
t.column :category_id, :integer, :null => false
t.column :post_id, :integer, :null => false
end
add_index :categories_posts, :category_id
add_index :categories_posts, :post_id
EDIT: As I mentioned to Cletus, I can understand the potential usefulness of an auto number field as a primary key even for a join table. However in the example I listed above, the author explicitly avoids creating an auto number field with the syntax ":id => false" in the "create table" statement. Normally Rails would automatically add an auto-number id field to a table created in a migration like this and this would become the primary key. But for this join table, the author specifically prevented it. I wasn't sure why he decided to follow this approach.
Some notes:
The combination of category_id and post_id is unique in of itself, so an additional ID column is redundant and wasteful
The phrase "not good to have a primary key" is incorrect in the screencast. You still have a Primary Key -- it is just made up of the two columns (e.g. CREATE TABLE foo( cid, pid, PRIMARY KEY( cid, pid ) ). For people who are used to tacking on ID values everywhere this may seem odd but in relational theory it is quite correct and natural; the screencast author would better have said it is "not good to have an implicit integer attribute called 'ID' as the primary key".
It is redundant to have the extra column because you will place a unique index on the combination of category_id and post_id anyway to ensure no duplicate rows are inserted
Finally, although common nomenclature is to call it a "composite key" this is also redundant. The term "key" in relational theory is actually the set of zero or more attributes that uniquely identify the row, so it is fine to say that the primary key is category_id, post_id
Place the MOST SELECTIVE column FIRST in the primary key declaration. A discussion of the construction of b(+/*) trees is out of the scope of this answer ( for some lower-level discussion see: http://www.akadia.com/services/ora_index_selectivity.html ) but in your case, you'd probably want it on post_id, category_id since post_id will show up less often in the table and thus make the index more useful. Of course, since the table is so small and the index will be, essentially, the data rows, this is not very important. It would be in broader cases where the table is wider.
It is a bad idea not to have a primary key on any table, period (if the DBMS is a relational DBMS - or an SQL DBMS). Primary keys are a crucial part of the integrity of your database.
I suppose if you don't mind your database being inaccurate and providing incorrect answers every so often, then you could do without...but most people want accurate answers from their DBMS and for such people, primary keys are crucial.
A DBA would tell you that the primary key in this case is actually the combination of the two FK columns. Since Rails/ActiveRecord doesn't play nice with composite PKs (by default, at least), that may be the reason.
The combination of foreign keys can be a primary key (called a composite primary key). Personally I favour using a technical primary key instead of that (auto number field, sequence, etc). Why? Well, it makes it much easier to identify the record, which you may need to do if you're going to delete it.
Think about it: if you're going to present a Webpage of all the linkages, having a primary key to identify the record makes it much easier.
Basically because there's no need for it. The combination of the two foreign key field adequately uniquely identifies any row.
But that merely says why it's not a Good Idea.... but why would it be a Bad Idea?
Consider the overhead adding a identity column would add. The table would take up 50% more disk space. Worse is the index situation. With a identity field, you have to maintain the identity count, plus a second index. You'll be tripling the disk space and tripling the work the needs to be performed on every insert. With the only advantage being a slightly shorter WHERE clause in a DELETE command.
On the other hand, If the composite key fields are the entire table, then the index can be the table.
Placing the most selective column first should only be relevant in the INDEX declaration. In the KEY declaration, it should not matter (because, as has been correctly pointed out, the KEY is a SET, and inside a set, order doesn't matter - the set {a1,a2} is the same set as {a2,a1}).
If a DBMS product is such that ordering of attributes inside a KEY declaration makes a difference, then that DBMS product is guilty of not properly distinguishing between the logical design of a database (the part where you do the KEY declaration) and the physical design of the database (the part where you do the INDEX declaration).
I wanted to comment on the following comment : "It is not correct to say zero or more".
I wanted to remark that the text to which this comment was added simply did not contain the text "zero or more", so the author of the comment I wanted to comment on was criticizing someone else for something that hadn't been said.
I also wanted to comment that it is not correct to say that it is not correct say "zero or more". Relational theory as commonly known today among the few people who still bother to study the details of that theory, actually REQUIRES the possibility of a key with no attributes.
But when I pressed the button "comment", the system responded to me that commenting requires a reputation score of 50 (or some such).
A sad illustration of how the world seems to have forgotten that science is not democracy, and that in science, the truth is not determined by whoever happens to be the majority, nor by whoever happens to have "enough reputation".
Pros of having a single PK
Uniquely identifies a row with a single value
Makes it easy to reference the relationship from elsewhere if needed
Some tools want you to have a single integer value pk
Cons of having a single PK
Uses more disk space
Need 3 indexes rather than 1
Without a unique constraint you could end up with multiple rows for the same relationship
Notes
You need to define a unique constraint if you want to avoid duplicates
In my opinion don't use the single pk if you're table is going to be huge, otherwise trade off some disk space for the convenience. Yes it's wasteful, but who cares about a few MB on disk in real world applications.

Resources