In my Fact_Table I have several Date fields such as:
order_date
payment_date
purchasing_date
estimated_delivery_date
actual_delivery_date
...
How to choose which ones that need to be linked to the Date_Dimension and the others?
Thank you for your advice,
You don't need to use foreign keys in a data warehouse, as your ETL should take care of integrity. Also, you might want hot-swappable dimensions in the future, and they don't use foreign keys.
Usually, a "smart" key is a bad idea, though I make an exception for dates, as it makes it easy to partition fact tables by date. Use an int type, and values like 20160201 (for Feb 1 2016).
You can, of course, join tables in SQL without foreign keys.
Related
I am doing a course in BI development, and in order to solidify all the skills I have learned so far, I start a hands-on project (DWH design, ETL application, Data modeling and etc).
During the DWH design, I have encountered a dilemma, so I would appreciate getting some best practices from more experienced pros.
It got 2 dimension tables:
DimWeather - a table that stores weather data. Each row is a day. The primary key is, of course, the date of the corresponding day.
DimDate - a simple calendar table. The Primary key, in this case, is also a date.
Both tables are connected to a Fact table that stores a bike rental log.
Following DWH design best practices, I need to create a surrogate key - let's call it DateKey for both of the tables.
I am wondering, how to execute it in this situation?
To the best of my knowledge, the surrogate key for each table has to be unique but on the other hand this surrogate key has the same logic.
I would be glad to hear what you will do?
Thanks a lot for putting time and effort
In my opinion, DimWeather should be a fact table, if it stores measures and numerical data about the weather on specific date (temperature, air pressure, humidity, etc.) and dateID in this table should reference to "regular" DimDate table.
By definition, dimension tables should contain different attributes/hierarchies in order to put measures from fact table in specific context (time, location, demography, etc.). In your scenario, you would put weather measures in specific context (i.e. AVG temperature in NYC in February 2020; MAX humidity in LA in December 2019, etc. depending of the structure of your Weather table).
Other than that, surrogate key is just a non-meaning value (usually integer value with identity feature, to secure uniqueness of the key), EXCEPT for DimDate dimension, where you can give surrogate key meaning with creating integer values based on date value (for example: 20200311 for '2020-03-11'). Of course, it's not forbidden to use source primary key as a key in dimension table, but it's a bad practice, since it can happen that you have same value in different source systems and that can cause problems when you load data into DWH.
First project using star schema, still in planning stage. We would appreciate any thoughts and advice on the following problem.
We have a dimension table for "product features used", and the set of features grows and changes over time. Because of the dynamic set of features, we think the features cannot be columns but instead must be rows.
We have a fact table for "user events", and we need to know which product features were used within each event.
So it seems we need to have a primary key on the fact table, which is used as a foreign key within the dimension table (exactly the opposite direction from a conventional star schema). We have several different dimension tables with similar dynamics and therefore a similar need for a foreign key into the fact table.
On the other hand, most of our dimension tables are more conventional and the fact table can just store a foreign key into these conventional dimension tables. We don't like that this means that some joins (many-to-one) will use the dimension table's primary key, but other joins (one-to-many) will use the fact table's primary key. We have considered using the fact table key as a foreign key in all the dimension tables, just for consistency, although the storage requirements increase.
Is there a better way to implement the keys for the "dynamic" dimension tables?
Here's an example that's not exactly what we're doing but similar:
Suppose our app searches for restaurants.
Optional features that a user may specify include price range, minimum star rating, or cuisine. The set of optional features changes over time (for example we may get rid of the option to specify cuisine, and add an option for most popular). For each search that is recorded in the database, the set of features used is fixed.
Each search will be a row in the fact table.
We are currently thinking that we should have a primary key in the fact table, and it should be used as a foreign key in the "features" dimension table. So we'd have:
fact_table(search_id, user_id, metric1, metric2)
feature_dimension_table(feature_id, search_id, feature_attribute1, feature_attribute2)
user_dimension_table(user_id, user_attribute1, user_attribute2)
Alternatively, for consistent joins and ignoring storage requirements for the sake of argument, we could use the fact table's primary key as a foreign key in all the dimension tables:
fact_table(search_id, metric1, metric2) /* no more user_id */
feature_dimension_table(feature_id, search_id, feature_attribute1, feature_attribute2)
user_dimension_table(user_id, search_id, user_attribute1, user_attribute2)
What are the pitfalls with these key schemas? What would be better ways to do it?
You need a Bridge table, it is the recommended solution for many-to-many relationships between fact and dimension.
http://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/multivalued-dimension-bridge-table/
Edit after example added to question:
OK, maybe it is not a bridge, the example changes my view.
A fundamental requirement of dimensional modelling is to correctly identify the grain of your fact table. A common example is invoice and line-item, where the grain is usually line-item.
Hypothetical examples are often difficult because you can never be sure that the example mirrors the real use case, but I think that your scenario might be search-and-criteria, and that your grain should be at the criteria level.
For example, your fact table might look like this:
fact_search (date_id,time_id,search_id,criteria_id,criteria_value)
Thinking about the types of query I might want to do against search data, this design is my best choice. The only issue I see is with the data type of criteria_value, it would have to be a choice/text value, and would definitely be non-additive.
We've got a data warehouse design with four dimension tables and one fact table:
dimUser id, email, firstName, lastName
dimAddress id, city
dimLanguage id, language
dimDate id, startDate, endDate
factStatistic id, dimUserId, dimAddressId, dimLanguageId, dimDate, loginCount, pageCalledCount
Our problem is: We want to build the fact table which includes calculating the statistics (depending on userId, date range) and filling the foreign keys.
But we don't know how, because we don't understand how to use natural keys (which seems to be the solution to our problem according to the literature we read).
I believe a natural key would be the userId, which is needed in all ETL jobs which calculate the dimension data.
But there are many difficulties:
in the ETL jobs load(), we do bulk inserts with INSERT IGNORE INTO to remove duplicates => we don't know the surrogate keys which were generated
if we create meta data (including a set of dimension_name, surrogate_key, natural_key) this will not work because of the duplicate elimination
The problem seems to be the duplicate elimination strategy. Is there a better approach?
We are using MySQL 5.1, if it makes any difference.
If your fact table is tracking logins and page calls per user, then you should have set of source tables which track these things, which is where you'll load your fact table data from. I would probably build the fact table at the grain of one row per user / login date - or even lower to persist atomic data if at all possible.
Here you would then have a fact table with two dimensions - User and Date. You can persist address and language as dimensions on the fact as well, but these are really just attributes of user.
Your dimensions should have surrogate keys, but also should have the source "business" or "natural" key available - either as an attribute on the dimension itself, or through a mapping table as your colleague suggested. It's not "wrong" to use a mapping table - it does make things easier when there are multiple sources.
If you store the business keys on a mapping table, or in the dimension as an attribue, then for each row to load in the fact, it's a simple lookup (usually via a join) against the dim or mapping table to get the surrogate key for the user (and then from the user to get the user's "current" address / language to persist on the fact). The date dimension usually hase a surrogate key stored in a YYYYMMDD or other "natural" format - you can just generate this from the date information on your source record that you're loading into the fact.
do not force for single query, try to load the data in separated queries and mix the data in some provider...
I currently have a model Feature that is used by various other models; in this example, it will be used by customer.
To keep things flexible, Feature is used to store things such as First Name, Last Name, Name, Date of Birth, Company Registration Number, etc.
You will have noticed a problem with this - while most of these are strings, features such as Date of Birth would ideally be stored in a column of type Date (and would be a datepicker rather than a text input in the view).
How would this best be handled? At the present time I simply have a string column "value"; I have considered using multiple value columns (e.g. string_value, date_value) but this doesn't seem particularly efficient as there will always be a null column in every record.
Would appreciate any advice on how to handle this - thanks!
There are a couple of ways I could see you going with this, depending on your needs. I'm not completely satisfied with any of these, but perhaps they can point you in the right direction:
Serialize Everything
Rails can store any object as a byte stream, and in Ruby everything is an object. So in theory you could store string representations of any object, including Strings, DateTimes, or even your own models in a database column. The Marshal module handles this for you most of the time, and allows you to write your own serialization methods if your objects have special needs.
Pros: Really store anything in a single database column.
Cons: Ability to work with data in the database is minimal - It's basically impossible to use this column as anything other than storage - you (probably) wouldn't be able to sort or filter your data based on it, since the format won't be anything the database will recognize.
Columns for every datatype
This is basically the solution you suggested in the question - figure out exactly which datatypes you might need to store - you mention strings and datestamps. If there aren't too many of those, it's feasible to simply have a column of each type and only store data in one of them. You can override the attribute accessor functions to use the proper column, and from the outside, Feature will act as though .value is whatever you need it to be.
Pros: Only need one table.
Cons: At least one null value in every record.
Multiple Models/Tables
You could make a model for each of the sorts of Feature you might need - TextFeature, DateFeature, etc. This guide on Multiple Table Inheritance conveys the idea and methodology.
Pros: No null values - every record contains only the columns it needs.
Cons: Complexity. In addition to needing multiple models, you may find yourself doing complex joins and unions if you need to work directly with features of different kinds in the database.
In my Rails application, I have a variety of database tables that contain user data. Some of these tables have a lot of rows (as many as 500,000 rows per user in some cases) and are queried frequently. Whenever I query any table for anything, the user_id of the current user is somewhere in the query - either directly, if the table has a direct relation with the user, or through a join, if they are related through some other tables.
Should I denormalize the user_id and include it in every table, for faster performance?
Here's one example:
Address belongs to user, and has a user_id
Envelope belongs to user, and has a user_id
AddressesEnvelopes joins an Address and an Envelope, so it has envelope_id and address_id -- it doesn't have user_id, but could get to it through either the envelope or the address (which must belong to the same user).
One common expensive query is to select all the AddressesEnvelopes for a particular user, which I could accomplish by joining with either Address or Envelope, even though I don't need anything from those tables. Or I could just duplicate the user id in this table.
Here's a different scenario:
Letter belongs to user, and has a user_id
Recepient belongs to Letter, and has a letter_id
RecepientOption belongs to Recepient, and has a recepient_id
Would it make sense to duplicate the user_id in both Recepient and RecepientOption, even though I could always get to it by going up through the associations, through Letter?
Some notes:
There are never any objects that are
shared between users. An entire
hierarchy of related objects always
belongs to the same user.
The user owner of objects never changes.
Database performance is important because it's a data intensive application. There are many queries and many tables.
So should I include user_id in every table so I can use it when creating indexes? Or would that be bad design?
I'd like to point out that it isn't necessary to denormalize, if you are willing to work with composite primary keys. Sample for AddressEnvelop case:
user(
#user_id
)
address(
#user_id
, #addres_num
)
envelope(
#user_id
, #envelope_num
)
address_envelope(
#user_id
, #addres_num
, #envelope_num
)
(the # indicates a primary key column)
I am not a fan of this design if I can avoid it, but considering the fact that you say that all these objects are tied to a user, this type of design would make it relatively simply to partition your data (either logically, put ranges of users in separate tables or physically, using multiple databases or even machines)
Another thing that would make sense with this type of design is using clustered indexes (in MySQL, the primary key of InnoDB tables are built from a clustered index). If you ensure the user_id is always the first column in your index, it will ensure that for each table, all data for one user is stored close together on disk. This is great when you always query by user_id, but it can hurt perfomance if you query by another object (in which case duplication like you sugessted may be a better solution)
At any rate, before you change the design, first make sure your schema is already optimized, and you have proper indexes on your foreign key columns. If performance really is paramount, you should simply try several solutions and do benchmarks.
As long as you
a) get a measurable performance improvement
and
b) know which parts of your database are real normalized data and which are redundant improvements
there is no reason not to do it!
Do you actually have a measured performance problem? 500 000 rows isn't very large table. Your selects should be reasonable fast if they are not very complex and you have proper indexes on your columns.
I would first see if there are slow queries and try to optimize them with indexes. If that is not enough, only then I would look into denormalization.
Denormalizations that you suggest seem reasonable if you can't achieve the required performance with other means. Just make sure that you keep denormalized fields up-to-date.