I have a fact named sales which have FKs to dimensions product and store. Each of these dimensions have information about that dimension alone, but I have some information about a product in a specific store like where a product is in that store.
I am tempted to model a dimension where the primary key is a combination of product and store, it is ok to do that or some better alternative exists?
my thoughts...
Having a 3rd dimension for location is definitely a viable option. You could also include store details within this Dim (but still have the location as its level of granularity) and have a Location > Store hierarchy
You won't find references to a dimension having a PK with multiple columns because that would break one of the fundamental design principles of dimensional modelling
I'm confused/surprised by your statement that your source system is generating surrogate keys? Given that surrogate keys (in this context) are entirely an artefact within a data warehouse, it seems unlikely that a source system would be generating them
Be careful another dimension = more joins = complex queries.
You can stick to a simple modeling :
Related
I am doing a course in BI development, and in order to solidify all the skills I have learned so far, I start a hands-on project (DWH design, ETL application, Data modeling and etc).
During the DWH design, I have encountered a dilemma, so I would appreciate getting some best practices from more experienced pros.
It got 2 dimension tables:
DimWeather - a table that stores weather data. Each row is a day. The primary key is, of course, the date of the corresponding day.
DimDate - a simple calendar table. The Primary key, in this case, is also a date.
Both tables are connected to a Fact table that stores a bike rental log.
Following DWH design best practices, I need to create a surrogate key - let's call it DateKey for both of the tables.
I am wondering, how to execute it in this situation?
To the best of my knowledge, the surrogate key for each table has to be unique but on the other hand this surrogate key has the same logic.
I would be glad to hear what you will do?
Thanks a lot for putting time and effort
In my opinion, DimWeather should be a fact table, if it stores measures and numerical data about the weather on specific date (temperature, air pressure, humidity, etc.) and dateID in this table should reference to "regular" DimDate table.
By definition, dimension tables should contain different attributes/hierarchies in order to put measures from fact table in specific context (time, location, demography, etc.). In your scenario, you would put weather measures in specific context (i.e. AVG temperature in NYC in February 2020; MAX humidity in LA in December 2019, etc. depending of the structure of your Weather table).
Other than that, surrogate key is just a non-meaning value (usually integer value with identity feature, to secure uniqueness of the key), EXCEPT for DimDate dimension, where you can give surrogate key meaning with creating integer values based on date value (for example: 20200311 for '2020-03-11'). Of course, it's not forbidden to use source primary key as a key in dimension table, but it's a bad practice, since it can happen that you have same value in different source systems and that can cause problems when you load data into DWH.
I denormalising a OLTP database for use in a DWH.
At the moment I am denormalising studygroups.
Each studygroup has a key pointing towards 1 project.
Each project has a key pointing towards 1 department.
Each department has a key pointing towards 1 university.
Each universityhas a key pointing to 1 city.
Now I know that you are supposed to denormalize the sh*t out your OLTP but in this dwh department will be a dimension on its own. This goes for university also. Would it suffise to add a key from studygroup pointing at department or is it wiser to denormalize as far as you can and add all attributes from the department and all attributes from its M:1 related tables to the dimension studygroup? Even when department and university will be dimensions by themselves?
In other words: how far/deep do you go when denormalizing?
The key concept behind a dimensional model is:
Keep your fact tables in 3NF (third normal form);
De-normalize your dimensions into 2NF (second normal form)
So ideally, the only joins you should have in your model are the joins between fact tables and relevant dimensions.
As part of this philosophy:
Avoid "snow flake" designs, where dimensions contain keys to other dimensions. It's always possible to come up with a data model that allows the same functionality as the snow flakes, without violating 3NF/2NF rule;
Never have any direct joins between 2 separate dimensions (i.e, department and study group) directly. All relations among dimensions must be resolved via fact tables;
Never have any direct joins between 2 separate fact tables. Any relations among fact tables must be resolved via shared dimensions.
Finally, consider that dimensional design, besides optimization of the data for querying, serves a second important purpose: it's a semantic model of the business (or whatever else it represents). So, when making decisions about combining data elements into dimensions and facts, consider their "logical affinity" - they should make intuitive sense to the end users. If you have hard times explaining to a BI analyst the meaning of your dimension or fact table, most likely you've made a modeling mistake.
For example, in your case you should consider logical relations between universities, departments, study groups, etc. It's very likely that University/Department form a natural hierarchy. If so, they should belong to the same dimension. Study group, on the other hand, might not - let's assume, it's possible to form study groups across multiple universities and/or multiple departments. Such Many:Many relations are clear indication that they should be resolved via fact tables. In addition, relations between universities and departments are stable (rarely change), while study groups are formed and dissolved very often, and thus should be modeled separately.
In general, if you see 1:1 or 1:M relations between dimensional elements, it's often an indication that they should be de-normalized into the same table (again, only if their combination makes logical sense). If the relations are M:M, most likely they belong to different tables (you can force them into the same table, but often such tables look like Frankenstein creatures).
You can get much better help by making your question more specific - draw your dimensional model, post it, and ask for specific issues/challenges you have. For general concepts, books from Kimball and Inmon are your best friends.
First project using star schema, still in planning stage. We would appreciate any thoughts and advice on the following problem.
We have a dimension table for "product features used", and the set of features grows and changes over time. Because of the dynamic set of features, we think the features cannot be columns but instead must be rows.
We have a fact table for "user events", and we need to know which product features were used within each event.
So it seems we need to have a primary key on the fact table, which is used as a foreign key within the dimension table (exactly the opposite direction from a conventional star schema). We have several different dimension tables with similar dynamics and therefore a similar need for a foreign key into the fact table.
On the other hand, most of our dimension tables are more conventional and the fact table can just store a foreign key into these conventional dimension tables. We don't like that this means that some joins (many-to-one) will use the dimension table's primary key, but other joins (one-to-many) will use the fact table's primary key. We have considered using the fact table key as a foreign key in all the dimension tables, just for consistency, although the storage requirements increase.
Is there a better way to implement the keys for the "dynamic" dimension tables?
Here's an example that's not exactly what we're doing but similar:
Suppose our app searches for restaurants.
Optional features that a user may specify include price range, minimum star rating, or cuisine. The set of optional features changes over time (for example we may get rid of the option to specify cuisine, and add an option for most popular). For each search that is recorded in the database, the set of features used is fixed.
Each search will be a row in the fact table.
We are currently thinking that we should have a primary key in the fact table, and it should be used as a foreign key in the "features" dimension table. So we'd have:
fact_table(search_id, user_id, metric1, metric2)
feature_dimension_table(feature_id, search_id, feature_attribute1, feature_attribute2)
user_dimension_table(user_id, user_attribute1, user_attribute2)
Alternatively, for consistent joins and ignoring storage requirements for the sake of argument, we could use the fact table's primary key as a foreign key in all the dimension tables:
fact_table(search_id, metric1, metric2) /* no more user_id */
feature_dimension_table(feature_id, search_id, feature_attribute1, feature_attribute2)
user_dimension_table(user_id, search_id, user_attribute1, user_attribute2)
What are the pitfalls with these key schemas? What would be better ways to do it?
You need a Bridge table, it is the recommended solution for many-to-many relationships between fact and dimension.
http://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/multivalued-dimension-bridge-table/
Edit after example added to question:
OK, maybe it is not a bridge, the example changes my view.
A fundamental requirement of dimensional modelling is to correctly identify the grain of your fact table. A common example is invoice and line-item, where the grain is usually line-item.
Hypothetical examples are often difficult because you can never be sure that the example mirrors the real use case, but I think that your scenario might be search-and-criteria, and that your grain should be at the criteria level.
For example, your fact table might look like this:
fact_search (date_id,time_id,search_id,criteria_id,criteria_value)
Thinking about the types of query I might want to do against search data, this design is my best choice. The only issue I see is with the data type of criteria_value, it would have to be a choice/text value, and would definitely be non-additive.
I have a master data with both the material and product details in a single table. I am creating a star schema and my question is do i need to make two dimension table with separate material attributes and product attributes or can i have both in a single dimension table? The current master data looks has the following fields -
Material id, name, type, product hier 1,2,3,4...product hierarchy, product category, sub category. In my case both material and product are same, so a single id.
I am thinking of making it in a single table, but is that the best practice? Any future potential issues?
Many thanks in advance,
Arun
The important (and obvious) thing is, that the fact table has two separate foreign keys: PRODUCT_ID and MATERIAL_ID, both referencing your single dimension table.
This setup is not always best practice for OLTP systems, because in this case the database can't enforce the referential integrity. (You may store a product ID in the MATERIAL_ID column).
But for data-warehouse the database constraints are typically not enabled and are enforced in the loading job, so this setup is fine.
The decision to split is more dependent on the origin of the two dimensions. If both of them are maintained together, I see no reason to split them. If the two dimension are independent, with different lifecycles and separate sources, there is no reason to combine them.
And BTW Kimball IMO mentions the split of hierarch levels (not separate dimensions). So he sees as an mistake to split the product attributes and the hiearchy and category attributes (which is not your problem).
It depends on your business requirement.
If you ever need to produce a report that shows (say) units produced of product category by material, then you need to keep them in separate dimensions.
We've got a data warehouse design with four dimension tables and one fact table:
dimUser id, email, firstName, lastName
dimAddress id, city
dimLanguage id, language
dimDate id, startDate, endDate
factStatistic id, dimUserId, dimAddressId, dimLanguageId, dimDate, loginCount, pageCalledCount
Our problem is: We want to build the fact table which includes calculating the statistics (depending on userId, date range) and filling the foreign keys.
But we don't know how, because we don't understand how to use natural keys (which seems to be the solution to our problem according to the literature we read).
I believe a natural key would be the userId, which is needed in all ETL jobs which calculate the dimension data.
But there are many difficulties:
in the ETL jobs load(), we do bulk inserts with INSERT IGNORE INTO to remove duplicates => we don't know the surrogate keys which were generated
if we create meta data (including a set of dimension_name, surrogate_key, natural_key) this will not work because of the duplicate elimination
The problem seems to be the duplicate elimination strategy. Is there a better approach?
We are using MySQL 5.1, if it makes any difference.
If your fact table is tracking logins and page calls per user, then you should have set of source tables which track these things, which is where you'll load your fact table data from. I would probably build the fact table at the grain of one row per user / login date - or even lower to persist atomic data if at all possible.
Here you would then have a fact table with two dimensions - User and Date. You can persist address and language as dimensions on the fact as well, but these are really just attributes of user.
Your dimensions should have surrogate keys, but also should have the source "business" or "natural" key available - either as an attribute on the dimension itself, or through a mapping table as your colleague suggested. It's not "wrong" to use a mapping table - it does make things easier when there are multiple sources.
If you store the business keys on a mapping table, or in the dimension as an attribue, then for each row to load in the fact, it's a simple lookup (usually via a join) against the dim or mapping table to get the surrogate key for the user (and then from the user to get the user's "current" address / language to persist on the fact). The date dimension usually hase a surrogate key stored in a YYYYMMDD or other "natural" format - you can just generate this from the date information on your source record that you're loading into the fact.
do not force for single query, try to load the data in separated queries and mix the data in some provider...