I have a case where I'm building factless fact table for my DWH. There are 2 dimension that I want to ask for this case: location and store. I have 2 approach.
Building dim_store and fact_account. Then put all the location data into the fact_account table
Building dim_store, dim_location, and fact_account. Then put the store_id and location_id on the fact_account
Here is the visualization for these 2 approaches:
1.
2.
Which is the best approach and why?
Thank you in advance.
Option 1 is definitely wrong, what is described there is not a dimensional model.
Option 2 is a correctly designed dimensional model. Whether it is the best way to dimensionally model your data depends on your reporting requirements.
Related
I have a fact named sales which have FKs to dimensions product and store. Each of these dimensions have information about that dimension alone, but I have some information about a product in a specific store like where a product is in that store.
I am tempted to model a dimension where the primary key is a combination of product and store, it is ok to do that or some better alternative exists?
my thoughts...
Having a 3rd dimension for location is definitely a viable option. You could also include store details within this Dim (but still have the location as its level of granularity) and have a Location > Store hierarchy
You won't find references to a dimension having a PK with multiple columns because that would break one of the fundamental design principles of dimensional modelling
I'm confused/surprised by your statement that your source system is generating surrogate keys? Given that surrogate keys (in this context) are entirely an artefact within a data warehouse, it seems unlikely that a source system would be generating them
Be careful another dimension = more joins = complex queries.
You can stick to a simple modeling :
I have a master data with both the material and product details in a single table. I am creating a star schema and my question is do i need to make two dimension table with separate material attributes and product attributes or can i have both in a single dimension table? The current master data looks has the following fields -
Material id, name, type, product hier 1,2,3,4...product hierarchy, product category, sub category. In my case both material and product are same, so a single id.
I am thinking of making it in a single table, but is that the best practice? Any future potential issues?
Many thanks in advance,
Arun
The important (and obvious) thing is, that the fact table has two separate foreign keys: PRODUCT_ID and MATERIAL_ID, both referencing your single dimension table.
This setup is not always best practice for OLTP systems, because in this case the database can't enforce the referential integrity. (You may store a product ID in the MATERIAL_ID column).
But for data-warehouse the database constraints are typically not enabled and are enforced in the loading job, so this setup is fine.
The decision to split is more dependent on the origin of the two dimensions. If both of them are maintained together, I see no reason to split them. If the two dimension are independent, with different lifecycles and separate sources, there is no reason to combine them.
And BTW Kimball IMO mentions the split of hierarch levels (not separate dimensions). So he sees as an mistake to split the product attributes and the hiearchy and category attributes (which is not your problem).
It depends on your business requirement.
If you ever need to produce a report that shows (say) units produced of product category by material, then you need to keep them in separate dimensions.
I'm struggling to understand the best way to model a particular scenario for a data warehouse.
I have a Person dimension, and a Tenancy dimension. A person could be on 0, 1 or (rarely) multiple tenancies at any one time, and will often have a succession of tenancies over time. A tenancy could have one or more people associated with it. The people associated with a tenancy can change over time, and tenancies generally last for many years.
One option is to add tenancy reference, start and end dates to the Person Dimension as type 2 SCD columns. This would work well as long as I ignore the possibility of multiple concurrent tenancies for a person. However, I have other areas of the data warehouse where I am facing a similar design issue and ignoring multiple relationships is not a possibility.
Another option is to model the relationship as an accumulating snapshot fact table. I'm not sure how well this would work in practice though as I could only link it to one version of a Person and Tenancy (both of which will have type 2 SCD columns) and that would seem to make it impossible to produce current or historical reports that link people and tenancies together.
Are there any recommended ways of modelling this type of relationship?
Edit based on the patient answer and comments given by SQL.Injection
I've produced a basic model showing the model as described by SQL.Injection.
I've moved tenancy start/end dates to the 'junk' dimension (Dim.Tenancy) and added Person tenancy start/end dates to the fact table as I felt that was a more accurate way to describe the relationship.
However, now that I see it visually I don't think that this is fundamentally any different from the model that I started with, other than the fact table is a periodic snapshot rather than an accumulating snapshot. It certainly seems to suffer from the same flaw that whenever I update a type 2 slowly changing attribute in any of the dimensions it is not reflected in the fact.
In order to make this work to reflect current changes and also allow historical reporting it seems that I will have to add a row to the fact table every time a SCD2 change occurs on any of the dimensions. Then, in order to prevent over-counting by joining to multiple versions of the same entity I will also need to add new versions of the other related dimensions so that I have new keys to join on.
I need to think about this some more. I'm beginning to think that the database model is right and that it's my understanding of how the model will be used that is wrong.
In the meantime any comments or suggestions are welcome!
Your problem is similar to to the sale transactions with multiple item. The difference, is that a transaction usually has multiple items and your tenancy fact usually has a single person (the tenant).
Your hydra is born because you are trying to model the tenancy as a dimension, when you should be modeling it as a fact.
The reason why I think you have a tenancy dimension, is because somewhere you have a fact rent. To model the fact rent consider use the same approach i stated above, if two persons are tenants of the same property two fact records should be inserted each month:
1) And now comes some magic (that is no magic at all), split the value of the of the rent by the number of tenants and store it the fact
2) store also the full value of the rent (you don't know how the data scientist is going to use the data)
3) check 1) with the business user (i mean people that build the risk models); there might be some advanced rule on how to do the spliting (a similar thing happens when the cost of shipping is to be divided across multiple item lines of the same order -- it might not be uniformly distributed)
Hi there I'm currently in the process of planning a very basic rails app. I want to create small weight tracking app this will store a weight(number) within a Weight model there will also be the ability to add a goal-weight(number) every week a user would enter their new weight and it will be compared against the goal weight and show completion % etc to the user.
Now my question is would I have both a Weight model and a Goal model or should it be a single Weight model with some extra meta information to set a weight as a goal? I will admit im very much a noob with rails my gut says 2 models but I could be completely wrong.
This is a pretty subjective answer, but I would separate them for two reasons.
More modular. For example, you might want some model validations on Weight model, but not Goal model. In this case, it's easier to make them into two different models rather than have them in one.
Model associations. You may want to create model associations in the future.
Weight wouldn't even be a model for me, weight would be a parameter from user, as so would goal_weight. You could implement a method weight_over_goal_weight? afterwards to check if weight is over or equal goal_weight.
I'm reading Ralph Kimball's book on Data warehouse and Dimension Modeling. I am reading one of the case studies, and it is about dimension modeling for an order system, where the requirement is to capture an order lifecycle, from order to fulfillment to shipped.
So, I was thinking that maybe they would suggest to have multiple lines with a transaction type FK to a transaction dimension. However, the book suggests instead to create 'role-playing' dimensions - create multiple date dimensions tables (one for order date, one for fulfillment, and one for shipped). Each one of them would then have a foreign key into the fact table, and therefore the fact table would have three columns to relate this.
Isn't this kind of restricting? Wouldn't a line-per-transaction be a better choice?
Design often involves trade offs, and it's hard to know what design is best without a lot of details on the entire system.
But my take on this: the table from the book with three separate columns, would likely speed up queries. Data warehouses are often denormalized like this to increase query performance, at the expense of simplicity and versatility of input.
Seems like a good answer to me: your line per transaction sounds better for the data capture tables that store the day to day transactional data, but not as great for analysis.