Can observations in a fact table also be dimensions? - data-warehouse

My understanding is that a fact table uses keys, which are dimensions that ought to have their own dimension table, to identify observations and assign them values. Can these values themselves be dimensions? Or does that violate some principle of a star schema?
For example, is this a valid fact table design?
Start Time
Stop Time
Employee ID
Performance
01
60
0100
Grade 3
01
20
0200
Grade 2
20
60
0200
Grade 3
My dimensions that I use to identify facts are the first three columns, with the final column being an observation. However, if I have more information about what each Performance means, does that mean that there needs to be a Performance dimension table? Or, because Performance is an observation rather than a dimension, does this data need to be in the fact table itself?

In a fact table there are normally 3 types of column:
measures: anything that can be aggregated
dimension keys: key to a record in a dimension table
degenerate dimensions: attributes that do not naturally sit in a dimension (often because they would be the only attribute in the dimension)
It is also possible for an attribute to be both a measure in a fact table and an attribute in a dimension. For example, the price of a product could be both a measure in a fact table and an attribute in the product dimension
Does this help?
Update
Say you wanted to know the average price of the products you have sold: in this case product price is a measure and lives in a sales fact table; that fact table would almost certainly have an FK to your Product Dimension - so you could filter on product attributes e.g. average price of product for products whose category = "Food".
You might also want to filter a query based on product price: in this case product price is an attribute in your product dimension (which would probably be an SCD2 dimension to cater for price changes). For example, you might want to query your stock-level fact table (which doesn't hold product prices as measures but does have an FK to the product dimension) for all products whose price is between £10 and £20

Related

Relations between slowly changing dimensions in a data warehouse

I’m designing a data warehouse and am struggling to work out how I should be modelling this scenario. I’ve got Users stored in a Slowly Changing Dimension Type 2 table along these lines:
UserKey
UserID
Language
EffectiveDate
ExpiryDate
IsCurrent
1
1001
EN
2021-01-01
9999-12-3
Y
2
1002
EN
2021-07-31
2022-01-06
N
3
1002
FR
2022-01-06
9999-12-31
Y
And a Login fact table like:
LoginKey
UserKey
LoginTime
12345
2
2021-12-25 15:00
12399
3
2022-01-31 18:00
Thereby allowing us to report on logins by date by user language setting at the time, etc.
Now I have to consider that each user may have one, none, or many concurrent subscriptions, which I was thinking of modelling in a Type 1 SCD thus:
SubsKey
SubsID
SubsType
UserKey
StartDate
EndDate
55501
SBP501
Premium
2
2021-08-01
2022-08-01
55502
SBB123
Bonus
3
2022-01-31
2023-01-31
Is it right for one dimension table to reference the surrogate row key of another like this, or should it rather contain the UserID natural key? It seems unwieldy for the Subs table to have different UserKeys for the two concurrent Subscriptions for the same user like this. Or perhaps, when the third row was added to the Type 2 User table, should all the existing rows in Subs with UserKey=2 have been updated to UserKey=3?
The whole thing doesn't seem to fit comfortably into the classic snowflake pattern, which usually has the one-to-many relationship pointing the other way, as might be the case were Language to be a separate dimension table say, with a one-to-many relation on User.
Edit
I'm wrestling with not only in the one-to-many example described (one user has many subscriptions) but also many-to-one relations between SCDT2 tables e.g. If the user's language was stored in a SCDT2 table, should the User dimension use reference the Language ID or the LanguageKey for Language table's current row?
A subscription is a fact and so should be stored in a fact table - though you might also have a subscription dimension that holds attributes of a subscription such as its name.
You relate dimensions through fact tables, so your subscription fact would have FKs to Subscription, User, Date etc dimensions.
Relating dimensions directly to each other is called snowflaking and is, generally a bad design.
BTW for an SCD2 table, having the expiry date of one row the same as the effective date of the next row is not a good design. In your example, you would need business logic to define which row was active on 2022-01-06, whereas if a row expires on 2022-01-06 and the next row starts on 2022-01-07 there can be no confusion.
Based on your examples, the last table looks like more close to SLCD Type 4 than Type 1.
Indeed, I agree that subscriptions might be a Fact table and have a Dimension table.
Perhaps, an SLCD Type 2 can be the best option for the subscriptions dimension table but adding a flag column to set the current/active subscription with his associated effective date.

Unit Price and Discounts - Fact or Dimension Table

I'm working on a datamart for our sales and marketing departments, and I've come across a modeling challenge. Our ERP stores pricing data in a few different ways:
List pricing for each item
A discount percentage from list pricing for a product line, either for groups of customers or for a specific account
A custom price for an item, either for groups of customers or for a specific account
The Pricing department primarily uses this data operationally, not analytically. For example, they generate reports for customers ("What special pricing / discount %s do I have?") and identify which items / item groups need to be changed when they engage in a new pricing strategy.
Pricing changes happen somewhat regularly on a small scale, usually on a customer-by-customer or item-by-item basis. Infrequently, there are large-scale adjustments to list pricing and group pricing (discounts and individual items) in addition to the customer-level discounts.
My head has been in creating one or more fact tables to represent this process. Unfortunately, there's no pre-existing business key for pricing. There's also no specific "transaction date," since the ERP doesn't (accurately) maintain records of when pricing is changed. Essentially, a "pricing event" is going to be a combination of:
Effective date
End date
Item OR product line
(Not required for list price) customer or customer group
A price amount OR discount percentage
A single fact table seems problematic in that I'm going to have to deal with a lot of invalid combinations of dimensions and facts. First, a record will never have both a non-NULL price amount and a non-NULL discount percentage; pricing events are either-or. Second, only certain combinations of dimensions are valid for each fact. For example, a discount percentage will only ever have a product line, not an individual item.
Does it make sense to model pricing as a fact table in the first place? If so, how many tables should I be considering? My intuition is to use at least two, one for the percentages and one for the price amounts, but this still leaves a problem where each record will either have a valid customer group OR a valid customer (or neither, for list prices), since we need to maintain customer-specific pricing separate from any group pricing that customer might have.
You may need to keep them both as attributes and as facts.
The price a certain item was sold for is a fact. When you multiply it by the quantity sold it's actually an additive measure. So, keep it in the fact table. Total discount applied is also additive, I'd keep it. You can later query "how much was discounted in 2019 per customer", which would be much harder to achieve without those facts.
But if you also need to query things like "what's the discount customer X is on", then you should also keep that as an attribute of the customer dimension, and treat it as a type II dimension, so as to keep discount history. If you know when a certain discount was applied, great, if not take the 1st sale as the start date and you won't be too far off.
Maybe the list price can also be kept as an attribute of product or product line in a dimension, but only if they don't change too often; but if most customers get discounts anyway that would be of limited use.

Fact and Dimension Tables in DW

I wonder why fact tables are bigger in size than dimension tables in data warehouses. Dimension tables contain the attribute-level information, and are highly de-normalized, so why are dimension tables not bigger in size ?
I could start by stealing some words off Kimball
"Dimensional modeling begins by dividing the world into measurements and context."
https://www.kimballgroup.com/2003/01/fact-tables-and-dimension-tables/
Fact tables record business activities or events and for that reason fact tables could grow in size. Dim Tables store information on different contexts.
For eg: In an university 100 students might be enrolling in 10 subjects. Now if you see the dims, Dim_Student and Dim_Subject, in this scenario they might have 100 rows and 10 rows each. But the activity of enrolments will be much more, as students can enrol into 0 or many subjects at the same time. This could lead to the Fact_Enrolment(which records the enrolment activities) table having lot more rows when compared to the dims.
Note: However in my experience I have also worked with facts where the fact tables have lesser rows when compared to the dims, at a particular point in time. They might grow in size eventually when the DataWarehouse grows.
Hope that helps.
Dimensions contain entity level information whereas facts contain transaction level information and for a dimension multiple transaction can take place over a period of time. For example, in a HR system, there can be a person dimension containing personal details of all the employees wherein typically there may be 1-3 records for each employee.
Fact tables will store multiple transactions of the employees e.g., hires, promotion. movement/change of departments, leaves Termination etc. so corresponding to one-person record in person dimension there will be multiple records in facts.
Also Fact Tables contains facts / measures corresponding to multiple dimensions
And so facts are joined with multiple dimensions using a surrogate key/ foreign key reference to different dimensions which makes the fact table heavier than dimensions.
Dimension tables contains the attribute level information and highly de-normalized
Actually, I doubt that dimension tables are "highly de-normalized". Generally speaking, each row in a dimension table is identified by a primary key so there is very less scope of having duplicates in them. This can explain why they do not get too big in size compared to fact tables.

Preferred no of columns for a Fact table?

I have my Fact table with Policy data in it & I want to add Policy Products details to the warehouse.
One policy gets different types of products and the values also are dynamic.
Eg: Policy01 may have two products Building & Contents where sum insured values are 1000 & 500 respectively. And Policy02 get Building only of 750.
There are like 30 products available and I need to store sum insured value, gross & net premiums of each product per policy.
So if I add separate column for each product type into fact table it'll add live 120 more columns (currently there are 23 columns). Also max 5 products per policy so only 20 columns will contain values & others remain empty.
Is it ok to have 100+ columns for fact table? Is it ok to keep this many empty values in a row?
Or is there any other approach I can solve this?
I'm a novice at DWH and hope someone can shed me some light how to add these to my fact table.
One approach is to add a product dimension:
You can then return totals by policy:
SELECT
PolicyKey
SUM(PolicyProductValue) AS PolicyValue
FROM
Fact.PolicyProductValue
GROUP BY
PolicyKey
;
Or product:
SELECT
ProductKey,
SUM(PolicyProductValue) AS ProductValue
FROM
Fact.PolicyProductValue
GROUP BY
ProductKey
;
Or both:
SELECT
PolicyKey,
ProductKey,
SUM(PolicyProductValue) AS PolicyProductValue
FROM
Fact.PolicyProductValue
GROUP BY
PolicyKey,
ProductKey
;
This approach moves the products from the columns to the rows.
This technique offers several benefits:
It is easier to add new rows than columns.
You can add common filters to Dim.Product.
Dim.Product provides a location to create product hierarchies. Example:
| Product Key | Product Name | Product Group |
| ----------- | ------------ | --------------------|
| 0 | Building | Building & Contents |
| 1 | Contents | Building & Contents |
It's not ok to have 100+ columns in a fact table; it's a symptom of an incorrect data model (the same is true for missing values - a well designed fact table shouldn't have any).
The logic of the fact table design is the following:
First, deside on the table "granularity" - the most atomic level of data it will contain. In your case, data granularity is defined by Policy number + Product. Together they uniquely identify the most detailed information available to you.
Then, identify your "facts". Typically, facts are pieces of data that you can aggregate (sum, count, average, etc). In your case, they are Insured_Value, Gross_Premium, Net_Premium.
Finally, define business context for these facts (dimensions). In your case, they are Policy and Product (most likely, you will also have some kind of Date).
Your resulting fact table should look something like this:
Policy_Date
Policy_Number
Product_ID
Insured_Value
Gross_Premium
Net_Premium
Policy_Date will provide connection to "Calendar" dimension, Product_ID will connect to "Product" dimension (table that contains your 30 products and their descriptions).
Policy_Number is what's called a "Degenerate Dimension" - it's an ID that is usually not connected to any dimensions (but could if you need to). It's stored in a fact table just as a reference. Some people add "Policy" dimension to the model, but usually it's a design mistake - such dimensions are too "tall", comparable in size to the fact table, which can dramatically slow down your model performance. It's usually better to split policy attributes into multiple small dimesions and leave the policy number as a degenerate dimension.
So, your typical policy with 5 products will be represented as 5 records in the fact table, rather than one record with 5 fields. This is the critical difference - never, ever store information (products in your case) in the name of the fact table fields.

What is the best way to represent 'N' no. of Products in 'M' no. of warehouses with the quantities included

I want to make relationship between Product entity and Warehouse(location) Entity as you can see in the picture below.
But the problem is the Quantity since the quantity differs in each warehouse and for each product i am not sure if its correct way since in most of the class diagrams for eg. doctrine2.5 there is no mapping class diagram simply annotation would do.
I know i can add extra column in the product entity but what if there are many warehouses i have not seen any practical with many warehouses usually there are large warehouses(space).
What is the best way to represent 'N' no. of Products in 'M' no. of warehouses with the quantities included.
My ER Diagram
In table Product_Location, I assume that the primary key is a combination of ProductId and LocationId, and not only one of that Id.
If that is the case, I don't see why you cannot have different quantity for a particular product in different locations.
For example:
Product A is stored in warehouse X and warehouse Y. The quantity of product A in warehouse X is 10. The quantity of product A in warehouse Y is 20. Thus, the content of table Product_Location will be:
A - X - 10 and A - Y - 20.
Hope this help.

Resources