Preferred no of columns for a Fact table? - data-warehouse

I have my Fact table with Policy data in it & I want to add Policy Products details to the warehouse.
One policy gets different types of products and the values also are dynamic.
Eg: Policy01 may have two products Building & Contents where sum insured values are 1000 & 500 respectively. And Policy02 get Building only of 750.
There are like 30 products available and I need to store sum insured value, gross & net premiums of each product per policy.
So if I add separate column for each product type into fact table it'll add live 120 more columns (currently there are 23 columns). Also max 5 products per policy so only 20 columns will contain values & others remain empty.
Is it ok to have 100+ columns for fact table? Is it ok to keep this many empty values in a row?
Or is there any other approach I can solve this?
I'm a novice at DWH and hope someone can shed me some light how to add these to my fact table.

One approach is to add a product dimension:
You can then return totals by policy:
SELECT
PolicyKey
SUM(PolicyProductValue) AS PolicyValue
FROM
Fact.PolicyProductValue
GROUP BY
PolicyKey
;
Or product:
SELECT
ProductKey,
SUM(PolicyProductValue) AS ProductValue
FROM
Fact.PolicyProductValue
GROUP BY
ProductKey
;
Or both:
SELECT
PolicyKey,
ProductKey,
SUM(PolicyProductValue) AS PolicyProductValue
FROM
Fact.PolicyProductValue
GROUP BY
PolicyKey,
ProductKey
;
This approach moves the products from the columns to the rows.
This technique offers several benefits:
It is easier to add new rows than columns.
You can add common filters to Dim.Product.
Dim.Product provides a location to create product hierarchies. Example:
| Product Key | Product Name | Product Group |
| ----------- | ------------ | --------------------|
| 0 | Building | Building & Contents |
| 1 | Contents | Building & Contents |

It's not ok to have 100+ columns in a fact table; it's a symptom of an incorrect data model (the same is true for missing values - a well designed fact table shouldn't have any).
The logic of the fact table design is the following:
First, deside on the table "granularity" - the most atomic level of data it will contain. In your case, data granularity is defined by Policy number + Product. Together they uniquely identify the most detailed information available to you.
Then, identify your "facts". Typically, facts are pieces of data that you can aggregate (sum, count, average, etc). In your case, they are Insured_Value, Gross_Premium, Net_Premium.
Finally, define business context for these facts (dimensions). In your case, they are Policy and Product (most likely, you will also have some kind of Date).
Your resulting fact table should look something like this:
Policy_Date
Policy_Number
Product_ID
Insured_Value
Gross_Premium
Net_Premium
Policy_Date will provide connection to "Calendar" dimension, Product_ID will connect to "Product" dimension (table that contains your 30 products and their descriptions).
Policy_Number is what's called a "Degenerate Dimension" - it's an ID that is usually not connected to any dimensions (but could if you need to). It's stored in a fact table just as a reference. Some people add "Policy" dimension to the model, but usually it's a design mistake - such dimensions are too "tall", comparable in size to the fact table, which can dramatically slow down your model performance. It's usually better to split policy attributes into multiple small dimesions and leave the policy number as a degenerate dimension.
So, your typical policy with 5 products will be represented as 5 records in the fact table, rather than one record with 5 fields. This is the critical difference - never, ever store information (products in your case) in the name of the fact table fields.

Related

Relations between slowly changing dimensions in a data warehouse

I’m designing a data warehouse and am struggling to work out how I should be modelling this scenario. I’ve got Users stored in a Slowly Changing Dimension Type 2 table along these lines:
UserKey
UserID
Language
EffectiveDate
ExpiryDate
IsCurrent
1
1001
EN
2021-01-01
9999-12-3
Y
2
1002
EN
2021-07-31
2022-01-06
N
3
1002
FR
2022-01-06
9999-12-31
Y
And a Login fact table like:
LoginKey
UserKey
LoginTime
12345
2
2021-12-25 15:00
12399
3
2022-01-31 18:00
Thereby allowing us to report on logins by date by user language setting at the time, etc.
Now I have to consider that each user may have one, none, or many concurrent subscriptions, which I was thinking of modelling in a Type 1 SCD thus:
SubsKey
SubsID
SubsType
UserKey
StartDate
EndDate
55501
SBP501
Premium
2
2021-08-01
2022-08-01
55502
SBB123
Bonus
3
2022-01-31
2023-01-31
Is it right for one dimension table to reference the surrogate row key of another like this, or should it rather contain the UserID natural key? It seems unwieldy for the Subs table to have different UserKeys for the two concurrent Subscriptions for the same user like this. Or perhaps, when the third row was added to the Type 2 User table, should all the existing rows in Subs with UserKey=2 have been updated to UserKey=3?
The whole thing doesn't seem to fit comfortably into the classic snowflake pattern, which usually has the one-to-many relationship pointing the other way, as might be the case were Language to be a separate dimension table say, with a one-to-many relation on User.
Edit
I'm wrestling with not only in the one-to-many example described (one user has many subscriptions) but also many-to-one relations between SCDT2 tables e.g. If the user's language was stored in a SCDT2 table, should the User dimension use reference the Language ID or the LanguageKey for Language table's current row?
A subscription is a fact and so should be stored in a fact table - though you might also have a subscription dimension that holds attributes of a subscription such as its name.
You relate dimensions through fact tables, so your subscription fact would have FKs to Subscription, User, Date etc dimensions.
Relating dimensions directly to each other is called snowflaking and is, generally a bad design.
BTW for an SCD2 table, having the expiry date of one row the same as the effective date of the next row is not a good design. In your example, you would need business logic to define which row was active on 2022-01-06, whereas if a row expires on 2022-01-06 and the next row starts on 2022-01-07 there can be no confusion.
Based on your examples, the last table looks like more close to SLCD Type 4 than Type 1.
Indeed, I agree that subscriptions might be a Fact table and have a Dimension table.
Perhaps, an SLCD Type 2 can be the best option for the subscriptions dimension table but adding a flag column to set the current/active subscription with his associated effective date.

How to model a dimension table that link to several facts with different level of grain?

I have a fact that store client's address. Problem is, the client can choose to insert information at state level, or county level, or street level. In the operation database, there is 1 table for streets, link to another table for counties, link to another table for states. The client table has 1 column for state, 1 column for county, 1 column for street that contain ID (so can link to higher object in the hierarchy)
How can I model the relationship between the fact and the dimension in a star-schema?
So I created one Location dimension with all states, all counties, all streets. The table look like this:
DIM_ID | Level | Street columns | County columns | State columns
1 | Street | Bolsa | Westminton | California
2 | County | Westminton [county] | Westminton | California
3 | State | [State of] California | [State of] California | California
If client disclose street then fact record link to row 1, client disclose county level then fact record link to row 2, client disclose only state then fact record link to row 3.
What do you think of that approach?
I would probably model these levels separately, as they are being treated as separate, i.e.:
dim.Street
dim.County
dim.State
As for relating these to clients, I'd go for bridge tables, e.g.:
CREATE TABLE ClientStreet
(
ClientID
, StreetID
)
Etc.
And if you cannot provide a Street without providing a County and State, or provide a County without providing a State, I would have within dim.Street, CountyID, and within dim.County, StateID, i.e. a hierarchical structure.
EDIT
With regards to your client dimensions with 3 IDs, this could also be a good model.
With regards to my hierarchical structure, and data modelling in general, I feel the way you model it really needs to:
Reflect reality (e.g., recording a client's Street as "[State of] California" seems inaccurate to me).
Reflect the reality of what is possible in terms of incoming data, i.e. do your clients input their address once, or can they change it (do you want to track changes?)?
One thing I'm wondering is if whether your clients have to pick one and only one of these levels to record their address at. In this case, I'd have either the model I'd suggested above, or I'd have client dimension with 3 IDs, and a CHECK CONSTRAINT to ensure that only one of these was ever populated. This would then be supported be the fact that a Street would have a CountyID, etc. So you would determine this kind of "channel" by which ID is populated in the client dimension.

Identifying the fact table in data warehouse design

I'm trying to design my first data mart with a star schema from an Excel Sheet containing informations about a Help Desk Service calls, this sheet contains 33 fields including different informations and I can't identify the fact table because I want to do the reporting later based on different KPI's.
I want to know how to identify the fact table measures easily and I have another question which is : Can a fact table contain only foreign keys of dimensions and no measures? Thanks in advance guys and sorry for my bad English.
You can have more than one fact table.
A fact table represents an event or process that you want to analyze.
The structure of the fact tables depend on the process or event that you are trying to analyze.
You need to tell us the events or processes that you want to analyze before we can help you further.
Can a fact table contain only foreign keys of dimensions and no measures?
Yes. This is called a factless fact table.
Let's say you want to do a basic analysis of calls:
Your full table might look like this
CALL_ID
START_DATE
DURATION
AGENT_NAME
AGENT_TENURE (how long worked for company)
CUSTOMER_NAME
CUSTOMER_TENURE (how long a customer)
PRODUCT_NAME (the product the customer is calling about)
RESOLVED
You would turn this into a fact table like this:
CALL_ID
START_DATE_KEY
AGENT_KEY
CUSTOMER_KEY
PRODUCT_KEY
DURATION (measure)
RESOLVED (quasi-measure)
And you would have a DATE dimension table, AGENT dimension table, CUSTOMER dimension table and PRODUCT dimension table.
Agile Data Warehouse Design is a good book, as are the ones by Kimball.
In general, the way I've done it (and there are a number of ways to do anything) is that the categorical data is referenced with a FKey in the fact table, but anything you want to perform aggregations on (typically as data types $/integers/doubles etc) can be in the fact table as well. So for example, a fact table might contain a hierarchy of types, such as product_category >> product_name, and it usually contains a time and/or location field as well; all of which would be referenced by a FKEY to a lookup table. The measure columns are usually integer based or money data, and are used in aggregate functions grouped by the other fields like this:
select sum(measureOne) as sum, product_category from facttable
where timeCol between X and Y group by product_category...etc
At one time a few years ago, I did have a fact table that had no measure column... because the only measure I had was based on count, which I would do dynamically by grouping different dimensions in the fact table.

How are dimensions and fact tables related in a star diagram?

If you have a relational database and you want to start making reports, you might do the following (please let me know if this is incorrect).
Go through your relational database and make a list of all the columns that you want to include in your report.
Group related columns together and then split those (normalise) into additional tables. These are the dimensions.
The dimensions then have a primary key (possibly a combination of two rows), and the fact table has a foreign key to reference each dimension, plus fields that you don't separate out in the first place such as sales value.
The question:
I was originally seeing dimensions as data marts that referenced data from external sources, and a fact table that in turn referenced data in the dimensions.. that's incorrect, isn't it? It's the other way around...
Or in general, if you were to normalise a database you would always replace the columns you take out a table with a foreign key, and add a primary key to the new table?
A fact table represents a process or event that you want to analyze.
Step 1: What is the process or event that you want to analyze?
The columns in the fact table represent all of the variables that are pertinent to your analysis.
Step 2: What variables are pertinent to the analysis?
Whether you "split-out" columns into dimension tables is irrelevant to your understanding. It's an optimization to minimize the space taken up by fact tables.
If you want to discriminate between measures and dimensions, ask
Step 3: What are the (true) numeric values in my fact table? These are your measures.
An example of a true numeric value is a dollar amount, like Sales Order Line Item Extended Price. You can sum it up or take an average of it.
An example of a not true numeric value is Customer ID 12345. It's a number, but represents something that isn't a number (a customer). The sum of customer ids makes no sense, nor does the average. Dig?
Regarding your questions:
Fact tables do not need foreign keys to dimension tables. (hint: see Hot-Swappable Dimensions)
"dimensions as data marts that referenced data from external sources". Hm...maybe, but don't worry about data marts for now. A dimension is just a column in your fact table (that isn't a measure). A dimension table is just a collection of dimensions that are related.
Just start with Excel. Figure out the columns you need in your analysis. Put them in Excel. That's your fact table. If you expect your fact table to get large (100s of MB), then do ONE level of normalization:
Figure out your measures. Leave them in the fact table.
Figure out your dimensions. Group them together (Customer info into one group, Store info into another).
Put them in their own tables. Give them meaningless surrogate keys. Put those keys in the fact table.

Should I flatten multiple customer into one row of dimension or using a bridge table

I'm new to datawarehousing and I have a star schema with a contract fact table. It holds basic contract information like Start date, end date, amount ...etc.
I have to link theses facts to a customer dimension. there's a maximum of 4 customers per contract. So I think that I have two options either I flatten the 4 customers into one row for ex:
DimCutomers
name1, lastName1, birthDate1, ... , name4, lastName4, birthDate4
the other option from what I've heard is to create a bridge table between the facts and the customer dimension. Thus complexifying the model.
What do you think I should do ? What are the advantages / drawbacks of each solution and is there a better solution ?
I would start by creating a customer dimension with all customers in it, and with only one customer per row. A customer dimension can be a useful tool by itself for CRM and other purposes and it means you'll have a single, reliable list of customers, which makes whatever design you then implement much easier.
After that it depends on the relationship between the customer(s) and the contract. The main scenarios I can think of are that a) one contract has 4 customer 'roles', b) one contract has 1-4 customers, all with the same role, and c) one contract has 1-n customers, all with the same role.
Scenario A would be that each contract has 4 customer roles, e.g. one customer who requested the contract, a second who signs it, a third who witnesses it and a fourth who pays for it. In that case your fact table will have one row per contract and 4 customer ID columns, each of which references the customer dimension:
...
RequesterCustomerID int,
SignatoryCustomerID int,
WitnessCustomerID int,
BillableCustomerID int,
...
Of course, if one customer is both a requester and a witness then you'll have the same ID in both RequesterCustomerID and WitnessCustomerID because you only have one row for him in your customer dimension. This is completely normal.
Scenario B is that all customers have the same role, e.g. each contract has 1-4 signatories. If the number of signatories can never be more than 4, and if you're very confident that this will 'always' be true, then the simple solution is also to have one row per contract in the fact table with 4 columns that reference the customer dimension:
...
SignatoryCustomer1 int,
SignatoryCustomer2 int,
SignatoryCustomer3 int,
SignatoryCustomer4 int,
...
Even if most contracts only have 1 or 2 signatories, it's not doing much harm to have 2 less frequently used columns in the table.
Scenario C is where one contract has 1-n customers, where n is a number that varies widely and can even be very large (class action lawsuit?). If you have 50 customers on one contract, then adding 50 columns to the fact table becomes difficult to manage. In this case I would add a bridge table called ContractCustomers or whatever that links the fact table with the customer dimension. This isn't as 'neat' as the other solutions, but a pure star schema isn't very good at handling n:m relationships like this anyway.
There may also be more complex cases, where you mix scenarios A and C: a contract has 3 requesters, 5 signatories, 2 witnesses and the bill is split 3 ways between the requesters. In this case you will have no choice but to create some kind of bridge table that contains the specific customer mix for each contract, because it simply can't be represented cleanly with just one fact and one dimension table.
Either way can work but each solution has different implications. Certainly you need customer and contract tables. A key question is: is it always a maximum of four or may it eventually increase beyond that? If it will stay at 4, then you can have a repeating group of 4 customer IDs in the contract. The disadvantage of this is that it is fixed. If a contract does not have four, there are some empty spaces. If, however, there might be more than 4, then the only viable solution is to use a bridge table because in a bridge table you add more customers by inserting new rows (with no change to the table structure). In the fixed solution, in this case you add more than 4 customers by altering the table. A bridge table is an example of what, for many decades now, ER modeling has called an associative entity. It is the more flexible of the two solutions. However, I worked on a margin system once wherein large margin amounts needed five levels of manager approval. It has been five and will always be five, they told me. Each approving manager represented a different organizational level. In this case, we used a repeating group of five manager IDs, one for each level, and included them in the trade. So it is important to understand the current business rules and the future outlook.

Resources