Modelling customer onboarding in fact or dimension table? - data-warehouse

I'm modelling the customer onboarding pipeline as an accumulating snapshot fact table. I'm considering the Joy Mundy's design tip to model this as a long-lived business process where I have a set of milestones which are updated as customers move through the pipeline. On top of this, some facts such as days from stage to stage are calculated.
As this table will have the same amount of records as my customer dimension, is it best practice to just add these fields to the customer dimension or keep a separate fact table with a one-to-one relationship?

Related

HR Data Mart Design Advice

I am working on a design for an HR data mart using the Kimball approach outlined in 'The Data Warehouse Toolkit'.
As per the Kimball design, I was planning to have a time-stamped, slowly-changing dimension to track employee profile changes (to support point-in-time analysis of employee state) and a head-count periodic snapshot fact table to support measures of new hires, leavers, leave taken, salary paid etc.
The problem I've encountered is that, in some cases, our employees can be assigned to multiple roles/jobs and each one needs to be tracked separately (i.e. the grain of my facts has to be at job-level, not employee level).
How might the Kimball design be adapted to fit a scenario where employee and role/job form a hierarchy like this? Ideally, I want to avoid duplicating employee profile data (address, demographics etc) for each role/job an employee is assigned to, but does this mean I need to snow-flake the dimension?
Options I've been considering include the below - I'd be interested in any thoughts or suggestions the community has on this so all input is welcome!
1) (see attached, design 1) A snowflake-style approach with an employee table which has a 1-to-Many link role table, which, in turn, has a 1-to-many link with the fact table. The advantage here is a clean employee dimension but I don't want to introduce unnecessary complexity. Is there any reason why I shouldn't link both dimensions directly to the fact table? The snowflake designs I've seen don't seem to do this.
2) (see attached, design 2) A combined Employee/Role dimension where each employee has a record for each assigned role but only one on them is flagged as 'Primary Role'. Point-in-time queries on the dimension can be performed by constraining on the 'Primary Role' flag.
Anything that occurred is an event and can be a fact. When you look at relationships between data, you need to also ask if the data value describes the entity (dim) or something that happened to/with the entity(fact). Everything can be a dim or a fact.(sometimes both)
A job describes an event that happened to the employee. You should have a fact employeejob that relates to the Dim employee and Dim job (as well as your date dimensions). This will then allow you to break down absences by employee and job. Your dim job would really just be job title, pay grades, etc. The fact would contain effective dates. Research factless fact tables.
Note that your vacancy reference would be part of a separate fact (when/where did you post it, how many applicants are all measurable facts about the vacancy). This may also be an example of a degenerate dimension.
I'm not fond of your monthly fact. I think that should just be some calculated measures built on fact absence and fact employeejob. When those events are put up against your dimensions, you can break them down by date, job type, manager, etc.

FactLoanVolume - One or Many Fact Tables

I am designing a Fact table to report on loan volume. The grain is one row per loan transaction. A loan has a few major milestones that we report on: In order of sequence, these are Lock Volume, Loan Funding Volume and Loan Sales Volume.
I have Lock Date, Loan Funding Date and Loan Sale Date as FK (there are other dimensions in addition to these) in the Fact table to role playing dimensions off my DimDate table.
My question is, should I create separate Fact Tables to report volume for each major milestone or should I keep all of this in one Fact Table and use a "far in the future" date (e.g., 12/31/2099) for a milestone on a loan that has not been met?
I have read the Kimball books but I didn't find a definitive answer(if one even exists).
Thanks
You may profit from immutable design, by setting the granularity more fine to the milestone level.
This gives you columns
transaction_id
milestone_type
milestone_date
in you fact table. The actual milestone of a transaction is the milestone from the last (most recent) record.
The one adavatage is that you may add new milestone types in the future, but the main gain is, that you never update your fact table - you use inserts only.
You may safe rollback a wrong ETL load, simple by deleting the records; which is while using updates much complicated.
You may also implement more complicated state diagrams, e.g. in case when some milestone is revoked and the transaction falls back in the previous state.
The question if you use one fact table or more depends on the fact if your milestones are homogenous or not. If the milestones have distinct attributes, you may get a more clean desing using dedicated fact tables, but the queries get complicated.
You would rather have only one Fact Table.
That following question and its conversation answer pretty well to the general question of " One or multiple fact tables? ", but maybe not to how to deal with your specific problem of dates.

Fact and Dimension Tables in DW

I wonder why fact tables are bigger in size than dimension tables in data warehouses. Dimension tables contain the attribute-level information, and are highly de-normalized, so why are dimension tables not bigger in size ?
I could start by stealing some words off Kimball
"Dimensional modeling begins by dividing the world into measurements and context."
https://www.kimballgroup.com/2003/01/fact-tables-and-dimension-tables/
Fact tables record business activities or events and for that reason fact tables could grow in size. Dim Tables store information on different contexts.
For eg: In an university 100 students might be enrolling in 10 subjects. Now if you see the dims, Dim_Student and Dim_Subject, in this scenario they might have 100 rows and 10 rows each. But the activity of enrolments will be much more, as students can enrol into 0 or many subjects at the same time. This could lead to the Fact_Enrolment(which records the enrolment activities) table having lot more rows when compared to the dims.
Note: However in my experience I have also worked with facts where the fact tables have lesser rows when compared to the dims, at a particular point in time. They might grow in size eventually when the DataWarehouse grows.
Hope that helps.
Dimensions contain entity level information whereas facts contain transaction level information and for a dimension multiple transaction can take place over a period of time. For example, in a HR system, there can be a person dimension containing personal details of all the employees wherein typically there may be 1-3 records for each employee.
Fact tables will store multiple transactions of the employees e.g., hires, promotion. movement/change of departments, leaves Termination etc. so corresponding to one-person record in person dimension there will be multiple records in facts.
Also Fact Tables contains facts / measures corresponding to multiple dimensions
And so facts are joined with multiple dimensions using a surrogate key/ foreign key reference to different dimensions which makes the fact table heavier than dimensions.
Dimension tables contains the attribute level information and highly de-normalized
Actually, I doubt that dimension tables are "highly de-normalized". Generally speaking, each row in a dimension table is identified by a primary key so there is very less scope of having duplicates in them. This can explain why they do not get too big in size compared to fact tables.

Material and Product hierarchies

I have a master data with both the material and product details in a single table. I am creating a star schema and my question is do i need to make two dimension table with separate material attributes and product attributes or can i have both in a single dimension table? The current master data looks has the following fields -
Material id, name, type, product hier 1,2,3,4...product hierarchy, product category, sub category. In my case both material and product are same, so a single id.
I am thinking of making it in a single table, but is that the best practice? Any future potential issues?
Many thanks in advance,
Arun
The important (and obvious) thing is, that the fact table has two separate foreign keys: PRODUCT_ID and MATERIAL_ID, both referencing your single dimension table.
This setup is not always best practice for OLTP systems, because in this case the database can't enforce the referential integrity. (You may store a product ID in the MATERIAL_ID column).
But for data-warehouse the database constraints are typically not enabled and are enforced in the loading job, so this setup is fine.
The decision to split is more dependent on the origin of the two dimensions. If both of them are maintained together, I see no reason to split them. If the two dimension are independent, with different lifecycles and separate sources, there is no reason to combine them.
And BTW Kimball IMO mentions the split of hierarch levels (not separate dimensions). So he sees as an mistake to split the product attributes and the hiearchy and category attributes (which is not your problem).
It depends on your business requirement.
If you ever need to produce a report that shows (say) units produced of product category by material, then you need to keep them in separate dimensions.

Fact table linked to Slowly Changing Dimension

I'm struggling to understand the best way to model a particular scenario for a data warehouse.
I have a Person dimension, and a Tenancy dimension. A person could be on 0, 1 or (rarely) multiple tenancies at any one time, and will often have a succession of tenancies over time. A tenancy could have one or more people associated with it. The people associated with a tenancy can change over time, and tenancies generally last for many years.
One option is to add tenancy reference, start and end dates to the Person Dimension as type 2 SCD columns. This would work well as long as I ignore the possibility of multiple concurrent tenancies for a person. However, I have other areas of the data warehouse where I am facing a similar design issue and ignoring multiple relationships is not a possibility.
Another option is to model the relationship as an accumulating snapshot fact table. I'm not sure how well this would work in practice though as I could only link it to one version of a Person and Tenancy (both of which will have type 2 SCD columns) and that would seem to make it impossible to produce current or historical reports that link people and tenancies together.
Are there any recommended ways of modelling this type of relationship?
Edit based on the patient answer and comments given by SQL.Injection
I've produced a basic model showing the model as described by SQL.Injection.
I've moved tenancy start/end dates to the 'junk' dimension (Dim.Tenancy) and added Person tenancy start/end dates to the fact table as I felt that was a more accurate way to describe the relationship.
However, now that I see it visually I don't think that this is fundamentally any different from the model that I started with, other than the fact table is a periodic snapshot rather than an accumulating snapshot. It certainly seems to suffer from the same flaw that whenever I update a type 2 slowly changing attribute in any of the dimensions it is not reflected in the fact.
In order to make this work to reflect current changes and also allow historical reporting it seems that I will have to add a row to the fact table every time a SCD2 change occurs on any of the dimensions. Then, in order to prevent over-counting by joining to multiple versions of the same entity I will also need to add new versions of the other related dimensions so that I have new keys to join on.
I need to think about this some more. I'm beginning to think that the database model is right and that it's my understanding of how the model will be used that is wrong.
In the meantime any comments or suggestions are welcome!
Your problem is similar to to the sale transactions with multiple item. The difference, is that a transaction usually has multiple items and your tenancy fact usually has a single person (the tenant).
Your hydra is born because you are trying to model the tenancy as a dimension, when you should be modeling it as a fact.
The reason why I think you have a tenancy dimension, is because somewhere you have a fact rent. To model the fact rent consider use the same approach i stated above, if two persons are tenants of the same property two fact records should be inserted each month:
1) And now comes some magic (that is no magic at all), split the value of the of the rent by the number of tenants and store it the fact
2) store also the full value of the rent (you don't know how the data scientist is going to use the data)
3) check 1) with the business user (i mean people that build the risk models); there might be some advanced rule on how to do the spliting (a similar thing happens when the cost of shipping is to be divided across multiple item lines of the same order -- it might not be uniformly distributed)

Resources