HR Data Mart Design Advice - data-warehouse

I am working on a design for an HR data mart using the Kimball approach outlined in 'The Data Warehouse Toolkit'.
As per the Kimball design, I was planning to have a time-stamped, slowly-changing dimension to track employee profile changes (to support point-in-time analysis of employee state) and a head-count periodic snapshot fact table to support measures of new hires, leavers, leave taken, salary paid etc.
The problem I've encountered is that, in some cases, our employees can be assigned to multiple roles/jobs and each one needs to be tracked separately (i.e. the grain of my facts has to be at job-level, not employee level).
How might the Kimball design be adapted to fit a scenario where employee and role/job form a hierarchy like this? Ideally, I want to avoid duplicating employee profile data (address, demographics etc) for each role/job an employee is assigned to, but does this mean I need to snow-flake the dimension?
Options I've been considering include the below - I'd be interested in any thoughts or suggestions the community has on this so all input is welcome!
1) (see attached, design 1) A snowflake-style approach with an employee table which has a 1-to-Many link role table, which, in turn, has a 1-to-many link with the fact table. The advantage here is a clean employee dimension but I don't want to introduce unnecessary complexity. Is there any reason why I shouldn't link both dimensions directly to the fact table? The snowflake designs I've seen don't seem to do this.
2) (see attached, design 2) A combined Employee/Role dimension where each employee has a record for each assigned role but only one on them is flagged as 'Primary Role'. Point-in-time queries on the dimension can be performed by constraining on the 'Primary Role' flag.

Anything that occurred is an event and can be a fact. When you look at relationships between data, you need to also ask if the data value describes the entity (dim) or something that happened to/with the entity(fact). Everything can be a dim or a fact.(sometimes both)
A job describes an event that happened to the employee. You should have a fact employeejob that relates to the Dim employee and Dim job (as well as your date dimensions). This will then allow you to break down absences by employee and job. Your dim job would really just be job title, pay grades, etc. The fact would contain effective dates. Research factless fact tables.
Note that your vacancy reference would be part of a separate fact (when/where did you post it, how many applicants are all measurable facts about the vacancy). This may also be an example of a degenerate dimension.
I'm not fond of your monthly fact. I think that should just be some calculated measures built on fact absence and fact employeejob. When those events are put up against your dimensions, you can break them down by date, job type, manager, etc.

Related

Is a table (from source system) that contains only relationships and current status of a row from another table a fact table in Data Warehouse?

I am developing a BI system for our company, from scratch, and currently, I am designing a data warehouse. I am completely new to this so there are many things that I don't really understand, so I need to hear some more insights into this.
My problems are:
1) In our source system, there are tables called "Booking" and "BookingAccess". Booking table holds the data of a booking, such as check-in time and check-out time, booking date, booking number, gross amount of that booking.
Whereas in BookingAccess, it holds foreign keys related to the booking, such as bookerID, customerID, processID, hotelID, paymentproviderID and a current status of that booking. Booking and BookingAccess has a 1:1 relation ship.
Our source system is about checking the validity of those bookings, these bookings are not ours. We receive these booking information from other sources, outsource the above process for them. The gross amount is just an information of that booking that we need to validate, their are not parts of our business. The current status of a booking which is hold in the BookingAccess table is the current status of that booking in our system, which can be "Processing" or "Finshed".
From what I read from Ralph Kimball, in this situation, the "Booking" is the Dimension table, and the BookingAccess should be the fact. I feel that the BookingAccess is some what a [Accumulating Snapshot table], in which I should track the time when a booking is "Processing", and when a booking is "Finshed".
Do I get it right?
2) In "Booking" table, there is also a foreign key called "ImportID". This key links to a table called "Import". This "Import" table hold history records of files (these file contain bookings which will be written to the "Booking" table) which were imported to our system, including attributes such as file name, imported date, total booking imported...
From my point of view, this is clearly a fact table.
But the problem is that, the "Import" table and the "Booking" table has a relationship of one to many (1 ImportID in "Import" table can have 1, 2 or more records which have a same ImportID in "Booking" table). This is against the idea of fact tables which insists that the relationship between Fact and Dimension must be many-to-one, which fact is always in the many side.
So what approach should I use to solve this case? I'm thinking of using bridge tables to solve this problem. But I don't know if this is a good practice, as there are a lot of record in the "Import" table, so I will have to create a big bridge table just to covers all of this.
3) Should I separate a table (from source system) which contains a mix of relationships and information to a fact table containing only relationships, and dimension table containing only information? (For example, a table called "Customer" in source system. This table contains some things like customer name, customer address and customertype id, customer parentID....)
I am asking this because I feel that if I use BI tools to analyze things (for example, analyzing the number of customers which has customertypeid = 1), I feel it's some what weird if there are no fact tables involved in.
Or should I treat it as a mere dimension table and use snowflake-schema? But this will lead to a mix of Star-Schema and snowflake-schema in our Data Warehouse. Is this normal? I have read some official sources (most likely Oracle) stating that one should try to avoid using and mixing snowflake-schema as much as possible. But some sources like Microsoft say that this is very normal. Even the Advanture Work Data Warehouse sample database uses this kind of approach.
Or should I de-normalize every relation in that "Customer" table? But I don't think this is a good approach as it will make the Customer contain a lot of columns, and it will be very hard to track the history of every row in the "DIM_Customer" table. For example, if any change occur in any relation of "Customer" table, the whole "DIM_Customer" table will need to be updated.
I still have a lot of question regarding to Data Warehouse. I am working with it nearly alone, without any help or consultant. So pardon me if I made any kind of inconveniences or mistakes.

Problems with Column in Fact Table

I'm building a DW just like the one from AdventureWorks. I have one fact table called FactSales and theres a table in the database called SalesReason that tells us the reason why a certain costumer buys our product.
The thing is there are two types of costumers - the resselers and the online customers - and only the online customers have a sales reason linked to them.
First of all, can I vave to Dimension tables pointing to the same FK in the Fact? Like in my case - Sk_OnlineCustomer and SK_Resseler both point to FK_Customer. Their Id numbers don't overlap-
And Second,
Should I build a reason dimension, link it to the fact and have a FK that most of the times is null or with a "dummy reason"?
Should I just put the reason in the fact sales without it being a key, just like a technical description that is nullable?
Should I divide the fact in two fact tables with one for the resselers and one for the online customers? But even in that case, I would have some costumers that don't answer to the reason, so the fk_reason would be null in some of its appearences in the new fact_Online_Customer.
In a solution I saw from the adventure works tutorial, it's created a new fact table called fact_reason. It Links the factSales with a DimReason.
That looks like a good solution, but I don't know how it works, because I never lerned in my classes that I could link a fact to a fact, thus I wouldn't be able to justify my option to my teacher.
If you could explain it I would appreciate it.
Thanks!
Please find my comments for your questions:
First of all, can I vave to Dimension tables pointing to the same FK in the Fact? Like in my case - Sk_OnlineCustomer and SK_Resseler both point to FK_Customer. Their Id numbers don't overlap-
Yes the dimension in this case would be Dim_Customer(for eg) and this could be a role playing dimension. You can expose reporting views to separate the Online customer and Reseller customer
And Second, Should I build a reason dimension, link it to the fact and have a FK that most of the times is null or with a "dummy reason"?
Yes it would make sense to build a reason dimension. In this you can tag a fact record to the reason
Should I divide the fact in two fact tables with one for the resselers and one for the online customers? But even in that case, I would have some costumers that don't answer to the reason, so the fk_reason would be null in some of its appearences in the new fact_Online_Customer.
I would suggest you keep one fact as your business activity is sales, you can add context to it, online or reseller using your dimensions. If you would prefer you can have separate Dim_Sales dimension to include the sales type and other details of the sales which you cannot include in the dact
To summarise you probably might be well off with the following facts:
Fact_Sales linked to
Dim_Customer
Dim_Sales
Dim_Reason (This can also may be go to the Dim_Sales)
Dim_Date(always include a date dimension when you build a DWH solution)
Hope that helps...

FactLoanVolume - One or Many Fact Tables

I am designing a Fact table to report on loan volume. The grain is one row per loan transaction. A loan has a few major milestones that we report on: In order of sequence, these are Lock Volume, Loan Funding Volume and Loan Sales Volume.
I have Lock Date, Loan Funding Date and Loan Sale Date as FK (there are other dimensions in addition to these) in the Fact table to role playing dimensions off my DimDate table.
My question is, should I create separate Fact Tables to report volume for each major milestone or should I keep all of this in one Fact Table and use a "far in the future" date (e.g., 12/31/2099) for a milestone on a loan that has not been met?
I have read the Kimball books but I didn't find a definitive answer(if one even exists).
Thanks
You may profit from immutable design, by setting the granularity more fine to the milestone level.
This gives you columns
transaction_id
milestone_type
milestone_date
in you fact table. The actual milestone of a transaction is the milestone from the last (most recent) record.
The one adavatage is that you may add new milestone types in the future, but the main gain is, that you never update your fact table - you use inserts only.
You may safe rollback a wrong ETL load, simple by deleting the records; which is while using updates much complicated.
You may also implement more complicated state diagrams, e.g. in case when some milestone is revoked and the transaction falls back in the previous state.
The question if you use one fact table or more depends on the fact if your milestones are homogenous or not. If the milestones have distinct attributes, you may get a more clean desing using dedicated fact tables, but the queries get complicated.
You would rather have only one Fact Table.
That following question and its conversation answer pretty well to the general question of " One or multiple fact tables? ", but maybe not to how to deal with your specific problem of dates.

Fact table linked to Slowly Changing Dimension

I'm struggling to understand the best way to model a particular scenario for a data warehouse.
I have a Person dimension, and a Tenancy dimension. A person could be on 0, 1 or (rarely) multiple tenancies at any one time, and will often have a succession of tenancies over time. A tenancy could have one or more people associated with it. The people associated with a tenancy can change over time, and tenancies generally last for many years.
One option is to add tenancy reference, start and end dates to the Person Dimension as type 2 SCD columns. This would work well as long as I ignore the possibility of multiple concurrent tenancies for a person. However, I have other areas of the data warehouse where I am facing a similar design issue and ignoring multiple relationships is not a possibility.
Another option is to model the relationship as an accumulating snapshot fact table. I'm not sure how well this would work in practice though as I could only link it to one version of a Person and Tenancy (both of which will have type 2 SCD columns) and that would seem to make it impossible to produce current or historical reports that link people and tenancies together.
Are there any recommended ways of modelling this type of relationship?
Edit based on the patient answer and comments given by SQL.Injection
I've produced a basic model showing the model as described by SQL.Injection.
I've moved tenancy start/end dates to the 'junk' dimension (Dim.Tenancy) and added Person tenancy start/end dates to the fact table as I felt that was a more accurate way to describe the relationship.
However, now that I see it visually I don't think that this is fundamentally any different from the model that I started with, other than the fact table is a periodic snapshot rather than an accumulating snapshot. It certainly seems to suffer from the same flaw that whenever I update a type 2 slowly changing attribute in any of the dimensions it is not reflected in the fact.
In order to make this work to reflect current changes and also allow historical reporting it seems that I will have to add a row to the fact table every time a SCD2 change occurs on any of the dimensions. Then, in order to prevent over-counting by joining to multiple versions of the same entity I will also need to add new versions of the other related dimensions so that I have new keys to join on.
I need to think about this some more. I'm beginning to think that the database model is right and that it's my understanding of how the model will be used that is wrong.
In the meantime any comments or suggestions are welcome!
Your problem is similar to to the sale transactions with multiple item. The difference, is that a transaction usually has multiple items and your tenancy fact usually has a single person (the tenant).
Your hydra is born because you are trying to model the tenancy as a dimension, when you should be modeling it as a fact.
The reason why I think you have a tenancy dimension, is because somewhere you have a fact rent. To model the fact rent consider use the same approach i stated above, if two persons are tenants of the same property two fact records should be inserted each month:
1) And now comes some magic (that is no magic at all), split the value of the of the rent by the number of tenants and store it the fact
2) store also the full value of the rent (you don't know how the data scientist is going to use the data)
3) check 1) with the business user (i mean people that build the risk models); there might be some advanced rule on how to do the spliting (a similar thing happens when the cost of shipping is to be divided across multiple item lines of the same order -- it might not be uniformly distributed)

Data warehouse multivalued attributes

Disclaimer: I have never created a data warehouse before. I have read several chapters of Kimball's Data Warehouse Toolkit.
Background: Plant (factory) management team needs to be able to slice and dice production information in various ways, and we want a consistent reporting format across manufacturing plants in our division. Through business analysis, we have concluded that the fact grain is 1 row per process completed. A completed process can either mean "machine" or "assemble." I am calling this the "Production fact".
The questions that the business needs to answer are the following:
Who was working when the process completed?
What was the cycle time of the process?
What is the serial number of the part was being produced by the process?
My schema includes the following first-level dimensions. I do not have any dimensions beyond the first level, but there are some cross relations between the plant dimension and the part type, shift, and process dimensions.
Part Type (Attributes: Surrogate Key, Part Number, Model, Variant, Part Name)
Plant (Attributes: Surrogate Key, Plant Name, Plant Acronym)
Shift (Attributes: Surrogate Key, Plant Key, Start Hour24, Start Minute, End Hour24, End Minute)
Process (Attributes: Surrogate Key, Plant Key, Production line, Process Group, Process
Name, Machine Type)
Date (typical date dimension attributes)
Time of Day (typical time of day dimension attributes)
The non dimensional facts are:
Part serial Number (instances of a part type)
Cycle time
Employee ID(s) *MULTI-VALUED*
Problem
My problem is that more than one employee may have been working the process at the time. So, I am wondering if I need to change my model and how to best represent the employee in the model. We are not trying to house employee information, just their company employee ID. I've considered the following options:
Allow for multiple employee IDs in the employee column of the fact table (e.g. comma separated). Disadvantage: the number of employees working on the process is a variable number. Would I need to create the field big enough to accommodate up to X number of employees? What should X be?
Create a record for each production fact per employee. This would be mean more than one record for the same fact; that would be bad. :)
Create an employee dimension and an "Process Employees" bridge table between the employee dimension table and the fact table. Problem: the employees working on the process at the time are not represented in the fact table.
Create an Employee dimension, a Process Employees Group table, and a bridge table between Process Employees Group table and the Employee dimension table. The employee group and bridge tables would need to be a) pre-populated with all possible employee combinations--this is not practical on any level since we have thousands of employees-- or b) populated on the fly during ETL. 4b would require a check to see if a given group employees already existed for each process; this might be taxing on the DBMS/ETL system if the source records are batched more frequently than a few times per day (e.g. 10 X's per hour for near real-time reporting).
My Question(s)
I'm thinking that option 3 is the most viable option, but I have some reservations. Are there potential watch-outs? Are there other alternatives that I should consider? Is it okay to take the employees who worked on the process out of the fact table?
Thank you for any advice.
There is a concept called slowly changing dimensions.
These are considered dimensions; basically over here the table which I will call PartEmployee;
The structure of this table will be
PartId - PK
EmployeeId - PK
EmployeeStartDate - PK
EmployeeEndDate
The End Date will be null if the employee is still working on the part. When a new employee starts working on the part, the previous employee record for the part will be closed and a new record created for the part with the new employee.
Add an employee on the PartFact table;
EmployeeId
This column will hold the current employee; This fact record will be updated everytime a new employee starts working on the part...
This will give you the historical perspective of which employees worked on the part and also the information of the employee who worked on the part last.
Hope this helps...
I've had time to think about my options, and none of the 4 options listed in my original post are correct. The problem discussed seems to be a classic "coverage" problem; the business needs to know which employees were working which processes at a given time. If we have that information, we will know who worked who was working on a particular part when a given process completed. This would best be represented as a fact-less fact table between an employee dimension and the production process dimension.
This approach helps also helps me to save space and improve querying power because a single employee "coverage" fact will span multiple process production facts.

Resources