Dealing with multiple fact tables concerning related processes in dimensional modeling - data-warehouse

I have the following scenario where OLTP sales data is stored in two separate physical tables:
Sales
Refunds/Cancellations
A refund always refers to an existing sale (thus 'negating' it), though the dimensions of these tables are nearly the same (date, sales clerk, store, etc.). The data schema looks something like the following:
CREATE TABLE sale
(
sale_id uuid,
transaction_at timestamp with time zone,
store_id uuid,
clerk_id uuid,
clerk_number bigint,
currency character varying(3),
pos_id uuid,
total numeric,
net_total numeric
);
CREATE TABLE refund
(
refund_id uuid,
sale_id uuid, -- referenced sale to void
refunded_at timestamp with time zone,
pos_id uuid,
clerk_id uuid,
clerk_number bigint
)
I am trying to figure how to model this data for ingstion in a DW. Since I am relatively new to dimensional modeling I have begun reading The Data Warehouse Toolkit, but I am as of now unsure of the best approach to handle this case.
To my mind, these are two separate fact tables describing two different business processes (e.g. making a sale and getting a refund), though due to normalization concerns the refunds table (besides containing most of the same dimensions) is basically a pointer to a row in the sales table (which is fine for OLTP).
Analytical reports down the line would obviously want to look at these in a few ways:
All net sales per dimension (gross sales minus refunds)
All refund amounts per dimension
Other potential business use cases
As is, the first two cases would require joining the fact tables to either subtract the sales amount (case 1) or to get the information on refund amounts (case 2).
The approach that seems to make the most sense for me is something like the following (via some ETL/ELT process):
Load the (gross) sales data into a table in the DW
Load and denormalize refund data into a table in the DW, joining actual sale data so that amounts etc. are located in the fact table
Join either table with common conformed dimensions for further roll-ups and querying
This makes sense to me because:
Both fact tables have all required information from the physical event, and
There is no explicit dependency between the fact tables, and
Common dimensions can be reused
However, in this case, I still would not be able to get the net sales without joining these two tables. This makes me think that there should be a separate net_sale fact table, but this is problematic:
From a business point of view, sales without refunds are the vast majority of events that occur. A net_sale table would copy basically 99% of all sale data.
From a business process point of view, this table would describe an event that does not exist as such (there is no "net sale", only an aggregated view of sales amount per dimension minus refund costs).
Glossing over the third Chapter in The Data Warehouse Toolkit, I do not see this case mentioned explicitly (though there might be some parallels w.r.t. factless fact-tables and derived facts). What kind of approach would work in a case like this?

Related

Is a table (from source system) that contains only relationships and current status of a row from another table a fact table in Data Warehouse?

I am developing a BI system for our company, from scratch, and currently, I am designing a data warehouse. I am completely new to this so there are many things that I don't really understand, so I need to hear some more insights into this.
My problems are:
1) In our source system, there are tables called "Booking" and "BookingAccess". Booking table holds the data of a booking, such as check-in time and check-out time, booking date, booking number, gross amount of that booking.
Whereas in BookingAccess, it holds foreign keys related to the booking, such as bookerID, customerID, processID, hotelID, paymentproviderID and a current status of that booking. Booking and BookingAccess has a 1:1 relation ship.
Our source system is about checking the validity of those bookings, these bookings are not ours. We receive these booking information from other sources, outsource the above process for them. The gross amount is just an information of that booking that we need to validate, their are not parts of our business. The current status of a booking which is hold in the BookingAccess table is the current status of that booking in our system, which can be "Processing" or "Finshed".
From what I read from Ralph Kimball, in this situation, the "Booking" is the Dimension table, and the BookingAccess should be the fact. I feel that the BookingAccess is some what a [Accumulating Snapshot table], in which I should track the time when a booking is "Processing", and when a booking is "Finshed".
Do I get it right?
2) In "Booking" table, there is also a foreign key called "ImportID". This key links to a table called "Import". This "Import" table hold history records of files (these file contain bookings which will be written to the "Booking" table) which were imported to our system, including attributes such as file name, imported date, total booking imported...
From my point of view, this is clearly a fact table.
But the problem is that, the "Import" table and the "Booking" table has a relationship of one to many (1 ImportID in "Import" table can have 1, 2 or more records which have a same ImportID in "Booking" table). This is against the idea of fact tables which insists that the relationship between Fact and Dimension must be many-to-one, which fact is always in the many side.
So what approach should I use to solve this case? I'm thinking of using bridge tables to solve this problem. But I don't know if this is a good practice, as there are a lot of record in the "Import" table, so I will have to create a big bridge table just to covers all of this.
3) Should I separate a table (from source system) which contains a mix of relationships and information to a fact table containing only relationships, and dimension table containing only information? (For example, a table called "Customer" in source system. This table contains some things like customer name, customer address and customertype id, customer parentID....)
I am asking this because I feel that if I use BI tools to analyze things (for example, analyzing the number of customers which has customertypeid = 1), I feel it's some what weird if there are no fact tables involved in.
Or should I treat it as a mere dimension table and use snowflake-schema? But this will lead to a mix of Star-Schema and snowflake-schema in our Data Warehouse. Is this normal? I have read some official sources (most likely Oracle) stating that one should try to avoid using and mixing snowflake-schema as much as possible. But some sources like Microsoft say that this is very normal. Even the Advanture Work Data Warehouse sample database uses this kind of approach.
Or should I de-normalize every relation in that "Customer" table? But I don't think this is a good approach as it will make the Customer contain a lot of columns, and it will be very hard to track the history of every row in the "DIM_Customer" table. For example, if any change occur in any relation of "Customer" table, the whole "DIM_Customer" table will need to be updated.
I still have a lot of question regarding to Data Warehouse. I am working with it nearly alone, without any help or consultant. So pardon me if I made any kind of inconveniences or mistakes.

Fact and Dimension Tables in DW

I wonder why fact tables are bigger in size than dimension tables in data warehouses. Dimension tables contain the attribute-level information, and are highly de-normalized, so why are dimension tables not bigger in size ?
I could start by stealing some words off Kimball
"Dimensional modeling begins by dividing the world into measurements and context."
https://www.kimballgroup.com/2003/01/fact-tables-and-dimension-tables/
Fact tables record business activities or events and for that reason fact tables could grow in size. Dim Tables store information on different contexts.
For eg: In an university 100 students might be enrolling in 10 subjects. Now if you see the dims, Dim_Student and Dim_Subject, in this scenario they might have 100 rows and 10 rows each. But the activity of enrolments will be much more, as students can enrol into 0 or many subjects at the same time. This could lead to the Fact_Enrolment(which records the enrolment activities) table having lot more rows when compared to the dims.
Note: However in my experience I have also worked with facts where the fact tables have lesser rows when compared to the dims, at a particular point in time. They might grow in size eventually when the DataWarehouse grows.
Hope that helps.
Dimensions contain entity level information whereas facts contain transaction level information and for a dimension multiple transaction can take place over a period of time. For example, in a HR system, there can be a person dimension containing personal details of all the employees wherein typically there may be 1-3 records for each employee.
Fact tables will store multiple transactions of the employees e.g., hires, promotion. movement/change of departments, leaves Termination etc. so corresponding to one-person record in person dimension there will be multiple records in facts.
Also Fact Tables contains facts / measures corresponding to multiple dimensions
And so facts are joined with multiple dimensions using a surrogate key/ foreign key reference to different dimensions which makes the fact table heavier than dimensions.
Dimension tables contains the attribute level information and highly de-normalized
Actually, I doubt that dimension tables are "highly de-normalized". Generally speaking, each row in a dimension table is identified by a primary key so there is very less scope of having duplicates in them. This can explain why they do not get too big in size compared to fact tables.

Granularity in Star Schema leads to multiple values in Fact Table?

I'm trying to understand star schema at the moment & struggling a lot with granularity.
Say I have a fact table that has session_id, user_id, order_id, product_id and I want to roll-up to sessions by user by week (keeping in mind that not every session would lead to an order or a product & the DW needs to track the sessions for non-purchasing users as well as those who purchase).
I can see no reason to track order_ids or session_ids in the fact table so it would become something like:
week_date, user_id, total_orders, total_sessions ...
But how would I then track product_ids if a user makes more than one purchase in a week? I assume I can't keep multiple product ids in an array (eg: "20/02/2012","5","3","PR01,PR32,PR22")?
I'm thinking it may have to be kept at 'every session' level but that could lead to a very large amount of data. How would you implement granularity for an example such as above?
Dimensional modelling required Dimensions as well as Facts.
You need a Date/Calendar dimension, which includes columns like this:
calendar (id,cal_date,cal_year,cal_month,...)
The "grain" of your fact table is the key to data storage. If you have transactions, then the transaction should be the grain, and you store one row per transaction. Use proper (integer) surrogate keys to your dimensions, and your table won't be as large as you fear.
Now you can write a query like this, to sum sales of product by year:
select product_name,cal_year,sum(purchase_amount)
from fact_whatever
inner join calendar on id = fact_whatever.calendar_id
inner join product on id = fact_whatever.product_id
group by product_name,cal_year

Identifying the fact table in data warehouse design

I'm trying to design my first data mart with a star schema from an Excel Sheet containing informations about a Help Desk Service calls, this sheet contains 33 fields including different informations and I can't identify the fact table because I want to do the reporting later based on different KPI's.
I want to know how to identify the fact table measures easily and I have another question which is : Can a fact table contain only foreign keys of dimensions and no measures? Thanks in advance guys and sorry for my bad English.
You can have more than one fact table.
A fact table represents an event or process that you want to analyze.
The structure of the fact tables depend on the process or event that you are trying to analyze.
You need to tell us the events or processes that you want to analyze before we can help you further.
Can a fact table contain only foreign keys of dimensions and no measures?
Yes. This is called a factless fact table.
Let's say you want to do a basic analysis of calls:
Your full table might look like this
CALL_ID
START_DATE
DURATION
AGENT_NAME
AGENT_TENURE (how long worked for company)
CUSTOMER_NAME
CUSTOMER_TENURE (how long a customer)
PRODUCT_NAME (the product the customer is calling about)
RESOLVED
You would turn this into a fact table like this:
CALL_ID
START_DATE_KEY
AGENT_KEY
CUSTOMER_KEY
PRODUCT_KEY
DURATION (measure)
RESOLVED (quasi-measure)
And you would have a DATE dimension table, AGENT dimension table, CUSTOMER dimension table and PRODUCT dimension table.
Agile Data Warehouse Design is a good book, as are the ones by Kimball.
In general, the way I've done it (and there are a number of ways to do anything) is that the categorical data is referenced with a FKey in the fact table, but anything you want to perform aggregations on (typically as data types $/integers/doubles etc) can be in the fact table as well. So for example, a fact table might contain a hierarchy of types, such as product_category >> product_name, and it usually contains a time and/or location field as well; all of which would be referenced by a FKEY to a lookup table. The measure columns are usually integer based or money data, and are used in aggregate functions grouped by the other fields like this:
select sum(measureOne) as sum, product_category from facttable
where timeCol between X and Y group by product_category...etc
At one time a few years ago, I did have a fact table that had no measure column... because the only measure I had was based on count, which I would do dynamically by grouping different dimensions in the fact table.

Data warehouse multivalued attributes

Disclaimer: I have never created a data warehouse before. I have read several chapters of Kimball's Data Warehouse Toolkit.
Background: Plant (factory) management team needs to be able to slice and dice production information in various ways, and we want a consistent reporting format across manufacturing plants in our division. Through business analysis, we have concluded that the fact grain is 1 row per process completed. A completed process can either mean "machine" or "assemble." I am calling this the "Production fact".
The questions that the business needs to answer are the following:
Who was working when the process completed?
What was the cycle time of the process?
What is the serial number of the part was being produced by the process?
My schema includes the following first-level dimensions. I do not have any dimensions beyond the first level, but there are some cross relations between the plant dimension and the part type, shift, and process dimensions.
Part Type (Attributes: Surrogate Key, Part Number, Model, Variant, Part Name)
Plant (Attributes: Surrogate Key, Plant Name, Plant Acronym)
Shift (Attributes: Surrogate Key, Plant Key, Start Hour24, Start Minute, End Hour24, End Minute)
Process (Attributes: Surrogate Key, Plant Key, Production line, Process Group, Process
Name, Machine Type)
Date (typical date dimension attributes)
Time of Day (typical time of day dimension attributes)
The non dimensional facts are:
Part serial Number (instances of a part type)
Cycle time
Employee ID(s) *MULTI-VALUED*
Problem
My problem is that more than one employee may have been working the process at the time. So, I am wondering if I need to change my model and how to best represent the employee in the model. We are not trying to house employee information, just their company employee ID. I've considered the following options:
Allow for multiple employee IDs in the employee column of the fact table (e.g. comma separated). Disadvantage: the number of employees working on the process is a variable number. Would I need to create the field big enough to accommodate up to X number of employees? What should X be?
Create a record for each production fact per employee. This would be mean more than one record for the same fact; that would be bad. :)
Create an employee dimension and an "Process Employees" bridge table between the employee dimension table and the fact table. Problem: the employees working on the process at the time are not represented in the fact table.
Create an Employee dimension, a Process Employees Group table, and a bridge table between Process Employees Group table and the Employee dimension table. The employee group and bridge tables would need to be a) pre-populated with all possible employee combinations--this is not practical on any level since we have thousands of employees-- or b) populated on the fly during ETL. 4b would require a check to see if a given group employees already existed for each process; this might be taxing on the DBMS/ETL system if the source records are batched more frequently than a few times per day (e.g. 10 X's per hour for near real-time reporting).
My Question(s)
I'm thinking that option 3 is the most viable option, but I have some reservations. Are there potential watch-outs? Are there other alternatives that I should consider? Is it okay to take the employees who worked on the process out of the fact table?
Thank you for any advice.
There is a concept called slowly changing dimensions.
These are considered dimensions; basically over here the table which I will call PartEmployee;
The structure of this table will be
PartId - PK
EmployeeId - PK
EmployeeStartDate - PK
EmployeeEndDate
The End Date will be null if the employee is still working on the part. When a new employee starts working on the part, the previous employee record for the part will be closed and a new record created for the part with the new employee.
Add an employee on the PartFact table;
EmployeeId
This column will hold the current employee; This fact record will be updated everytime a new employee starts working on the part...
This will give you the historical perspective of which employees worked on the part and also the information of the employee who worked on the part last.
Hope this helps...
I've had time to think about my options, and none of the 4 options listed in my original post are correct. The problem discussed seems to be a classic "coverage" problem; the business needs to know which employees were working which processes at a given time. If we have that information, we will know who worked who was working on a particular part when a given process completed. This would best be represented as a fact-less fact table between an employee dimension and the production process dimension.
This approach helps also helps me to save space and improve querying power because a single employee "coverage" fact will span multiple process production facts.

Resources