Fact and Dimension Tables in DW - data-warehouse

I wonder why fact tables are bigger in size than dimension tables in data warehouses. Dimension tables contain the attribute-level information, and are highly de-normalized, so why are dimension tables not bigger in size ?

I could start by stealing some words off Kimball
"Dimensional modeling begins by dividing the world into measurements and context."
https://www.kimballgroup.com/2003/01/fact-tables-and-dimension-tables/
Fact tables record business activities or events and for that reason fact tables could grow in size. Dim Tables store information on different contexts.
For eg: In an university 100 students might be enrolling in 10 subjects. Now if you see the dims, Dim_Student and Dim_Subject, in this scenario they might have 100 rows and 10 rows each. But the activity of enrolments will be much more, as students can enrol into 0 or many subjects at the same time. This could lead to the Fact_Enrolment(which records the enrolment activities) table having lot more rows when compared to the dims.
Note: However in my experience I have also worked with facts where the fact tables have lesser rows when compared to the dims, at a particular point in time. They might grow in size eventually when the DataWarehouse grows.
Hope that helps.

Dimensions contain entity level information whereas facts contain transaction level information and for a dimension multiple transaction can take place over a period of time. For example, in a HR system, there can be a person dimension containing personal details of all the employees wherein typically there may be 1-3 records for each employee.
Fact tables will store multiple transactions of the employees e.g., hires, promotion. movement/change of departments, leaves Termination etc. so corresponding to one-person record in person dimension there will be multiple records in facts.
Also Fact Tables contains facts / measures corresponding to multiple dimensions
And so facts are joined with multiple dimensions using a surrogate key/ foreign key reference to different dimensions which makes the fact table heavier than dimensions.

Dimension tables contains the attribute level information and highly de-normalized
Actually, I doubt that dimension tables are "highly de-normalized". Generally speaking, each row in a dimension table is identified by a primary key so there is very less scope of having duplicates in them. This can explain why they do not get too big in size compared to fact tables.

Related

Dealing with multiple fact tables concerning related processes in dimensional modeling

I have the following scenario where OLTP sales data is stored in two separate physical tables:
Sales
Refunds/Cancellations
A refund always refers to an existing sale (thus 'negating' it), though the dimensions of these tables are nearly the same (date, sales clerk, store, etc.). The data schema looks something like the following:
CREATE TABLE sale
(
sale_id uuid,
transaction_at timestamp with time zone,
store_id uuid,
clerk_id uuid,
clerk_number bigint,
currency character varying(3),
pos_id uuid,
total numeric,
net_total numeric
);
CREATE TABLE refund
(
refund_id uuid,
sale_id uuid, -- referenced sale to void
refunded_at timestamp with time zone,
pos_id uuid,
clerk_id uuid,
clerk_number bigint
)
I am trying to figure how to model this data for ingstion in a DW. Since I am relatively new to dimensional modeling I have begun reading The Data Warehouse Toolkit, but I am as of now unsure of the best approach to handle this case.
To my mind, these are two separate fact tables describing two different business processes (e.g. making a sale and getting a refund), though due to normalization concerns the refunds table (besides containing most of the same dimensions) is basically a pointer to a row in the sales table (which is fine for OLTP).
Analytical reports down the line would obviously want to look at these in a few ways:
All net sales per dimension (gross sales minus refunds)
All refund amounts per dimension
Other potential business use cases
As is, the first two cases would require joining the fact tables to either subtract the sales amount (case 1) or to get the information on refund amounts (case 2).
The approach that seems to make the most sense for me is something like the following (via some ETL/ELT process):
Load the (gross) sales data into a table in the DW
Load and denormalize refund data into a table in the DW, joining actual sale data so that amounts etc. are located in the fact table
Join either table with common conformed dimensions for further roll-ups and querying
This makes sense to me because:
Both fact tables have all required information from the physical event, and
There is no explicit dependency between the fact tables, and
Common dimensions can be reused
However, in this case, I still would not be able to get the net sales without joining these two tables. This makes me think that there should be a separate net_sale fact table, but this is problematic:
From a business point of view, sales without refunds are the vast majority of events that occur. A net_sale table would copy basically 99% of all sale data.
From a business process point of view, this table would describe an event that does not exist as such (there is no "net sale", only an aggregated view of sales amount per dimension minus refund costs).
Glossing over the third Chapter in The Data Warehouse Toolkit, I do not see this case mentioned explicitly (though there might be some parallels w.r.t. factless fact-tables and derived facts). What kind of approach would work in a case like this?

HR Data Mart Design Advice

I am working on a design for an HR data mart using the Kimball approach outlined in 'The Data Warehouse Toolkit'.
As per the Kimball design, I was planning to have a time-stamped, slowly-changing dimension to track employee profile changes (to support point-in-time analysis of employee state) and a head-count periodic snapshot fact table to support measures of new hires, leavers, leave taken, salary paid etc.
The problem I've encountered is that, in some cases, our employees can be assigned to multiple roles/jobs and each one needs to be tracked separately (i.e. the grain of my facts has to be at job-level, not employee level).
How might the Kimball design be adapted to fit a scenario where employee and role/job form a hierarchy like this? Ideally, I want to avoid duplicating employee profile data (address, demographics etc) for each role/job an employee is assigned to, but does this mean I need to snow-flake the dimension?
Options I've been considering include the below - I'd be interested in any thoughts or suggestions the community has on this so all input is welcome!
1) (see attached, design 1) A snowflake-style approach with an employee table which has a 1-to-Many link role table, which, in turn, has a 1-to-many link with the fact table. The advantage here is a clean employee dimension but I don't want to introduce unnecessary complexity. Is there any reason why I shouldn't link both dimensions directly to the fact table? The snowflake designs I've seen don't seem to do this.
2) (see attached, design 2) A combined Employee/Role dimension where each employee has a record for each assigned role but only one on them is flagged as 'Primary Role'. Point-in-time queries on the dimension can be performed by constraining on the 'Primary Role' flag.
Anything that occurred is an event and can be a fact. When you look at relationships between data, you need to also ask if the data value describes the entity (dim) or something that happened to/with the entity(fact). Everything can be a dim or a fact.(sometimes both)
A job describes an event that happened to the employee. You should have a fact employeejob that relates to the Dim employee and Dim job (as well as your date dimensions). This will then allow you to break down absences by employee and job. Your dim job would really just be job title, pay grades, etc. The fact would contain effective dates. Research factless fact tables.
Note that your vacancy reference would be part of a separate fact (when/where did you post it, how many applicants are all measurable facts about the vacancy). This may also be an example of a degenerate dimension.
I'm not fond of your monthly fact. I think that should just be some calculated measures built on fact absence and fact employeejob. When those events are put up against your dimensions, you can break them down by date, job type, manager, etc.

How deep to go when denormalising

I denormalising a OLTP database for use in a DWH.
At the moment I am denormalising studygroups.
Each studygroup has a key pointing towards 1 project.
Each project has a key pointing towards 1 department.
Each department has a key pointing towards 1 university.
Each universityhas a key pointing to 1 city.
Now I know that you are supposed to denormalize the sh*t out your OLTP but in this dwh department will be a dimension on its own. This goes for university also. Would it suffise to add a key from studygroup pointing at department or is it wiser to denormalize as far as you can and add all attributes from the department and all attributes from its M:1 related tables to the dimension studygroup? Even when department and university will be dimensions by themselves?
In other words: how far/deep do you go when denormalizing?
The key concept behind a dimensional model is:
Keep your fact tables in 3NF (third normal form);
De-normalize your dimensions into 2NF (second normal form)
So ideally, the only joins you should have in your model are the joins between fact tables and relevant dimensions.
As part of this philosophy:
Avoid "snow flake" designs, where dimensions contain keys to other dimensions. It's always possible to come up with a data model that allows the same functionality as the snow flakes, without violating 3NF/2NF rule;
Never have any direct joins between 2 separate dimensions (i.e, department and study group) directly. All relations among dimensions must be resolved via fact tables;
Never have any direct joins between 2 separate fact tables. Any relations among fact tables must be resolved via shared dimensions.
Finally, consider that dimensional design, besides optimization of the data for querying, serves a second important purpose: it's a semantic model of the business (or whatever else it represents). So, when making decisions about combining data elements into dimensions and facts, consider their "logical affinity" - they should make intuitive sense to the end users. If you have hard times explaining to a BI analyst the meaning of your dimension or fact table, most likely you've made a modeling mistake.
For example, in your case you should consider logical relations between universities, departments, study groups, etc. It's very likely that University/Department form a natural hierarchy. If so, they should belong to the same dimension. Study group, on the other hand, might not - let's assume, it's possible to form study groups across multiple universities and/or multiple departments. Such Many:Many relations are clear indication that they should be resolved via fact tables. In addition, relations between universities and departments are stable (rarely change), while study groups are formed and dissolved very often, and thus should be modeled separately.
In general, if you see 1:1 or 1:M relations between dimensional elements, it's often an indication that they should be de-normalized into the same table (again, only if their combination makes logical sense). If the relations are M:M, most likely they belong to different tables (you can force them into the same table, but often such tables look like Frankenstein creatures).
You can get much better help by making your question more specific - draw your dimensional model, post it, and ask for specific issues/challenges you have. For general concepts, books from Kimball and Inmon are your best friends.

Identifying the fact table in data warehouse design

I'm trying to design my first data mart with a star schema from an Excel Sheet containing informations about a Help Desk Service calls, this sheet contains 33 fields including different informations and I can't identify the fact table because I want to do the reporting later based on different KPI's.
I want to know how to identify the fact table measures easily and I have another question which is : Can a fact table contain only foreign keys of dimensions and no measures? Thanks in advance guys and sorry for my bad English.
You can have more than one fact table.
A fact table represents an event or process that you want to analyze.
The structure of the fact tables depend on the process or event that you are trying to analyze.
You need to tell us the events or processes that you want to analyze before we can help you further.
Can a fact table contain only foreign keys of dimensions and no measures?
Yes. This is called a factless fact table.
Let's say you want to do a basic analysis of calls:
Your full table might look like this
CALL_ID
START_DATE
DURATION
AGENT_NAME
AGENT_TENURE (how long worked for company)
CUSTOMER_NAME
CUSTOMER_TENURE (how long a customer)
PRODUCT_NAME (the product the customer is calling about)
RESOLVED
You would turn this into a fact table like this:
CALL_ID
START_DATE_KEY
AGENT_KEY
CUSTOMER_KEY
PRODUCT_KEY
DURATION (measure)
RESOLVED (quasi-measure)
And you would have a DATE dimension table, AGENT dimension table, CUSTOMER dimension table and PRODUCT dimension table.
Agile Data Warehouse Design is a good book, as are the ones by Kimball.
In general, the way I've done it (and there are a number of ways to do anything) is that the categorical data is referenced with a FKey in the fact table, but anything you want to perform aggregations on (typically as data types $/integers/doubles etc) can be in the fact table as well. So for example, a fact table might contain a hierarchy of types, such as product_category >> product_name, and it usually contains a time and/or location field as well; all of which would be referenced by a FKEY to a lookup table. The measure columns are usually integer based or money data, and are used in aggregate functions grouped by the other fields like this:
select sum(measureOne) as sum, product_category from facttable
where timeCol between X and Y group by product_category...etc
At one time a few years ago, I did have a fact table that had no measure column... because the only measure I had was based on count, which I would do dynamically by grouping different dimensions in the fact table.

How are dimensions and fact tables related in a star diagram?

If you have a relational database and you want to start making reports, you might do the following (please let me know if this is incorrect).
Go through your relational database and make a list of all the columns that you want to include in your report.
Group related columns together and then split those (normalise) into additional tables. These are the dimensions.
The dimensions then have a primary key (possibly a combination of two rows), and the fact table has a foreign key to reference each dimension, plus fields that you don't separate out in the first place such as sales value.
The question:
I was originally seeing dimensions as data marts that referenced data from external sources, and a fact table that in turn referenced data in the dimensions.. that's incorrect, isn't it? It's the other way around...
Or in general, if you were to normalise a database you would always replace the columns you take out a table with a foreign key, and add a primary key to the new table?
A fact table represents a process or event that you want to analyze.
Step 1: What is the process or event that you want to analyze?
The columns in the fact table represent all of the variables that are pertinent to your analysis.
Step 2: What variables are pertinent to the analysis?
Whether you "split-out" columns into dimension tables is irrelevant to your understanding. It's an optimization to minimize the space taken up by fact tables.
If you want to discriminate between measures and dimensions, ask
Step 3: What are the (true) numeric values in my fact table? These are your measures.
An example of a true numeric value is a dollar amount, like Sales Order Line Item Extended Price. You can sum it up or take an average of it.
An example of a not true numeric value is Customer ID 12345. It's a number, but represents something that isn't a number (a customer). The sum of customer ids makes no sense, nor does the average. Dig?
Regarding your questions:
Fact tables do not need foreign keys to dimension tables. (hint: see Hot-Swappable Dimensions)
"dimensions as data marts that referenced data from external sources". Hm...maybe, but don't worry about data marts for now. A dimension is just a column in your fact table (that isn't a measure). A dimension table is just a collection of dimensions that are related.
Just start with Excel. Figure out the columns you need in your analysis. Put them in Excel. That's your fact table. If you expect your fact table to get large (100s of MB), then do ONE level of normalization:
Figure out your measures. Leave them in the fact table.
Figure out your dimensions. Group them together (Customer info into one group, Store info into another).
Put them in their own tables. Give them meaningless surrogate keys. Put those keys in the fact table.

Resources