Split fact table because of one missing foreign key? - data-warehouse

Imagine that we have two different messages:
CarDataLog
CarStatusLog
CarDataLog contains data which has a direct relation to a car and the corresponding Person and contains data about the car.
CarStatusLog contains data about the same car as mentioned above which had a customer in the log included. But this time the data is a status. For a field like: "CleaningState": "NotCleaned" or "Cleaned".
Both of the log messages contain a Car_ID. Would we create one Fact table with the foreign keys to Car and Person and have the risk the person_id is null sometimes because it is not given.. Or would a better approach be to create two fact tables with the risk of having the 'grain' spreaded out?
The use case would be: get data for a specific car, including the states it had and the Person first name.
I am new to data warehousing and I hope someone can assist me with this issue?

A standard practice in data warehousing is to make a dummy row for dimension tables that is used to match "UNKNOWN" data. This prevents NULLS in the foreign keys in the fact table.
Depending on your use case, you may have multiple types of "UNKNOWN" data. For example, you could use a key of -1 for "UNKNOWN" and -2 for "NOT APPLICABLE" dimensional data.
See also: https://www.kimballgroup.com/2010/10/design-tip-128-selecting-default-values-for-nulls/

You need dims as Car_dim, Person_dim, Status_dim (as values CleaningState,NotCleaned" or "Cleaned), and Date_dim. Person_dim can have a row of "Unknown" person name when you get a null person name.
Dim and Fact tables have parent/child relationship that means you have to load data in Dim first (Dim is a parent) and then you load into a Fact (child) table.
Load dim IDs from above Dims in your Fact table based on the data you get. Make sure the 2 logs you have date fields in them so you can join both logs on a Car_id and when a date in both logs matches for that Car_id.
If you get a scenario when a Car_id exists in CarDataLog but not in CarStatusLog, then you need to create a row of "Unknown Status" in the Status_dim so you can use it in the Fact table. Good Luck!

Related

Normalizing issue with data history

I have two entities: Location and Employee. Each employee works in a single location at a time. For any given moment in time, the model is as follows:
There is, however, a requirement to also store historical information for all locations and employees for every end-of-month. I can achieve this by adding a Month PK attribute in both entities, but: how do I handle the relationship in that case?
A foreign key has to reference a composite PK in its entirety. Several alternatives come to mind:
Option 1: repeat the Month attribute in the Employee entity to get the full PK as FK attributes. This feels a bit redundant? If an employee has existed in a given month, surely she has to work in a location in the same month - i.e. the two Month attributes have to always have the exact same value:
Option 2: re-use the Month attribute in the PK of the Employee entity as a foreign key referencing Location. I don't even know if this is allowed (note: I'm going to be using SQL Server eventually, if it matters here)?
Option 3: create a separate bridge entity that holds the history of Location-Employee relationships. This feels kind of neat, but then again I have some doubts as to whether or not I can use one Month attribute here or if I need two of them. Also, it would allow many-to-many relationships (an employee in several locations on a given month), which is not supposed to happen in this case and I'd like to be able to enforce this in the data model.
Am I missing something obvious here? What is the "correct" and properly normalized solution? Or should I just leave the FK constraints out?

How to populate fact table with Surrogate keys from dimensions

Could you please help understand how to populate fact table with Surrogate keys from dimensions.
I have the following fact table and dimensions:
ClaimFacts
ContractDim_SK
ClaimDim_SK
AccountingDim_SK
ClaimNbr
ClaimAmount
ContractDim
ContractDim_SK (PK)
ContractNbr(BK)
ReportingPeriod(BK)
Code
Name
AccountingDim
TransactionNbr(BK)
ReportingPeriod(PK)
TransactionCode
CurrencyCode
(Should I add ContractNbr here ?? original table in OLTP has it)
ClaimDim
CalimsDim_Sk(PK)
CalimNbr (BK)
ReportingPeriod(BK)
ClaimDesc
ClaimName
(Should I add ContractNbr here ?? original table in OLTP has it)
My logic to load data into fact table is the following :
First I load data into dimensions (with Surrogate keys are created as identity columns)
From transactional model (OLTP) the fact table will be filled with the measures (ClaimNbr And ClaimAmount)
I don’t know how to populate fact table with SKs of Dimensions, how to know where to put the key I am pulling from dimensions to which row in fact table (which key belongs to this claimNBR ?)
Should I add contract Nbr in all dimensions and join them together when loading keys to fact?
What’s the right approach to do this?
Please help,
Thank you
The way it usually works:
In your dimensions, you will have "Natural Keys" (aka "Business Keys") - keys that come from external systems. For example, Contract Number. Then you create synthetic (surrogat) keys for the table.
In your fact table, all keys initially must also be "Natural Keys". For example, Contract Number. Such keys must exist for each dimension that you want to connect to the fact table. Sometimes, a dimension might need several natural keys (collectively, they represent dimension table "Granularity" level). For example, Location might need State and City keys if modeled on State-City level.
Join your dim table to the fact table on natural keys, and from the result omit natural key from fact and select surrogat key from dim. I usually do a left join (fact left join dim), to control records that don't match. I also join dims one by one (to better control what's happening).
Basic example (using T-SQL). Let's say you have the following 2 tables:
Table Source.Sales
( Contract_BK,
Amount,
Quantity)
Table Dim.Contract
( Contract_SK,
Contract_BK,
Contract Type)
To Swap keys:
SELECT
c.Contract_SK
,s.Amount
,s.Quantity
INTO
Fact.Sales
FROM
Source.Sales s LEFT JOIN Dim.Contract c ON s.Contract_BK = c.Contract_BK
-- Test for missing keys
SELECT
*
FROM
Fact.Sale
WHERE
Contract_SK IS NULL

Identifying the fact table in data warehouse design

I'm trying to design my first data mart with a star schema from an Excel Sheet containing informations about a Help Desk Service calls, this sheet contains 33 fields including different informations and I can't identify the fact table because I want to do the reporting later based on different KPI's.
I want to know how to identify the fact table measures easily and I have another question which is : Can a fact table contain only foreign keys of dimensions and no measures? Thanks in advance guys and sorry for my bad English.
You can have more than one fact table.
A fact table represents an event or process that you want to analyze.
The structure of the fact tables depend on the process or event that you are trying to analyze.
You need to tell us the events or processes that you want to analyze before we can help you further.
Can a fact table contain only foreign keys of dimensions and no measures?
Yes. This is called a factless fact table.
Let's say you want to do a basic analysis of calls:
Your full table might look like this
CALL_ID
START_DATE
DURATION
AGENT_NAME
AGENT_TENURE (how long worked for company)
CUSTOMER_NAME
CUSTOMER_TENURE (how long a customer)
PRODUCT_NAME (the product the customer is calling about)
RESOLVED
You would turn this into a fact table like this:
CALL_ID
START_DATE_KEY
AGENT_KEY
CUSTOMER_KEY
PRODUCT_KEY
DURATION (measure)
RESOLVED (quasi-measure)
And you would have a DATE dimension table, AGENT dimension table, CUSTOMER dimension table and PRODUCT dimension table.
Agile Data Warehouse Design is a good book, as are the ones by Kimball.
In general, the way I've done it (and there are a number of ways to do anything) is that the categorical data is referenced with a FKey in the fact table, but anything you want to perform aggregations on (typically as data types $/integers/doubles etc) can be in the fact table as well. So for example, a fact table might contain a hierarchy of types, such as product_category >> product_name, and it usually contains a time and/or location field as well; all of which would be referenced by a FKEY to a lookup table. The measure columns are usually integer based or money data, and are used in aggregate functions grouped by the other fields like this:
select sum(measureOne) as sum, product_category from facttable
where timeCol between X and Y group by product_category...etc
At one time a few years ago, I did have a fact table that had no measure column... because the only measure I had was based on count, which I would do dynamically by grouping different dimensions in the fact table.

How to create a fact table using natural keys

We've got a data warehouse design with four dimension tables and one fact table:
dimUser id, email, firstName, lastName
dimAddress id, city
dimLanguage id, language
dimDate id, startDate, endDate
factStatistic id, dimUserId, dimAddressId, dimLanguageId, dimDate, loginCount, pageCalledCount
Our problem is: We want to build the fact table which includes calculating the statistics (depending on userId, date range) and filling the foreign keys.
But we don't know how, because we don't understand how to use natural keys (which seems to be the solution to our problem according to the literature we read).
I believe a natural key would be the userId, which is needed in all ETL jobs which calculate the dimension data.
But there are many difficulties:
in the ETL jobs load(), we do bulk inserts with INSERT IGNORE INTO to remove duplicates => we don't know the surrogate keys which were generated
if we create meta data (including a set of dimension_name, surrogate_key, natural_key) this will not work because of the duplicate elimination
The problem seems to be the duplicate elimination strategy. Is there a better approach?
We are using MySQL 5.1, if it makes any difference.
If your fact table is tracking logins and page calls per user, then you should have set of source tables which track these things, which is where you'll load your fact table data from. I would probably build the fact table at the grain of one row per user / login date - or even lower to persist atomic data if at all possible.
Here you would then have a fact table with two dimensions - User and Date. You can persist address and language as dimensions on the fact as well, but these are really just attributes of user.
Your dimensions should have surrogate keys, but also should have the source "business" or "natural" key available - either as an attribute on the dimension itself, or through a mapping table as your colleague suggested. It's not "wrong" to use a mapping table - it does make things easier when there are multiple sources.
If you store the business keys on a mapping table, or in the dimension as an attribue, then for each row to load in the fact, it's a simple lookup (usually via a join) against the dim or mapping table to get the surrogate key for the user (and then from the user to get the user's "current" address / language to persist on the fact). The date dimension usually hase a surrogate key stored in a YYYYMMDD or other "natural" format - you can just generate this from the date information on your source record that you're loading into the fact.
do not force for single query, try to load the data in separated queries and mix the data in some provider...

To normalize or not to normalize user_ids

In my Rails application, I have a variety of database tables that contain user data. Some of these tables have a lot of rows (as many as 500,000 rows per user in some cases) and are queried frequently. Whenever I query any table for anything, the user_id of the current user is somewhere in the query - either directly, if the table has a direct relation with the user, or through a join, if they are related through some other tables.
Should I denormalize the user_id and include it in every table, for faster performance?
Here's one example:
Address belongs to user, and has a user_id
Envelope belongs to user, and has a user_id
AddressesEnvelopes joins an Address and an Envelope, so it has envelope_id and address_id -- it doesn't have user_id, but could get to it through either the envelope or the address (which must belong to the same user).
One common expensive query is to select all the AddressesEnvelopes for a particular user, which I could accomplish by joining with either Address or Envelope, even though I don't need anything from those tables. Or I could just duplicate the user id in this table.
Here's a different scenario:
Letter belongs to user, and has a user_id
Recepient belongs to Letter, and has a letter_id
RecepientOption belongs to Recepient, and has a recepient_id
Would it make sense to duplicate the user_id in both Recepient and RecepientOption, even though I could always get to it by going up through the associations, through Letter?
Some notes:
There are never any objects that are
shared between users. An entire
hierarchy of related objects always
belongs to the same user.
The user owner of objects never changes.
Database performance is important because it's a data intensive application. There are many queries and many tables.
So should I include user_id in every table so I can use it when creating indexes? Or would that be bad design?
I'd like to point out that it isn't necessary to denormalize, if you are willing to work with composite primary keys. Sample for AddressEnvelop case:
user(
#user_id
)
address(
#user_id
, #addres_num
)
envelope(
#user_id
, #envelope_num
)
address_envelope(
#user_id
, #addres_num
, #envelope_num
)
(the # indicates a primary key column)
I am not a fan of this design if I can avoid it, but considering the fact that you say that all these objects are tied to a user, this type of design would make it relatively simply to partition your data (either logically, put ranges of users in separate tables or physically, using multiple databases or even machines)
Another thing that would make sense with this type of design is using clustered indexes (in MySQL, the primary key of InnoDB tables are built from a clustered index). If you ensure the user_id is always the first column in your index, it will ensure that for each table, all data for one user is stored close together on disk. This is great when you always query by user_id, but it can hurt perfomance if you query by another object (in which case duplication like you sugessted may be a better solution)
At any rate, before you change the design, first make sure your schema is already optimized, and you have proper indexes on your foreign key columns. If performance really is paramount, you should simply try several solutions and do benchmarks.
As long as you
a) get a measurable performance improvement
and
b) know which parts of your database are real normalized data and which are redundant improvements
there is no reason not to do it!
Do you actually have a measured performance problem? 500 000 rows isn't very large table. Your selects should be reasonable fast if they are not very complex and you have proper indexes on your columns.
I would first see if there are slow queries and try to optimize them with indexes. If that is not enough, only then I would look into denormalization.
Denormalizations that you suggest seem reasonable if you can't achieve the required performance with other means. Just make sure that you keep denormalized fields up-to-date.

Resources