How to manage foreign keys with Kettle/Spoon? - data-warehouse

I'm filling my data warehouse's table (MySQL) after some transformations with spoon. However, my dimension table is filled and I've to fill my fact table from a CSV file. So, when I try doing it, kettle warn me there are some records that violate foreign key constraint and the transformation is aborted. This is caused because not all foreign keys in my CSV appear in my dimension tables. How can I control if a foreign key is present in the dimension table which refers by kettle?

Related

Primary and Foreign Key in DW tables

I've read that dimension tables hold the primary key and and fact tables contain the foreign key which references the primary key of Dimension tables.
Now the confusion I am having is this - suppose I have an ETL pipeline which populates the dimension table (let's say customer) from a source (say another DB). Let's assume this is a frequently changing table and has over 200 columns. How do I incorporate these changes in the dimension tables? I want to have only the latest record for each customer (type 1 SCD) in the DWH.
One thing what I could do is delete the row in the dimension table and re-insert the new updated row. But this approach won't work because of the primary key - foreign key constraint (which will not allow me to delete the record).
Should I write an update statement with all 200 columns in the ETL script? Or is there any other approach?
Strictly speaking you just need to update the fields that changed. But the cost of updating all in a single record is probably similar (assuming it’s row based storage), and it’s probably easier to write.
You can’t delete and re-insert, as the new row will have a new PK and old facts will no longer be linked.

How to populate fact table with Surrogate keys from dimensions

Could you please help understand how to populate fact table with Surrogate keys from dimensions.
I have the following fact table and dimensions:
ClaimFacts
ContractDim_SK
ClaimDim_SK
AccountingDim_SK
ClaimNbr
ClaimAmount
ContractDim
ContractDim_SK (PK)
ContractNbr(BK)
ReportingPeriod(BK)
Code
Name
AccountingDim
TransactionNbr(BK)
ReportingPeriod(PK)
TransactionCode
CurrencyCode
(Should I add ContractNbr here ?? original table in OLTP has it)
ClaimDim
CalimsDim_Sk(PK)
CalimNbr (BK)
ReportingPeriod(BK)
ClaimDesc
ClaimName
(Should I add ContractNbr here ?? original table in OLTP has it)
My logic to load data into fact table is the following :
First I load data into dimensions (with Surrogate keys are created as identity columns)
From transactional model (OLTP) the fact table will be filled with the measures (ClaimNbr And ClaimAmount)
I don’t know how to populate fact table with SKs of Dimensions, how to know where to put the key I am pulling from dimensions to which row in fact table (which key belongs to this claimNBR ?)
Should I add contract Nbr in all dimensions and join them together when loading keys to fact?
What’s the right approach to do this?
Please help,
Thank you
The way it usually works:
In your dimensions, you will have "Natural Keys" (aka "Business Keys") - keys that come from external systems. For example, Contract Number. Then you create synthetic (surrogat) keys for the table.
In your fact table, all keys initially must also be "Natural Keys". For example, Contract Number. Such keys must exist for each dimension that you want to connect to the fact table. Sometimes, a dimension might need several natural keys (collectively, they represent dimension table "Granularity" level). For example, Location might need State and City keys if modeled on State-City level.
Join your dim table to the fact table on natural keys, and from the result omit natural key from fact and select surrogat key from dim. I usually do a left join (fact left join dim), to control records that don't match. I also join dims one by one (to better control what's happening).
Basic example (using T-SQL). Let's say you have the following 2 tables:
Table Source.Sales
( Contract_BK,
Amount,
Quantity)
Table Dim.Contract
( Contract_SK,
Contract_BK,
Contract Type)
To Swap keys:
SELECT
c.Contract_SK
,s.Amount
,s.Quantity
INTO
Fact.Sales
FROM
Source.Sales s LEFT JOIN Dim.Contract c ON s.Contract_BK = c.Contract_BK
-- Test for missing keys
SELECT
*
FROM
Fact.Sale
WHERE
Contract_SK IS NULL

how to capture the updates happening on dimension table

i have a fact table joined to 5 dimension tables. both the fact and dimension tables have metadata fields DWcreateddate,DWupdatedate,DWdeleteddate,DWdeletedflag.am building a table which flattens out the fact table by joining all the dimension on surrogate keys.
am doing the incremental load to the flattened table.am tracking the upserts happening on the fact table by metadata fields and doing the incremental load(stored procedure does that)...if a record is updated to a new name in the dimension table the fact DWupdated date doesnt have the latest date..so my flattened table is ending up having the old name..can some one help how to overcome this
You should never update your dimension. Once created, should be left alone with a few exceptions like slowly changing dimensions. You should be creating a new dimension record instead.

Load fact table using informatica

How can we load fact table in star schema using informatica powercenter ? Can you please provide any example for mappings/tranformations for this.
to load fact table ,if there is star schema dimentions table are independant at that time lookup on every dimention which you have to load, override the query with only active records check the condition with only natural key means your primary key in dimention after that on that basis take the surrogate key which artifically made by us for loading dimention table and also take which field you want to load in to that fact table.
Take the Staging tables as source tables and take the dimensions as lookups then load the data into fact table.
eg. http://www.folkstalk.com/2012/11/how-to-load-rows-into-fact-table-in.html
I was not able to find one when I was learning, hence adding this screenshot as a reference for new learners.
the mapping basically looks up at each of the dimension tables, and loads the dimension keys into fact as Foriegn keys and rest of the active records should come from SQ, I have used SQL override to perform all the joins and conditions required for loading the fact records.

Fact table primary key

I have a fact table with 8 foreign keys (referencing 8 dimensions), but even a combination of all eight keys does not uniquely identify a row. Do I need to add another attribute from the original data (i.e. "project-id" attribute, which is useless for anything), so that I can have a primary key, or I can leave fact table as it is, without a primary key?
The first rule of a fact table is to declare your grain - what uniquely identifies a row.
It sounds like you haven't declared your grain for this table. If the grain of the table is "one row per project", then you need to include project as a degenerate dimension in the table.
Every table must have a primary key. That's relational rule #1.
You can always add a surrogate key, but I like the idea of a fact table having attributes that satisfy a unique constraint. I second your idea: add more attributes until you have a unique constraint.
Along with those 8 foreign key include a simple surrogate key (like a row index) to each row. This will identify every row of the fact table uniquely
For a surrogate key you may start from an index say 1 for the first row and then increment the index by one each time you make a new entry to the fact table

Resources