I have a fact table with 8 foreign keys (referencing 8 dimensions), but even a combination of all eight keys does not uniquely identify a row. Do I need to add another attribute from the original data (i.e. "project-id" attribute, which is useless for anything), so that I can have a primary key, or I can leave fact table as it is, without a primary key?
The first rule of a fact table is to declare your grain - what uniquely identifies a row.
It sounds like you haven't declared your grain for this table. If the grain of the table is "one row per project", then you need to include project as a degenerate dimension in the table.
Every table must have a primary key. That's relational rule #1.
You can always add a surrogate key, but I like the idea of a fact table having attributes that satisfy a unique constraint. I second your idea: add more attributes until you have a unique constraint.
Along with those 8 foreign key include a simple surrogate key (like a row index) to each row. This will identify every row of the fact table uniquely
For a surrogate key you may start from an index say 1 for the first row and then increment the index by one each time you make a new entry to the fact table
Related
I've read that dimension tables hold the primary key and and fact tables contain the foreign key which references the primary key of Dimension tables.
Now the confusion I am having is this - suppose I have an ETL pipeline which populates the dimension table (let's say customer) from a source (say another DB). Let's assume this is a frequently changing table and has over 200 columns. How do I incorporate these changes in the dimension tables? I want to have only the latest record for each customer (type 1 SCD) in the DWH.
One thing what I could do is delete the row in the dimension table and re-insert the new updated row. But this approach won't work because of the primary key - foreign key constraint (which will not allow me to delete the record).
Should I write an update statement with all 200 columns in the ETL script? Or is there any other approach?
Strictly speaking you just need to update the fields that changed. But the cost of updating all in a single record is probably similar (assuming it’s row based storage), and it’s probably easier to write.
You can’t delete and re-insert, as the new row will have a new PK and old facts will no longer be linked.
I am trying to make one to many relationship between Contacts table and DepartmentTitle table.
I was thinking of introducing surrogate key on DepartmentTitle table so that I can reference this DepartmentTitle to Contacts table to trigger one to many relationship between these two tables. But I don't want to register same combination of the composite keys in the DepartmentTitle and that has prevented me from introducing the surrogate key to the table. I want the combination of composite keys in DepartmentTitle table to be unique.
To remedy the situation, I thought of implementing below ER diagram, where departmentTitleID would be unique and is used as reference id to the table (but is not primary key). Would this work? If not, what would be the solution?
If you're going to introduce a surrogate key, use it as your primary key. However, I would rather have Department_ID and Title_ID as separate columns in Contacts, since that allows Contacts to be joined directly to Department and/or Title as needed, without always needing to join DepartmentTitle. You can still have a composite foreign key constraint from the two columns in Contacts to the same in DepartmentTitle.
I want to convert given ER diagram with (min, max) notation to tables and im unsure of what the primary key of the "trainieren"-relation is.
If the Relation R is between A and B and:
one to one -> the primary key is the primary key of either A or B
one to many -> the primary key is the primary key of the entity that takes part multiple times in the relation
many to many -> the primary key is the primary key of A and B
I'd interpret (0,1) and (1,1) as one and (1,3) and (1,*) as many, therefore
my solution would be (primary keys in strong text) trainieren: {[Trainer.AkkrNr, Teams.Land]}
Generally, we try to use only one or many cardinality indicators, since those map easily to table structure. Most data modelers would do the same as you did in discarding the upper limit to simplify the model.
If you want to enforce that limit, there are a few ways to implement it:
Use your structure and a trigger on insert/update to count how many trainers the given team has, and throw an error if it exceeds 3.
You could add a position column to the primary key of trainieren and a constraint to limit it to values 1, 2 and 3. However, that imposes an ordering that wasn't part of the conceptual model.
Change trainieren to (Teams.Land PK, Trainer1.AkkrNr, Trainer2.AkkrNr, Trainer3.AkkrNr). Trainer2 and Trainer3 would need to be nullable, and this design loses the constraint that each trainer belongs to only one team. You could fix that with a trigger. Yuck.
Since there's no ideal way to implement an upper bound on the relationship cardinality, most data modelers would follow the same approach as you did, and leave it to the database client (usually the application logic) to enforce that limit.
Could you please help understand how to populate fact table with Surrogate keys from dimensions.
I have the following fact table and dimensions:
ClaimFacts
ContractDim_SK
ClaimDim_SK
AccountingDim_SK
ClaimNbr
ClaimAmount
ContractDim
ContractDim_SK (PK)
ContractNbr(BK)
ReportingPeriod(BK)
Code
Name
AccountingDim
TransactionNbr(BK)
ReportingPeriod(PK)
TransactionCode
CurrencyCode
(Should I add ContractNbr here ?? original table in OLTP has it)
ClaimDim
CalimsDim_Sk(PK)
CalimNbr (BK)
ReportingPeriod(BK)
ClaimDesc
ClaimName
(Should I add ContractNbr here ?? original table in OLTP has it)
My logic to load data into fact table is the following :
First I load data into dimensions (with Surrogate keys are created as identity columns)
From transactional model (OLTP) the fact table will be filled with the measures (ClaimNbr And ClaimAmount)
I don’t know how to populate fact table with SKs of Dimensions, how to know where to put the key I am pulling from dimensions to which row in fact table (which key belongs to this claimNBR ?)
Should I add contract Nbr in all dimensions and join them together when loading keys to fact?
What’s the right approach to do this?
Please help,
Thank you
The way it usually works:
In your dimensions, you will have "Natural Keys" (aka "Business Keys") - keys that come from external systems. For example, Contract Number. Then you create synthetic (surrogat) keys for the table.
In your fact table, all keys initially must also be "Natural Keys". For example, Contract Number. Such keys must exist for each dimension that you want to connect to the fact table. Sometimes, a dimension might need several natural keys (collectively, they represent dimension table "Granularity" level). For example, Location might need State and City keys if modeled on State-City level.
Join your dim table to the fact table on natural keys, and from the result omit natural key from fact and select surrogat key from dim. I usually do a left join (fact left join dim), to control records that don't match. I also join dims one by one (to better control what's happening).
Basic example (using T-SQL). Let's say you have the following 2 tables:
Table Source.Sales
( Contract_BK,
Amount,
Quantity)
Table Dim.Contract
( Contract_SK,
Contract_BK,
Contract Type)
To Swap keys:
SELECT
c.Contract_SK
,s.Amount
,s.Quantity
INTO
Fact.Sales
FROM
Source.Sales s LEFT JOIN Dim.Contract c ON s.Contract_BK = c.Contract_BK
-- Test for missing keys
SELECT
*
FROM
Fact.Sale
WHERE
Contract_SK IS NULL
I understand the general concept of a surrogate key in a DWH environment.
But there are two aspects I don't understand and couldn't find information about:
Is it common practice that a surrogate key is unique in the whole DWH or unique in one Dimension?
If I have a Dimension with a hierarchy, does that hierarchy influence the generation of the surrogate key?
1) A surrogate key is unique to one row - it is used as a common handle for the relationships betweeen all the cells in a row. Due to how data is stored, a surrogate key is not strictly nescessary to infer the releationship between cells in a row. But if your row represents a countable identity (a row) in an entity ( a table), which would be the case if your database is normalized, then refering to one single surrogate key (usually the primary key) is easier than keeping a reference to all the participants in the primary key. Maintaing an index on one compact column is easier than on the whole row, for example.
In fact tables, surrogate keys have another application. Because data is often combined from many sources, chances are that you will run into the problem with composite primary keys (several columns combined are used identify each row uniquely), as well as the problem of duplicate business keys (the keys taken from the various source systems). Because surrogate keys are used for lookups, the compactness of it is important. Use an incrementing integer or a fixed length hash, and keep the business key from the source in a separte column.
2) This questions is difficult to answer because I dont know what software you are using to manage your dimensions and hierarchies. This influences things a lot. On a typical denormalized Kimball architecture, in a dimension table, a surrogate key is used to reference a unique row in the dimension table. In a dimesion table with several hierarhies, the meaning of this can be a bit confusing. The surrogate key will only be truly unique for the hierarchy with the highest cardinality (number of members), as it is this which will determine how many rows will be in the dimension table. So the practise is that the key is unique to the dimension table, AND exactly ONE of the hierarchies in it - the one with the highest number of members. If you add versioning of hierarchies (slowly changing dimensions) to this, the exact meaning of the surrogate key can be deceptive.
Note/Rant : I generally find the idea of multiple hierarchies in one dimension table quite apalling. True, it reduces the number of dimension references in the fact table, but there are drawbacks. There are several consequences to the denormalization of the dimension table, (the ugly duplication). One of them is the risk of double counting when joining on a dimension table. This is often fixed (or glossed over) by the software packages used, checking if values are the same and then summing them and reducing the count if the are. But this is a common source of counting anomalies and summing errors which can only be handled down the road by really dirty hacks. Of which I have see quite a few.
yes, A surrogate key is unique to one row - it is used as a common handle for the relationships betweeen all the cells in a row. Due to how data is stored, a surrogate key is not strictly nescessary to infer the releationship between cells in a row. But if your row represents a countable identity (a row) in an entity ( a table), which would be the case if your database is normalized, then refering to one single surrogate key (usually the primary key) is easier than keeping a reference to all the participants in the primary key. Maintaing an index on one compact column is easier than on the whole row, for example.