Surrogate Keys in a Data Warehouse - data-warehouse

I understand the general concept of a surrogate key in a DWH environment.
But there are two aspects I don't understand and couldn't find information about:
Is it common practice that a surrogate key is unique in the whole DWH or unique in one Dimension?
If I have a Dimension with a hierarchy, does that hierarchy influence the generation of the surrogate key?

1) A surrogate key is unique to one row - it is used as a common handle for the relationships betweeen all the cells in a row. Due to how data is stored, a surrogate key is not strictly nescessary to infer the releationship between cells in a row. But if your row represents a countable identity (a row) in an entity ( a table), which would be the case if your database is normalized, then refering to one single surrogate key (usually the primary key) is easier than keeping a reference to all the participants in the primary key. Maintaing an index on one compact column is easier than on the whole row, for example.
In fact tables, surrogate keys have another application. Because data is often combined from many sources, chances are that you will run into the problem with composite primary keys (several columns combined are used identify each row uniquely), as well as the problem of duplicate business keys (the keys taken from the various source systems). Because surrogate keys are used for lookups, the compactness of it is important. Use an incrementing integer or a fixed length hash, and keep the business key from the source in a separte column.
2) This questions is difficult to answer because I dont know what software you are using to manage your dimensions and hierarchies. This influences things a lot. On a typical denormalized Kimball architecture, in a dimension table, a surrogate key is used to reference a unique row in the dimension table. In a dimesion table with several hierarhies, the meaning of this can be a bit confusing. The surrogate key will only be truly unique for the hierarchy with the highest cardinality (number of members), as it is this which will determine how many rows will be in the dimension table. So the practise is that the key is unique to the dimension table, AND exactly ONE of the hierarchies in it - the one with the highest number of members. If you add versioning of hierarchies (slowly changing dimensions) to this, the exact meaning of the surrogate key can be deceptive.
Note/Rant : I generally find the idea of multiple hierarchies in one dimension table quite apalling. True, it reduces the number of dimension references in the fact table, but there are drawbacks. There are several consequences to the denormalization of the dimension table, (the ugly duplication). One of them is the risk of double counting when joining on a dimension table. This is often fixed (or glossed over) by the software packages used, checking if values are the same and then summing them and reducing the count if the are. But this is a common source of counting anomalies and summing errors which can only be handled down the road by really dirty hacks. Of which I have see quite a few.

yes, A surrogate key is unique to one row - it is used as a common handle for the relationships betweeen all the cells in a row. Due to how data is stored, a surrogate key is not strictly nescessary to infer the releationship between cells in a row. But if your row represents a countable identity (a row) in an entity ( a table), which would be the case if your database is normalized, then refering to one single surrogate key (usually the primary key) is easier than keeping a reference to all the participants in the primary key. Maintaing an index on one compact column is easier than on the whole row, for example.

Related

Is A Table Linking a Dimension to a Fact table, a Dimension of a Fact?

In my incomplete view of BI tables, a Fact table represents action and a dimension an entity.
I have a FactOrder table that contains order information (including OrderId and CustomerId). There is a separate dimension for people who are actually connected to the order but are not customers. So they are saved in a separate table called DimServiceUser. A linking table connects an Order to a ServiceUser. Should this intermediate OrderServeruser table be defined as a Dimension, a Fact or another type?
It really is more of a bridge table. Here's what is going on.
Your FactOrder table is a fact table, but it also contains a degenerate dimension. A degenerate dimension acts as a dimension key in the fact table, however does not join to a corresponding dimension table because all its interesting attributes have already been placed in other analytic dimensions. So you have an implied DimOrder in there that didn't require a separate table.
A bridge table can connect a set of values to a single fact table row, or it can connect two dimensions (such as customers and bank accounts). It is a way of handling legitimate many-to-many relationships. A bridge table is like a factless fact table. But in dimensional modeling we do not join fact tables together, whereas it is acceptable to join bridge tables and fact tables together. If you must force your bridge table into being a fact or dimension, it is closer to a fact table. But doing so may make it easier to implement bad modeling habits in the future. If you can call it a bridge instead, I would just go with that. (Make sure you read that third link on "like a factless fact table". It was written by the author of Star Schema: The Complete Reference. That is a pretty well accepted source.)
As OrderService does not contain any facts/measurements therefore you cannot call it a fact table.
Dimension table:
A dimension table contains dimensions of a fact.
They are joined to fact table via a foreign key.
Dimension tables are de-normalized tables.
The Dimension Attributes are the various columns in a dimension table
Dimensions offers descriptive characteristics of the facts with the help of their attributes
No set limit set for given for number of dimensions
The dimension can also contain one or more hierarchical relationships
Based on the above definition of dimension table, I believe you table should be referred as dimension table.
OrderServeruser table should be prefix with "bridge".

How to populate fact table with Surrogate keys from dimensions

Could you please help understand how to populate fact table with Surrogate keys from dimensions.
I have the following fact table and dimensions:
ClaimFacts
ContractDim_SK
ClaimDim_SK
AccountingDim_SK
ClaimNbr
ClaimAmount
ContractDim
ContractDim_SK (PK)
ContractNbr(BK)
ReportingPeriod(BK)
Code
Name
AccountingDim
TransactionNbr(BK)
ReportingPeriod(PK)
TransactionCode
CurrencyCode
(Should I add ContractNbr here ?? original table in OLTP has it)
ClaimDim
CalimsDim_Sk(PK)
CalimNbr (BK)
ReportingPeriod(BK)
ClaimDesc
ClaimName
(Should I add ContractNbr here ?? original table in OLTP has it)
My logic to load data into fact table is the following :
First I load data into dimensions (with Surrogate keys are created as identity columns)
From transactional model (OLTP) the fact table will be filled with the measures (ClaimNbr And ClaimAmount)
I don’t know how to populate fact table with SKs of Dimensions, how to know where to put the key I am pulling from dimensions to which row in fact table (which key belongs to this claimNBR ?)
Should I add contract Nbr in all dimensions and join them together when loading keys to fact?
What’s the right approach to do this?
Please help,
Thank you
The way it usually works:
In your dimensions, you will have "Natural Keys" (aka "Business Keys") - keys that come from external systems. For example, Contract Number. Then you create synthetic (surrogat) keys for the table.
In your fact table, all keys initially must also be "Natural Keys". For example, Contract Number. Such keys must exist for each dimension that you want to connect to the fact table. Sometimes, a dimension might need several natural keys (collectively, they represent dimension table "Granularity" level). For example, Location might need State and City keys if modeled on State-City level.
Join your dim table to the fact table on natural keys, and from the result omit natural key from fact and select surrogat key from dim. I usually do a left join (fact left join dim), to control records that don't match. I also join dims one by one (to better control what's happening).
Basic example (using T-SQL). Let's say you have the following 2 tables:
Table Source.Sales
( Contract_BK,
Amount,
Quantity)
Table Dim.Contract
( Contract_SK,
Contract_BK,
Contract Type)
To Swap keys:
SELECT
c.Contract_SK
,s.Amount
,s.Quantity
INTO
Fact.Sales
FROM
Source.Sales s LEFT JOIN Dim.Contract c ON s.Contract_BK = c.Contract_BK
-- Test for missing keys
SELECT
*
FROM
Fact.Sale
WHERE
Contract_SK IS NULL

Star schema: how to handle dimension table with constantly changing set of columns?

First project using star schema, still in planning stage. We would appreciate any thoughts and advice on the following problem.
We have a dimension table for "product features used", and the set of features grows and changes over time. Because of the dynamic set of features, we think the features cannot be columns but instead must be rows.
We have a fact table for "user events", and we need to know which product features were used within each event.
So it seems we need to have a primary key on the fact table, which is used as a foreign key within the dimension table (exactly the opposite direction from a conventional star schema). We have several different dimension tables with similar dynamics and therefore a similar need for a foreign key into the fact table.
On the other hand, most of our dimension tables are more conventional and the fact table can just store a foreign key into these conventional dimension tables. We don't like that this means that some joins (many-to-one) will use the dimension table's primary key, but other joins (one-to-many) will use the fact table's primary key. We have considered using the fact table key as a foreign key in all the dimension tables, just for consistency, although the storage requirements increase.
Is there a better way to implement the keys for the "dynamic" dimension tables?
Here's an example that's not exactly what we're doing but similar:
Suppose our app searches for restaurants.
Optional features that a user may specify include price range, minimum star rating, or cuisine. The set of optional features changes over time (for example we may get rid of the option to specify cuisine, and add an option for most popular). For each search that is recorded in the database, the set of features used is fixed.
Each search will be a row in the fact table.
We are currently thinking that we should have a primary key in the fact table, and it should be used as a foreign key in the "features" dimension table. So we'd have:
fact_table(search_id, user_id, metric1, metric2)
feature_dimension_table(feature_id, search_id, feature_attribute1, feature_attribute2)
user_dimension_table(user_id, user_attribute1, user_attribute2)
Alternatively, for consistent joins and ignoring storage requirements for the sake of argument, we could use the fact table's primary key as a foreign key in all the dimension tables:
fact_table(search_id, metric1, metric2) /* no more user_id */
feature_dimension_table(feature_id, search_id, feature_attribute1, feature_attribute2)
user_dimension_table(user_id, search_id, user_attribute1, user_attribute2)
What are the pitfalls with these key schemas? What would be better ways to do it?
You need a Bridge table, it is the recommended solution for many-to-many relationships between fact and dimension.
http://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/multivalued-dimension-bridge-table/
Edit after example added to question:
OK, maybe it is not a bridge, the example changes my view.
A fundamental requirement of dimensional modelling is to correctly identify the grain of your fact table. A common example is invoice and line-item, where the grain is usually line-item.
Hypothetical examples are often difficult because you can never be sure that the example mirrors the real use case, but I think that your scenario might be search-and-criteria, and that your grain should be at the criteria level.
For example, your fact table might look like this:
fact_search (date_id,time_id,search_id,criteria_id,criteria_value)
Thinking about the types of query I might want to do against search data, this design is my best choice. The only issue I see is with the data type of criteria_value, it would have to be a choice/text value, and would definitely be non-additive.

How are dimensions and fact tables related in a star diagram?

If you have a relational database and you want to start making reports, you might do the following (please let me know if this is incorrect).
Go through your relational database and make a list of all the columns that you want to include in your report.
Group related columns together and then split those (normalise) into additional tables. These are the dimensions.
The dimensions then have a primary key (possibly a combination of two rows), and the fact table has a foreign key to reference each dimension, plus fields that you don't separate out in the first place such as sales value.
The question:
I was originally seeing dimensions as data marts that referenced data from external sources, and a fact table that in turn referenced data in the dimensions.. that's incorrect, isn't it? It's the other way around...
Or in general, if you were to normalise a database you would always replace the columns you take out a table with a foreign key, and add a primary key to the new table?
A fact table represents a process or event that you want to analyze.
Step 1: What is the process or event that you want to analyze?
The columns in the fact table represent all of the variables that are pertinent to your analysis.
Step 2: What variables are pertinent to the analysis?
Whether you "split-out" columns into dimension tables is irrelevant to your understanding. It's an optimization to minimize the space taken up by fact tables.
If you want to discriminate between measures and dimensions, ask
Step 3: What are the (true) numeric values in my fact table? These are your measures.
An example of a true numeric value is a dollar amount, like Sales Order Line Item Extended Price. You can sum it up or take an average of it.
An example of a not true numeric value is Customer ID 12345. It's a number, but represents something that isn't a number (a customer). The sum of customer ids makes no sense, nor does the average. Dig?
Regarding your questions:
Fact tables do not need foreign keys to dimension tables. (hint: see Hot-Swappable Dimensions)
"dimensions as data marts that referenced data from external sources". Hm...maybe, but don't worry about data marts for now. A dimension is just a column in your fact table (that isn't a measure). A dimension table is just a collection of dimensions that are related.
Just start with Excel. Figure out the columns you need in your analysis. Put them in Excel. That's your fact table. If you expect your fact table to get large (100s of MB), then do ONE level of normalization:
Figure out your measures. Leave them in the fact table.
Figure out your dimensions. Group them together (Customer info into one group, Store info into another).
Put them in their own tables. Give them meaningless surrogate keys. Put those keys in the fact table.

How to create a fact table using natural keys

We've got a data warehouse design with four dimension tables and one fact table:
dimUser id, email, firstName, lastName
dimAddress id, city
dimLanguage id, language
dimDate id, startDate, endDate
factStatistic id, dimUserId, dimAddressId, dimLanguageId, dimDate, loginCount, pageCalledCount
Our problem is: We want to build the fact table which includes calculating the statistics (depending on userId, date range) and filling the foreign keys.
But we don't know how, because we don't understand how to use natural keys (which seems to be the solution to our problem according to the literature we read).
I believe a natural key would be the userId, which is needed in all ETL jobs which calculate the dimension data.
But there are many difficulties:
in the ETL jobs load(), we do bulk inserts with INSERT IGNORE INTO to remove duplicates => we don't know the surrogate keys which were generated
if we create meta data (including a set of dimension_name, surrogate_key, natural_key) this will not work because of the duplicate elimination
The problem seems to be the duplicate elimination strategy. Is there a better approach?
We are using MySQL 5.1, if it makes any difference.
If your fact table is tracking logins and page calls per user, then you should have set of source tables which track these things, which is where you'll load your fact table data from. I would probably build the fact table at the grain of one row per user / login date - or even lower to persist atomic data if at all possible.
Here you would then have a fact table with two dimensions - User and Date. You can persist address and language as dimensions on the fact as well, but these are really just attributes of user.
Your dimensions should have surrogate keys, but also should have the source "business" or "natural" key available - either as an attribute on the dimension itself, or through a mapping table as your colleague suggested. It's not "wrong" to use a mapping table - it does make things easier when there are multiple sources.
If you store the business keys on a mapping table, or in the dimension as an attribue, then for each row to load in the fact, it's a simple lookup (usually via a join) against the dim or mapping table to get the surrogate key for the user (and then from the user to get the user's "current" address / language to persist on the fact). The date dimension usually hase a surrogate key stored in a YYYYMMDD or other "natural" format - you can just generate this from the date information on your source record that you're loading into the fact.
do not force for single query, try to load the data in separated queries and mix the data in some provider...

Resources