The definition states
Hash join is used when projections of the joined tables are not
already sorted on the join columns. In this case, the optimizer builds
an in-memory hash table on the inner table's join column. The
optimizer then scans the outer table for matches to the hash table,
and joins data from the two tables accordingly.
So table A is hashed and table B not. So how can we compare a hashed value with a non-hashed value?
Hash tables (of any kind) do not make comparisons between hashed and unhashed values.
Hash tables (of any kind) use a hash of the key to determine the (small) bucket into which to place the actual key (here, the FK value) and the key's value (here, the row). The hash is used to deterministically, but evenly, evenly distribute entries across the buckets, with the goal of minimising the maximum number of entries in all buckets.
By keeping the bucket sizes (the number of entries per bucket) small by dynamically resizing the number of buckets, hash tables make for remarkably fast retrieval, because once the bucket is found, there are few entries to traverse in the entries found there to find the hit (or not).
Hash tables are so fast they are considered to have constant time operations - ie the time to find/add an entry is constant no matter how many entries (here, rows) are stored in it.
Related
I've read that dimension tables hold the primary key and and fact tables contain the foreign key which references the primary key of Dimension tables.
Now the confusion I am having is this - suppose I have an ETL pipeline which populates the dimension table (let's say customer) from a source (say another DB). Let's assume this is a frequently changing table and has over 200 columns. How do I incorporate these changes in the dimension tables? I want to have only the latest record for each customer (type 1 SCD) in the DWH.
One thing what I could do is delete the row in the dimension table and re-insert the new updated row. But this approach won't work because of the primary key - foreign key constraint (which will not allow me to delete the record).
Should I write an update statement with all 200 columns in the ETL script? Or is there any other approach?
Strictly speaking you just need to update the fields that changed. But the cost of updating all in a single record is probably similar (assuming it’s row based storage), and it’s probably easier to write.
You can’t delete and re-insert, as the new row will have a new PK and old facts will no longer be linked.
Could you please help understand how to populate fact table with Surrogate keys from dimensions.
I have the following fact table and dimensions:
ClaimFacts
ContractDim_SK
ClaimDim_SK
AccountingDim_SK
ClaimNbr
ClaimAmount
ContractDim
ContractDim_SK (PK)
ContractNbr(BK)
ReportingPeriod(BK)
Code
Name
AccountingDim
TransactionNbr(BK)
ReportingPeriod(PK)
TransactionCode
CurrencyCode
(Should I add ContractNbr here ?? original table in OLTP has it)
ClaimDim
CalimsDim_Sk(PK)
CalimNbr (BK)
ReportingPeriod(BK)
ClaimDesc
ClaimName
(Should I add ContractNbr here ?? original table in OLTP has it)
My logic to load data into fact table is the following :
First I load data into dimensions (with Surrogate keys are created as identity columns)
From transactional model (OLTP) the fact table will be filled with the measures (ClaimNbr And ClaimAmount)
I don’t know how to populate fact table with SKs of Dimensions, how to know where to put the key I am pulling from dimensions to which row in fact table (which key belongs to this claimNBR ?)
Should I add contract Nbr in all dimensions and join them together when loading keys to fact?
What’s the right approach to do this?
Please help,
Thank you
The way it usually works:
In your dimensions, you will have "Natural Keys" (aka "Business Keys") - keys that come from external systems. For example, Contract Number. Then you create synthetic (surrogat) keys for the table.
In your fact table, all keys initially must also be "Natural Keys". For example, Contract Number. Such keys must exist for each dimension that you want to connect to the fact table. Sometimes, a dimension might need several natural keys (collectively, they represent dimension table "Granularity" level). For example, Location might need State and City keys if modeled on State-City level.
Join your dim table to the fact table on natural keys, and from the result omit natural key from fact and select surrogat key from dim. I usually do a left join (fact left join dim), to control records that don't match. I also join dims one by one (to better control what's happening).
Basic example (using T-SQL). Let's say you have the following 2 tables:
Table Source.Sales
( Contract_BK,
Amount,
Quantity)
Table Dim.Contract
( Contract_SK,
Contract_BK,
Contract Type)
To Swap keys:
SELECT
c.Contract_SK
,s.Amount
,s.Quantity
INTO
Fact.Sales
FROM
Source.Sales s LEFT JOIN Dim.Contract c ON s.Contract_BK = c.Contract_BK
-- Test for missing keys
SELECT
*
FROM
Fact.Sale
WHERE
Contract_SK IS NULL
If you have a relational database and you want to start making reports, you might do the following (please let me know if this is incorrect).
Go through your relational database and make a list of all the columns that you want to include in your report.
Group related columns together and then split those (normalise) into additional tables. These are the dimensions.
The dimensions then have a primary key (possibly a combination of two rows), and the fact table has a foreign key to reference each dimension, plus fields that you don't separate out in the first place such as sales value.
The question:
I was originally seeing dimensions as data marts that referenced data from external sources, and a fact table that in turn referenced data in the dimensions.. that's incorrect, isn't it? It's the other way around...
Or in general, if you were to normalise a database you would always replace the columns you take out a table with a foreign key, and add a primary key to the new table?
A fact table represents a process or event that you want to analyze.
Step 1: What is the process or event that you want to analyze?
The columns in the fact table represent all of the variables that are pertinent to your analysis.
Step 2: What variables are pertinent to the analysis?
Whether you "split-out" columns into dimension tables is irrelevant to your understanding. It's an optimization to minimize the space taken up by fact tables.
If you want to discriminate between measures and dimensions, ask
Step 3: What are the (true) numeric values in my fact table? These are your measures.
An example of a true numeric value is a dollar amount, like Sales Order Line Item Extended Price. You can sum it up or take an average of it.
An example of a not true numeric value is Customer ID 12345. It's a number, but represents something that isn't a number (a customer). The sum of customer ids makes no sense, nor does the average. Dig?
Regarding your questions:
Fact tables do not need foreign keys to dimension tables. (hint: see Hot-Swappable Dimensions)
"dimensions as data marts that referenced data from external sources". Hm...maybe, but don't worry about data marts for now. A dimension is just a column in your fact table (that isn't a measure). A dimension table is just a collection of dimensions that are related.
Just start with Excel. Figure out the columns you need in your analysis. Put them in Excel. That's your fact table. If you expect your fact table to get large (100s of MB), then do ONE level of normalization:
Figure out your measures. Leave them in the fact table.
Figure out your dimensions. Group them together (Customer info into one group, Store info into another).
Put them in their own tables. Give them meaningless surrogate keys. Put those keys in the fact table.
I understand the general concept of a surrogate key in a DWH environment.
But there are two aspects I don't understand and couldn't find information about:
Is it common practice that a surrogate key is unique in the whole DWH or unique in one Dimension?
If I have a Dimension with a hierarchy, does that hierarchy influence the generation of the surrogate key?
1) A surrogate key is unique to one row - it is used as a common handle for the relationships betweeen all the cells in a row. Due to how data is stored, a surrogate key is not strictly nescessary to infer the releationship between cells in a row. But if your row represents a countable identity (a row) in an entity ( a table), which would be the case if your database is normalized, then refering to one single surrogate key (usually the primary key) is easier than keeping a reference to all the participants in the primary key. Maintaing an index on one compact column is easier than on the whole row, for example.
In fact tables, surrogate keys have another application. Because data is often combined from many sources, chances are that you will run into the problem with composite primary keys (several columns combined are used identify each row uniquely), as well as the problem of duplicate business keys (the keys taken from the various source systems). Because surrogate keys are used for lookups, the compactness of it is important. Use an incrementing integer or a fixed length hash, and keep the business key from the source in a separte column.
2) This questions is difficult to answer because I dont know what software you are using to manage your dimensions and hierarchies. This influences things a lot. On a typical denormalized Kimball architecture, in a dimension table, a surrogate key is used to reference a unique row in the dimension table. In a dimesion table with several hierarhies, the meaning of this can be a bit confusing. The surrogate key will only be truly unique for the hierarchy with the highest cardinality (number of members), as it is this which will determine how many rows will be in the dimension table. So the practise is that the key is unique to the dimension table, AND exactly ONE of the hierarchies in it - the one with the highest number of members. If you add versioning of hierarchies (slowly changing dimensions) to this, the exact meaning of the surrogate key can be deceptive.
Note/Rant : I generally find the idea of multiple hierarchies in one dimension table quite apalling. True, it reduces the number of dimension references in the fact table, but there are drawbacks. There are several consequences to the denormalization of the dimension table, (the ugly duplication). One of them is the risk of double counting when joining on a dimension table. This is often fixed (or glossed over) by the software packages used, checking if values are the same and then summing them and reducing the count if the are. But this is a common source of counting anomalies and summing errors which can only be handled down the road by really dirty hacks. Of which I have see quite a few.
yes, A surrogate key is unique to one row - it is used as a common handle for the relationships betweeen all the cells in a row. Due to how data is stored, a surrogate key is not strictly nescessary to infer the releationship between cells in a row. But if your row represents a countable identity (a row) in an entity ( a table), which would be the case if your database is normalized, then refering to one single surrogate key (usually the primary key) is easier than keeping a reference to all the participants in the primary key. Maintaing an index on one compact column is easier than on the whole row, for example.
We've got a data warehouse design with four dimension tables and one fact table:
dimUser id, email, firstName, lastName
dimAddress id, city
dimLanguage id, language
dimDate id, startDate, endDate
factStatistic id, dimUserId, dimAddressId, dimLanguageId, dimDate, loginCount, pageCalledCount
Our problem is: We want to build the fact table which includes calculating the statistics (depending on userId, date range) and filling the foreign keys.
But we don't know how, because we don't understand how to use natural keys (which seems to be the solution to our problem according to the literature we read).
I believe a natural key would be the userId, which is needed in all ETL jobs which calculate the dimension data.
But there are many difficulties:
in the ETL jobs load(), we do bulk inserts with INSERT IGNORE INTO to remove duplicates => we don't know the surrogate keys which were generated
if we create meta data (including a set of dimension_name, surrogate_key, natural_key) this will not work because of the duplicate elimination
The problem seems to be the duplicate elimination strategy. Is there a better approach?
We are using MySQL 5.1, if it makes any difference.
If your fact table is tracking logins and page calls per user, then you should have set of source tables which track these things, which is where you'll load your fact table data from. I would probably build the fact table at the grain of one row per user / login date - or even lower to persist atomic data if at all possible.
Here you would then have a fact table with two dimensions - User and Date. You can persist address and language as dimensions on the fact as well, but these are really just attributes of user.
Your dimensions should have surrogate keys, but also should have the source "business" or "natural" key available - either as an attribute on the dimension itself, or through a mapping table as your colleague suggested. It's not "wrong" to use a mapping table - it does make things easier when there are multiple sources.
If you store the business keys on a mapping table, or in the dimension as an attribue, then for each row to load in the fact, it's a simple lookup (usually via a join) against the dim or mapping table to get the surrogate key for the user (and then from the user to get the user's "current" address / language to persist on the fact). The date dimension usually hase a surrogate key stored in a YYYYMMDD or other "natural" format - you can just generate this from the date information on your source record that you're loading into the fact.
do not force for single query, try to load the data in separated queries and mix the data in some provider...