Surrogate keys in fact-less fact tables - data-warehouse

Why do you need surrogate keys in fact-less fact tables (or many to many dimensional relation tables)

Few circumstances when assigning a surrogate key to the rows in a fact table is beneficial:
Sometimes the business rules of the organization legitimately allow multiple identical rows to exist for a fact table. Normally as a designer, you try to avoid this at all costs by searching the source system for some kind of transaction time stamp to make the rows unique. But occasionally you are forced to accept this undesirable input. In these situations it will be necessary to create a surrogate key for the fact table to allow the identical rows to be loaded.
Certain ETL techniques for updating fact rows are only feasible if a surrogate key is assigned to the fact rows. Specifically, one technique for loading updates to fact rows is to insert the rows to be updated as new rows, then to delete the original rows as a second step as a single transaction. The advantages of this technique from an ETL perspective are improved load performance, improved recovery capability and improved audit capabilities. The surrogate key for the fact table rows is required as multiple identical primary keys will often exist for the old and new versions of the updated fact rows between the time of the insert of the updated row and the delete of the old row.
A similar ETL requirement is to determine exactly where a load job was suspended, either to resume loading or back put the job entirely. A sequentially assigned surrogate key makes this task straightforward.

Related

How to preserve Referential Integrity when initializing Key FACT\DIM tables

Where I work we initialize ("INIT" => truncate & load) a Fact\ Dimension table on the rare occasions when the need arise.
Such "INIT" requires all object referencing the initialized object to be initialized subsequently in order to preserve RI (Referential Integrity).
Example – we found a bug in an attribute rich (~25) SCD Dim_Employees which will change the effective dates of each record. This requires all referencing objects to recalculate their Foreign Keys.
Do you have the same situation and if so how do you manage it ?
We have redimensioned our data multiple times as we change from SCD2/SCD1 to SCD6 or found things wrong with one of the data streams, like you mention.
It's not too hard to remap your data, you just need to make before you truncate you data in your INIT step to clone it, (or clone within your time travel window) then you join your fact table to the old dimension on the old dimension key, and then join to you new dimension via the old dimension foreign key and time, now you know the old key to new key mapping. Now that becomes the source of your update, and if you do it one operate with all other ETL operations paused, you have maintained your data integrity.
If you cannot pause the live ETL processes, you will a multi step update where you have your new divisional table and old one live, and do your normal mapping to both, and fix your reads to coalesce the results, and then once your new facts are correctly mapping to new dimension, turn around and back-fill all the old facts with ND keys, and then you have no ND gaps, then you can stop the OD/ND coalesce, and then stop the OD mapping and drop the OD column...

Merging without rewriting one table

I'm wondering about something that doesn't seem efficient to me.
I have 2 tables, one very large table DATA (millions of rows and hundreds of cols), with an id as primary key.
I then have another table, NEW_COL, with variable rows (1 to millions) but alwas 2 cols : id, and new_col_name.
I want to update the first table, adding the new_data to it.
Of course, i know how to do it with a proc sql/left join, or a data step/merge.
Yet, it seems inefficient, as far as I see with time executing, (which may be wrong), these 2 ways of doing rewrite the huge table completly, even when NEW_DATA is only 1 row (almost 1 min).
I tried doing 2 sql, with alter table add column then update, but it's waaaaaaaay too slow as update with joining doesn't seem efficient at all.
So, is there an efficient way to "add a column" to an existing table WITHOUT rewriting this huge table ?
Thanks!
SAS datasets are row stores and not columnar stores like tables in other databases. As such, adding rows is far easier and efficient than adding columns. A key joined view could be argued as the most 'efficient' way to add a column to a data rectangle.
If you are adding columns so often that the 1 min resource incursion is a problem you may need to upgrade hardware with faster drives, less contentious operating environment, or more memory and SASFILE if the new columns are often yet temporary in nature.
#Richard answer is perfect. If you are adding columns on regular basis then there is problem with your design. You either need to give more details on what you are doing and someone can suggest you.
I would try hash join. you can find code for simple hash join. This is efficient way of joining because in your case you have one large table and one small table if it fit into memory, it much better than a left join. I have done various joins using and query run times was considerably less( to order of 10)
By Altering table approach you are rewriting the table and also it causes lock on your table and nobody can use the table.
You should perform this joins when workload is less, which means during not during office and you may need to schedule the jobs in night, when more SAS resources are available
Thanks for your answers guys.
To add information, i don't have any constraint about table locking, balance load or anything as it's a "projet tool" script I use.
The goal is, in data prep step 'starting point data generator', to recompute an already existing data, or add a new one (less often but still quite regularly). Thus, i just don't want to "lose" time to wait for the whole table to rewrite while i only need to update one data for specific rows.
When i monitor the servor, the computation of the data and the joining step are very fast. But when I want tu update only 1 row, i see the whole table rewriting. Seems a waste of ressource to me.
But it seems it's a mandatory step, so can't do much about it.
Too bad.

Fact table design guidance for 100s of facts

I'm trying to create a datamart for the healthcare application. The facts in the datamart are basically going to be measurements and findings related to heart, and we have 100s of them. Starting from 1000 and can go to as big as 20000 per exam type.
I'm wondering what my design choices for the fact tables are:
Grain: 1 row per patient per exam type.
Some of the choices that I can think of -
1) A big wide fact table with 1000 or more columns.
2) EAV based design - A separate Measure dimension table. This foreign key will go into the fact table and the measure value will be in fact table. So the grain of the fact table will be changed to 1 row per patient per exam type per measurement.
3) Create smaller multiple fact tables per exam type per some other criteria like subgroup. But the end user is going to query across subgroups for that exam type and fact-fact join is not recommended.
4) any other ideas?
Any inputs would be appreciated.
1. A big wide fact table with 1000 or more columns.
One very wide fact table gives end-user maximum flexibility if queries are executed directly in the data warehouse. However some considerations should be taken into account, as you might hit some limits depending on a platform.
SQL Server 2014 limits are as per below:
Bytes per row 8,060. A row-overflow storage might be a solution, however it supports only few column types typically not related to fact nature, i.e. varchar, nvarchar, varbinary, sql_variant. Also not supported in In-Memory OLTP. https://technet.microsoft.com/en-us/library/ms186981(v=sql.105).aspx
Columns per non-wide table 1024. Wide-tables and sparse columns are solution as columns per wide table limit is 30,000. However the same Bytes per row limit applies. https://technet.microsoft.com/en-us/library/cc280604(v=sql.120).aspx
Columns per SELECT/INSERT/UPDATE statement 4,096
Non-clustered indexes per table 999
https://technet.microsoft.com/en-us/library/ms143432(v=sql.120).aspx
2. EAV based design - A separate Measure dimension table. This foreign key will go into the fact table and the measure value will be in fact table. So the grain of the fact table will be changed to 1 row per patient per exam type per measurement.
According to Kimball, EAV design is called Fact Normalization. It may make sense when a number of measurements is extremely lengthy, but sparsely populated for a given fact and no computations are made between facts.
Because facts are normalized therefore:
Extensibility is very easy, i.e. it's easy to add new measurements without the need to amend the data structure.
It's good to extract all measurements for one exam and present measurements as rows on the screen.
It's hard to extract/aggregate/make computation between several measurements (e.g. average HDL to CHOL ration) and present measurements/aggregates/computations as columns, i.e. requires complex WHERE/PIVOTING or multi-joins. SQL makes it difficult to make computations between facts in different rows.
If primary end-user platform is an OLAP cube then Fact Normalization makes sense. The cubes allows to make computation across any dimension.
Data importing could be an issue if data format is in a flat style CSV.
This questions is also discussed here Should I use EAV model?.
3) Create smaller multiple fact tables per exam type per some other criteria like subgroup. But the end user is going to query across subgroups for that exam type and fact-fact join is not recommended.
In some scenarios multiple smaller fact tables perfectly makes sense. One of the reason is if you hit some physical limits set by platform, e.g. Bytes per row.
The facts could be grouped either by subject area, e.g. measurement group/subgroup, or by frequency of usage. Each table could be placed on a separate file group and drive to maximize I/O.
Further, you could duplicate measurements across different fact tables to reduce the need of fact tables join, i.e. put one measurement in a specific measurement subgroup fact table and in frequently used measurement fact table.
However some considerations should be taken into account if there are some specific requirements for data loading. For example, if a record errors out in your ETL to one fact table, you might want to make sure that the corresponding records in the other fact tables are deleted and staged to your error table so you don't end up with any bogus information. This is especially true if end users have their own calculations in the front end tool.
If you use OLAP cubes then multiple fact tables actually becomes a source of a measure group to a specific fact table.
In terms of fact-to-fact join, you (BI application) should never issue SQL that joins two fact tables together across the fact table’s foreign keys. Instead, the technique of Drilling Across two fact tables should be used, where the answer sets from two or more fact tables are separately created, and the results sort-merged on the common row header attribute values to produce the correct result.
More on this topic: http://www.kimballgroup.com/2003/04/the-soul-of-the-data-warehouse-part-two-drilling-across/
4) any other ideas?
SQL XML or some kind NoSQL could be an option, but the same querying / aggregation / computation / presentation issues exist.

Table Normalization with no Domain values

There is a debate between our ETL team and a Data Modeler on whether a table should be normalized or not, and I was hoping to get some perspective from the online community.
Currently the tables are set up as such
MainTable LookupTable
PrimaryKey (PK) Code (PK)
Code (FK) Name
OtherColumns
Both tables are only being populated by a periodic file (from a 3rd party)
through an ETL job
A single record in the file contains all attributes in both tables for a single row)
The file populating these tables is a delta (only rows with some change in them are in the file)
One change to one attribute for one record (again only by the 3rd party) will result in all the data for that record in the file
The Domain Values for Code and Name are
not known.
Question:Should the LookupTable be denormalized into MainTable.
ETL team: Yes. With this setup, every row from the file will first have to check the 2nd table to see if their FK is in there (insert if it is not), then add the MainTable row. More Code, Worse Performance, and yes slightly more space. However ,regardless of a change to a LookupTable.Name from a 3rd party, the periodic file will reflect every row affected, and we will still have to parse through each row. If lumped into MainTable, all it is, is a simple update or insert.
Data Modeler: This is standard good database design.
Any thoughts?
Build prototypes. Make measurements.
You started with this, which your data modeler says is a standard good database design.
MainTable LookupTable
PrimaryKey (PK) Code (PK)
Code (FK) Name
OtherColumns
He's right. But this, too, is a good database design.
MainTable
PrimaryKey (PK)
Name
OtherColumns
If all updates to these tables come only from the ETL job, you don't need to be terribly concerned about enforcing data integrity through foreign keys. The ETL job would add new names to the lookup table anyway, regardless of what their values happen to be. Data integrity depends mainly on the system the data is extracted from. (And the quality of the ETL job.)
With this setup, every row from the file will first have to check the
2nd table to see if their FK is in there (insert if it is not), then
add the MainTable row.
If they're doing row-by-row processing, hire new ETL guys. Seriously.
More Code, Worse Performance, and yes slightly more space.
They'll need a little more code to update two tables instead of one. How long does it take to write the SQL statements? How long to run them? (How long each way?)
Worse performance? Maybe. Maybe not. If you use a fixed-width code, like an integer or char(3), updates to the codes won't affect the width of the row. And since the codes are shorter than the names, more rows might fit in a page. (It doesn't make any sense to use a code that longer than the name.) More rows per page usually means less I/O.
Less space, surely. Because you're storing a short code instead of a long name in every row of "MainTable".
For example, the average length of a country name is about 11.4 characters. If you used 3-character ISO country codes, you'd save an average of 8.4 bytes per row in "MainTable". For 100 million rows, you save about 840 million bytes. The size of that lookup table is negligible, about 6k.
And you don't usually need a join to get the full name; country codes are intended to be human-readable without expansion.

Deletion of rows from Informix Database

I have around 3 Million rows in a Table in Informix DB.
We have to delete it, before loading new data.
It has a primary key on one of its columns.
For deleting the same, I thought of going with rowid usage. But when I tried
select rowid from table
it responded with -857 error [Rowid does not exist].
So, I am not sure, how to go with the deletion. I prefer not going with primary key, as deletion with primary key is costly compared with rowid deletion.
Any suggestion on the above would be helpful.
If you get error -857, the chances are that the table is fragmented, and was created without the WITH ROWIDS option.
Which version of Informix are you using, and on which platform?
The chances are high that you have the TRUNCATE TABLE statement, which is designed to drop all the rows from a table very quickly indeed.
Failing that, you can use a straight-forward:
DELETE FROM TableName;
as long as you have sufficient logical log space available. If that won't work, then you'll need to do repeated DELETE statements based on ranges of the primary key (or any other convenient column).
Or you could consider dropping the table and then creating it afresh, possible with the WITH ROWIDS clause (though I would not particularly recommend using the WITH ROWIDS clause - it becomes a physical column with index instead of being a virtual column as it is in a non-fragmented table). One of the downsides of dropping and rebuilding a table is that the referential constraints have to be reinstated, and any views built on the table are automatically dropped when the table is dropped, so they have to be reinstated too.
I'm assuming this is IDS?.. How many new rows will be loaded and how often is this process repeated?.. Despite having to re-establish referential constraints and views, in my opinion, it is much better to drop the table, create it from scratch, load the data and then create the indexes because if you just delete all the rows, the deleted rows still remain physically in the table with a NULL \0 flag at the end of the row, thus the table size will be even larger when loading in the new rows and performance will suffer!.. It's also a good opportunity to create fresh indexes, and if possible, pre-sort the load data so that its in the most desirable order (like when creating a CLUSTERED INDEX). If you're going to fragment your tables on expressions or other type, then ROWID's go out the window, but use WITH ROWIDS if you're sure the table will never be fragmented. If your table has a serial column, are there any other tables using the serial columns as a foreign key?

Resources