How different fact tables are handled to share some common dimension tables in data marts in star model? - data-warehouse

I am quite new to DW and I am just learning the stuff. I read on the Internet that after the ETL process, DW data is then stored in some data marts for some reasons such as ease of use. Each data mart can use a structure. Let's say a data mart is using star structure. Now my questions arise:
First of all, can a data mart use two structures, for instance, star and snowflake?
Assume we have two data marts that are using star structure. Suppose that both of them have only one fact table and some dimension tables. The thing is, it turned out that some of the dimension tables in the first data mart are the same as the second one.
While considering they are in different data marts, what should we do? Should we duplicate the tables in different data marts?
What if the fact tables were in the same data mart? Should we duplicate dimension tables or just create a foreign key to the table we already have?

Snowflake describes one or more objects in your model. So some parts of your model could be snowflaked and others not
Datamarts are logical groupings of your facts and dimensions, not physical ones. So you don’t duplicate these tables and they can appear in as many datamarts as necessary

Related

where to put dimensions in Kimball data warehouse

I am creating a data warehouse following Kimball's theory. So, I am aiming for the greatest amount of dimension table re-use. In this theory, how should I physically organize the dimension tables? I have used databases to organize my data marts (i.e., 1 mart per database, with possibly multiple fact tables per mart). Since a given dimension can be used by multiple marts (and is what I want to aim for), I don't know where I should put my dimension tables.
For example, should I put them all under a certain schema in a certain database (e.g., schema 'Dimension' under database 'Dimensions')? Or, perhaps, should I incrementally add them to each new database as I build out new data marts?
A datamart is a logical subset of your data warehouse, not a physical one. Your data warehouse should (under most circumstances) reside in a single database
Traditional data warehouses often use separate databases to create application boundaries based on either workload, domain or security.
As an example, a traditional SQL Server data warehouse might include a staging database, a data warehouse database, and some data mart databases. In this topology, each database operates as a workload and security boundary in the architecture.
You can create a schema for example for HR datamart and load all the related dimensions under it.
CREATE SCHEMA [HR]; -- name for the data mart HR
CREATE TABLE [HR].[DimEmployee] -- create dimensions related to data mart HR in the HR schema
( EmployeeSK BIGINT NOT NULL
, ...
);

How deep to go when denormalising

I denormalising a OLTP database for use in a DWH.
At the moment I am denormalising studygroups.
Each studygroup has a key pointing towards 1 project.
Each project has a key pointing towards 1 department.
Each department has a key pointing towards 1 university.
Each universityhas a key pointing to 1 city.
Now I know that you are supposed to denormalize the sh*t out your OLTP but in this dwh department will be a dimension on its own. This goes for university also. Would it suffise to add a key from studygroup pointing at department or is it wiser to denormalize as far as you can and add all attributes from the department and all attributes from its M:1 related tables to the dimension studygroup? Even when department and university will be dimensions by themselves?
In other words: how far/deep do you go when denormalizing?
The key concept behind a dimensional model is:
Keep your fact tables in 3NF (third normal form);
De-normalize your dimensions into 2NF (second normal form)
So ideally, the only joins you should have in your model are the joins between fact tables and relevant dimensions.
As part of this philosophy:
Avoid "snow flake" designs, where dimensions contain keys to other dimensions. It's always possible to come up with a data model that allows the same functionality as the snow flakes, without violating 3NF/2NF rule;
Never have any direct joins between 2 separate dimensions (i.e, department and study group) directly. All relations among dimensions must be resolved via fact tables;
Never have any direct joins between 2 separate fact tables. Any relations among fact tables must be resolved via shared dimensions.
Finally, consider that dimensional design, besides optimization of the data for querying, serves a second important purpose: it's a semantic model of the business (or whatever else it represents). So, when making decisions about combining data elements into dimensions and facts, consider their "logical affinity" - they should make intuitive sense to the end users. If you have hard times explaining to a BI analyst the meaning of your dimension or fact table, most likely you've made a modeling mistake.
For example, in your case you should consider logical relations between universities, departments, study groups, etc. It's very likely that University/Department form a natural hierarchy. If so, they should belong to the same dimension. Study group, on the other hand, might not - let's assume, it's possible to form study groups across multiple universities and/or multiple departments. Such Many:Many relations are clear indication that they should be resolved via fact tables. In addition, relations between universities and departments are stable (rarely change), while study groups are formed and dissolved very often, and thus should be modeled separately.
In general, if you see 1:1 or 1:M relations between dimensional elements, it's often an indication that they should be de-normalized into the same table (again, only if their combination makes logical sense). If the relations are M:M, most likely they belong to different tables (you can force them into the same table, but often such tables look like Frankenstein creatures).
You can get much better help by making your question more specific - draw your dimensional model, post it, and ask for specific issues/challenges you have. For general concepts, books from Kimball and Inmon are your best friends.

How are dimensions and fact tables related in a star diagram?

If you have a relational database and you want to start making reports, you might do the following (please let me know if this is incorrect).
Go through your relational database and make a list of all the columns that you want to include in your report.
Group related columns together and then split those (normalise) into additional tables. These are the dimensions.
The dimensions then have a primary key (possibly a combination of two rows), and the fact table has a foreign key to reference each dimension, plus fields that you don't separate out in the first place such as sales value.
The question:
I was originally seeing dimensions as data marts that referenced data from external sources, and a fact table that in turn referenced data in the dimensions.. that's incorrect, isn't it? It's the other way around...
Or in general, if you were to normalise a database you would always replace the columns you take out a table with a foreign key, and add a primary key to the new table?
A fact table represents a process or event that you want to analyze.
Step 1: What is the process or event that you want to analyze?
The columns in the fact table represent all of the variables that are pertinent to your analysis.
Step 2: What variables are pertinent to the analysis?
Whether you "split-out" columns into dimension tables is irrelevant to your understanding. It's an optimization to minimize the space taken up by fact tables.
If you want to discriminate between measures and dimensions, ask
Step 3: What are the (true) numeric values in my fact table? These are your measures.
An example of a true numeric value is a dollar amount, like Sales Order Line Item Extended Price. You can sum it up or take an average of it.
An example of a not true numeric value is Customer ID 12345. It's a number, but represents something that isn't a number (a customer). The sum of customer ids makes no sense, nor does the average. Dig?
Regarding your questions:
Fact tables do not need foreign keys to dimension tables. (hint: see Hot-Swappable Dimensions)
"dimensions as data marts that referenced data from external sources". Hm...maybe, but don't worry about data marts for now. A dimension is just a column in your fact table (that isn't a measure). A dimension table is just a collection of dimensions that are related.
Just start with Excel. Figure out the columns you need in your analysis. Put them in Excel. That's your fact table. If you expect your fact table to get large (100s of MB), then do ONE level of normalization:
Figure out your measures. Leave them in the fact table.
Figure out your dimensions. Group them together (Customer info into one group, Store info into another).
Put them in their own tables. Give them meaningless surrogate keys. Put those keys in the fact table.

Bitemporal master-satellite relationship

I am a DB newbie to the bitemporal world and had a naive question.
Say you have a master-satellite relationship between two tables - where the master stores essential information and the satellite stores information that is relevant to only few of the records of the master table. Example would be 'trade' as a master table and 'trade_support' as the satellite table where 'trade_support' will only house supporting information for non-electronic trades (which will be a small minority).
In a non-bitemporal landscape, we would model it as a parent-child relationship. My question is: in a bitemporal world, should such a use case be still modeled as a two-table parent-child relationship with 4 temporal columns on both tables? I don't see a reason why it can't be done, but the question of "should it be done" is quite hazy in my mind. Any gurus to help me out with the rationale behind the choice?
Pros:
Normalization
Cons:
Additional table and temporal columns to maintain and manage via DAO's
Defining performant join conditions
I believe this should be a pretty common use-case and wanted to know if there are any best practices that I can benefit from.
Thanks in advance!
Bitemporal data management and foreign keys can be quite tricky. For a master-satellite relationship between bitemporal tables, an "artificial key" needs to be introduced in the master table that is not unique but identical for different temporal or historical versions of an object. This key is referenced from the satellite. When joining the two tables a bitemporal context (T_TIME, V_TIME) where T_TIME is the transaction time and V_TIME is the valid time must be given for the join. The join would be something like the following:
SELECT m.*, s.*
FROM master m
LEFT JOIN satellite s
ON m.key = s.master_key
AND <V_TIME> between s.valid_from and s.valid_to
AND <T_TIME> between s.t_from and s.t_to
WHERE <V_TIME> between m.valid_from and m.valid_to
AND <T_TIME> between m.t_from and m.t_to
In this query the valid period is given by the columns valid_from and valid_to and the transaction period is given by the columns t_from and t_to for both the master and the satellite table. The artificial key in the master is given by the column m.key and the reference to this key by s.master_key. A left outer join is used to retrieve also those entries of the master table for which there is no corresponding entry in the satellite table.
As you have noted above, this join condition is likely to be slow.
On the other hand this layout may be more space efficient in that if only the master data (in able trade) or only the satellite data (in table trade_support) is updated, this will only require a new entry in the respective table. When using one table for all data, a new entry for all columns in the combined table would be necessary. Also you will end up with a table with many null values.
So the question you are asking boils down to a trade-off between space requirements and concise code. The amount of space you are sacrificing with the single-table solution depends on the number of columns of your satellite table. I would probably go for the single-table solution, since it is much easier to understand.
If you have any chance to switch database technology, a document oriented database might make more sense. I have written a prototype of a bitemporal scala layer based on mongodb, which is available here:
https://github.com/1123/bitemporaldb
This will allow you to work without joins, and with a more flexible structure of your trade data.

Ralph Kimball's Data warehouse toolkit book - Order lifecycle mart design

I'm reading Ralph Kimball's book on Data warehouse and Dimension Modeling. I am reading one of the case studies, and it is about dimension modeling for an order system, where the requirement is to capture an order lifecycle, from order to fulfillment to shipped.
So, I was thinking that maybe they would suggest to have multiple lines with a transaction type FK to a transaction dimension. However, the book suggests instead to create 'role-playing' dimensions - create multiple date dimensions tables (one for order date, one for fulfillment, and one for shipped). Each one of them would then have a foreign key into the fact table, and therefore the fact table would have three columns to relate this.
Isn't this kind of restricting? Wouldn't a line-per-transaction be a better choice?
Design often involves trade offs, and it's hard to know what design is best without a lot of details on the entire system.
But my take on this: the table from the book with three separate columns, would likely speed up queries. Data warehouses are often denormalized like this to increase query performance, at the expense of simplicity and versatility of input.
Seems like a good answer to me: your line per transaction sounds better for the data capture tables that store the day to day transactional data, but not as great for analysis.

Resources