I am creating a data warehouse following Kimball's theory. So, I am aiming for the greatest amount of dimension table re-use. In this theory, how should I physically organize the dimension tables? I have used databases to organize my data marts (i.e., 1 mart per database, with possibly multiple fact tables per mart). Since a given dimension can be used by multiple marts (and is what I want to aim for), I don't know where I should put my dimension tables.
For example, should I put them all under a certain schema in a certain database (e.g., schema 'Dimension' under database 'Dimensions')? Or, perhaps, should I incrementally add them to each new database as I build out new data marts?
A datamart is a logical subset of your data warehouse, not a physical one. Your data warehouse should (under most circumstances) reside in a single database
Traditional data warehouses often use separate databases to create application boundaries based on either workload, domain or security.
As an example, a traditional SQL Server data warehouse might include a staging database, a data warehouse database, and some data mart databases. In this topology, each database operates as a workload and security boundary in the architecture.
You can create a schema for example for HR datamart and load all the related dimensions under it.
CREATE SCHEMA [HR]; -- name for the data mart HR
CREATE TABLE [HR].[DimEmployee] -- create dimensions related to data mart HR in the HR schema
( EmployeeSK BIGINT NOT NULL
, ...
);
Related
I am quite new to DW and I am just learning the stuff. I read on the Internet that after the ETL process, DW data is then stored in some data marts for some reasons such as ease of use. Each data mart can use a structure. Let's say a data mart is using star structure. Now my questions arise:
First of all, can a data mart use two structures, for instance, star and snowflake?
Assume we have two data marts that are using star structure. Suppose that both of them have only one fact table and some dimension tables. The thing is, it turned out that some of the dimension tables in the first data mart are the same as the second one.
While considering they are in different data marts, what should we do? Should we duplicate the tables in different data marts?
What if the fact tables were in the same data mart? Should we duplicate dimension tables or just create a foreign key to the table we already have?
Snowflake describes one or more objects in your model. So some parts of your model could be snowflaked and others not
Datamarts are logical groupings of your facts and dimensions, not physical ones. So you don’t duplicate these tables and they can appear in as many datamarts as necessary
Considering there is a data warehouse contains one fact table and three dimension tables.
Fact table:
fact_orders
Dimension tables:
dim_user
dim_product
dim_date
All the data of these tables are extracted from our business systems.
In the business system, the user has many attributes, some of which could change upon time(mobile, avatar_url, nick_name, status), some others won't change once the record is created(id,gender,register_channel).
So generally in the dim_user table, which fields should we use and why?
Dim_User should have both changeable and unchangeable fields. In denormalized model, it is preferrable to keep all the related attributes of a dimension in a single table.
Also, it is preferrable to keep all the information available about user in the dimension table, as they might be used for reporting purposes. If they won't be needed for reporting purpose, you can skip them.
If you want to keep the history of change of the user, you can consider implementing slowly changing dimensions. Otherwise, you can update the dimension attributes, as and when they change. It is called SCD Type I.
I wonder if it's possible to create a logic that automatically creates a denormalized table and it's data (and maintains it) by a specific SQL-like query.
Given a system where the user can maintain his data model and data. All data are stored in "relational" tables, but those tables are only used for the user to maintain his data. If he wants to display data on a webpage he has to write a query (SQL) which will automatically turn into a denormalized table and also be kept up-to-date when updating/deleting the relational data.
Let's say I got a query like this:
select t1.a, t1.b from t1 where t1.c = 1
The logic will automatically create a denormalized table with a copy of the needed data according to the query. It's mostly like a view (I wonder if views will be more performant than my approach). Whenever this query (give it a name) is needed by some business logic it will be replaced by a simple query on that new table.
Any update in t1 will search for all queries where t1 is involved and update the denormalized data automatically, but for performance win it will only update the rows infected (in this example just one row). That's the point where I'm not sure if it's achievable in an automatic way. The example query is simple, but what if there are queries with joins, aggregation or even sub queries?
Does an approach like this exist in the NoSQL world and maybe can somebody share his experience with it?
I would also like to know whether creating one table per query does conflict with any best practises when using NoSQL databases.
I have an idea how to solve simple queries just by finding the involved entity by its primary key when updating data and run the query on that specific entity again (so that joins will be updated, too). But with aggregation and sub queries I don't really know how to determine which denormalized table's entity is involved.
I am a DB newbie to the bitemporal world and had a naive question.
Say you have a master-satellite relationship between two tables - where the master stores essential information and the satellite stores information that is relevant to only few of the records of the master table. Example would be 'trade' as a master table and 'trade_support' as the satellite table where 'trade_support' will only house supporting information for non-electronic trades (which will be a small minority).
In a non-bitemporal landscape, we would model it as a parent-child relationship. My question is: in a bitemporal world, should such a use case be still modeled as a two-table parent-child relationship with 4 temporal columns on both tables? I don't see a reason why it can't be done, but the question of "should it be done" is quite hazy in my mind. Any gurus to help me out with the rationale behind the choice?
Pros:
Normalization
Cons:
Additional table and temporal columns to maintain and manage via DAO's
Defining performant join conditions
I believe this should be a pretty common use-case and wanted to know if there are any best practices that I can benefit from.
Thanks in advance!
Bitemporal data management and foreign keys can be quite tricky. For a master-satellite relationship between bitemporal tables, an "artificial key" needs to be introduced in the master table that is not unique but identical for different temporal or historical versions of an object. This key is referenced from the satellite. When joining the two tables a bitemporal context (T_TIME, V_TIME) where T_TIME is the transaction time and V_TIME is the valid time must be given for the join. The join would be something like the following:
SELECT m.*, s.*
FROM master m
LEFT JOIN satellite s
ON m.key = s.master_key
AND <V_TIME> between s.valid_from and s.valid_to
AND <T_TIME> between s.t_from and s.t_to
WHERE <V_TIME> between m.valid_from and m.valid_to
AND <T_TIME> between m.t_from and m.t_to
In this query the valid period is given by the columns valid_from and valid_to and the transaction period is given by the columns t_from and t_to for both the master and the satellite table. The artificial key in the master is given by the column m.key and the reference to this key by s.master_key. A left outer join is used to retrieve also those entries of the master table for which there is no corresponding entry in the satellite table.
As you have noted above, this join condition is likely to be slow.
On the other hand this layout may be more space efficient in that if only the master data (in able trade) or only the satellite data (in table trade_support) is updated, this will only require a new entry in the respective table. When using one table for all data, a new entry for all columns in the combined table would be necessary. Also you will end up with a table with many null values.
So the question you are asking boils down to a trade-off between space requirements and concise code. The amount of space you are sacrificing with the single-table solution depends on the number of columns of your satellite table. I would probably go for the single-table solution, since it is much easier to understand.
If you have any chance to switch database technology, a document oriented database might make more sense. I have written a prototype of a bitemporal scala layer based on mongodb, which is available here:
https://github.com/1123/bitemporaldb
This will allow you to work without joins, and with a more flexible structure of your trade data.
I'm reading Ralph Kimball's book on Data warehouse and Dimension Modeling. I am reading one of the case studies, and it is about dimension modeling for an order system, where the requirement is to capture an order lifecycle, from order to fulfillment to shipped.
So, I was thinking that maybe they would suggest to have multiple lines with a transaction type FK to a transaction dimension. However, the book suggests instead to create 'role-playing' dimensions - create multiple date dimensions tables (one for order date, one for fulfillment, and one for shipped). Each one of them would then have a foreign key into the fact table, and therefore the fact table would have three columns to relate this.
Isn't this kind of restricting? Wouldn't a line-per-transaction be a better choice?
Design often involves trade offs, and it's hard to know what design is best without a lot of details on the entire system.
But my take on this: the table from the book with three separate columns, would likely speed up queries. Data warehouses are often denormalized like this to increase query performance, at the expense of simplicity and versatility of input.
Seems like a good answer to me: your line per transaction sounds better for the data capture tables that store the day to day transactional data, but not as great for analysis.