Datawarehouse design - data-warehouse

I am going to design a Datawarehouse (although its not an easy process). I am wondering through out the ETL process , how the data in the Datawarehouse is going to extract/transform to Data Mart ?
Are there any model design within Datawarehouse vs Datamart ? Also usually starschema or snowflake?so should we place the table like in the following
In Datawarehouse
dim_tableA
dim_tableB
fact_tableA
fact_tableB
And in Datamart A
dim_tableA (full copy from datawarehouse)
fact_tableA (full copy from datawarehouse)
And in Datamart B
dim_tableB (full copy from datawarehouse)
fact_tableB (full copy from datawarehouse)
is it something real life example which can demonstrate the model difference between datawarehouse and datamart ?

I echo with both Nick's responses and in more technical way following Kimball methodology:
In my opinion and my experience. At high level ,we have data marts like Service Analytics , Financial Analytics , Sales Analytics , Marketing Analytics ,Customer Analytics etc. These were grouped as below
Subject Areas -> Logical grouping(Star Modelling) ->Data Marts -> Dimension &Fact (As per Kimball’s)
Example:
AP Real Time ->Supplier, Supplier Transaction’s , GL Data -> Financial Analytics + Customer Analytics->Physical Tables

Data marts contain repositories of summarized data collected for analysis on a specific section or unit within an organization, for example, the sales department. ... A data warehouse is a large centralized repository of data that contains information from many sources within an organization.
Depending on their needs, companies can use multiple data marts for different departments and opt for data mart consolidation by merging different marts to build a single data warehouse later. This approach is called the Kimball Dimensional Design Method. Another method, called The Inmon Approach, is to first design a data warehouse and then create multiple data marts for particular services as needed.
An example: In a data warehouse, email clicks are recorded based on a click date, with the email address being just one of the click parameters. For a CRM expert, the e-mail address (or any other customer identifier) ​​will be the entry point: opposite each contact, the frequency of clicks, the date of the last click, etc.
The Datamart is a prism that adapts the data to the user. In this, its keys to success depend a lot on the way the data is organized. The more understandable it is to the user, the better the result. This is why the titles of each field and their method of calculation must stick as closely as possible to the uses of the trade.

Related

How to standardize city names inserted by user

I need to write a small ETL pipeline because I need to move some data from a source database to a target database (a datawarehouse) to perform some analysis on data.
Among those data, I need to clean and conform the name of cities. Cities are inserted manually by international users, conseguently for a single city I can have multiple names (for example London or Londra).
In my source database I do not have only big cities but I have also small villages.
Well, if I do not standardize city names, our analysis could be nonsensical.
Which is the best practices to standardize cities in my target database? Have any idea or suggestion I can undertake?
Thank you
The only reliable way to do this is to use commercial address validation software - preferably in your source system when the data is being created but it could be integrated into your data pipeline processes.
Assuming you can't afford/justify the use of commercial software, the only other solution is to create your own translation table i.e. a table that holds the values that are entered and what value you want them to be translated to.
While you can build this table based on historic data, there will always be new values that are not in the table, so you would need a process to identify these, add the new record to your translation data and then fix the affected records. You would also need to accept that there would be un-cleansed data in your warehouse for a period of time after each data load

Does a data warehouse need to satisfy 2NF or another normal form?

I'm investigating data warehouses. And I have an issue about star schemas.
It's in
Oracle® OLAP Application Developer's Guide
10g Release 1 (10.1)
3.2.1 Dimension Table: TIME_DIM
https://docs.oracle.com/cd/B13789_01/olap.101/b10333/global.htm#CHDCGABE
To represent the hierarchy MONTH -> QUARTER -> YEAR, we need some keys such as: YEAR_ID, QUARTER_ID. But there are some things that I do not understand:
1) Why do we need field YEAR_DSC & QUARTER_DSC? I think that we can look up these values from YEAR & QUARTER TABLE. And it breaks 2NF.
2) What is the normal form that a schema in data warehouse needs to satisfy? (1NF, 2NF, 3NF, or any.)
NFs (normal forms) don't matter for data warehouse base tables.
We normalize to reduce certain kinds of redundancy so that when we update a database we don't have to say the same thing in multiple places and so that we can't accidentally erroneously not say the same thing where it would need to be said in multiple places. That is not a problem in query results because we are not updating them. The same is true for a data warehouse's base tables. (Which are also just queries on its original database's base tables.)
Data warehouses are usually optimized for reading speed, and that usually means some denormalization compared to the original database to avoid recomputation at the expense of space. (Notice though that sometimes rereading something bigger can be slower than reading smaller parts and recomputing the big thing.) We probably don't want to drop normalized tables when moving to a data warehouse, because they answer simple queries and we don't want to slow down by recomputing them. Other than those tradeoffs, there's no reason not to denormalize. Some particular warehouse design methods might have their own rules about what parts should be denormalized what amounts.
(Whatever our original database design NF is chosen to be, we should always first normalize to 5NF then consciously denormalize. We don't need to normalize or know constraints to update or query a database.)
Read some textbook basics on why we normalize & why we use data warehouses.

How to handle denormalization in B.I data model

As far as I know, normalization is done to avoid inconsistency in the database.
By normalizing we:
reduce data redundancy, and
protect data integrity.
That's why most OLTP databases are in 3NF.
Different databases from OLTP come together in a data warehouse. (DWH, OLAP). DWHs are denormalized (1FN), and is obvious it has to be like that, because the main table of a DWH has hundreds of columns.
From that DWH we can build several data marts that we would later use for doing analysis with a BI reporting tool (Cognos, QlikView, BO .. )
The problem is that the data model for the BI report is not normalized.
Couldn't that be a problem for redundancy and data integrity for the report?
In OLAP systems (such as data warehouses), the key efficiency needs are in querying and data retrieval.
Therefore some of the design considerations are done in order to retrieve the information faster, even if the updates may be longer.
An example for such a model is a Star-Schema on which we denormalize data in a such way that all the data will be stored in a 1-join-hop distance.
Key elements such as transactions are located at the big table (Facts), with foreign keys pointing at the dimensions.
The dimension themselves are smaller, and may contain not-normalized data. For example an address dimension may store street, neighborhood and city data without normalizing it to 3NF.
There are for sure redundancy issues (you don't really have to store Day_of_Week per each date row) but it is insignificant (since storage is not a bottleneck in this scenario).
As per integrity - you face it only on updates (F.E. a less-realistic scenario of country change per State_Province in Dim_Store) , and in DWH update is a rare-case, where we allow ourselves to be inefficient.
Moreover - integrity is not enforced by the DB (or normalization) but by design and implementation of the ETL process.
Read more on Data Warehouses modeling
Regarding redundancy: some data warehouse engines like Amazon Redshift allow data compression that is very handy for denormalization. Let's say you have a table of sales events with 100M records and every sale has a city. In OLTP data model, you would have sales and cities with city_id connecting them. In OLAP data model with compression allowed it's much easier to have sales table with a text city attribute compressed. You'll be able to calculate sales by city without joining tables, and your city values won't occupy much disk space because they will be encoded.
More info about compression is in Amazon docs: Choosing a Column Compression Type
Regarding data integrity: you have to design your ETL routines to minimize the possibility of duplicate data and also run scheduled checks for duplicates based on criteria like this:
select count(*) from table;
select count(*) from (select distinct <expression> from table);
where is a list of columns which combination should be unique (your artificial primary key).

What is the Difference between Data mart and DSS(Decision Suport System)?

I trying to know the difference between Data Mart and DSS
When I check the info in Internet about DSS vs DWH. I found that .
"Data warehouse is often the componet taht stores data for a DSS".
The problem is that as long as i know DWH is too the componet that stores data for a Data Mart.
so
What is the difference between a DSS and a Data Mart?
Thanks in advance , Enrique
More appropriate question would be: What is similar with Data Mart and DSS?
Data mart is subject oriented set of related tables where you have one fact table (transactions) and multiple dimension tables (categories). Example: Data mart of sales. Fact table (salesID,agentID,categoryID,dateID,amount,quantity). Dimension Agent (AgentID, AgentName, AgentType, etc)
Data Warehouse (it's database) is centralised repository of aggregated data from one or multiple source aimed to serve for reporting purpose. It's usually denormalized. It could be based on data marts or one logical data model in 3rn normalisation form.
DSS is information system, it's not database neither entity. It lies on data, but it also have it own's model and user interface. Model is critical for decision recommendation engine.
What may led you to misunderstands is because some of DSS lies on DWHs, specifically on Kimball (Data Marts) types of DWHs.

Date Warehouse: when the cleaning and transforming is performed?

I am reading a book "Modeling the agile data warehouse with data vault" by H. Hultgren. He states:
EDW represents what did happen - not what should have happened
When does the cleaning and possible transforming is performed? Under transforming I mean stadartization f the values, for example, sex column can contain only two possible values 'f' and 'm' and not 'female' or 'male' or 0 or 1)?
If you are importing data through ETL, that is one place to do it. Or you can use some other kind of data cleansing tool. This is a very general question. It depends on the architecture of your data warehouse.
For example you might have a data warehouse that loads data and tries to automatically clean it or you might have an architecture where every single 'bad' record goes to an approval area to be cleaned by a person. I can assure you in the real world, no business user wants to have to pick from 6 values for gender.
The other thing is you might be loading data from three different systems, and these three different representations are completely valid in each system, but an end user doesn't want to have to pick from 6 choices - they want the data to be cleansed.
I'm thinking maybe this statement
EDW represents what did happen - not what should have happened
is a data vault specific thing since DV is all about modelling and storing the source system data no matter how the schema changes, and I guess in this case you would treat the data vault as an ODS and preserve the data as-as, then cleanse it on the way into the reporting star schema

Resources