New to Data Warehousing and Data Marts - data-warehouse

I'm completely new to Data Warehousing and Marts and wanted to ask for some advice on the best resources to learn and gain knowledge to start me off on the right path. I have a project to work on but need some guidance or somewhere to start really.
The problem is I've been given a matter of weeks to create a small mart with Fact and Dims then need to write stored procs for a GUI to feed in and out of this. I need to know how to create a scripted SCD which I have a basic idea and can use MERGE scripts.

I assume you're looking for free resources.
You can start with Kimball Dimensional Modeling Techniques. Check also SCD not easy as 1, 2, 3 and Slowly Changing Dimensions.
Since you need to design Fact tables and Dimension tables would be hepful as well.
Do you have any concrete assignments and if yes, you can show what you've done so far.

Related

EDW Kimball vs Inmon

I've been tasked with coming up with a recommendation of how to proceed with a EDW and am looking for clarification on what I'm seeing. Everything that I am learning about states that Kimball's approach will bring value quicker to business vs Inmon's. I get that Kimball's approach is a dimensional model from the getgo and different data marts (star schema) are integrated through conformed dimensions... thus the theory is I can simply come up with my immediate DM to solve business need and go on from there.
What I'm learning states that Inmon's model suggests that I have a EDW designed in 3NF. The EDW is not defined by source system but instead the structure of the business, Corporate Factory (Orders, HR, etc.). So data from disparate systems map into this structure. Once the data is in this form, ETLs are then created to produce DMs.
Personally I feel Inmon's approach is a better way. I believe this way is going to ensure that data is going to be consistent and it feels like you can do more with this data. What holds me back with this approach though is everything I'm reading says it's going to take much more time to deliver something but I'm not seeing how that is true. From my narrow view, it feels like no matter what the end result is we need a DM. Regardless of using Kimball's or Inmon's approach the end result is the same.
So then the question becomes how do we get there? In Kimballs approach we will create ETLs to some staging location and generally from there create a DM. In Inmon's approach I feel we just add in another layer... that is from the staging area we load this data into another database in 3NF organized by function. What I'm missing is how this step adds so much time.
I feel I can look at the end DM that needs to be made. Map those back to a DW in 3NF and then as more DMs are requested keep building up the DW in 3NF with more and more data. However if I create a DM in Kimballs model that DM is going to be built around the level of grain decided for that DM and what if the next DM requested wants reporting at even a deeper grain (to me it feels like Kimballs methodology would take more work) and with Inmon's it doesn't matter. I have everything at the transnational level so DMs of varying grains are requested, well I have the data, just ETL it to a DM and all DMs will report the same since they are sourced from the same data.
I dunno... just looking for others views. Everything I read says Kimball's is quicker... I say sure maybe a little bit but there is certainly a cost attributed by going to quicker route. And for sake of argument... let's say it takes a week to get a DM up and running through Kimballs methodology... to me it feels like it should only take 10% maybe 20% longer utilizing Inmon's.
If anyone has any real world experience with the different models and if one really takes so much longer then the other... please share. Or if I have this so backwards tell me that too!
For context; I look after a 3 billion record data warehouse, for a large multi-national. Our data makes its way from the various source systems through staging and into a 3NF db. From here our ELT processes move the data into a dimensionally modelled, star schema db.
If I could start again I would definitely drop the 3NF step. When I first built that layer I thought it would add real value. I felt sure that normalisation would protect the integrity of my data. I was equally confident the 3NF db would be the best place to run large/complex queries.
But in practice, it has slowed our development. Most changes require an update to the stage, 3NF and star schema db.
The extra layer also increases the amount of time it takes to publish our data. The additional transformations, checks and reconciliations all add up.
The promised improvement in integrity never materialised. I realise now that because I control the ETL, and the validation processes within, I can ensure my data is both denormalised and accurate. In reporting data we control every cell in every table. The more I think about that, the more I see it as a real opportunity.
Large and complex queries was another myth that has been busted by experience. I now see the need to write complex reporting queries as a failing of my star db. When this occurs I always ask myself: why isn't this question easy to answer? The answer is most often bad table design. The heavy lifting is best carried out when transforming the data.
Running a 3NF and star also creates an opportunity for the two systems to disagree. When this happens it is often a very subtle difference. Neither is wrong, per se. Instead, it is possible the 3NF and star query are asking slightly different questions, and therefore returning different results. Although technically correct, this can be hard to explain. Even minor and explainable differences can erode confidence, over time.
In defence of our 3NF db, it does make loading into the star easier. But I would happily trade more complex SSIS packages for one less layer.
Having said all of this; it is very hard to recommend an approach to anyone without a deep understanding of their systems, requirements, culture, skills, etc. Having read your question I am sure you have wrestled with all these issues, and many more no doubt! In the end, only you can decide what the best approach for your situation is. Once you've made your mind up, stick with it. Consistency, clarity and a well-defined methodology are more important that anything else.
Dimensions and measures are a well proven method for presenting and simplifying data to end users.
If you present a schema based on the source system (3nf) to an end user, vs a dimensionally modelled star schema (Kimball) to an end user, they will be able to make much more sense of the dimensionally modelled one
I've never really looked into an Inmon decision support system but to me it seems to be just the ODS portion of a full datawarehouse.
You are right in saying "The EDW is not defined by source system but instead the structure of the business". A star schema reflects this but an ODS (a copy of the source system) doesn't
A star schema takes longer to build than just an ODS but gives many benefits including
Slowly changing dimensions can track changes over time
Denormalisation simplifies joins and improves performance
Surrogate keys allow you to disconnect from source systems
Conformed dimensions let you report across business units (i.e. Profit per headcount)
If your Inmon 3NF database is not just an ODS (replica of source systems), but some kind of actual business model then you have two layers to model: the 3NF layer and the star schema layer.
It's difficult nowadays to sell the benefit of even one layer of data modelling when everyone thinks they can just do it all in a 'self service' tool! (which I believe is a fallacy). Your system should be no more complicated than it needs to be because all that complexity adds up to maintenance and that's the real issue - introducing changes 12 months into the build when you have to change many layers
To paraphrase #destination-data: your source system to star schema transformation (and seperation) is already achieved through ETL so the 3nf seems redundant to me. You design your star schema to be independent from source systems by correctly implementing surrogate keys and business keys, and modelling it on the business, not on the source system
With ETL and back-end data wrangling taking up about 70% of the project time for this kind of endeavour, an extra layer makes a big difference. Its an extra layer of transforming from source to target, to agree with the business and to test. It all adds up.
Whilst I'm not saying that dimensional models (the Kimball kind) are always easy to change, you've got a whole lot more inflexibility should you have to always change lots of layers when you want to change your BI.
In fact, where I've been consulting in places that have data warehouses that are considered to be inflexible and expensive to develop for, and not keeping pace with changes to the business, they have without exception included the 3NF layer prior to the DMs. As Nick mentioned, it is hard nowadays to sell the idea of a 'proper' data warehouse as opposed to a Data Discovery Bi tool- and the appeal of these is often driven by DWs being seen to be slow and expensive to develop.
Kimball isn't against having a 3NF layer prior to his DW if it makes sense for a situation, he just doesn't agree with Inmon that there's a point.
One common misunderstanding is that Kimball proposes distinct data marts, so that you'd have to change it each time there is a different reporting request. Instead, Kimball's DMs are based on real life business processes and modelled accordingly. Although its true you will then try and make them suitable for reporting, you try and make them so they can answer forseaable queries. You don't aggregate and store just the aggregates: you work with the transactional data in a Kimball dimensional model.
So no need to be reluctant from that perspective.
If an ODS works for you, then go for it- but a Kimball DW will meet the majority of requirements.

Landing Zone vs Staging tables

I am building a small DWH in SQL Server. We have 6 source tables that we have to combine into a single BASE table based on a given logic.
My question is - should I start by creating 6 LZ tables (corresponding to each of the 6 source tables) to land the data on the system. Secondly, combine these 6 LZ tables into 1 Staging table and then finally, move the data from the Staging table to the Base table ?
My first thought was to create 6 Staging tables (instead of LZ tables) and then combine these 6 to form the base table. However, I decided against it based on my understanding - that the structure of LZ tables should match the source tables and that the sructure of Staging table should refect the base tables ?
Which alternative should be pursued in this case ? What are the pros & cons ?
Pls share your thoughts.
Thanks
Honestly, I don't see one single answer to this question. It really depends on various factors - frequency of extract, source system availability, complexity of transformations, data lineage requirements, etc.
I will start with creating one staging table per source system table/entity. If you are using an ETL tool do the ETL process, then most of the ETL tools are pretty good at doing simple to complex transformations "on the fly" (in memory). I have extensively used SSIS and it is pretty good at most of the transformations.
You can sometimes end up with some other tables in the staging area if your transformations have very complex business rules. It helps in debugging in the sense that you can see the data before, during and after transformations. But as I said, that really depends on the data and the transformations required.
It really is a broad question and difficult to answer in a few paragraphs but I hope it helps you in getting you started with your ETL process!
In my experience using LZ tables or datadump area, is a good idea.
First of all it provides one to one mapping with minimal transformations if any, ie adding the file name attribute.
Secondly if the process fails, before achieving another milestone, the Landing Zone tables allow for restarting the process without the need to access the data source, which may or may not be accessible at that time.
You can also archive the data from LZ tables, which, if you only taking a subset of data further down the pipeline, might save you lots of work if suddenly pipeline needs to add another attribute and the historic values are needed and the attribute is on the original files.
Hope that helps
In order to land the source tables , I Would Recommend you to Make Separate table for every source table .
There should be no dependency to the source table.LANDED tables would help you in making staging area.

Is it a good way of working with ETL, if I use joins in the table input step?

I am wondering if it is the correct way of working with ETL by using a join (in my case I use 3 joins to get the desired values) in the table input step in my transformation. Or is there a better way? Thank you for your help.
As it is often the case: the answer depends on your environment. For instance, if you have a fast changing source system and lots of transformations with longer durations, first copying the needed information into a staging database can help you create reproducible results through all transformations involved. Directly joining tables from the source system can in that case create different results for two transformations running one after the other.
If you have a timeframe where your source system doesn't change much or at all - or if you need that information only in this single transformation - joining the tables may be no problem at all.
From a technical point of view there is nothing to say against joins (actually there are arguments for joins, especially performance). Comprehensibility is another matter, and here again your specific environment matters. ETL processes are often badly documented and working on a transformation that has been created years ago by someone else can be either easy or a complete pain. If your joins make sense from a technical perspective and you obtain your data from a consistent source, I don't see why you shouldn't use them. They should always be much faster than lookup steps in an ETL transformation.

Linq to Sql structure standard

I was wondering what the "correct"-way is when using a Linq to Sql-model in Visual Studio.
Should I create a new model for each of my components, say Blog, Users, News and so on and have all different xxxDataContext's with tables and SPROCs added in each of these.
Or should I create one MyDbDataContext and always work against that?
What's the pro's/con's? My gut tells me to divide it up in smaller context's, but it also feels like that could bring problems as the project expands?
What's the deal? Help me Stackoverflow :)
There will always be overhead when creating the data context as the model needs to be built. Depending on the number of tables in your database this might not be much of a big deal though. If it's only 10 tables or so, the overhead will not be much more than that for a context with say 1 table (sorry, I don't have actual stress testing to show the overhead, but, hey, maybe that gives me something to blog on this weekend).. When looking at large databases the overhead might be a enough to consider using seperate contexts.
The main advantage I would see with using a single data context is that you gain the ability to use JOINs in your LINQ query and that will be translated to T-SQL. Where as if you do the join after the arrays of objects are pulled, the performance might be a bit slower. Additionally, keeping track of multiple data contexts might be confusing and good naming conventions would be needed. So building your own data model w/ business logic which encapsulates the contexts would be a bit harder. I've done this and it's not fun :)
However, if you still feel you want to go that route, then I would recommend putting similar tables (that you might need to join) in the same context. Also, there are some tuts online that recommend using a shared MappingSource when using multiple contexts that use the same source. Information on this can be found here: http://www.albahari.com/nutshell/speedinguplinqtosql.aspx
Sorry, I know that's not really a black and white answer, but hopefully it helps :)
Addition:
Just wanted to add that I did a small test and ran 20,000 SELECT statements against a small sized table using 2 different data contexts:
DataClasses1DataContext contained mappings to all tables in the db (4 total)
DataClasses2DataContext contained a single mapping for just the one table
Results:
Time to execute 20000 SELECTs using DataClasses1DataContext: 00:00:10.4843750
Time to execute 20000 SELECTs using DataClasses2DataContext: 00:00:10.4218750
As you can see, it's not much of a difference.

Where can I find a real dataset anywhere online that I could try doing a data warehouse cube with?

I am studying data warehouses and I have to do one final project for my studies.
I am thinking about doing a cube for a data warehouse. Where can I find a real dataset anywhere online that I could try doing a cube with?
You can refer to this page to see how to convert a part of the Northwind database to a star schema for building cubes: Northwind Star Schema.
Here's an example on Adventure Works - of course, it's prebuilt SSAS, but I guess you could look at the underlying AventureWorks DB and do the dimensional modeling yourself.
I think doing a DW on an existing popular dataset like Northwind or AdventureWorks is probably not a great idea, because so many people have done them. Even StackOverflow has had data mining done, but perhaps it would be a good candidate - I'm not sure what Brent's work actually comprised.
So if you are looking to do an original project, you might need to look further afield, if only to distinguish your work from previous work.

Resources