Landing Zone vs Staging tables - data-warehouse

I am building a small DWH in SQL Server. We have 6 source tables that we have to combine into a single BASE table based on a given logic.
My question is - should I start by creating 6 LZ tables (corresponding to each of the 6 source tables) to land the data on the system. Secondly, combine these 6 LZ tables into 1 Staging table and then finally, move the data from the Staging table to the Base table ?
My first thought was to create 6 Staging tables (instead of LZ tables) and then combine these 6 to form the base table. However, I decided against it based on my understanding - that the structure of LZ tables should match the source tables and that the sructure of Staging table should refect the base tables ?
Which alternative should be pursued in this case ? What are the pros & cons ?
Pls share your thoughts.
Thanks

Honestly, I don't see one single answer to this question. It really depends on various factors - frequency of extract, source system availability, complexity of transformations, data lineage requirements, etc.
I will start with creating one staging table per source system table/entity. If you are using an ETL tool do the ETL process, then most of the ETL tools are pretty good at doing simple to complex transformations "on the fly" (in memory). I have extensively used SSIS and it is pretty good at most of the transformations.
You can sometimes end up with some other tables in the staging area if your transformations have very complex business rules. It helps in debugging in the sense that you can see the data before, during and after transformations. But as I said, that really depends on the data and the transformations required.
It really is a broad question and difficult to answer in a few paragraphs but I hope it helps you in getting you started with your ETL process!

In my experience using LZ tables or datadump area, is a good idea.
First of all it provides one to one mapping with minimal transformations if any, ie adding the file name attribute.
Secondly if the process fails, before achieving another milestone, the Landing Zone tables allow for restarting the process without the need to access the data source, which may or may not be accessible at that time.
You can also archive the data from LZ tables, which, if you only taking a subset of data further down the pipeline, might save you lots of work if suddenly pipeline needs to add another attribute and the historic values are needed and the attribute is on the original files.
Hope that helps

In order to land the source tables , I Would Recommend you to Make Separate table for every source table .
There should be no dependency to the source table.LANDED tables would help you in making staging area.

Related

Enterprise Data Warehouse - Should the EDW table be named the same as it is in the source system

So we are loading an EDW with several Electronic Medical Record systems. We give each source system a database, internally referred to as a source mart. Then we merge similar data into tables into another database called Essentials.
I am curious as to the best practice for naming the tables at the source mart. I think they should maintain the exact same name as the source system. That way when apps are ported over we have some level of lineage to map to. Developers on the existing system would know that the table PAT_REF is patient data on both systems and would not have to maintain a second dictionary to figure out that table has been named something else.
But once we merge tables from multiple systems into the Essentials database we would rename the tables based on what Data governance worked out wit hall parties involved in using the data.
I could swear I saw this in one of the bazillion best practices documents out there, but I only seem to find docs going through normalization steps at the first level of data. I don't see trying to design fact and dimensions at that level and then trying to merge them with the other source systems. not to mention the huge hit those normalized queries we put on the source server.
We use the same table name in our staging area as we do in our source systems.
To load them into the combined data warehouse we write views that define relationships and dependencies from the source systems. Then in the data warehouse the table names reflect that of the views used to load them.

Is it a good way of working with ETL, if I use joins in the table input step?

I am wondering if it is the correct way of working with ETL by using a join (in my case I use 3 joins to get the desired values) in the table input step in my transformation. Or is there a better way? Thank you for your help.
As it is often the case: the answer depends on your environment. For instance, if you have a fast changing source system and lots of transformations with longer durations, first copying the needed information into a staging database can help you create reproducible results through all transformations involved. Directly joining tables from the source system can in that case create different results for two transformations running one after the other.
If you have a timeframe where your source system doesn't change much or at all - or if you need that information only in this single transformation - joining the tables may be no problem at all.
From a technical point of view there is nothing to say against joins (actually there are arguments for joins, especially performance). Comprehensibility is another matter, and here again your specific environment matters. ETL processes are often badly documented and working on a transformation that has been created years ago by someone else can be either easy or a complete pain. If your joins make sense from a technical perspective and you obtain your data from a consistent source, I don't see why you shouldn't use them. They should always be much faster than lookup steps in an ETL transformation.

When developing web applications when would you use a Graph database versus a Document database?

I am developing a web-based application using Rails. I am debating between using a Graph Database, such as InfoGrid, or a Document Database, such as MongoDB.
My application will need to store both small sets of data, such as a URL, and very large sets of data, such as Virtual Machines. This data will be tied to a single user.
I am interested in learning about peoples experiences with either Graph or Document databases and why they would use either of the options.
Thank you
I don't feel enough experienced with both worlds to properly and fully answer your question, however I'm using a document database for some time and here are some personal hints.
The document databases are based on a concept of key,value, and static views and are pretty cool for finding a set of documents that have a particular value.
They don't conceptualize the relations between documents.
So if your software have to provide advanced "queries" where selection criteria act on several 'types of document' or if you simply need to perform a selection using several elements, the [key,value] concept is not appropriate.
There are also a number of other cases where document databases are inappropriate : presenting large datasets in "paged" tables, sortable on several columns is one of the cases where the performances are low and disk space usage is huge.
So in many cases you'll have to perform "server side" processing in order to pick up the pieces, and with rails, or any other ruby based framework, you might run into performance issues.
The graph database are based on the concept of tripplestore, meaning that they also conceptualize the relations between the entities.
The graph can be traversed using the relations (and entity roles), and might be more convenient when performing searches across relation-structured data.
As I have no experience with graph database, I'm not aware if the graph database can be easily queried/traversed with several criterias, however if an advised reader has such an information I'd really appreciate any examples of such queries/traversals.
I'm currently reading about InfoGrid and trying to figure if such databases could by handy in order to perform complex requests on a very large set of data, relations included ....
From what I can read, the InfoGrah should be considered as a "data federator" able to search/mine the data from several sources (Stores) wich can also be a NoSQL database such as Mongo.
Wich means that you could use a mongo store for updates and InfoGraph for data searching, and maybe spare a lot of cpu and disk when it comes to complex searches inside a nosql database.
Of course it might seem a little "overkill" if your app simply stores a large set of huge binary files in a database and all you need is to perform simple key queries and to retrieve the result. In that cas a nosql database such as mongo or couch would probably be handy.
Hope some of this helps ;)
When connecting related documents by edges, will you get a shallow or a deep graph? I think the answer to that question is important when deciding between graphdbs and documentdbs. See Square Pegs and Round Holes in the NOSQL World by Jim Webber for thoughts along these lines.

Avoid writing SQL queries altogether in SSIS

Working on a Data Warehouse project, the guy that gave us the tutorial advised that we stick to using SQL queries over defining a lot of data flow transformations, citing points like it'll consume a lot of memory on the ETL box so we'd rather leave the processing to the DB box. Is this really advisable? Where's the balance between relying on GUI tools over executing a bunch of SQL scripts on your Integration package?
And honestly, I'd like to avoid writing SQL queries as much as I can. (but that's beside the point. I'd really like to look at this objectively.)
The answer is: it depends, but you want to pick one or the other for any given job and avoid mixing the two where possible.
Generally, it's best to either do everything possible within the tool or do everything possible within stored procedure code. When you have significant amounts of logic split between layers the system becomes harder to trace and debug.
Where the tool can do the transformations without the data flows becoming awkward and convoluted you could use the tool and try to have little or no logic in queries. This means that one single layer has the business logic and it should be fairly obvious where to find it. However, ETL tools tend to handle highly complex transformations relatively poorly. The sweet spot for this type of approach is on systems where you have a large number of data sources but relatively simple transformations.
If you have relatively complex transformations you may be better off putting all the business logic and transformation into a layer of stored procedures. SQL code is better at implementing complex transformations in a maintainable way - I have it on fairly good authority that around half of all data warehouse projects in the banking and insurance sectors use this type of architecture for precisely that reason. In this case the ETL tool can be used to implement relatively dumb data copies. Source data can be copied into staging areas essentially verbatim and then picked up by a body of stored procedure code that does the ETL. The ETL tool can be used for data copies, bulk load operations, logging, scheduling and other framework tasks.
In either case you're best off picking one approach. Otherwise, you can end up with business logic spread across extraction layers, database views, data flows, and stored procedure code. Logic spread across multiple layers is much harder to test.
When all of the logic is (for example) contained within stored procedures or focussed ETL transformation jobs you can unit test a given transformation in isolation. The clarity in design also helps with maintenance and auditing.
I find that using SQl code is not only faster to run, but it is faster to develop and much much easier to maintain.
Generally when you want to process each row individually, use a data flow, otherwise it may be better to use a Sql Command.
Personally I'd go with writing the SQL where I can. It's easier to optimise later and (usually) faster as well. Google will give much more detailed answers.
Another factor to think about is the provider you use for your connections.
You need to make the decision based on your needs. We use postgres DB, so we have to create a load of staging tables for some processes, which speeds the whole thing up.
You should also take into consideration the box it is running on, if you have an all powerful DB box, and a little ETL box, there'd be no point in running anything.
If you do all your processing on the ETL box you'll be dragging a lot of data across the network as well.
Check out these links to get you started:
ssistalk.com/category/ssis/ssis-advanced-techniques/
msdn.microsoft.com/en-us/library/ms141031.aspx
weblogs.sqlteam.com/jamesn/Default.aspx
I think this is a difficult question; and an interesting one as well.
One reason to use SSIS is to improve maintainability, IMHO. If you pack all the logic in SQL statements (and you sure can!) you tend to spoil this reason of using SSIS in the first place. You cannot really "see the data flow" anymore.
On the other hand I feel there are times when a well placed SQL statement has its value. For example when you read data from a table and for whatever reason already know you will only ever need the rows satisfying condition X I do not see the reason for reading the whole table and in the next step "conditional-splitting most of it away".
What I do not know is what this means in terms of performance, by the way. Is SSIS smart enough to see what is happening and change the "read-whole-table-and-conditional-split-it" into a "select Y from where X" on the fly (or when building/deploying)?
The big question is where to draw the line. And this depends to a certain extent on the people working on your ETL process. If everyone ever supporting the process knows SQL since its beginning you can better support a higher amount of SQL in your ETL than if you have co-workers (or customers, or successors you care about) that hardly understand what is happening in all your SQL, let alone change/improve/add to it.
So I think the bottom line is that neither not using nor doing everything in SQL is better. Try to make up some simple rules that fit your requirements and that everyone can live with, then follow them. This buys you the most value from using SSIS.
SQL Server does some things well and other things not so well. I use SSIS to import to or export data from SQL Server. During the course of the move I use SSIS where it makes sense. I can easily do work on a per row basis, which is not very efficient in SQL Server (cursors). To say that you shouldn't use transformations and data flows on an ETL box, because it is too expensive on the ETL box is like say 'don't drive your car too fast, because it causes the engine to work'. The purpose of an ETL and SSIS is to take some of the processing that SQL Sever does not do well and move it to an engine that does.
Got to use the right tool for the job. Generally, you do most things in SSIS, with certain things done in "pure" SQL.
For instance, in cases where you do a lot of UPDATE (table difference on dimension table in a dimensional model, say), you really don't want to execute an UPDATE for each row. In this scenario, you do a regular insert into a temporary table and then do the UPDATE in SQL, joining on appropriate keys.

Linq to Sql structure standard

I was wondering what the "correct"-way is when using a Linq to Sql-model in Visual Studio.
Should I create a new model for each of my components, say Blog, Users, News and so on and have all different xxxDataContext's with tables and SPROCs added in each of these.
Or should I create one MyDbDataContext and always work against that?
What's the pro's/con's? My gut tells me to divide it up in smaller context's, but it also feels like that could bring problems as the project expands?
What's the deal? Help me Stackoverflow :)
There will always be overhead when creating the data context as the model needs to be built. Depending on the number of tables in your database this might not be much of a big deal though. If it's only 10 tables or so, the overhead will not be much more than that for a context with say 1 table (sorry, I don't have actual stress testing to show the overhead, but, hey, maybe that gives me something to blog on this weekend).. When looking at large databases the overhead might be a enough to consider using seperate contexts.
The main advantage I would see with using a single data context is that you gain the ability to use JOINs in your LINQ query and that will be translated to T-SQL. Where as if you do the join after the arrays of objects are pulled, the performance might be a bit slower. Additionally, keeping track of multiple data contexts might be confusing and good naming conventions would be needed. So building your own data model w/ business logic which encapsulates the contexts would be a bit harder. I've done this and it's not fun :)
However, if you still feel you want to go that route, then I would recommend putting similar tables (that you might need to join) in the same context. Also, there are some tuts online that recommend using a shared MappingSource when using multiple contexts that use the same source. Information on this can be found here: http://www.albahari.com/nutshell/speedinguplinqtosql.aspx
Sorry, I know that's not really a black and white answer, but hopefully it helps :)
Addition:
Just wanted to add that I did a small test and ran 20,000 SELECT statements against a small sized table using 2 different data contexts:
DataClasses1DataContext contained mappings to all tables in the db (4 total)
DataClasses2DataContext contained a single mapping for just the one table
Results:
Time to execute 20000 SELECTs using DataClasses1DataContext: 00:00:10.4843750
Time to execute 20000 SELECTs using DataClasses2DataContext: 00:00:10.4218750
As you can see, it's not much of a difference.

Resources