Loading Denormalized Data into a Datawarehouse - normalization

We are building a data warehouse by consuming file feeds from different sources.
The file feeds are all denormalized/flattened (In the Transactions (fact) file, the Account attributes keeps repeating in all the records).
Also, the account information changes often (the feed gives an as-is version of the data).
What is the best practice in this situation. Should the data warehouse have a star schema model (with the Account information as a slowly changing dimension and a Transaction fact). Will re-normalizing make the ETL process complex?

In my company, whenever some input is denormalized, we normalize it and from there we proceed with loading our schemas (whatever your schema is).
The reason is that, being de-normalized, those inputs are difficult to check for inconsistencies (data quality). Apart from that, conforming all of your inputs to some standard allows your code to be more maintainable.
In our case, following the Kimball practices has been a total success, fact table, slow changing dimensions and all that jazz.

Hard to answer without such details as daily volume, latency threshold, resource availability, reporting requirements, platform and tool constraints, etc. A traditional ODS, where you import into and store a normalized structure before creating data marts from that, is great but not optimal for big data or real time analysis. A more modern approach, using a data lake in Hadoop or a virtualization layer, may not be feasible for your organization.
General Opinions:
1) re-normalizing does seem unnecessary from both a complexity and performance standpoint unless you have some ongoing use for the normalized data store.
2) Whether or not you build a traditional star schema or a graph or whatever should be governed by the reporting requirements and tools, not the source data format. Those sources will change, btw.
3) "Transaction" does not sound like a fact to me. A purchase transaction, e.g., could feed a sales fact, an accumulating snapshot for a sales cycle, a funnel conversion fact, etc.
4) I'm not sure whether "Account" is a customer, or a balance account such as a credit card, online payment service, bank account, etc. They imply different SCD types. In any case, Google will be sufficient to get plenty of information about building those dimensions.

Related

EDW Kimball vs Inmon

I've been tasked with coming up with a recommendation of how to proceed with a EDW and am looking for clarification on what I'm seeing. Everything that I am learning about states that Kimball's approach will bring value quicker to business vs Inmon's. I get that Kimball's approach is a dimensional model from the getgo and different data marts (star schema) are integrated through conformed dimensions... thus the theory is I can simply come up with my immediate DM to solve business need and go on from there.
What I'm learning states that Inmon's model suggests that I have a EDW designed in 3NF. The EDW is not defined by source system but instead the structure of the business, Corporate Factory (Orders, HR, etc.). So data from disparate systems map into this structure. Once the data is in this form, ETLs are then created to produce DMs.
Personally I feel Inmon's approach is a better way. I believe this way is going to ensure that data is going to be consistent and it feels like you can do more with this data. What holds me back with this approach though is everything I'm reading says it's going to take much more time to deliver something but I'm not seeing how that is true. From my narrow view, it feels like no matter what the end result is we need a DM. Regardless of using Kimball's or Inmon's approach the end result is the same.
So then the question becomes how do we get there? In Kimballs approach we will create ETLs to some staging location and generally from there create a DM. In Inmon's approach I feel we just add in another layer... that is from the staging area we load this data into another database in 3NF organized by function. What I'm missing is how this step adds so much time.
I feel I can look at the end DM that needs to be made. Map those back to a DW in 3NF and then as more DMs are requested keep building up the DW in 3NF with more and more data. However if I create a DM in Kimballs model that DM is going to be built around the level of grain decided for that DM and what if the next DM requested wants reporting at even a deeper grain (to me it feels like Kimballs methodology would take more work) and with Inmon's it doesn't matter. I have everything at the transnational level so DMs of varying grains are requested, well I have the data, just ETL it to a DM and all DMs will report the same since they are sourced from the same data.
I dunno... just looking for others views. Everything I read says Kimball's is quicker... I say sure maybe a little bit but there is certainly a cost attributed by going to quicker route. And for sake of argument... let's say it takes a week to get a DM up and running through Kimballs methodology... to me it feels like it should only take 10% maybe 20% longer utilizing Inmon's.
If anyone has any real world experience with the different models and if one really takes so much longer then the other... please share. Or if I have this so backwards tell me that too!
For context; I look after a 3 billion record data warehouse, for a large multi-national. Our data makes its way from the various source systems through staging and into a 3NF db. From here our ELT processes move the data into a dimensionally modelled, star schema db.
If I could start again I would definitely drop the 3NF step. When I first built that layer I thought it would add real value. I felt sure that normalisation would protect the integrity of my data. I was equally confident the 3NF db would be the best place to run large/complex queries.
But in practice, it has slowed our development. Most changes require an update to the stage, 3NF and star schema db.
The extra layer also increases the amount of time it takes to publish our data. The additional transformations, checks and reconciliations all add up.
The promised improvement in integrity never materialised. I realise now that because I control the ETL, and the validation processes within, I can ensure my data is both denormalised and accurate. In reporting data we control every cell in every table. The more I think about that, the more I see it as a real opportunity.
Large and complex queries was another myth that has been busted by experience. I now see the need to write complex reporting queries as a failing of my star db. When this occurs I always ask myself: why isn't this question easy to answer? The answer is most often bad table design. The heavy lifting is best carried out when transforming the data.
Running a 3NF and star also creates an opportunity for the two systems to disagree. When this happens it is often a very subtle difference. Neither is wrong, per se. Instead, it is possible the 3NF and star query are asking slightly different questions, and therefore returning different results. Although technically correct, this can be hard to explain. Even minor and explainable differences can erode confidence, over time.
In defence of our 3NF db, it does make loading into the star easier. But I would happily trade more complex SSIS packages for one less layer.
Having said all of this; it is very hard to recommend an approach to anyone without a deep understanding of their systems, requirements, culture, skills, etc. Having read your question I am sure you have wrestled with all these issues, and many more no doubt! In the end, only you can decide what the best approach for your situation is. Once you've made your mind up, stick with it. Consistency, clarity and a well-defined methodology are more important that anything else.
Dimensions and measures are a well proven method for presenting and simplifying data to end users.
If you present a schema based on the source system (3nf) to an end user, vs a dimensionally modelled star schema (Kimball) to an end user, they will be able to make much more sense of the dimensionally modelled one
I've never really looked into an Inmon decision support system but to me it seems to be just the ODS portion of a full datawarehouse.
You are right in saying "The EDW is not defined by source system but instead the structure of the business". A star schema reflects this but an ODS (a copy of the source system) doesn't
A star schema takes longer to build than just an ODS but gives many benefits including
Slowly changing dimensions can track changes over time
Denormalisation simplifies joins and improves performance
Surrogate keys allow you to disconnect from source systems
Conformed dimensions let you report across business units (i.e. Profit per headcount)
If your Inmon 3NF database is not just an ODS (replica of source systems), but some kind of actual business model then you have two layers to model: the 3NF layer and the star schema layer.
It's difficult nowadays to sell the benefit of even one layer of data modelling when everyone thinks they can just do it all in a 'self service' tool! (which I believe is a fallacy). Your system should be no more complicated than it needs to be because all that complexity adds up to maintenance and that's the real issue - introducing changes 12 months into the build when you have to change many layers
To paraphrase #destination-data: your source system to star schema transformation (and seperation) is already achieved through ETL so the 3nf seems redundant to me. You design your star schema to be independent from source systems by correctly implementing surrogate keys and business keys, and modelling it on the business, not on the source system
With ETL and back-end data wrangling taking up about 70% of the project time for this kind of endeavour, an extra layer makes a big difference. Its an extra layer of transforming from source to target, to agree with the business and to test. It all adds up.
Whilst I'm not saying that dimensional models (the Kimball kind) are always easy to change, you've got a whole lot more inflexibility should you have to always change lots of layers when you want to change your BI.
In fact, where I've been consulting in places that have data warehouses that are considered to be inflexible and expensive to develop for, and not keeping pace with changes to the business, they have without exception included the 3NF layer prior to the DMs. As Nick mentioned, it is hard nowadays to sell the idea of a 'proper' data warehouse as opposed to a Data Discovery Bi tool- and the appeal of these is often driven by DWs being seen to be slow and expensive to develop.
Kimball isn't against having a 3NF layer prior to his DW if it makes sense for a situation, he just doesn't agree with Inmon that there's a point.
One common misunderstanding is that Kimball proposes distinct data marts, so that you'd have to change it each time there is a different reporting request. Instead, Kimball's DMs are based on real life business processes and modelled accordingly. Although its true you will then try and make them suitable for reporting, you try and make them so they can answer forseaable queries. You don't aggregate and store just the aggregates: you work with the transactional data in a Kimball dimensional model.
So no need to be reluctant from that perspective.
If an ODS works for you, then go for it- but a Kimball DW will meet the majority of requirements.

Massive data operations in the stored proc to DDD

Lets take an example of a product classification. All the products needs to be classified as vegetable or not. The business logic is, the product can be classified as vegetable if that product is from company A, B & C. If the product is not from those companies they are not vegetables. There are millions of products. This can be done in a stored proc with few lines of code. The operation may take only few seconds if it is done synchronizely.
As I understand, the DDD goes against the idea of putting the logic in the stored procedure. The logic can be put as a behavior on product which can self classify based on who is the source. To do this, all the million products need to be read into memory, process and then save it back to the database.
The problem here is the large amount of memory this operation needs. If the operation is done in chucks like 50,000 the repository has to first figure out how may products needs to be classified and should tell the domain the long running operation has to go in chunk. Surely, this approach is going to take more time and a bad user experience for the user who has to wait more time than a process than a stored procedure takes.
What is the reasonable approach to DDD when it comes to long running processes? Is the delay expected, so the app has to inform the user that the classification is going to take time and will let the user know when that is complete? And should not use stored procedure, but have the logic part of the domain.
UPDATE
Just to add some clarity, this classification process is done quite often. The application has to support the classification process, not an ETL or can't wait longer. That's why I'm trying to find the trade offs between using a stored procedure versus DDD.
Also note that it is not a Query, but a Command. The command can be called ClassifyAllProductsCommand(). When this command is run, there was no classification before. After the classification, other users of the system should see the new classification. For example, the product A is classified as Unavailable, and after the classification it can be Vegetable or Meat.
Classification is an interesting thing. It is a separate thing. Classification should never be implemented as structure... but that is another story :)
Your classification may even be regarded as a bounded context in the same way that reporting may be a bounded context. As such you may wish to handle classification separately. Your classification is not an aggregate root. It plays an auxiliary role. If it has no impact on the consistency in your domain modelling it may not even necessarily be part of your Product aggregate. It may be added and it may even be changed independently (not as bulk) but if it is used to determine the validity of your aggregate then your classification sub-system is going to have to take that into account.
Please bear in mind that it isn't a matter of DDD vs a stored procedure. You are executing queries against your data store. Whether that is done via a stored procedure or dynamically should not affect your decision. There is nothing preventing, say, a ProductRepository from calling a stored procedure.
You can have your classification sub-system still execute your SP or use DML directly. However, this isn't necessarily going to be part of your domain. You most certainly do not want to classify each product individually if it is something that happens quite often and as a bulk operation. If your current design dictates that these are bulk operations then keep them as such and don't force them into a DDD structure that is going to be prohibitive.
It is a design choice and sometimes making changes to individual items does not make sense. It should certainly be your aim to work on a single aggregate at a time but things like reporting or classification are another animal that don't always fit cleanly into the Domain-Driven Design thinking.
I think you're confusing DDD. If you were looking for Vegetable type Products, you would call a service that would retrieve Products for a particular Company. There would be no need to load all the products into memory.
Application or domain-centric design, just means designing your application around the business domain and not from a collection of database tables upwards (like a data-centric approach).
In contrast you end up with more data associations (joins) being done in your application and less in monolithic stored procedures. Which moves all your business logic into the application and not in the persistence device (the database), which kinda makes a lot of sense.
Also, if you deny yourself huge table joins then you also think carefully about things that traditionally cause massive overhead on your database and end up moving towards better design, like creating a separate reporting database, message buses, asynchronous tasks, etc.
EDIT
It seems like a common phrase in DDD but "it depends on your specific domain".
Without knowing the detail, I would want to know how often these classifications occur. Can they be done as the Products are created? Are they done often or rarely, planned or unpredictably?
If the classifications are common and must be done across all one million products, it might be best to create a smaller model for the Product, maybe something with just SmallProduct.Id and SmallProduct.CompanyId (probably naming it something better). Then data cache this smaller collection in memory and perform operations against it.
If the check to see if the product is a Vegetable is common and only one of a few possible classifications, it might be best to have Classifications in their own table and a linking table to link them to Products. Then the problem becomes more of a one time data setup issue.
On the rare chance that you're using a Document Database, you could just store these classifications in a collection on the Product object itself.
It seams you are interpreting "classification" as you aggregate root, containing products (as entities).
Honestly, it does not feel like a good design decision (I might be wrong, depends on the requirements specifics).
What if you think of the product as aggregate root (containing suppliers, discounts, etc.)?. In that case, you´ll need to load only one product at a time.
If the classification/supplier has a complex domain, you should consider having a separate bounded context for that.
Also, in your comment:
Just to add some clarity, this classification process is done quite often. The application has to support the classification process, not an ETL or can't wait longer. That's why I'm trying to find the trade offs between using a stored procedure versus DDD.
REALLY? You can´t fire an event and have the product service update the classification when the there´s an update on the supplier? The user will have an inconsistent state (say.. "undefined" category"), for a few seconds/minutes. It is not that bad, is it ?
But, if you are talking about a batch job, then, by all means, go with the stored procedure.

How to do some reporting with Rails (with a dedicated DB)

In a Rails app, I am wondering how to build a reporting solution. I heard that I should use a separated database for reporting purposes but knowing that I will need to store a huge amount of data, I have a lot of questions :
What kind of DBMS should I choose?
When should I store data in the reporting database?
Should the database schema of the production db and reporting db be identical?
I am storing basic data (information about users, about result of operations) and I will need for example to run a report to know how many user failed an operation during the previous month.
In now that it is a vague question, but any hint would be highly appreciated.
Thanks!
Work Backwards
Start from what the end-users want for reporting or how they want to/should visualize data. Once you have some concepts in mind, then start working backwards to how to achieve those goals. Starting with the assumption that it should be a replicated copy in an RBDMS excludes several reasonable possibilities.
Making a Real-time Interface
If users are looking to aggregate values (counts, averages, etc.) on the fly (per web request), it would be worthwhile looking into replicating the master down to a reporting database if the SQL performance is acceptable (and stays acceptable if you were to double the input data). SQL engines usually do a great job aggregation and scale pretty far. This would also give you the capability to join data results together and return complex results as the users request it.
Just remember, replication isn't easy or without it's own set of problems.
This'll start to show signs of weakness in the hundreds of millions of rows range with normalized data, in my experience. At some point, inserts fight with selects on the same table enough that both become exceptionally slow (remember, replication is still a stream of inserts). Alternatively, indexes become so large that storage I/O is required for rekeying, so overall table performance diminishes.
Batching
On the other hand, if reporting falls under the scheme of sending standardized reports out with little interaction, I wouldn't necessarily recommend backing to an RBDMS. In this case, results are combined, aggregated, joined, etc. once. Paying the overhead of RBDMS indexing and storage bloat isn't worth it.
Batch engines like Hadoop will scale horizontally (many smaller machines instead of a few huge machines) so processing larger volumes of data is economical.
Batch to RBDMS or K/V Store
This is also a useful path if a lot of computation is needed to make the records more meaningful to a reporting engine. Alternatively, records could be denormalized before storing them in the reporting storage engine. The denormalized or simple results would then be shipped to a key/value store or RBDMS to make reporting easier and achieve higher performance at the cost of latency, compute, and possibly storage.
Personal Advice
Don't over-design it to start with. The decisions you make on the initial implementation will probably all change at some point. However, design it with the current and near-term problems in mind. Also, benchmarks done by others are not terribly useful if your usage model isn't exactly the same as theirs; benchmark your usage model.
I would recommend to to use some pre-build reporting services than to manually write out if you need a large set of reports.
You might want to look at Tableau http://www.tableausoftware.com/ and other available.
Database .. Yes it should be a separate seems safer , plus reporting is generally for old and consolidated data.. you live data might be too large to perform analysis on.
Database type -- > have to choose based on the reporting services used , though I think mongo is not supported by any of the reporting services , mysql is preferred.
If there are only one or two reports you could just build them on rails

Datawarehouse for analytical CRM

Is it beneficial to pull the data from Datawarehouse for analytical CRM application or it should be pulled from the source systems without the need of Datawarehouse??....Please help me answering.....
For CRM it is better to fetch the data from datawarehouse. Where a data transformations developed according to the buiness needs using various ETL tools, using this transofrmations you can integrate the CRM analytics for analysing the large chunk of data.
I guess the answer will lie in a few factors
what data you need,
the granularity of that data and,
the ease of extract
If you need data that you will need to access more than one source system, then you will have to do the joining of that data between them. One big strength of getting the data from a DWH, is that they tend to have data from a number of source systems and are well connected across these source systems with busienss rules being applied consistently across them.
A datawarehouse should have lowest granularity data, but sometimes, for pragmatic reasons, decisions may have been taken to partly summarise the data, thus you may not have the approproate granularity.
The big advantage of a DWH is that it is a simle dimensional model structure (for a kimball star schema any how), so as long as the first two are true, I would always get my data from the DWH.
g/l!
Sharing my thoughts on business case to pull from datawarehouse rather than directly from CRM system would be -
DWH can hold lot more indicators for Decision making and analysis at enterprise level across various systems than a single system like CRM. Therefore if you want to further your analysis on CRM data you can merge easily information from other system to perform better analytics/BI from DWH.
If you want to bring conformity across systems for seeing data of customer with single view. For example, you can have pipeline and sales information from CRM and then perform revenue calculation in another system for the same customer. Its possible that you want both sets of details in single place with same customer record linked to both measures.Then you might want to add Risk (Credit information) from external source into the same record in DWH. It brings true scability in terms of reporting and adhoc requests.
Remove the non-core work and dettach the CRM production system from BI and reporting (not talking of specific CRM reports). This has various advantages both terms of operations and convinence. You can google on this subject more to understand the benefits.
For now these are the only points that come to me. I will try adding more thoughts later.
P.S: I am more than happy to be corrected :-)

Achieving better DB performance

I have a website backed by a relational database comprised of the usual e-commerce related tables (Order, OrderItem, ShoppingCart, CreditCard, Payment, Customer, Address, etc...).
The stored proc. which returns order history is painfully slow due to the amount of data + the numerous joins which must occur, and depending on the search parameters it sometimes times out (despite the indexing that is in place).
The DB schema is pretty well normalized and I believe I can achieve better performance by moving toward something like a data warehouse. DW projects aren't trivial and then there's the issue of keeping the data in sync so I was wondering if anyone knows of a shortcut. Perhaps an out-of the box solution that will create the DW schema and keep the data in sync (via triggers perhaps). I've heard of Lucene but it seems geared more toward text searches and document management. Does anyone have other suggestions?
How big is your database?
There's not really any shortcuts, but dimensional modelling is really NOT that hard. You first determine a grain and then need to identify your facts and the dimensions associated with the facts. Then you divide the dimensions into tables which allow you to have the dimensions only grow slowly over time. The choice of dimensions is completely practical and based on the data behavior.
I recommend you have a look at Kimball's books.
For a database of a few GB, it's certainly possible to update a reporting database from scratch several times a day (no history, just repopulating from a 3NF for a different model of the same data). There are certain realtime data warehousing techniques which just apply changes continuously throughout the day.
So while DW projects might not be trivial, the denormalization techniques are very approachable and usable without necessarily building a complete time-invariant data warehouse.
Materialized Views are what you might use in Oracle. They give you the "keeping the data in sync" feature you are looking for combined with fast access of aggregate data. Since you didn't mention any specifics (platform, server specs, number of rows, number of hits/second, etc) of your platform, I can't really help much more than that.
Of course, we are assuming you've already checked that all your SQL is written properly and optimally, that your indexing is correct, that you are properly using caching in all levels of your app, that your DB server has enough RAM, fast hard drives, etc.
Also, have you considered denormalizing your schema, just enough to serve up your most common queries faster? that's better than implementing an entire data warehouse, which might not even be what you want anyway. Usually a data warehouse is for reporting purposes, not for serving interactive apps.

Resources