Datawarehousing, is it important to track historical data (SQL Server)? - data-warehouse

We started designing a process for detecting changes in our ERP database for creating datawarehouse databases. Since they don't like placing triggers on the ERP databases or even enabling CDC (sql server), we are thinking in reading changes from the databases that get replicated to a central repository through transaction replication, then have an extra copy that will merge the changes (we will have CDC on the extra copy)...
I wonder if there is a possibility where data that changes within, let's say 15 minutes, is important enough to consider a change in our design, the way we plan in designing this would not be able to track every single change, it will only get the latest after a period of time, so for example if a value on a row changes from A to B, then 1 minute later, changes from B to C, the replication system will bring that last value to the central repository, then we will merge the table with our extra copy (that extra copy might have had the value of A, then it will be updated with C, and we lost the value of B).
Is there a good scenario in a data warehouse database where you need to track ALL changes a table has gone through ?

Taking care of historical data in a DW is important in some cases such as:
When the dimension value changes. Say, a supplier merged with another and changed their commercial name
When the fact table uses calculations derived based on other information outside the fact table that changes. Say conversion rate changes for example.
When you need to run queries that reflect fact information in previous periods (versions of the fact table).
An example where every change maters may be a bank account's balance or a storage warehouse item count or a stock price, etc.
For your particular case, you should check with your customer how the system will be used and what is its benefits exactly, and design accordingly. How granular the change should be captured (every hour, day, etc.) is primarily your customer's call.
Some techniques in handling dimension data change is in Kimball-Slowly Changing Dimension.

In direct answer to your question: depends on the application.
Examples:
The value is the description field of an item in some inventory, where the items themselves do not change (i.e. item ID X is always a sparkly-thingy). In this case saving short lived descriptions is probably not required.
The value is the last reading of temperature sensor. If it goes over a certain value action is taken to bring the temperature back. In this case you certainly need to save each an every change.
This raises three points:
The second case where every single change is required shows very bad design. Such a system would surely insert new values with a time stamp into a table and not update a single value.
Bad designs do exist. Hence:
The amount data being warehoused depends on the nature of data.
a. Will you be able to derive any intelligence from your warehoused data?
b. Will you be able to know based on changes at the database level what happened at the business level?
c. What happens to your data when the database schema changes because you upgraded the ERP product?
I'm wondering whether saving a log of changes on the table level is usable. You might be able to reverse engineer what a set of changes means and then save that to the warehouse, or actually get the ERP to "tell" you what it has done and save those changes.

Related

Is it okay if Operational data store holds data for rolling time period?

We are planning to build an Operational data store for the front-end users data extraction requirements.
As far as I know the Kimball's approach to build ODS\DW, it should hold the data for complete time period and not like the rolling time period.
The reason being, there could be a need to extract older data from ODS\DW.
So I need your thoughts on this. How should I approach ?
I would create a snapshot table that could hold the values for the rolling period for each day, and filter on the client side of things which snapshot to display.
Once the period is over then the final values can be stored on the permanent data mart.
Kimball's approach for a data warehouse would be to load transactional data to any data warehouse if you can, because it is more flexible in terms of being rolled up. Certainly at the ODS stage you wouldn't want to 'pre-aggregate' your data, if there could be a need to get hold of older data.
If you store both the transactional data and then pre-aggregated versions of the data (in aggregate fact tables, with indexes/views or with a cube, or just filtering on the report side as the other answer suggests), you can get the best of both worlds.
(Note: Kimball's approach in fact does not require an ODS: they're fine if you want to build one, but their focus is on the dimensionally modelled data warehouse.)

data warehouse - non volatile vs change data capture

Some say data warehouse is non-volatile. It means no update of data is allowed.
However, sometime we have to capture changes in data. For example changes in transaction status.
Then change data capture comes as a solution.
My question is, should we rely on fundamental concept of data warehouse, to be non-volatile? If we should, then what is another alternatives to capture data changes?
Non volatile doesn't mean "no updates". An accumulating snapshot fact table usually uses updates. Non volatile pertains more to the notion that data is not discarded, it's not temporary. Even if old data is archived, there's still a way to retrieve it at some point. At least this is how I understand the recommendation.
I prefer to avoid updates entirely, mostly by inserting "correction facts". For example, you have a snapshot fact table with an account balance. On a given day the account balance is 1000; a late arriving fact changes that balance and it now should be 1100. Instead of updating the previously inserted fact, I'd rather insert a correction fact with value 100, the difference between the previously known value and the new value. However, for an accumulating snapshot fact table this may not be possible or recommended. Tracking status changes is, usually, modeled through accumulating snapshots, which will require updates.
When we say the data warehouse is volatile, that simply means data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business.

calculating lots of statistics on database user data: optimizing performance

I have the following situation (in Rails 3): my table contains financial transactions for each user (users can buy and sell products). Since lots of such transactions occur I present statistics related to the current user on the website, e.g. current balance, overall profit, how many products sold/bought overall, averages, etc. (the same also on a per month/per year basis instead of overall). Parts of this information is displayed to the user on many forms/pages so that the user can always see his current account information (different bits of statistics is displayed on different pages though).
My question is: how can I optimize database performance (and is it worth it)? Surely, if the user is just browsing, there is no need to re-calculate all of the values every time a new page is loaded unless a change to the underlying database has been made?
My first solution would be to store these statistics in their own table and update them once a financial transaction has been added/edited (in Rails maybe using :after_update ?). Taking this further, if, for example, a new transaction has been made, then I can just modify the average instead of re-calculating the whole thing.
My second idea would be to use some kind of caching (if this is possible?), or to store these values in the session object.
Which one is the preferred/recommended way, or is all of this a waste of time as the current largest number of financial transactions is in the range of 7000-9000?
You probably want to investigate summary tables, also known as materialized views.
This link may be helpful:
http://wiki.postgresql.org/wiki/Materialized_Views

ASP.NET MVC 3 - Web Application - Efficiently Aggregate Data

I am running an ASP.NET MVC 3 web application and would like to gather statistics such as:
How often is a specific product viewed
Which search phrases typically return specific products in their result list
How often (for specific products) does a search result convert to a view
I would like to aggregate this data and break it down:
By product
By product by week
etc.
I'm wondering what are the cleanest and most efficient strategies for aggregating the data. I can think of a couple but I'm sure there are many more:
Insert the data into a staging table, then run a job to aggregate the data and push it into permanent tables.
Use a queuing system (MSMQ/Rhino/etc.) and create a service to aggregate this data before it ever gets pushed to the database.
My concerns are:
I would like to limit the number of moving parts.
I would like to reduce impact on the database. The fewer round trips and less extraneous data stored the better
In certain scenarios (not listed) I would like the data to be somewhat close to real-time (accurate to the hour may be appropriate)
Does anyone have real world experience with this and if so which approach would you suggest and what are the positives and negatives? If there is a better solution that I am not thinking of I'd love ot hear it...
Thanks
JP
I needed to do something similar in a recent project. We've implemented a full audit system in a secondary database, it tracks changes on every record on the live db. Essentially every insert, update and delete actually updates 2 records, one in the live db and one in the audit db.
Since we have this data in realtime on the audit db, we use this second database to fill any reports we might need. One of the tricks I've found when working with a reporting DB is to forget about normalisation. Just create a table for each report you want, and have it carry just the data you want for that report. Its duplicating data, but the performance gains are worth it.
As to filling the actual data in the reports, we use a mixture. Daily reports are generated by a scheduled task at around 3am, ditto for the weekly and monthly reports, normally over weekends or late at night.
Other reports are generated on demand, using mostly the data since the last daily, so its not that many records, once again all from the secondary database.
I agree that you should create a separate database for your statistics, it will reduce the impact on your database.
You can go with your idea of having "Staging" tables and "Aggregate" tables; that way, if you want to access the near-real-time data you go o the staging table, when you want to historical data, you go to the aggregates.
Finally, I would recommend you use an asynchronous call to save your statistics; that way your pages will not have an impact in response time.
I suggest that you will create a separate database for this. The best way is to use BI technique. There is a separate services in
SQL server for Bi.

Recommendations on handling object status fields in rails apps: store versus calculate?

I have a rails app that tracks membership cardholders, and needs to report on a cardholder's status. The status is defined - by business rule - as being either "in good standing," "in arrears," or "canceled," depending on whether the cardholder's most recent invoice has been paid.
Invoices are sent 30 days in advance, so a customer who has just been invoiced is still in good standing, one who is 20 days past the payment due date is in arrears, and a member who fails to pay his invoice more than 30 days after it is due would be canceled.
I'm looking for advice on whether it would be better to store the cardholder's current status as a field at the customer level (and deal with the potential update anomalies resulting from potential updates of invoice records without updating the corresponding cardholder's record), or whether it makes more sense to simply calculate the current cardholder status based on data in the database every time the status is requested (which could place a lot of load on the database and slow down the app).
Recommendations? Or other ideas I haven't thought of?
One important constraint: while it's unlikely that anyone will modify the database directly, there's always that possibility, so I need to try to put some safeguards in place to prevent the various database records from becoming out of sync with each other.
The storage of calculated data in your database is generally an optimisation. I would suggest that you calculate the value on every request and then monitor the performance of your application. If the fact that this data is not stored becomes an issue for you then is the time to refactor and store the value within the database.
Storing calculated values, particularly those that can affect multiple tables are generally a bad idea for the reasons that you have mentioned.
When/if you do refactor and store the value in the DB then you probably want a batch job that checks the value for data integrity on a regular basis.
The simplest approach would be to calculate the current cardholder status based on data in the database every time the status is requested. That way you have no duplication of data, and therefore no potential problems with the duplicates becoming out of step.
If, and only if, your measurements show that this calculation is causing a significant slowdown, then you can think about caching the value.
Recently I had similar decision to take and I decided to store status as a field in database. This is because I wanted to reduce sql queries and it looks simpler. I choose to do it that way because I will very often need to get this status and calculating it is (at least in my case) a bit complicated.
Possible problem with it is that it get out of sync, so I added some after_save and after_destroy to child model, to keep it synchronized. And of course if somebody would modify database in different way, it would make some problems.
You can write simple rake task that will check all statuses and, if needed, correct them. You can run it in cron so you don't have to worry about it.

Resources