Some say data warehouse is non-volatile. It means no update of data is allowed.
However, sometime we have to capture changes in data. For example changes in transaction status.
Then change data capture comes as a solution.
My question is, should we rely on fundamental concept of data warehouse, to be non-volatile? If we should, then what is another alternatives to capture data changes?
Non volatile doesn't mean "no updates". An accumulating snapshot fact table usually uses updates. Non volatile pertains more to the notion that data is not discarded, it's not temporary. Even if old data is archived, there's still a way to retrieve it at some point. At least this is how I understand the recommendation.
I prefer to avoid updates entirely, mostly by inserting "correction facts". For example, you have a snapshot fact table with an account balance. On a given day the account balance is 1000; a late arriving fact changes that balance and it now should be 1100. Instead of updating the previously inserted fact, I'd rather insert a correction fact with value 100, the difference between the previously known value and the new value. However, for an accumulating snapshot fact table this may not be possible or recommended. Tracking status changes is, usually, modeled through accumulating snapshots, which will require updates.
When we say the data warehouse is volatile, that simply means data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business.
Related
the code i'm working on makes heavy usage of TFDMemTables, and clones of those tables using CloneCursor.
Sometimes, under specific conditions which I am unable to identify, the source table and its clone become out of sync: the data between them may be different, the record count as well.
Calling Refresh on the cloned table puts things back in order.
From my understanding, CloneCursor is used to address the same underlying memory where data is stored, meaning alterations to the underlying data from any of the two pointers should reflect on the other table, yet allow the user to have separate filter / record positioning per "view". so how can it possibly go out of sync?
I built a small simulator, where I can insert / delete / filter records in either the table or its clone, and observe the impact on the other one. Changes were reflected correctly.
Another downside of Refresh is that it slows the execution tremendously, if overused.
Has anyone faced similar issues or found explanations / documentation regarding this matter?
Edit:
to clarify what I mean by "out of sync", it means reading a value from the table using FieldByName will return X prior to Refresh, and Y post-refresh. I was not able to reproduce this behavior on the simulator mentioned above.
I have an app that will get Core Data objects from a server. The number of items may be very large. What's the best way to limit the number of items that Core Data will store so I don't use too much space on the phone? I was thinking that for ordered items, in applicationWillTerminate I could mark all but the first N items as toDelete and then delete them the next time the app starts (per this article http://inessential.com/2014/02/22/core_data_and_deleting_objects). Any thoughts?
As often happens, what strategy is good depends on how people use your data. What data is more important to keep available? What is less important?
Keeping the first N items in an ordered relationship is a simple rule, and fairly easy to implement. But whether it's good for your app depends on what that data is, how a person would use it, and whether not having the rest of the related objects is likely to matter. You don't even need a toDelete flag, you just need to know the value of N. But keep in mind that you can't rely on applicationWillTerminate actually being called, so it's a bad place to put critical code.
Other strategies might include:
Delete the oldest data as measured by the length of time since it was downloaded. Local data matches what's newest on the server.
Delete the oldest data as measured by the length of time since the user has accessed it. Local data matches what the user is interested in, while also allowing for new data from the server.
These are more complex, requiring date tracking in your persistent store. Only you can really say whether the advantages are worth that complexity.
Starting out though, a more important question is: does this even matter? How many items is "very large"? Does a "very large" number of items translate into a lot of data, or just a lot of little items?
By using both the synchronous=OFF and journal_mode=MEMORY options, I am able to reduce the speed of updates from 15 ms to around 2 ms which is a major performance improvement. These updates happen one at a time, so many other optimizations (like using transactions about a bunch of them) are not applicable.
According to the SQLite documentation, the DB can go 'corrupt' in the worst case if there is a power outage of some type. However, is not the worst thing that can happen is for the data to be lost, or possibly part of a transaction to be lost (which I guess is a form of corruption). Is it really possible for arbitrary corruption to occur with either of these options? If so, why?
I am not using any transactions, so partially written data from transactions is not a concern, and I can handle loosing data once in a blue moon. But if 'corruption' means that all the data in the DB can be randomly changed in an unpredictable way, that would be a strong reason to not use these options.
Does any one know what the real worst-case behavior would be on iOS?
Tables are organized as B-trees with the rowid as the key.
If some writes get lost while SQLite is updating the tree structure, the entire table might become unreadable.
(The same can happen with indexes, but those could be simply dropped and recreated.)
Data is organized in pages (typically 1 KB or 4 KB). If some page update gets lost while some tree is being reorganized, all the data in these pages (i.e., some random rows from the table with nearby rowid values) might become corruped.
If SQLite needs to allocate a new page, and that page contains plausible data (e.g., deleted data from the same table), and the writing of that page gets lost, then you have incorrect data in the table, without the ability to detect it.
We started designing a process for detecting changes in our ERP database for creating datawarehouse databases. Since they don't like placing triggers on the ERP databases or even enabling CDC (sql server), we are thinking in reading changes from the databases that get replicated to a central repository through transaction replication, then have an extra copy that will merge the changes (we will have CDC on the extra copy)...
I wonder if there is a possibility where data that changes within, let's say 15 minutes, is important enough to consider a change in our design, the way we plan in designing this would not be able to track every single change, it will only get the latest after a period of time, so for example if a value on a row changes from A to B, then 1 minute later, changes from B to C, the replication system will bring that last value to the central repository, then we will merge the table with our extra copy (that extra copy might have had the value of A, then it will be updated with C, and we lost the value of B).
Is there a good scenario in a data warehouse database where you need to track ALL changes a table has gone through ?
Taking care of historical data in a DW is important in some cases such as:
When the dimension value changes. Say, a supplier merged with another and changed their commercial name
When the fact table uses calculations derived based on other information outside the fact table that changes. Say conversion rate changes for example.
When you need to run queries that reflect fact information in previous periods (versions of the fact table).
An example where every change maters may be a bank account's balance or a storage warehouse item count or a stock price, etc.
For your particular case, you should check with your customer how the system will be used and what is its benefits exactly, and design accordingly. How granular the change should be captured (every hour, day, etc.) is primarily your customer's call.
Some techniques in handling dimension data change is in Kimball-Slowly Changing Dimension.
In direct answer to your question: depends on the application.
Examples:
The value is the description field of an item in some inventory, where the items themselves do not change (i.e. item ID X is always a sparkly-thingy). In this case saving short lived descriptions is probably not required.
The value is the last reading of temperature sensor. If it goes over a certain value action is taken to bring the temperature back. In this case you certainly need to save each an every change.
This raises three points:
The second case where every single change is required shows very bad design. Such a system would surely insert new values with a time stamp into a table and not update a single value.
Bad designs do exist. Hence:
The amount data being warehoused depends on the nature of data.
a. Will you be able to derive any intelligence from your warehoused data?
b. Will you be able to know based on changes at the database level what happened at the business level?
c. What happens to your data when the database schema changes because you upgraded the ERP product?
I'm wondering whether saving a log of changes on the table level is usable. You might be able to reverse engineer what a set of changes means and then save that to the warehouse, or actually get the ERP to "tell" you what it has done and save those changes.
Is it possible to store all changes of a set by using some means of logical paths - of the changes as they occur - such that one may revert the changes by essentially "stepping back"? I assume that something would need to map the changes as they occur, and the process of reverting them would thus ultimately be linear.
Apologies for any incoherence and this isn't applicable to any particular language. Rather, it's a problem of memory – i.e. can a set * (e.g. which may be some store of user input)* of a finite size that's changed continuously * (e.g. at any given time for any amount of time - there's no limit with regards to how much it can be changed)* be mapped procedurally such that new - future - changes are assumed to be the consequence of prior change * (in a second, mirror store that can be used to revert the state of the set all the way to its initial state)*.
You might want to look at some functional data structures. Functional languages, like Erlang, make it easy to roll back to the earlier state, since changes are always made on new data structures instead of mutating existing ones. While this feature can be used at repeatedly internally, Erlang programming typically uses this abundantly at the top level of a "process" so that on any kind of failure, it aborts both processing as well as all the changes in their entirety simply by throwing an exception (in a non-functional language, using mutable data structures, you'd be able to throw an exception to abort, but restoring originals would be your program's job not the runtime's job). This is one reason that Erlang has a solid reputation.
Some of this functional style of programming is usefully applied to non-functional languages, in particular, use of immutable data structures, such as immutable sets, lists, or trees.
Regarding immutable sets, for example, one might design a functionally-oriented data structure where modifications always generate a new set given some changes and an existing set (a change set consisting of additions and removals). You'd leave the old set hanging around for reference (by whomever); languages with automatic garbage collection reclaim the old ones when they're no longer being used (referenced).
You can put a id or tag into your set data structure, this way you can do some introspection to see what data structure id someone has a hold of. You also can capture the id of the base off of which each new version was generated; this gives you some history or lineage.
If desired, you can also capture a reference to the entire old data structure in the new one, or, one can maintain a global list of all of the sets as they are being generated. If you do, however, you'll have to take over more responsibility for storage management, as an automatic collector will probably not find any unused (unreferenced) garbage to collect without additional some help.
Database designs do some of this in their transaction controllers. For the purposes of your question, you can think of a database as a glorified set. You might look into MVCC (Multi-version Concurrency Control) as one example that is reasonably well written up in literature. This technique keeps old snapshot versions of data structures around (temporarily), meaning that mutations always appear to be in new versions of the data. An old snapshot is maintained until no active transaction references it; then is discarded. When two concurrently running transactions both modify the database, they each get a new version based off the same current and latest data set. (The transaction controller knows exactly which version each transaction is based off of, though the transaction's client doesn't see the version information.) Assuming both concurrent transactions choose to commit their changes, the versioning control in the transaction controller recognizes that the second committer is trying to commit a change set that is not a logical successor to the first (since both changes sets as we postulated above were based on the same earlier version). If possible, the transaction controller will merge the changes as if the 2nd committer was really working off the other, newer version committed by the first committer. (There are varying definitions of when this is possible, MVCC says it is when there are no write conflicts, which is a less-than-perfect answer but fast and scalable.) But if not possible, it will abort the 2nd committers transaction and inform the 2nd committer thereof (they then have the opportunity, should they like, to retry their transaction starting from the newer base). Under the covers, various snapshot versions in flight by concurrent transactions will probably share the bulk of the data (with some transaction-specific change sets that are consulted first) in order to make the snapshots cheap. There is usually no API provided to access older versions, so in this domain, the transaction controller knows that as transactions retire, the original snapshot versions they were using can also be (reference counted and) retired.
Another area this is done is using Append-Only-Files. Logging is a way of recording changes; some databases are based 100% on log-oriented designs.
BerkeleyDB has a nice log structure. Though used mostly for recovery, it does contain all the history so you can recreate the database from the log (up to the point you purge the log in which case you should also archive the database). Again someone has to decide when they can start a new log file, and when they can purge old log files, which you'd do to conserve space.
These database techniques can be applied in memory as well. (Nothing is free, though, of course ;)
Anyway, yes, there are fields where this is done.
Immutable data structures help preserve history, by simply keeping old copies; changes always go to new copies. (And efficiency techniques can make this not as bad as it sounds.)
Id's can help understand lineage without necessarily holding onto all the old copies.
If you do want to hold onto all old the copies, you have to look at your domain design to understand when/how/if old data structures possibly can get accessed with an eye toward how to eventually reclaim them. You'll mostly likely have to help get involved in defining how they get released, if ever. Or how they get archived for posterity though at the cost of slower access later.