I always wonder if the FactInternetSale table of the AdventureworksDW is a accumulating snapshot table. It has a ShipDateKey in it.
According to the AdventureWorks OLTP documentation, it says that the ShipDate of the SalesOrderHeader is the date that "the order was shipped to customer". I interpret this line as, when the order is shipped, the ship date will be updated.
That also means the rows in the DW FactInternetSale will also need to be updated as well. The ship date marks the an important milestone of an order and this is clearly the behavior of an accumulating snapshot fact table.
So should this table be considered an accumulating snapshot fact table? If so then is there any problem that there is no real transaction fact table?
In the Kimball's data warehouse toolkit book, in this kind of problem, he separates the Order transaction fact table and the Shipping Fact table very strictly, with the Order Transaction Fact table only contains only the information which is recorded when the order is made, and will not be updated. The dates in the Order Transaction Fact table are always expected date, not the real date. The shipping fact table contains the true ship date of an item. After that there is an accumulating snapshot fact table that contains all the important milestones of an order. Not only the ship date, but also other important milestones... By having dates of important milestones, we of course can know the current status of the order.
In my personal opinion, I consider that the Order Fact Table that does not contain the current status of it is totally useless. What is the point of knowing the total amount of orders but cannot know how much is from fulfilled (shipped) ones and how much is from unfulfilled ones? In my experience, users (data analysts) will always just use the accumulating snapshot table to do their job all the time, as the search predicate of "current status" is never absent in their query.
In my real world, I usually design this Order (information) fact table as a accumulating snapshot straightforwardly, skipping the transaction fact table (like what Kimball does, strictly separates things), as I feel that is very time-consuming and have no use. The transaction fact tables are usually just the actions done on the order (for example: shipping).
How do you think about this?
No, it's not an accumulating snapshot fact table
Related
I have a requirement to create a Fact table which stores granted_share_qty awarded to employees. There are surrounding Dimensions like SPS Grant_dim which stores info about each grant, SPS Plan Dim which stores info about the Plan, SPS Client Dim which stores info about the Employer and SPS Customer Dim which stores info about the customer. The DimKeys (Surrogate Key) and DurableKeys(Supernatural Keys) from each Dimension is added to the Fact.
Reporting need is "as-of" ie on any given date, one should be able to see the granted_share_qty as of that date (similar to account balance as of that date) along with point-in-time values of few attributes from the Grant,Plan, Client, Customer dimensions.
First, we thought of creating a daily snapshot table where the data is repeated everyday in the fact (unless source sends any changes). However since there could be more than 100 million grant records , repeating this everyday was almost impossible, moreover the granted_share_qty doesnt change that often so why copy this everyday?.
So instead of a daily snapshot we thought of adding an EFFECTIVE_DT and EXPIRATION_DT on the Fact table (like a TIMESPAN PERIODIC SNAPSHOT table if such a thing exists)
This reduces the volume and perfectly satisfies a reporting need like "get me the granted_qty and grant details,client, plan, customer details as of 10/01/2022 " will translate to "select granted_qty from fact where 10/01/2022 between EFFECTIVE_DT and EXPIRATION_DT and Fact.DimKeys=Dim.DimKeys"
The challenge however is to keep the Dim Keys of the Fact in sync with Dim Keys of the Dimensions. Even if the Fact doesn't change, any DimKey changes due to versioning in any of the Dimension need to be tracked and versioned in the Fact. This has become an implementation nightmare
(To worsen the things, the Dims could undergo multiple intraday changes , so these are to be tracked near-real-time :-( )
Any thoughts how to handle such situations will be highly appreciated (Database: Snowflake)
P:S: We could remove the DimKeys from the Fact and use DurableKeys+Date to join between the Facts and Type 2 Dims, but that proposal is not favored/approved as of now
Thanks
Sunil
First, we thought of creating a daily snapshot table where the data is repeated everyday in the fact (unless source sends any changes). However
Stop right there. Whenever you know the right model but think it's un-workable for some reason, try harder. At a minimum test your assumption that it would be "too much data", and consider not materializing the snapshot but leaving it as a view and computing it at query time.
... moreover the granted_share_qty doesnt change that often so why copy this everyday?.
And there's your answer. Use a monthly snapshot instead of a daily snapshot, and you've divided the data by 30.
I am building a data warehouse for my company. Recently, I just realized that there are some holes (potentially very dangerous) in my SCD type 2 dimension implementation, so that I have to review it over.
The current "fromdate" of a SCD type 2 dimension table is the date it came to the data warehouse, or the date that it replaced an older record, and the "todate" is usually null, or the date a new record with a same natural key came in replacing the old record.
Currently, when loading fact, I get the surrogate key for that fact by using the natural key and condition iscurrent = true or todate = null.
I just realize that this doesn't guarantee the correctness of the surrogate key for fact, for example:
What if the change happened at 11:00AM. This means: half of the transaction occured during that day will be related to the old dimension record, but half of them will related to the new dimension record. But when the data comes to the data warehouse, all the transactions of that day will be treated to be related to the new dimension, and this is not correct.
If we use datetime of the transactions to get the surrogate key more precisely, when loading fact records to the data warehouse, all the transaction that occured before the day the dimension comes to the Data Warehouse will not be able to find any dimension surrogate key related to it. For example: I made the dimension table yesterday, so all the start-date in that SCD 2 dimension table will have a min value of yesterday, while nearly all the old transactions (which haven't been loaded to the data warehouse) happened before that day. So they will have no surrogate key. Such paradox.
I even try to make it more precise by consolidate the start-date of a row, by trying to pass the create date of that dimension row in the OLTP system. But still I can not find the most correct way to do it. First the datetime in Data Warehouse and the OLTP system is different (because they might belong to different GMT+X)...
And many other problems .....
I understand that if we want to track the history in a perfectly precise accuracy, the only way is that we must implement it in the OLTP system by directly writing the related entity to the transaction records. Data warehouse can not do it. But I still feel that there are too many holes in the SCD 2 concept, or that I didn't implement the SCD Typ2 2 system correctly. So please teach me if the above problems is normal, or point the mistake in my understanding out for me.
If time matters, use a datetime not a date. But first consider whether time matters
Again solved by use of datetime
Decide what timezone your datawarehouse is in.
UTC
Source System
Local system
A timezone aware data type
Just a note: I suggest you use 2099-01-01 rather than NULL as your end date for the current record. Then you can easily use between when searching for a matching dimension member.
You'll need to be more specific
Edit:
One observation based on the comments so far: Don't use Is_Current to look up the surrogate key, use the business key in the transaction and the transaction datetime between the dimension start and end date.
This means you can reload data from three months ago and it will pick up the correct dimension member (not the current one)
This reinforces my other comment to not use NULL for the active record end date. Instead use a datetime way into the future. so you can always between these dates and get a result
Why do you need surrogate keys in fact-less fact tables (or many to many dimensional relation tables)
Few circumstances when assigning a surrogate key to the rows in a fact table is beneficial:
Sometimes the business rules of the organization legitimately allow multiple identical rows to exist for a fact table. Normally as a designer, you try to avoid this at all costs by searching the source system for some kind of transaction time stamp to make the rows unique. But occasionally you are forced to accept this undesirable input. In these situations it will be necessary to create a surrogate key for the fact table to allow the identical rows to be loaded.
Certain ETL techniques for updating fact rows are only feasible if a surrogate key is assigned to the fact rows. Specifically, one technique for loading updates to fact rows is to insert the rows to be updated as new rows, then to delete the original rows as a second step as a single transaction. The advantages of this technique from an ETL perspective are improved load performance, improved recovery capability and improved audit capabilities. The surrogate key for the fact table rows is required as multiple identical primary keys will often exist for the old and new versions of the updated fact rows between the time of the insert of the updated row and the delete of the old row.
A similar ETL requirement is to determine exactly where a load job was suspended, either to resume loading or back put the job entirely. A sequentially assigned surrogate key makes this task straightforward.
Sometimes my app will add many Realm records at once.I need to be able to consistently keep them in the same order.
The documentation recommends that I use NSDate:
Another common motivation for auto-incrementing properties is to preserve order of insertion. In some situations, this can be accomplished by appending objects to a List or by using a createdAt property with a default value of NSDate().
However, since records are added so quickly sometimes, the dates are not always unique, especially considering Realm stores NSDate only to the second accuracy.
Is there something I'm missing about the suggestion in the documentation?Maybe the documentation wasn't considering records added in quick succession? If so, would it be recommended to keep an Int position property and to always query for the last record at the moment when adding a new record, so as to ensure sequential positions?However, querying for the last record in such a case won't return the previous record unless you've also added and finalized a write, which is wasteful if you need to add a lot of records.Then, it would require batch create logic, which is unfortunate.
However, since records are added so quickly sometimes, the dates are not always unique, especially considering Realm stores NSDate only to the second accuracy.
The limitation on date precision was addressed back in Realm v0.101. Realm can now represent dates with greater precision than NSDate.
However, querying for the last record in such a case won't return the previous record unless you've also added and finalized a write, which is wasteful if you need to add a lot of records.
It's not necessary to commit a write transaction for queries on the same thread to see data that you've added during the write transaction.
Is there something I'm missing about the suggestion in the documentation?
You skipped over the first suggestion: appending objects to a List. Lists in Realm are inherently ordered, so you do not need to find a way to create unique, ordered values. Simply append the new object to the list, and rely on the list's order to determine the order in which the objects were added. This also has the advantage of being safe when using Realm Mobile Platform's synchronization features, as incrementing fields can generate duplicates on different devices and timestamps may not be reliable.
We are designing a Staging layer to handle incremental load. I want to start with a simple scenario to design the staging.
In the source database There are two tables ex, tbl_Department, tbl_Employee. Both this table is loading a single table at destination database ex, tbl_EmployeRecord.
The query which is loading tbl_EmployeRecord is,
SELECT EMPID,EMPNAME,DEPTNAME
FROM tbl_Department D
INNER JOIN tbl_Employee E
ON D.DEPARTMENTID=E.DEPARTMENTID
Now, we need to identify incremental load in tbl_Department, tbl_Employee and store it in staging and load only the incremental load to the destination.
The columns of the tables are,
tbl_Department : DEPARTMENTID,DEPTNAME
tbl_Employee : EMPID,EMPNAME,DEPARTMENTID
tbl_EmployeRecord : EMPID,EMPNAME,DEPTNAME
Kindly suggest how to design the staging for this to handle Insert, Update and Delete.
Identifying Incremental Data
The incremental loading needs to be based on some segregating information present in your source table. Such information helps you to identify the incremental portion of the data that you will load. Often times, load date or last updated date of the record is a good choice for this.
Consider this, your source table has a date column that stores both the date of insertion of the records as well as the date when any update was done on that record. At any given day during your staging load, you may take advantage of this date to identify which are the records that are newly inserted or updated since your last staging load and you consider only those changed / updated records as your incremental delta.
Given your structure of the tables, I am not sure which column you may use for this. ID columns will not help as if the record gets updated you won't know that.
Maintaining Load History
It is important to store information about how much you have loaded today so that you can load the next part in the next load. To do this, maintain a staging table - often called Batch Load Details table. That load typically will have structure such as below:
BATCH ID | START DATE | END DATE | LOAD DATE | STATUS
------------------------------------------------------
1 | 01-Jan-14 | 02-Jan-14 | 02-Jan-14 | Success
You need to insert a new record in this table everyday before you start the data loading. The new record will have start date equal to the end date of last successful load and status null. Once loading is successful, you will update the status to 'Success'
Modification in data Extraction Query to take Advantage of Batch Load Table
Once you maintain your loading history like above, you may include this table in your extraction query,
SELECT EMPID,EMPNAME,DEPTNAME
FROM tbl_Department D
INNER JOIN tbl_Employee E
ON D.DEPARTMENTID=E.DEPARTMENTID
WHERE E.load_date >= (SELECT max(START_DATE) FROM BATCH_LOAD WHERE status IS NULL)
What I am going suggest you is by no means a standard. In fact you should evaluate my suggestion carefully against your requirement.
Suggestion
Use incremental loading for transaction data, not for master data. Transaction data are generally higher in volume and can be easily segregated in incremental chunks. Master data tend to be more manageable and can be loaded in Full everytime. In the above example, I am assuming your Employee table is behaving like transactional data whereas your department table is your master.
I trust this article on incremental loading will be very helpful for you
I'm not sure what database you are using, so I'll just talk in conceptual terms. If you want to add tags for specific technologies, we can probably provide specific advice.
It looks like you have 1 row per employee and that you are only keeping the current record for each employee. I'm going to assume that EMPIDs are unique.
First, add a field to the query that currently populates the dimension. This field will be a hash of the other fields in the table EMPID, EMPNAME, DEPTNAME. You can create a view, populate a new staging table, or just use the query. Also add this same hash field to the dimension table. Basically, the hash is an easy way to generate a field that is unique for each record and efficient to compare.
Inserts: These are the records for which the EMPID does not already exist in the dimension table but does exist in your staging query/view.
Updates: These are the records for which the EMPID does in both the staging query/view the dimension table, but the hash field doesn't match.
Deletes: These are the records for which the EMPID exists in the dimension but does not exist in the staging query/view.
If this will be high-volume, you may want to create new tables to hold the records that should be inserted and the records that should be updated. Once you have identified the records, you can insert/update them all at once instead of one-by-one.
It's a bit uncommon to delete lots of records from a data warehouse as they are typically used to keep history. I would suggest perhaps creating a column that is a status or a bit field that indicates if is is active or deleted in the source. Of course, how you handle deletes should be dependent upon your business needs/reporting requirements. Just remember that if you do a hard delete you can never get that data back if you decide you need it later.
Updating the the existing dimension in place (rather than creating historical records for each change) is called a Type 1 dimension in dimensional modeling terms. This is fairly common. But if you decide you need to keep history, you can use the hash to help you create the SCD type 2 records.