Transaction lifecycle tracking in data warehouse

Transaction lifecycle tracking in data warehouse - data-warehouse

How do you store facts within which data is related? And how do you configure the measure? For example, I have a data warehouse that tracks the lifecycle of an order, which changes states - ordered, to shipped, to refunded. And for a state like 'refunded', it is not always there. So in my model, I am employing the transaction store model, so every time the order changes state, it is another row in the fact table. So, for an order that was placed in april, and refunded in may, there will be two rows - one with a state of 'ordered' and another with a state of 'refunded'. So if the user wanted to see all the orders placed/ordered in april, and wanted to see how many of 'those' orders got refunded, how would he see that? Is this a MDX query that will be run at runtime? Is this is a calculated measure I can store in the cube? How would I do that? My thought process is that it should be a fact that the user can use in a pivottable, but I'm not sure.....

One way to model this would be to create a factless fact table to model events. Your ORDERS fact table models the transaction amount, customer information etc, while the factless fact table (perhaps called ORDER_STATUS) models any events that occur in relation to a specific order.
With this model, it's easy to count or add all transactions based on their order status by checking for existence of records in the factless fact table.

Related

Is a table (from source system) that contains only relationships and current status of a row from another table a fact table in Data Warehouse?

I am developing a BI system for our company, from scratch, and currently, I am designing a data warehouse. I am completely new to this so there are many things that I don't really understand, so I need to hear some more insights into this.
My problems are:
1) In our source system, there are tables called "Booking" and "BookingAccess". Booking table holds the data of a booking, such as check-in time and check-out time, booking date, booking number, gross amount of that booking.
Whereas in BookingAccess, it holds foreign keys related to the booking, such as bookerID, customerID, processID, hotelID, paymentproviderID and a current status of that booking. Booking and BookingAccess has a 1:1 relation ship.
Our source system is about checking the validity of those bookings, these bookings are not ours. We receive these booking information from other sources, outsource the above process for them. The gross amount is just an information of that booking that we need to validate, their are not parts of our business. The current status of a booking which is hold in the BookingAccess table is the current status of that booking in our system, which can be "Processing" or "Finshed".
From what I read from Ralph Kimball, in this situation, the "Booking" is the Dimension table, and the BookingAccess should be the fact. I feel that the BookingAccess is some what a [Accumulating Snapshot table], in which I should track the time when a booking is "Processing", and when a booking is "Finshed".
Do I get it right?
2) In "Booking" table, there is also a foreign key called "ImportID". This key links to a table called "Import". This "Import" table hold history records of files (these file contain bookings which will be written to the "Booking" table) which were imported to our system, including attributes such as file name, imported date, total booking imported...
From my point of view, this is clearly a fact table.
But the problem is that, the "Import" table and the "Booking" table has a relationship of one to many (1 ImportID in "Import" table can have 1, 2 or more records which have a same ImportID in "Booking" table). This is against the idea of fact tables which insists that the relationship between Fact and Dimension must be many-to-one, which fact is always in the many side.
So what approach should I use to solve this case? I'm thinking of using bridge tables to solve this problem. But I don't know if this is a good practice, as there are a lot of record in the "Import" table, so I will have to create a big bridge table just to covers all of this.
3) Should I separate a table (from source system) which contains a mix of relationships and information to a fact table containing only relationships, and dimension table containing only information? (For example, a table called "Customer" in source system. This table contains some things like customer name, customer address and customertype id, customer parentID....)
I am asking this because I feel that if I use BI tools to analyze things (for example, analyzing the number of customers which has customertypeid = 1), I feel it's some what weird if there are no fact tables involved in.
Or should I treat it as a mere dimension table and use snowflake-schema? But this will lead to a mix of Star-Schema and snowflake-schema in our Data Warehouse. Is this normal? I have read some official sources (most likely Oracle) stating that one should try to avoid using and mixing snowflake-schema as much as possible. But some sources like Microsoft say that this is very normal. Even the Advanture Work Data Warehouse sample database uses this kind of approach.
Or should I de-normalize every relation in that "Customer" table? But I don't think this is a good approach as it will make the Customer contain a lot of columns, and it will be very hard to track the history of every row in the "DIM_Customer" table. For example, if any change occur in any relation of "Customer" table, the whole "DIM_Customer" table will need to be updated.
I still have a lot of question regarding to Data Warehouse. I am working with it nearly alone, without any help or consultant. So pardon me if I made any kind of inconveniences or mistakes.

FactLoanVolume - One or Many Fact Tables

I am designing a Fact table to report on loan volume. The grain is one row per loan transaction. A loan has a few major milestones that we report on: In order of sequence, these are Lock Volume, Loan Funding Volume and Loan Sales Volume.
I have Lock Date, Loan Funding Date and Loan Sale Date as FK (there are other dimensions in addition to these) in the Fact table to role playing dimensions off my DimDate table.
My question is, should I create separate Fact Tables to report volume for each major milestone or should I keep all of this in one Fact Table and use a "far in the future" date (e.g., 12/31/2099) for a milestone on a loan that has not been met?
I have read the Kimball books but I didn't find a definitive answer(if one even exists).
Thanks

You may profit from immutable design, by setting the granularity more fine to the milestone level.
This gives you columns
transaction_id
milestone_type
milestone_date
in you fact table. The actual milestone of a transaction is the milestone from the last (most recent) record.
The one adavatage is that you may add new milestone types in the future, but the main gain is, that you never update your fact table - you use inserts only.
You may safe rollback a wrong ETL load, simple by deleting the records; which is while using updates much complicated.
You may also implement more complicated state diagrams, e.g. in case when some milestone is revoked and the transaction falls back in the previous state.
The question if you use one fact table or more depends on the fact if your milestones are homogenous or not. If the milestones have distinct attributes, you may get a more clean desing using dedicated fact tables, but the queries get complicated.

You would rather have only one Fact Table.
That following question and its conversation answer pretty well to the general question of " One or multiple fact tables? ", but maybe not to how to deal with your specific problem of dates.

Track multiple status in Transaction Fact Table

I have to track the status of my business process for analysis purpose. I have seen a post where it is mentioned that we can keep the status in Transaction Fact Table against time/transaction type/service center and we can use the Accumulated fact table to study the process lag, I am wondering if few transactions have multiple status in a single day should I store all the status in Transaction Fact Table? Here I am assuming that my ETL is done at end of the business day.
Secondly should i keep all my key dimensions keys into Transaction Fact Table. Keys in this case are Transaction Type, Department id, Service_type, Service_id, Submission Channel or should I divide them in multiple fact tables?
Third if I need to report which department is meeting its SLA what would be the best approach, Calculate and keep track of Within SLA and Not Within SLA in Transaction Fact Table or I should compute this value at run time?
Thanks in advance for your help and assistance.

For status tracking you should have:
A transaction table where ony events show up (but does not provide event tracing)
An accumulating snapshot table where each process's status are tracked/updated as they happen.
As for the keys, you should keep as much detail as possible. No need to delete keys if they may hold valuable information in the future.

Repeated query on related table are very slow. How to adjust schema to be efficient?

I have two related tables on in a postgres DB. Let's say the first table is Products and the second is Transactions. I have a search feature that queries Products based on specific attributes, some of which are dependent on Transactions (such as last sale price of a specific product).
My problem is that the search query often has to query Transactions for the most recent transaction every time it is searched which takes a long time. It is noteworthy that Transasctions will be updated monthly or quarterly with the latest data.
My simple mind thought a solution would be to add fields for the most recent sale price, etc. in the Products table and run the query so that Products has a most_recent_sales_price field which is updated via query whenever the Transactions table is updated. My gut is telling me that this is a hackey way of caching (which I know very little about). Is there a better approach for this?
Edit: there are approximately 1million transactions and 50,000 products in the DB.

Freezing associated objects

Does anyone know of any method in Rails by which an associated object may be frozen. The problem I am having is that I have an order model with many line items which in turn belong to a product or service. When the order is paid for, I need to freeze the details of the ordered items so that when the price is changed, the order's totals are preserved.

I worked on an online purchase system before. What you want to do is have an Order class and a LineItem class. LineItems store product details like price, quantity, and maybe some other information you need to keep for records. It's more complicated but it's the only way I know to lock in the details.
An Order is simply made up of LineItems and probably contains shipping and billing addresses. The total price of the Order can be calculated by adding up the LineItems.
Basically, you freeze the data before the person makes the purchase. When they are added to an order, the data is frozen because LineItems duplicate nessacary product information. This way when a product is removed from your system, you can still make sense of old orders.
You may want to look at a rails plugin call 'AASM' (formerly, acts as state machine) to handle the state of an order.
Edit: AASM can be found here http://github.com/rubyist/aasm/tree/master

A few options:
1) Add a version number to your model. At the day job we do course scheduling. A particular course might be updated occasionally but, for business rule reasons, its important to know what it looked like on the day you signed up. Add :version_number to model and find_latest_course(course_id), alter code as appropriate, stir a bit. In this case you don't "edit" models so much as you do a new save of the new, updated version. (Then, obviously, your LineItems carry a item_id and an item_version_number.)
This generic pattern can be extended to cover, shudder, audit trails.
2) Copy data into LineItem objects at LineItem creation time. Just because you can slap has_a on anything, doesn't mean you should. If a 'LineItem' is supposed to hold a constant record of one item which appeared on an invoice, then make the LineItem hold a constant record of one item which appeared on an invoice. You can then update InventoryItem#current_price at will without affecting your previously saved LineItems.
3) If you're lazy, just freeze the price on the order object. Not really much to recommend this but, hey, it works in a pinch. You're probably just delaying the day of reckoning though.
"I ordered from you 6 months ago and now am doing my taxes. Why won't your bookstore show me half of the books I ordered? What do you mean their IDs were purged when you stopped selling them?! I need to know which I can get deductions for!"

Shouldn't the prices already be frozen when the items are added to the order? Say I put a widget into my shopping basket thinking it costs $1 and by the time I'm at the register, it costs $5 because you changed the price.
Back to your problem: I don't think it's a language issue, but a functional one. Instead of associating the prices with items, you need to copy the prices. If every item in the order has it's own version of a price, future price changes won't effect it, you can add discounts, etc.
Actually, to be clean you need to add versioning to your prices. When an item's price changes, you don't overwrite the price, you add a newer version. The line items in your order will still be associated with the old price.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart