Is this the right approach for preparing fact data before loading to DW? - data-warehouse

Our source system data is not a typical sales transaction table that references product and customer business keys at the transaction record. There are one or two tables in between before I can get to the customer or product information. When I load data into staging tables in DW, I plan to prepare the data (by performing joins on all the tables in between transaction and customer/product stored in the data lake using spark) to have the customer and product business keys. Is this the right approach? I don't want to perform these joins on the staging tables. Instead, I want to load prepared data with all the context (associated business keys). I will replace with replace these keys with surrogate keys when loading them to the target fact table.

Related

Is a table (from source system) that contains only relationships and current status of a row from another table a fact table in Data Warehouse?

I am developing a BI system for our company, from scratch, and currently, I am designing a data warehouse. I am completely new to this so there are many things that I don't really understand, so I need to hear some more insights into this.
My problems are:
1) In our source system, there are tables called "Booking" and "BookingAccess". Booking table holds the data of a booking, such as check-in time and check-out time, booking date, booking number, gross amount of that booking.
Whereas in BookingAccess, it holds foreign keys related to the booking, such as bookerID, customerID, processID, hotelID, paymentproviderID and a current status of that booking. Booking and BookingAccess has a 1:1 relation ship.
Our source system is about checking the validity of those bookings, these bookings are not ours. We receive these booking information from other sources, outsource the above process for them. The gross amount is just an information of that booking that we need to validate, their are not parts of our business. The current status of a booking which is hold in the BookingAccess table is the current status of that booking in our system, which can be "Processing" or "Finshed".
From what I read from Ralph Kimball, in this situation, the "Booking" is the Dimension table, and the BookingAccess should be the fact. I feel that the BookingAccess is some what a [Accumulating Snapshot table], in which I should track the time when a booking is "Processing", and when a booking is "Finshed".
Do I get it right?
2) In "Booking" table, there is also a foreign key called "ImportID". This key links to a table called "Import". This "Import" table hold history records of files (these file contain bookings which will be written to the "Booking" table) which were imported to our system, including attributes such as file name, imported date, total booking imported...
From my point of view, this is clearly a fact table.
But the problem is that, the "Import" table and the "Booking" table has a relationship of one to many (1 ImportID in "Import" table can have 1, 2 or more records which have a same ImportID in "Booking" table). This is against the idea of fact tables which insists that the relationship between Fact and Dimension must be many-to-one, which fact is always in the many side.
So what approach should I use to solve this case? I'm thinking of using bridge tables to solve this problem. But I don't know if this is a good practice, as there are a lot of record in the "Import" table, so I will have to create a big bridge table just to covers all of this.
3) Should I separate a table (from source system) which contains a mix of relationships and information to a fact table containing only relationships, and dimension table containing only information? (For example, a table called "Customer" in source system. This table contains some things like customer name, customer address and customertype id, customer parentID....)
I am asking this because I feel that if I use BI tools to analyze things (for example, analyzing the number of customers which has customertypeid = 1), I feel it's some what weird if there are no fact tables involved in.
Or should I treat it as a mere dimension table and use snowflake-schema? But this will lead to a mix of Star-Schema and snowflake-schema in our Data Warehouse. Is this normal? I have read some official sources (most likely Oracle) stating that one should try to avoid using and mixing snowflake-schema as much as possible. But some sources like Microsoft say that this is very normal. Even the Advanture Work Data Warehouse sample database uses this kind of approach.
Or should I de-normalize every relation in that "Customer" table? But I don't think this is a good approach as it will make the Customer contain a lot of columns, and it will be very hard to track the history of every row in the "DIM_Customer" table. For example, if any change occur in any relation of "Customer" table, the whole "DIM_Customer" table will need to be updated.
I still have a lot of question regarding to Data Warehouse. I am working with it nearly alone, without any help or consultant. So pardon me if I made any kind of inconveniences or mistakes.

Reducing over multiple joins in CouchDB

In my CouchDB database, I have the following models (implemented as documents in the database with different type fields):
Team: name, id (has many matches, has many fans)
Match: name, team_a, team_b, time (has many teams, has many tweets)
Fan: team_id (has many tweets)
Tweet: time, sentiment, fan_id
I want to average the tweet sentiment for each team. If I were using SQL I'd do it like this:
SELECT avg(sentiment)
FROM team
JOIN match on team.id = match.team_a OR team.id = match.team_b
JOIN fan on fan.team = team.id
JOIN tweet on (tweet.time BETWEEN match.time AND match.time + interval '1 hour') AND tweet.user = fan.id
GROUP BY team.id
However in CouchDB you can at best do 1 join in a view function, as explained in the docs (by emitting the join field as the key).
How can this be better modelled in CouchDB to allow for this query to work? I don't really want to denormalise too much, but I guess I will if I have to?
It's a bit complex, but I use what I call "tertiary indexes". The goal is to be able to write a view that is applied to another view. Unfortunately, the only way to do this is to use a view to write data to a secondary database and then have another view that works on that database. Doing this requires an outside process - I use a script that listens to the _changes feed of the primary database, and then updates the relevant documents in the secondary database when something changes.
So in your example your secondary database could consist of a single document for each team with all of the (or the latest) match/fan/tweet data in that one document. Then you write a view that extracts the sentiment (or whatever) from that secondary database.

How to store data in fact table with multiple products in an order in data warehouse

I am trying to design a dimensional modeling for data warehousing for one of my project(Sales Order). I'm new to this concept.
So far, I could understand that the product, customer and date can be stored in the dimension table and the order info will be in the fact table.
Date_dimension table structure will be
date_dim_id, date, week_number, month_number
Product_dimension table structure will be
product_dim_id, product_name, desc, sku
Order_fact table structure will be
order_id, product_dim_id(fk), date_dim_id(fk), order_quantity, order_total_price, etc
If a order is place with 2 or more number of product, will there be repeated entry in the order_fact table for the same order_id, date_dim_id
Please help on this. I'm confused here. I know that in a relational database, order table will have one entry per order and relation between the product and order will be maintained in a different table having the order_id and product_id as the foreign key.
Thanks in advance.
This is a classic case where you should (probbaly) have two fact tables
FactOrderHeader and FactOrderDetail.
FactOrderHeader will have a record for each order, storing information regarding the value of the order and any order level discounts; though they could be expressed as an OrderDetail record in some cases.
FactOrderDetail will have a record for each order line, storing information regard the product, product cost, product sale price, number of items, item discount. etc.
You may need to have a DimOrderHeader as well, if there are non-Fact pieces of information that you want to store, for example, date the order was taken, delivered, paid.

How to mark data as demo data in SQL database

We haave Accounts, Deals, Contacts, Tasks and some other objects in the database. When a new organisation we want to set up some of these objects as "Demo Data" which they can view/edit and delete as they wish.
We also want to give the user the option to delete all demo data so we need to be able to quickly identify it.
Here are two possible ways of doing this:
Have a "IsDemoData" field on all the above objects : This would mean that the field would need to be added if new types of demo data become required. Also, it would increase database size as IsDemoData would be redundant for any record that is not demo data.
Have a DemoDataLookup table with TableName and ID. The ID here would not be a strong foreign key but a theoretical foreign key to a record in the table stated by table name.
Which of these is better and is there a better normalised solution.
As a DBA, I think I'd rather see demo data isolated in a schema named "demo".
This is simple with some SQL database management systems, not so simple with others. In PostgreSQL, for example, you can write all your SQL with unqualified names, and put the "demo" schema first in the schema search path. When your clients no longer want the demo data, just drop the demo schema.

Best way to save/store customer purchase order data?

I have a custom membership provider which I extended - added a couple of fields, first name, last name, adress, zip code and city.
now, these fields reside in the aspnet_Membership table so that I can easily access them when using the static Membership asp.net class.
now, I want to be able so save customer purchase order data (first name, last name, adress, zip code and city) to the database.
should I in my order model/table use a new set of fields - first name, last name, adress, zip code, city or should I create a relationship between my asp_Membershihp table and my Orders table?
Also, If i have dupe data, once a users asks for his account to be removed I wont have any orphan rows in my Orders table if I use the first method.
so, which is best, to have the user data, first name, last name, adress, zipcode, city in only one table and create a relationship between aspnet_Membership table and Orders table OR create the dupe fields in my Orders table with no relationship to the aspnet_Membership table? Pros cons?
Thanks!
/P
In this scenario, i would rather have the relationship.
Also being the data you are storing Orders (i assume at least, from the name :)) i would maintain a separate set of data on the Order, so one would optionally be able to specify different billing/shipping data than it's Identity on the site.
Another valid reason for duplicate at least some data on the Order table is to have all the necessary data relevant to an Order in the table, thus avoiding problems if the Client request his data to be deleted, and maintain the original values for that data on the order if the customer data were to change in time.
If you are able to, though, you should not actually delete User data, but have a field in which you specify if the User isActive or isDeleted.

Resources