Granularity in Star Schema leads to multiple values in Fact Table? - data-warehouse

I'm trying to understand star schema at the moment & struggling a lot with granularity.
Say I have a fact table that has session_id, user_id, order_id, product_id and I want to roll-up to sessions by user by week (keeping in mind that not every session would lead to an order or a product & the DW needs to track the sessions for non-purchasing users as well as those who purchase).
I can see no reason to track order_ids or session_ids in the fact table so it would become something like:
week_date, user_id, total_orders, total_sessions ...
But how would I then track product_ids if a user makes more than one purchase in a week? I assume I can't keep multiple product ids in an array (eg: "20/02/2012","5","3","PR01,PR32,PR22")?
I'm thinking it may have to be kept at 'every session' level but that could lead to a very large amount of data. How would you implement granularity for an example such as above?

Dimensional modelling required Dimensions as well as Facts.
You need a Date/Calendar dimension, which includes columns like this:
calendar (id,cal_date,cal_year,cal_month,...)
The "grain" of your fact table is the key to data storage. If you have transactions, then the transaction should be the grain, and you store one row per transaction. Use proper (integer) surrogate keys to your dimensions, and your table won't be as large as you fear.
Now you can write a query like this, to sum sales of product by year:
select product_name,cal_year,sum(purchase_amount)
from fact_whatever
inner join calendar on id = fact_whatever.calendar_id
inner join product on id = fact_whatever.product_id
group by product_name,cal_year

Related

Rails how to make a table with composite key: (Category and Month/Year), and value: Amount

I need to report a budget to actual table, displaying net values by month per category.
Currently I'm calculating on the fly but it's already slow with hardly more than seeded data.
I think the proper way to do this is with a composite key table. The key is the unique combination of Category and Period (Month-Year), the value is the amount.
Each transaction would trigger a callback to update this table. At report time, a user would supply the period of the report (March 2022) and it would find all rows of that period.
I think it would look something like this:
key
value
March-2021/All Categories
$335
March-2021/Food
$75
March-2021/Fuel
$60
March-2021/Entertainment
$200
...
...
March-2022/All Categories
$49
March-2022/Food
$25
March-2022/Fuel
-$10
March-2022/Entertainment
$34
...
...
April-2022/All Categories
$58
April-2022/Food
$5
April-2022/Fuel
$30
April-2022/Entertainment
$23
April-2002/Some_New_Category
$20
I've read this SO question but I'd rather not blindly use a gem, and anyway it seems to involve composite 'primary' keys. (User model, and Organization model and Department model). Whereas I'm just using a Category model and a non-model period (Month-Year)
How to implement composite primary keys in rails
I can't seem to find anything about how to implement this. Am I approaching this wrong?

Dealing with multiple fact tables concerning related processes in dimensional modeling

I have the following scenario where OLTP sales data is stored in two separate physical tables:
Sales
Refunds/Cancellations
A refund always refers to an existing sale (thus 'negating' it), though the dimensions of these tables are nearly the same (date, sales clerk, store, etc.). The data schema looks something like the following:
CREATE TABLE sale
(
sale_id uuid,
transaction_at timestamp with time zone,
store_id uuid,
clerk_id uuid,
clerk_number bigint,
currency character varying(3),
pos_id uuid,
total numeric,
net_total numeric
);
CREATE TABLE refund
(
refund_id uuid,
sale_id uuid, -- referenced sale to void
refunded_at timestamp with time zone,
pos_id uuid,
clerk_id uuid,
clerk_number bigint
)
I am trying to figure how to model this data for ingstion in a DW. Since I am relatively new to dimensional modeling I have begun reading The Data Warehouse Toolkit, but I am as of now unsure of the best approach to handle this case.
To my mind, these are two separate fact tables describing two different business processes (e.g. making a sale and getting a refund), though due to normalization concerns the refunds table (besides containing most of the same dimensions) is basically a pointer to a row in the sales table (which is fine for OLTP).
Analytical reports down the line would obviously want to look at these in a few ways:
All net sales per dimension (gross sales minus refunds)
All refund amounts per dimension
Other potential business use cases
As is, the first two cases would require joining the fact tables to either subtract the sales amount (case 1) or to get the information on refund amounts (case 2).
The approach that seems to make the most sense for me is something like the following (via some ETL/ELT process):
Load the (gross) sales data into a table in the DW
Load and denormalize refund data into a table in the DW, joining actual sale data so that amounts etc. are located in the fact table
Join either table with common conformed dimensions for further roll-ups and querying
This makes sense to me because:
Both fact tables have all required information from the physical event, and
There is no explicit dependency between the fact tables, and
Common dimensions can be reused
However, in this case, I still would not be able to get the net sales without joining these two tables. This makes me think that there should be a separate net_sale fact table, but this is problematic:
From a business point of view, sales without refunds are the vast majority of events that occur. A net_sale table would copy basically 99% of all sale data.
From a business process point of view, this table would describe an event that does not exist as such (there is no "net sale", only an aggregated view of sales amount per dimension minus refund costs).
Glossing over the third Chapter in The Data Warehouse Toolkit, I do not see this case mentioned explicitly (though there might be some parallels w.r.t. factless fact-tables and derived facts). What kind of approach would work in a case like this?

ER Model representing entities not stored in DB and user choice

I'm trying to create a ER diagram of a simple retail chain type database model. You have your customer, the various stores, inventory etc.
My first question is, how to represent a customer placing an order in a store. If the customer is a discount card holder, the company has their name, address etc, so I can have a cardHolder entity connect to item and store with an order relationship. But how do I represent an order being placed by a customer who is not really an entity in the database?
Secondly, how are conditional... stuff represented in ER diagrams, e.g. in a car dealership, a customer may choose one or more optional extra when buying a car. I would think that there is a Car entity with the relevant attributes and the options as a multi-valued attribute, but how do you represent a user picking those options (I.e. order table shows the car ordered, extras chosen and the added cost of extras) in the order relationship?
First, do you really need to model customers as distinct entities, or do you just need order, payment and delivery details? Many retail systems don't track individual customers. If you need to, you can have a customer table with a surrogate key and unique constraints on identifying attributes like SSN or discount card number (even if those attributes are optional). It's generally hard to prevent duplication in customer tables since there's no ideal natural key for people, so consider whether this is really required.
How to model optional extras depends on what they depends on. Some extras might be make or model-specific, e.g. the choice of certain colors or manual/automatic transmission. Extended warranties might be available across the board.
Here's an example of car-specific optional extras:
car (car_id PK, make, model, color, vin, price, ...)
car_extras (extra_id PK, car_id FK, option_name, price)
order (order_id PK, date_time, car_id FK, customer_id FK, payment_id FK, discount)
order_extras (order_id PK/FK, car_id FK, extra_id PK/FK)
I excluded price totals since those can be calculated via aggregate queries.
In my example, order_extras.car_id is redundant, but supports better integrity via the use of composite FK constraints (i.e. (order_id, car_id) references the corresponding columns in order, and (car_id, extra_id) references the corresponding columns in car_optional_extras to prevent invalid extras from being linked to an order).
Here's an ER diagram for the tables above:
First, as per your thought you can definitely have two kinds of customers. Discount card holders whose details are present with the company and new customers whose details aren't available with the company.
There are three possible ways to achieve what you are trying,
1) Have two different order table in the system(which I personally wouldn't suggest)
2) Have a single Order table in the system and getting the details of those who are a discount card holder.
3) Insert a row in the discount card holder table for new/unregistered customers having only one order table in the system.
Having a single order table would make the system standardized and would be more convenient while performing many other operations.
Secondly, to solve your concern, you need to follow normalization. It will reduce the current problem faced and will also make the system redundant free and will make the entities light weighted which will directly impact on the performance when you grow large.
The extra chosen items can be listed in the order against the customer by adding it at the time of generating a bill using foreign key. Dealing with keys will result in fast and robust results instead of storing redundant/repeating details at various places.
By following normalization, the problem can be handled by applying foreign keys wherever you want to refer data to avoid problems or errors.
Preferably NF 4 would be better. Have a look at the following link for getting started with normalization.
http://www.w3schools.in/dbms/database-normalization/

How to store data in fact table with multiple products in an order in data warehouse

I am trying to design a dimensional modeling for data warehousing for one of my project(Sales Order). I'm new to this concept.
So far, I could understand that the product, customer and date can be stored in the dimension table and the order info will be in the fact table.
Date_dimension table structure will be
date_dim_id, date, week_number, month_number
Product_dimension table structure will be
product_dim_id, product_name, desc, sku
Order_fact table structure will be
order_id, product_dim_id(fk), date_dim_id(fk), order_quantity, order_total_price, etc
If a order is place with 2 or more number of product, will there be repeated entry in the order_fact table for the same order_id, date_dim_id
Please help on this. I'm confused here. I know that in a relational database, order table will have one entry per order and relation between the product and order will be maintained in a different table having the order_id and product_id as the foreign key.
Thanks in advance.
This is a classic case where you should (probbaly) have two fact tables
FactOrderHeader and FactOrderDetail.
FactOrderHeader will have a record for each order, storing information regarding the value of the order and any order level discounts; though they could be expressed as an OrderDetail record in some cases.
FactOrderDetail will have a record for each order line, storing information regard the product, product cost, product sale price, number of items, item discount. etc.
You may need to have a DimOrderHeader as well, if there are non-Fact pieces of information that you want to store, for example, date the order was taken, delivered, paid.

Rating System Database Structure

I have two entity groups. Restaurants and Users. Restaurants can be rated (1-5) by users. And rating fromeach user should be retrievable.
Resturant(id, name, ..... , total_number_of_votes, total_voting_points )
User (id, name ...... )
Rating (id, restaurant_id, user_id, rating_value)
Do i need to store the avg value so that it need not be calculated every time ? which table is the best place to store avg_rating, total_no_of_votes, total_voting_points ?
Well, if you store the average value somewhere; it will only be accurate as of the last time you calculated it. (i.e. you have 5 reviews; then store the averages somewhere. You get 5 more new reviews, and then your saved average is incorrect).
My opinion is that this sort of logic is perfectly suited to a middle-tier. Calculating an average shouldn't be very resource intensive, and really shouldn't impact performance.
If you really want to store it in the database; I would probably store them in their own table, and update those values via triggers. However, this could be even more resource intensive than calculating it in the middle-tier.
Some database, for example, PostGreSQL, allow you to store an array as part of a row. e.g.
create table restaurants (
...,
ratings integer[],
...
);
So you could, for example, keep the last 5 ratings in the same row as the restaurant. When you get a new rating, shuffle the old ratings left, and add the new rating at the end, then calculate the average.

Resources