Unit Price and Discounts - Fact or Dimension Table - data-warehouse

I'm working on a datamart for our sales and marketing departments, and I've come across a modeling challenge. Our ERP stores pricing data in a few different ways:
List pricing for each item
A discount percentage from list pricing for a product line, either for groups of customers or for a specific account
A custom price for an item, either for groups of customers or for a specific account
The Pricing department primarily uses this data operationally, not analytically. For example, they generate reports for customers ("What special pricing / discount %s do I have?") and identify which items / item groups need to be changed when they engage in a new pricing strategy.
Pricing changes happen somewhat regularly on a small scale, usually on a customer-by-customer or item-by-item basis. Infrequently, there are large-scale adjustments to list pricing and group pricing (discounts and individual items) in addition to the customer-level discounts.
My head has been in creating one or more fact tables to represent this process. Unfortunately, there's no pre-existing business key for pricing. There's also no specific "transaction date," since the ERP doesn't (accurately) maintain records of when pricing is changed. Essentially, a "pricing event" is going to be a combination of:
Effective date
End date
Item OR product line
(Not required for list price) customer or customer group
A price amount OR discount percentage
A single fact table seems problematic in that I'm going to have to deal with a lot of invalid combinations of dimensions and facts. First, a record will never have both a non-NULL price amount and a non-NULL discount percentage; pricing events are either-or. Second, only certain combinations of dimensions are valid for each fact. For example, a discount percentage will only ever have a product line, not an individual item.
Does it make sense to model pricing as a fact table in the first place? If so, how many tables should I be considering? My intuition is to use at least two, one for the percentages and one for the price amounts, but this still leaves a problem where each record will either have a valid customer group OR a valid customer (or neither, for list prices), since we need to maintain customer-specific pricing separate from any group pricing that customer might have.

You may need to keep them both as attributes and as facts.
The price a certain item was sold for is a fact. When you multiply it by the quantity sold it's actually an additive measure. So, keep it in the fact table. Total discount applied is also additive, I'd keep it. You can later query "how much was discounted in 2019 per customer", which would be much harder to achieve without those facts.
But if you also need to query things like "what's the discount customer X is on", then you should also keep that as an attribute of the customer dimension, and treat it as a type II dimension, so as to keep discount history. If you know when a certain discount was applied, great, if not take the 1st sale as the start date and you won't be too far off.
Maybe the list price can also be kept as an attribute of product or product line in a dimension, but only if they don't change too often; but if most customers get discounts anyway that would be of limited use.

Related

Dealing with multiple fact tables concerning related processes in dimensional modeling

I have the following scenario where OLTP sales data is stored in two separate physical tables:
Sales
Refunds/Cancellations
A refund always refers to an existing sale (thus 'negating' it), though the dimensions of these tables are nearly the same (date, sales clerk, store, etc.). The data schema looks something like the following:
CREATE TABLE sale
(
sale_id uuid,
transaction_at timestamp with time zone,
store_id uuid,
clerk_id uuid,
clerk_number bigint,
currency character varying(3),
pos_id uuid,
total numeric,
net_total numeric
);
CREATE TABLE refund
(
refund_id uuid,
sale_id uuid, -- referenced sale to void
refunded_at timestamp with time zone,
pos_id uuid,
clerk_id uuid,
clerk_number bigint
)
I am trying to figure how to model this data for ingstion in a DW. Since I am relatively new to dimensional modeling I have begun reading The Data Warehouse Toolkit, but I am as of now unsure of the best approach to handle this case.
To my mind, these are two separate fact tables describing two different business processes (e.g. making a sale and getting a refund), though due to normalization concerns the refunds table (besides containing most of the same dimensions) is basically a pointer to a row in the sales table (which is fine for OLTP).
Analytical reports down the line would obviously want to look at these in a few ways:
All net sales per dimension (gross sales minus refunds)
All refund amounts per dimension
Other potential business use cases
As is, the first two cases would require joining the fact tables to either subtract the sales amount (case 1) or to get the information on refund amounts (case 2).
The approach that seems to make the most sense for me is something like the following (via some ETL/ELT process):
Load the (gross) sales data into a table in the DW
Load and denormalize refund data into a table in the DW, joining actual sale data so that amounts etc. are located in the fact table
Join either table with common conformed dimensions for further roll-ups and querying
This makes sense to me because:
Both fact tables have all required information from the physical event, and
There is no explicit dependency between the fact tables, and
Common dimensions can be reused
However, in this case, I still would not be able to get the net sales without joining these two tables. This makes me think that there should be a separate net_sale fact table, but this is problematic:
From a business point of view, sales without refunds are the vast majority of events that occur. A net_sale table would copy basically 99% of all sale data.
From a business process point of view, this table would describe an event that does not exist as such (there is no "net sale", only an aggregated view of sales amount per dimension minus refund costs).
Glossing over the third Chapter in The Data Warehouse Toolkit, I do not see this case mentioned explicitly (though there might be some parallels w.r.t. factless fact-tables and derived facts). What kind of approach would work in a case like this?

Rails & Postgres: Best practice for many boolean relationship entries

I'm looking for input on what's best practice for my problem. In abstract terms, I need to store a lot of relationships and need to prioritise between database size, query speed and “ease of maintenance”. The setup is Ruby on Rails with PostgreSQL.
More specific description: Imagine a website that's essentially a searchable database of Products sold by Vendors. Vendors may not ship worldwide, and I want to filter out Products by Vendors who won't ship to a user based on their GeoIP country. To make matters slightly more complex, a Product page can have two different features (let’s call them F1 and F2) that are separately geo-IP-dependent.
Example: A Vendor may want their Product pages to have feature F1 for all countries worldwide except a few e.g. because of embargos; but feature F2 may only be available for countries within e.g. Europe.
Country filters will always be set at the Vendor level.
The “search” function of the website is a basic SQL search in which I want Products to show up if at least one of the features is available for the current user’s country.
The website will allow in the range of 1,000 to 3,000 Vendors (this is a hard limit), and a total of around 10,000 to 50,000 Products. Let's assume that in the beginning, filtering is only relevant to around 100 Vendors.
I had the following ideas and hope that others have feedback on these, or additional approaches:
One relation model CountryVendor with two boolean columns (in which case optionally, a Product could still be shown if the respective country_vendor does not yet exist; i.e. show if !country_vendor&.allow).
Assuming ca. 200 countries, this would imply ca. 2,000 rows in the beginning, and around 600k rows if filters were in place for every one of the 3,000 potential Vendors.
(Theoretically, if non-existence is treatet as true, I could also set up a rake task that removes rows that are false for both features, thus reigning in the table size.)
Two relation models CountryVendorF1 and CountryVendorF2, each with just one boolean column. Not sure if this will effectively be much different, but I imagine it closer to how I think of the UI for setting up the country filters (without going into detail here).
Two JSON columns in the Vendors table that would store true/false for each Country. (Maybe with an ISO code string as the index for simplicity.) There wouldn’t be thousands of new rows, although the DB would still grow in size, but querying might become slow.

ER Model representing entities not stored in DB and user choice

I'm trying to create a ER diagram of a simple retail chain type database model. You have your customer, the various stores, inventory etc.
My first question is, how to represent a customer placing an order in a store. If the customer is a discount card holder, the company has their name, address etc, so I can have a cardHolder entity connect to item and store with an order relationship. But how do I represent an order being placed by a customer who is not really an entity in the database?
Secondly, how are conditional... stuff represented in ER diagrams, e.g. in a car dealership, a customer may choose one or more optional extra when buying a car. I would think that there is a Car entity with the relevant attributes and the options as a multi-valued attribute, but how do you represent a user picking those options (I.e. order table shows the car ordered, extras chosen and the added cost of extras) in the order relationship?
First, do you really need to model customers as distinct entities, or do you just need order, payment and delivery details? Many retail systems don't track individual customers. If you need to, you can have a customer table with a surrogate key and unique constraints on identifying attributes like SSN or discount card number (even if those attributes are optional). It's generally hard to prevent duplication in customer tables since there's no ideal natural key for people, so consider whether this is really required.
How to model optional extras depends on what they depends on. Some extras might be make or model-specific, e.g. the choice of certain colors or manual/automatic transmission. Extended warranties might be available across the board.
Here's an example of car-specific optional extras:
car (car_id PK, make, model, color, vin, price, ...)
car_extras (extra_id PK, car_id FK, option_name, price)
order (order_id PK, date_time, car_id FK, customer_id FK, payment_id FK, discount)
order_extras (order_id PK/FK, car_id FK, extra_id PK/FK)
I excluded price totals since those can be calculated via aggregate queries.
In my example, order_extras.car_id is redundant, but supports better integrity via the use of composite FK constraints (i.e. (order_id, car_id) references the corresponding columns in order, and (car_id, extra_id) references the corresponding columns in car_optional_extras to prevent invalid extras from being linked to an order).
Here's an ER diagram for the tables above:
First, as per your thought you can definitely have two kinds of customers. Discount card holders whose details are present with the company and new customers whose details aren't available with the company.
There are three possible ways to achieve what you are trying,
1) Have two different order table in the system(which I personally wouldn't suggest)
2) Have a single Order table in the system and getting the details of those who are a discount card holder.
3) Insert a row in the discount card holder table for new/unregistered customers having only one order table in the system.
Having a single order table would make the system standardized and would be more convenient while performing many other operations.
Secondly, to solve your concern, you need to follow normalization. It will reduce the current problem faced and will also make the system redundant free and will make the entities light weighted which will directly impact on the performance when you grow large.
The extra chosen items can be listed in the order against the customer by adding it at the time of generating a bill using foreign key. Dealing with keys will result in fast and robust results instead of storing redundant/repeating details at various places.
By following normalization, the problem can be handled by applying foreign keys wherever you want to refer data to avoid problems or errors.
Preferably NF 4 would be better. Have a look at the following link for getting started with normalization.
http://www.w3schools.in/dbms/database-normalization/

E-R diagram confusion

I am in the process of designing this E-R diagram for a shop of which I have shown part of below (the rest is not relevant). See the link please:
E-R diagram
The issue that I have is that the shop only sells two items, Socks and Shoes.
Have I correctly detailed this in my diagram? I'm not sure if my cardinalities and/or my design is correct. A customer has to buy at least one of these items for the order to exist (but has the liberty to buy any number).
The Shoe and Sock entities would have their respective ID attribute, and I am planning to translate to a relational schema like this:
(I forgot to add to my diagram the ORDER_CONTAINS relationship to have an attribute called "Quantity". )
Table: Order_Contains
ORDER_ID | SHOEID | SOCKID | QTY
primary key | FK, could be null |FK, could be null | INT
This clearly won't work since the Qty would be meaningless. Is there a way I can reduce the products to just two products and make all this work?
Having two one-to-many relationships combined into one with nullable fields is a poor design. How would you record an order containing both shoes and socks - a row per shoe with SOCKID set to NULL and vice-versa for socks, or would you combine rows? In the former case the meaning of QTY is clear though it depends on the contents of SHOEID/SOCKID fields, but what would the QTY mean in the latter case? How would you deal with rows where both SHOEID and SOCKID are NULL and the QTY is positive? Keep in mind Murphy's law of databases - if it can be recorded it will be. Worse, your primary key (ORDER_ID) will prevent you from recording more than one row, so a customer couldn't buy more than one (pair of) socks or shoes.
A better design would be to have two separate relations:
Order_Socks (ORDER_ID PK/FK, SOCKID PK/FK, QTY)
Order_Shoes (ORDER_ID PK/FK, SHOEID PK/FK, QTY)
With this, there's only one way to record the contents of an order and it's unambiguous.
You have not explained very well the context here. I'll try to explain from what I understand, and give you some hints.
Do your shop only and always (forever) sell 2 products? Do the details of these products (color, model, weight, width, etc...) need to be persisted in the database? If yes, then we have two entities in the model, SOCKS and SHOES. Each entity has its own properties. A purchase or a order is usually seen as an event on the ERD. If your customers always buys (or order) socks with shoes, then there will always be a link between three entities:
CLIENTS --- SHOES --- SOCKS
This connection / association / relationship is an event, and this would be the purchase (or order).
If a customer can buy separate shoes and socks, then socks and shoes are subtypes of a super entity, called PRODUCTS, and a purchase is an event between CUSTOMERS and PRODUCTS. Here in this case we have a partitioning relationship.
If however, your customers buy separate products, and your store will not sell forever only 2 products, and details of the products are not always the same and will not be saved as columns in a table, then the case is another.
Shoes and socks are considered products, as well as other items that can be considered in future. Thus, we have records/rows in a PRODUCTS table.
When a customer places an order (or a purchase), he (she) is acquiring products. There is a strong link between customers and products here, again usually an event, which would be the purchase (or a order).
I do not know if you do it, but before thinking of start a diagram, type the problem context in a paper or a document. Show all details present in the situation.
The entities are seen when they have properties. If you need to save the name of a customer, the customer's eye color, the customer's e-mail, and so on, then you will have certainly a CUSTOMER entity.
If you see entities relate in some way, then you have a relationship, and you should ask yourself what kind of relationship these entities form. In your case of products and customers, we have a purchasing relationship there between. The established relationship is a purchase (or an order, you call it). One customer can buy various products, and one product (not on the same shelf, is the type, model) can be purchased for several customers, thus, we have a Many-To-Many relationship.
The relationship created changes according to the context. Whatever, we'll invent something crazy here as examples. Say we have customers and products. Say you want to persist a situation where customers lick Products (something really crazy, just for you to see how the context says the relationship).
There would be an intimate connection between customers and products entities (really close... I think...). In this case, the relationship represents a history of customers licking products. This would generate an EVENT. In this event you could put properties such as the date, the amount of times a customer licked a proper product, the weather, the time, the traffic light color on the street, etc., only what you need to persist according to your context, your needs.
Remember that for N-N relationships created, we need to see if new entities (out of relationship) will emerge. This usually happens when you are decomposing the conceptual model to the logical model. Probably, product orders will generate not one but two entities: The ORDER and the products of orders. It is within the products of orders that you place the list of products ordered from each customer, and the quantity.
I would like to present various materials to study ERD, but unfortunately they are all in Portuguese. I hope I have helped you in some way. If you want to be more specific about your problem, I think I can really help you best. Anything, please ask.

Point of Sale and Inventory database schema

I’m trying to create a basic Point of Sale and Inventory management system.
Some things to take into account:
The products are always the same (same ID) through the whole system, but inventory (available units for sale per product) is unique per location. Location Y and Z may both have for sale units of product X, but if, for example, two units are sold from location Y, location Z’s inventory should not be affected. Its stocked units are still intact.
Selling one (1) unit of product X from location Y, means inventory of location Y should subtract one unit from its inventory.
From that, I thought of these tables:
locations
id
name
products
id
name
transactions
id
description
inventories_header
id
location_id
product_id
inventories_detail
inventories_id
transaction_id
unit_cost
unit_price
quantity
orders_header
id
date
total (calculated from orders_detail quantity * price; just for future data validation)
orders_detail
order_id
transaction_id
product_id
quantity
price
Okay, so, are there any questions? Of course.
How do I keep track of changes in units cost? If some day I start paying more for a certain product, I would need to keep track of the marginal utility ((cost*quantity) - (price*quantity) = marginal utility) some way. I thought of inventories_detail mostly for this. I wouldn’t have cared otherwise.
Are relationships well stablished? I still have a hard time thinking if the locations have inventories, or if inventories have several locations. It’s maddening.
How would you keep/know your current stock levels? Since I had to separate the inventory table to keep up with cost updates, I guess I would just have to add up all the quantities stated in inventories_detail.
Any suggestions do you want to share?
I’m sure I still have some questions, but these are mostly the ones I need addressing. Also, since I’m using Ruby on Rails for the first time, actually, as a learning experience, it’s a shame to be stopped at design, not letting me punch through implementation quicker, but I guess that’s the way it should be.
Thanks in advance.
The tricky part here is that you're really doing more than a POS solution. You're also doing an inventory management & basic cost accounting system.
The first scenario you need to address is what accounting method you'll use to determine the cost of any item sold. The most common options would be FIFO, LIFO, or Specific Identification (all terms that can be Googled).
In all 3 scenarios, you should record your purchases of your goods in a data structure (typically called PurchaseOrder, but in this case I'll call it SourcingOrder to differentiate from your orders tables in the original question).
The structure below assumes that each sourcing order line will be for one location (otherwise things get even more complex). In other words, if I buy 2 widgets for store A and 2 for store B, I'd add 2 lines to the order with quantity 2 for each, not one line with quantity 4.
SourcingOrder
- order_number
- order_date
SourcingOrderLine
- product_id
- unit_cost
- quantity
- location_id
Inventory can be one level...
InventoryTransaction
- product_id
- quantity
- sourcing_order_line_id
- order_line_id
- location_id
- source_inventory_transaction_id
Each time a SourcingOrderLine is received at a store, you'll create an InventoryTransaction with a positive quantity and FK references to the sourcing_order_line_id, product_id and location_id.
Each time a sale is made, you'll create an InventoryTransaction with a negative quantity and FK references to the order_line_id, product_id and location_id, source_inventory_transaction_id.
The source_inventory_transaction_id would be a link from the negative quantity InventoryTransaction back to the postiive quantity InventoryTransaction calculated using whichever accounting method you choose.
Current inventory for a location would be SELECT sum(quantity) FROM inventory_transactions WHERE product_id = ? and location_id = ?
GROUP BY product_id, location_id.
Marginal cost would be calculated by tracing back from the sale, through the 2 related inventory transactions to the SourcingOrder line.
NOTE: You have to handle the case where you allocate one order line across 2 inventory transactions because the ordered quantity was larger that what was left in the next inventory transaction to be allocated. This data structure will handle this, but you'll need to work the logic and query yourself.
Brian is correct. Just to add additional info. If you are working into a complete system for your business or client. I would suggest that you start working on the organizational level down to process of POS and accounting. That would make your database experience more extensive... :P In my experience in system development, Inventory modules always start with the stock taking+(purchases-purchase returns)=SKU available for sales. POS is not directly attached to Inventory module but rather will be reconciled daily by the sales supervisor. Total Daily Sales quantities will then be deducted to SKU available for sales. you will work out also the costing and pricing modules. Correct normalization of database is always a must.

Resources