How to create a dimensional model with different metrics depending of the hierarchical level - data-warehouse

I need to create a dimensional environment for sales analysis for a retail company.
The hierarchy that will be present in my Sales fact is:
1 - Country
1.1 - Region
1.1.1 - State
1.1.1.1 - City
1.1.1.1.1 - Neighbourhood
1.1.1.1.1.1 - Store
1.1.1.1.1.1.1 - Section
1.1.1.1.1.1.1.1 - Catgory
1.1.1.1.1.1.1.1.1 - Subcatgory
1.1.1.1.1.1.1.1.1.1 - Product
Metrics such as Number of Sales, Revenue and Medium Ticket (Revenue / Number of Sales) makes sense up to the Subcategory level, because if I reach the Product level the agreggation composition will need to change (I guess).
Also, metrics such as Productivity, which is Revenue / Number of Store Staff, won't make sense to existe in this fact table, because it only works up to the Store level (also, I guess).
I'd like to know the best solution resolve this question because all of it are about Sales, but some makes sense until a specifict level of my hierarchy and others don't.
Waiting for the reply and Thanks in advance!

You should split your hierarchy into 2 dimensions, Stores and Products
The Stores dimension is all about the Location of the sale, and you can put the number of employees in this dimension
Store_Key STORE Neighbourhood City Country Num_Staff
1 Store1 4th Street LA US 10
2 Store2 Main Street NY US 2
The products dimension looks like
Product_Key Prod_Name SubCat Category Unit_Cost
1 Cheese Sticks Diary Food $2.00
2 Timer Software Computing $25.00
The your fact table has a record for each Sale, and is keyed to the above dimensions
Store_Key Product_Key Date Quantity Tot_Amount
1 1 31/7/2014 5 $10.00 (store1 sells 5 cheese)
1 2 31/7/2014 1 $25.00 (store1 sells 1 timer)
2 1 31/7/2014 3 $6.00 (store2 sells 3 cheese)
2 2 31/7/2014 1 $25.00 (store2 sells 1 timer)
Now that your data is in place you can use your reporting tool to get the measures you need. Example SQL is something like below
SELECT store.STORE,
SUM(fact.tot_amount) as revenue,
COUNT(*) as num_sales
SUM(fact.tot_amount) / store.NumStaff as Productivity
FROM tbl_Store store, tb_Fact fact
WHERE fact.Store_key = store.Store_key
GROUP BY store.STORE
should return the following result
STORE revenue num_sales Productivity
Store1 $35.00 2 3.5
Store2 $31.00 2 15.5

Related

MS Access: compare multiple query results from one table against the results of a query on the same table

I am building an ms access db to manage part numbers of mixtures. It’s pretty much a bill of materials. I have a table, tblMixtures that references itself in the PreMixture field. I set this up so that a mixture can be a pre-mixture in another mixture, which can in turn be a pre-mixture in another mixture, etc. Each PartNumber in tblMixture is related to many Components in tblMixtureComponents by the PartNumber. The Components and their associated data is stored in tblComponentData. I have put in example data in the tables below.
tblMixtures
PartNumber
Description
PreMixtures
1
Mixture 1
4, 5
2
Mixture 2
4, 6
3
Mixture 3
4
Mixture 4
3
5
Mixture 5
6
Mixture 6
tblMixtureComponents
ID
PartNumber
Component
Concentration
1
1
A
20%
2
1
B
40%
3
1
C
40%
4
2
A
40%
5
2
B
30%
6
2
D
30%
tblComponentData
ID
Name
Density
Category
1
A
1.5
O
2
B
2
F
3
C
2.5
I
4
D
1
F
I have built the queries needed to pull the information together for the final mixture and even display the details of the pre-mixtures and components used for each mixture. However, with literally tens of thousands of part numbers, there can be a lot of overlap in pre-mixtures used for mixtures. In other words, Mixture 4 can be used as a pre-mixture for Mixture 1 and Mixture 2 and a lot more. I want to build a query that will identify all possible mixtures that can be used as a pre-mixture in a selected mixture. So I want a list of all the mixtures that have the same components or subset of components as the selected mixtures. The pre-mixture doesn’t have to have all the components in the mixture, but it can’t have any components that are not in the mixture.
If you haven't solved it yet...
The PreMixtures column storing a collection of data is a sign that you need to "Normalize" your database design a little more. If you are going to be getting premixture data from a query then you do not need to store this as table data. If you did, you would be forced to update the premix data every time your mixtures or components changed.
Also we need to adress that tblMixtures doesn't have an id field. Consider the following table changes:
tblMixture:
id
description
1
Mixture 1
2
Mixture 2
3
Mixture 3
tblMixtureComponent:
id
mixtureId
componentId
1
1
A
2
1
B
3
1
C
4
2
A
5
2
B
6
2
D
7
3
A
8
4
B
I personally like to use column naming that exposes primary to foreign key relationships. tblMixtures.id is clearly related to tblMixtureComponenets.mixtureId. I am lazy so i would also probably abreviate everything too.
Now as far as the query, first lets get the components of mixture 1:
SELECT tblMixtureComponent.mixtureId, tblMixtureComponent.componentId
FROM tblMixtureComponent
WHERE tblMixtureComponent.mixtureId = 1
Should return:
mixtureId
componentId
1
A
1
B
1
C
We could change the WHERE clause to the id of any mixture we wanted. Next we need to get all the mixture ids with bad components. So we will build a join to compare around the last query:
SELECT tblMixtureComponent.mixtureId
FROM tblMixtureComponenet LEFT JOIN
(SELECT tblMixtureComponent.mixtureId,
tblMixtureComponent.componentId
FROM tblMixtureComponent
WHERE tblMixtureComponent.mixtureId = 1) AS GoodComp
ON tblMixtures.componentId = GoodComp.componentId
WHERE GoodComp.componentId Is Null
Should return:
mixtureId
2
Great so now we have ids of all the mixtures we don't want. Lets add another join to get the inverse:
SELECT tblMixture.id
FROM tblMix LEFT JOIN
(SELECT tblMixtureComponent.mixtureId
FROM tblMixtureComponenet LEFT JOIN
(SELECT tblMixtureComponent.mixtureId,
tblMixtureComponent.componentId
FROM tblMixtureComponent
WHERE tblMixtureComponent.mixtureId = 1) AS GoodComp
ON tblMixtures.componentId = GoodComp.componentId
WHERE GoodComp.componentId Is Null) AS BadMix
ON tblMixtures.id = BadMix.mixtureId
WHERE BadMix.mixtureId = Null AND tblMixture.id <> 1
Should return:
mixtureId
3
4
Whats left is all of the ids of that have similar components but not nonsimilar components to mixture 1.
Sorry i did this on a phone...

SPSS descriptives long data

I am trying to run descriptives (Means/frequencies) on my data that are in long format/repeated measures. So for example, for 1 participant I have:
Participant Age ID 1 25 ID 1 25 ID 1 25 ID 1 25 ID 2 (Second participant .. etc) 30
So SPSS reads that as an N of 5 and uses that to compute the mean. I want SPSS to ignore repeated cases (Only read ID 1 data as one person, ignore the other 3). How do I do this?
Assuming the ages are always identical for all occurrences of the same ID - what you should do is aggregate (Data => aggregate) your data into a separate dataset, in which you'll take only the first age for each ID. Then you can analyse the age in the new dataset with no repetitions.
you can use this syntax:
DATASET DECLARE OneLinePerID.
AGGREGATE /OUTFILE='OneLinePerID' /BREAK=ID /age=first(age) .
dataset activate OneLinePerID.
means age.

Joining four tables but excluding duplicates

I am trying to join four tables (users, user_payments, content_type and media_content) but I always get duplicates. Instead of seeing for example that user Smith purchased media_content_id_purchase 5011 for a price of 3.99 and he streamed media_content_stream_id 5000 for a price of 0.001 per min, I get:
multiple combinations such as, media_content_id_purchase 5011 costs 3.99, 1.99, 6.99 etc. with media_content_id_stream that also has all sorts of prices.
This is my query:
select u.surname, up.media_content_id_purchase, ct.purchase_price, up.media_content_id_stream, ct.stream_price, ct.min_price
from users u, user_payments up, content_type ct, media_content mc
where u.user_ID=up.user_ID_purchase and
up.media_content_ID_purchase=mc.media_content_ID or up.media_content_ID_purchase is null and
ct.content_type_ID=mc.content_type_ID;
My goal is to display each user and what they have consumed with the corresponding prices.
Thanks!!!
Perhaps you should try using select distinct?
http://www.w3schools.com/sql/sql_distinct.asp
As you can see here select DISTINCT is supposed to show only the different (distinct) values.

Keep historical database relations integrity when data changes

I hesitate between various alternative when it comes to relations that have "historical"
value.
For example, let's say an User has bought an item at a certain date... if I just store this the classic way like:
transation_id: 1
user_id: 2
item_id: 3
created_at: 01/02/2010
Then obviously the user might change its name, the item might change its price, and 3 years later when I try to create a report of what happend I have false data.
I have two alternative:
keep it stupid like I shown earlier, but use something like https://github.com/airblade/paper_trail and do something like:
t = Transaction.find(1);
u = t.user.version_at(t.created_at)
create a database like transaction_users and transaction_items and copy the users/items into these tables when a transaction is made. The structure would then become:
transation_id: 1
transaction_user_id: 2
transaction_item_id: 3
created_at: 01/02/2010
Both approach have their merits, tho solution 1 looks much simpler... Do you see a problem with solution 1? How is this "historical data" problem usually solved? I have to solve this problem for 2-3 models like this for my project, what do you reckon would be the best solution?
Taking the example of Item price, you could also:
Store a copy of the price at the time in the transaction table
Creating a temporal table for item prices
Storing a copy of the price in the transaction table:
TABLE Transaction(
user_id -- User buying the item
,trans_date -- Date of transaction
,item_no -- The item
,item_price -- A copy of Price from the Item table as-of trans_date
)
Getting the price as of the time of transaction is then simply:
select item_price
from transaction;
Creating a temporal table for item prices:
TABLE item (
item_no
,etcetera -- All other information about the item, such as name, color
,PRIMARY KEY(item_no)
)
TABLE item_price(
item_no
,from_date
,price
,PRIMARY KEY(item_no, from_date)
,FOREIGN KEY(item_no)
REFERENCES item(item_no)
)
The data in the second table would looke something like:
ITEM_NO FROM_DATE PRICE
======= ========== =====
A 2010-01-01 100
A 2011-01-01 90
A 2012-01-01 50
B 2013-03-01 60
Saying that from the first of January 2010 the price of Article A was 100. It changed the first of Januari 2011 to 90, and then again to 50 from the first of January 2012.
You will most likely add a TO_DATE to the table, even though it's a denormalization (the TO_DATE is the next FROM_DATE).
Finding the price as of the transaction would be something along the lines of:
select t.item_no
,t.trans_date
,p.item_price
from transaction t
join item_price p on(
t.item_no = p.item_no
and t.trans_date between p.from_date and p.to_date
);
ITEM_NO TRANS_DATE PRICE
======= ========== =====
A 2010-12-31 100
A 2011-01-01 90
A 2011-05-01 90
A 2012-01-01 50
A 2012-05-01 50
I'll went with PaperTrail, it keeps history of all my models, even their destruction. I could always switch to point 2 later on if it doesn't scale.

Find nodes with same relationships that initial node

I have customers (id, name, type), commerces (id, name, type) and relationships between them (idcustomer, idcommerce, quantity) that indicates that a customer has bought in a commerce and the quantity.
Well, I want to achieve nodes that have same relationships that the origin node, I mean, if customer 1 bought in commerce id=10 and id=11 I want to achive other customers who have bought in exact the same commerces (at least) that customer 1 in order to recommend the rest of commerces.
Now, I have next command but it doesn't work because it returns me all customers that have bought in one of the commerce where customer 1 bought but not in all of them.
START m=node:id(id="1") MATCH (m)-[:BUY]->(commerces)<-[:BUY]-(customers) RETURN customers;
Example
Customer 1 bought commerce 10, 11
Customer 2 bought commerce 10, 3
Customer 3 bought commerce 10, 11, 4
Customer 4 bought commerce 5, 8, 10
The return that I want is Customer 3 in order to recommend commerce 4.
Thank you.
Here is one solution,
The first query gets all of the products the start node m buys, that is the collect(commerce) of the first "WITH" clause;
The second query gets all products each customer shares with the m, that is the customerCommerces of the second "With" clause;
Then the "Where" clause eliminates those customers who share only a subset of the products bought by the m, therefore returns the customers who share all of the products with the m.
START m=node:id(id="1")
Match (m)-[:BUY]->(commerce)
With collect(commerce) as mCos
START m=node:id(id="1")
Match (m)-[:BUY]->(commerce)<-[:BUY]-(customer)
with mCos, customer, collect(commerce) as customerCommerces
Where length(mCos) = length(customerCommerces)
Return customer

Resources