Data Warehouse dimension for schedules (Dimensional Modeling)

Data Warehouse dimension for schedules (Dimensional Modeling) - data-warehouse

I have not found an example or a way of building a dimension that contains schedule attributes. For example, in my scenario I'm building a data warehouse that will help to gather analytics on podcast/radio show episodes.
We have the following:
dim_episode
dim_podcast_show
dim_date
fact_user_daily_activity
And I'm trying to add another dimension that contains schedule attributes about the podcast_show, for example, some shows air their episodes every day, others tuesdays and thursdays, others only saturdays.
dim_show_schedule (Option 1)
| schedule_key | show_key | time | sunday_flag | monday_flag | tuesday_flag | wednesday_flag | thursday_flag | friday_flag | saturday_flag |
|--------------|----------|-------|-------------|-------------|--------------|----------------|---------------|-------------|---------------|
| 1 | 0 | 00:30 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 2 | 1 | 12:30 | 0 | 1 | 1 | 1 | 1 | 1 | 0 |
| 3 | 2 | 21:00 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
However, would it be better to have a bridge table with something like:
bridge_show_schedule (Option 2)
| show_key | day_key |
|----------|---------|
| 0 | 2 |
| 0 | 4 |
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 1 | 4 |
| 1 | 5 |
dim_show_schedule (Option 3) (suggested by #nsousa)
| schedule_key | show_key | time | day |
|--------------|----------|-------|-------------|
| 1 | 0 | 00:30 | tuesday |
| 1 | 0 | 00:30 | thursday |
| 2 | 1 | 12:30 | monday |
| 2 | 1 | 12:30 | tuesday |
| 2 | 1 | 12:30 | wednesday |
| 2 | 1 | 12:30 | thursday |
| 2 | 1 | 12:30 | friday |
| 3 | 2 | 21:00 | saturday |
I've searched in Kimball's Data warehouse lifecycle toolkit and could not find an example on this use case.
Any thoughts?

If you keep a dimension with a string attribute saying which days it’s on, e.g., “M,W,F”, the most entries you have are 2^7, 128. A bridge table is an unnecessary complication.

Option 1
You can create a scheduled dimension that has a unique record for every possible schedule (128 daily combinations) combined with every reasonable start time. Using 5 minute intervals would still be less than 37k rows which is trivial for a dimension.
Option 2
If you want to leverage a date dimension instead, create a "Scheduled" fact that relate the show dimension to the date dimension for that future date. This would be handled in your ETL process to map the relationship. Your date dimension should already have the week and day of week logic included. You could also leverage your Show duration attribute to create a semi-additive calculated measure to allow you to easily get the total programming for the period.
I would opt for Option 2 as it provides many more possibilities for analytics.

Related

Time span accumulating fact tables design

I need to design a star schema to process order processing. The progress of an order look like this:
Customer C place an order on item I with quantity 100
Factory F1 take the order partially with quantity 30
Factory F2 take the order partially with quantity 20
Buy from market 50 items
F1 delivery 20 items
F1 delivery 7 items
F1 cancel the contract (we need to buy 3 more item from market)
F2 delivery 20 items
Buy from market 3 items
Complete the order
How can I design a fact table in this case, since the number of step is not fixed, the data types of event is not the same.
I'm sorry for my bad English.

The definition of an Accumulating Snapshot Fact table according to Kimball is:
summarizes the measurement events occurring at predictable steps between the beginning and the end of a process.
For this particular use case I would go with a Transaction Fact Table as the events (steps) are unpredictable, it is more like an event fact table, something similar to logs or audits.
| order_key | date_key | full_datetime | entity_key (customer, factory, etc. varchar) | entity_type | state | quantity |
|-----------|----------|---------------------|----------------------------------------------|-------------|----------|----------|
| 1 | 20190602 | 2019-06-02 04:30:00 | C1 | customer | request | 100 |
| 1 | 20190602 | 2019-06-02 05:30:00 | F1 | factory | receive | 30 |
| 1 | 20190602 | 2019-06-02 05:30:00 | F2 | factory | receive | 20 |
| 1 | 20190602 | 2019-06-02 05:40:00 | Company? | company | buy | 50 |
| 1 | 20190603 | 2019-06-03 06:40:00 | F1 | factory | deliver | 20 |
| 1 | 20190603 | 2019-06-03 02:40:00 | F1 | factory | deliver | 7 |
| 1 | 20190603 | 2019-06-03 04:40:00 | F1 | factory | deliver | 3 |
| 1 | 20190603 | 2019-06-03 06:40:00 | F1 | factory | cancel | |
| 1 | 20190604 | 2019-06-04 07:40:00 | F2 | factory | deliver | 20 |
| 1 | 20190604 | 2019-06-04 07:40:00 | Company? | company | buy | 3 |
| 1 | 20190604 | 2019-06-04 09:40:00 | Company? | company | complete | 100 |
I'm not sure about your reporting needs as they were not specified, but assuming you need to measure lag/durations of unpredictable steps, you could PIVOT and use dynamic SQL to create the required view
SQL Server dynamic PIVOT query?
Let me know if you came up with something different as I'm interested on this particular use case. Good luck

Automated way to create a confusion matrix in Google Sheets?

I have a table of this form in Google Sheets:
+---------+------------+--------+
| item_id | prediction | actual |
+---------+------------+--------+
| 1 | 1 | 1 |
| 2 | 1 | 1 |
| 3 | 1 | 0 |
| 4 | 0 | 1 |
| 5 | 0 | 0 |
| 6 | 1 | 1 |
+---------+------------+--------+
And I'd like to know if there's an automated way to get this kind of summary, with the counts of items that fit the criteria specified in that row/column combination:
+----------+--------------+--------------+-------+
| | prediction=0 | prediction=1 | total |
+----------+--------------+--------------+-------+
| actual=0 | 1 | 1 | 2 |
| actual=1 | 1 | 3 | 4 |
+----------+--------------+--------------+-------+
| total | 2 | 4 | |
+----------+--------------+--------------+-------+
I've been doing this somewhat manually in Google Sheets by using COUNTIFS, but I'm wondering if there's a built-in way? I tried using pivot tables, but couldn't figure out how to get the calculated fields to show the data I want.

A coworker figured it out - you can get this by creating a pivot table with the correct columns and rows, and setting the value to item_id summarized by COUNTUNIQUE.

weka gives 100% correctly classified instances for every dataset

I'm not able to get accuracy, as every dataset I provide provides 100% accuracy for every classifier algorithm I apply. My data set is of 10 people.
It gives the same accuracy for naive bayes, J48, JRip classifier algorithm.
+----+-------+----+----+----+----+----+-----+----+------+-------+-------+-------+
| id | name | q1 | q2 | q3 | m1 | m2 | tut | fl | proj | fexam | total | grade |
+----+-------+----+----+----+----+----+-----+----+------+-------+-------+-------+
| 1 | abv | 5 | 5 | 5 | 13 | 13 | 4 | 8 | 7 | 40 | 100 | p |
| 2 | ca | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 40 | 48 | f |
| 3 | ga | 4 | 2 | 3 | 5 | 10 | 4 | 5 | 6 | 20 | 59 | f |
| 4 | ui | 5 | 4 | 4 | 12 | 13 | 3 | 7 | 7 | 39 | 94 | p |
| 5 | pa | 4 | 1 | 1 | 4 | 3 | 2 | 4 | 5 | 22 | 46 | f |
| 6 | la | 2 | 3 | 1 | 1 | 2 | 0 | 4 | 2 | 11 | 26 | f |
| 7 | ka | 5 | 4 | 1 | 3 | 3 | 1 | 6 | 4 | 24 | 51 | f |
| 8 | ma | 5 | 3 | 3 | 9 | 8 | 4 | 8 | 0 | 20 | 60 | p |
| 9 | ash | 2 | 5 | 5 | 11 | 12 | 3 | 7 | 6 | 30 | 81 | p |
| 10 | opo | 4 | 2 | 1 | 13 | 1 | 3 | 7 | 3 | 35 | 69 | p |
+----+-------+----+----+----+----+----+-----+----+------+-------+-------+-------+

Make sure to not include any unique identifier column.
Also don't include the total.
Most likely, the classifiers learned that "name" is a good predictor and/or that you need total > 59 points total to pass.
I suggest you even withhold at least one exercise because of that - some classifiers will still learn that the sum of the individual points is necessary to pass.
I assume you want to find out if one part is most indicative of passing, i.e. "if you do well on part 3, you will likely pass". But to answer this question, you need to account for e.g. different amount of points per question, etc. - otherwise, your predictor will just identify which question has the most points...
Also, 10 is a much too small sample size!

You can see from the output that is displayed that the tree that J48 generated used only the variable fl, so I do not think that you have the problem that #Anony-Mousse referred to.
I notice that you are testing on the training set (see the "Test Options" radio buttons at upper left of the GUI). That almost always overestimates the accuracy. What you are seeing is overfitting. Instead, use cross-validation to get a better estimate of the accuracy you could expect on new data. With only 10 data points, you should use either 10 folds or 5.

Try testing your model on cross-validation on "k splits" or Percentage split.
Generally in Percentage Split: Training set is of 2/3 of dataset and Test set is 1/3.
Also, What I feel that your dataset is very small... There are chances of high accuracy in that case.

Normalization, changing it to 1nf, 2nf and 3nf

INVOICE
So i have to put this into 1NF, 2NF and 3NF
PROD_NUM PROD_LABEL PROD_PRICE
AA-E3422QW ROTARY SANDER 49.95
AA-E3422QW ROTARY SANDER 49.95
QD-300932X 0.25IN. DRILL BIT 3.45
RU-95748G BAND SAW 33.99
GH-778345P POWER DRILL 87.75
VEN_CODE VEN_NAME
211 NEVERFAIL, INC
211 NEVERFAIL, INC
211 NEVERFAIL, INC
309 BEGOOD, INC
157 TOUGHGO, INC
So far i have these as my 2NF. Am i going right? And how do i put the table into 3NF ?
So my 2nf will be like this ?2NF TABLE IMAGE

I think the picture you were given is considered 1NF.
And you initially showed 3NF, but you'll need an additional table to reference which Product is by what Vendor as well as modify the invoice table.
Vendor - Unique list of vendors
VEN_ID | VEN_CODE | VEN_NAME
-------|----------|---------------
1 | 211 | NEVERFAIL, INC
2 | 309 | BEGOOD, INC
3 | 157 | TOUGHGO, INC
Product - Unique list of products
PROD_ID | PROD_NUM | PROD_LABEL | PROD_PRICE
--------|------------|-------------------|-----------
1 | AA-E3422QW | ROTARY SANDER | 49.95
2 | QD-300932X | 0.25IN. DRILL BIT | 3.45
3 | RU-95748G | BAND SAW | 33.99
4 | GH-778345P | POWER DRILL | 87.75
Vendor_Product - the mapping between products and vendors
VEN_ID | PROD_ID
-------|----------
1 | 1
1 | 2
2 | 3
3 | 4
Purchases - The transactions that happened
PURCH_ID | INV_NUM | SALE_DATE | PROD_ID | QUANT_SOLD
---------|---------|-------------|---------|------------
1 | 211347 | 15-JAN-2006 | 1 | 1
2 | 211347 | 15-JAN-2006 | 2 | 8
3 | 211347 | 15-JAN-2006 | 3 | 1
4 | 211348 | 15-JAN-2006 | 1 | 2
5 | 211349 | 16-JAN-2006 | 4 | 1
I think that is good, but it can be split again.
Invoices - A unique list of invoices
INV_ID | INV_NUM | SALE_DATE
--------|---------|-------------
1 | 211347 | 15-JAN-2006
2 | 211348 | 15-JAN-2006
3 | 211349 | 16-JAN-2006
Purchases - The transactions that happened
PURCH_ID | INV_ID | PROD_ID | QUANT_SOLD
---------|--------|---------|---------
1 | 1 | 1 | 1
2 | 1 | 2 | 8
3 | 1 | 3 | 1
4 | 2 | 1 | 2
5 | 3 | 4 | 1
To get 2NF, combine the Vendor information back into the Product table. With these columns
PROD_ID | PROD_NUM | PROD_LABEL | PROD_PRICE | VEN_CODE | VEN_NAME
In this case, the Vendor and Vendor_Product tables aren't needed.

How do you SUM two fields from two tables, even when the field in the second table could be null?

I have the following tables:
products.rb
# has_many :sales
+----+----------+----------+-------+
| id | name | quantity | price |
+----+----------+----------+-------+
| 1 | Pencil | 30 | 1.0 |
| 2 | Pen | 50 | 1.5 |
| 3 | Notebook | 100 | 2.0 |
+----+----------+----------+-------+
sales.rb
# belongs_to :product
+----+----------+------------+
| id | quantity | product_id |
+----+----------+------------+
| 1 | 10 | 1 |
| 2 | 2 | 1 |
| 3 | 5 | 1 |
| 4 | 2 | 2 |
| 5 | 10 | 2 |
+----+----------+------------+
I'd like to know, first, how many items I have left, regardless of their type. The answer is of course 151, but that'd be cheating. I could simply make a SUM of both tables individually, then put them together to know the final number, but I'm wondering if this could be done via activerecord in a single command.
I tried the following:
Product.includes(:sales).group('products.id').sum('products.quantity - sales.quantity')
but I get:
=> {1=>73, 2=>88, 3=>0}
which is understandable, as it is going through each one to do the sum like this:
+-------------------+----------------+-----+
| products.quantity | sales.quantity | sum |
+-------------------+----------------+-----+
| 30 | 10 | 20 |
| 30 | 2 | 28 |
| 30 | 5 | 25 |
+-------------------+----------------+-----+
which equals 73.
Anyway, how could this be achieved with ActiveRecord? I want to know the total number of items, but I'd also like to know the total of each type.

I'm not familiar of any ActiveRecord way to achieve what you want but you can try mixing a little sql in there
Product
.group('products.id')
.sum('products.quantity - (SELECT SUM(sales.quantity) AS sales_quantity FROM sales WHERE sales.product_id = products.id)')

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Data Warehouse dimension for schedules (Dimensional Modeling) - data-warehouse

If you keep a dimension with a string attribute saying which days it’s on, e.g., “M,W,F”, the most entries you have are 2^7, 128. A bridge table is an unnecessary complication.

Related

Time span accumulating fact tables design

Automated way to create a confusion matrix in Google Sheets?

weka gives 100% correctly classified instances for every dataset

Normalization, changing it to 1nf, 2nf and 3nf

How do you SUM two fields from two tables, even when the field in the second table could be null?

Categories

Resources