Understanding Slowly Changing Dimension Type 2

Understanding Slowly Changing Dimension Type 2 - data-warehouse

I am having difficult time understanding how to use slowly changing dimension type 2, in my scenario.
I have gone through different tutorial websites but they don't fit.
I have an employee dimension table containing:
+-----+---------------+------------+------------+
| id | employee | designation| Location |
+-----+---------------+------------+------------+
| 1 | Ola | CEO | Newyork |
| 2 | Ahmed | DEVELOPER | California |
| 3 | Ola | Manager | California |
+----------+----------+------------+------------+
I have a Account Fact table
+-------+----------+
|emp_id | Amount |
+-------+-----------
| 1 | 2000000 |
| 2 | 300000 |
+----------+-------+
Now we see that the dimension has changed, and thus a new ID to same Ola employee has been given.
How would we manage in the fact table?
The new ID of Ola will not be found in Fact Table.
so if we add a new row in fact, with new ID of Ola, how would we link that they are same employee, when they are identified differently, 'primary key'.
How would we distinguish this employee is not a new employee and actually location / designation has been changed.

I am sure there are many ways of doing it, here's one way - Have an "employee_Key" in your dimension Table which is unique for an employee. So your dimension table will look like this -
id | emp_key | employee | designation| Location |Valid From| Valid To |
-----|---------|------------|------------|------------|----------|----------|
1 | EMP1 | Ola | CEO | Newyork |1/1/1900 |1/1/2016 |
2 | EMP2 | Ahmed | DEVELOPER | California |1/1/1900 |NULL |
3 | EMP1 | Ola | Manager | California |1/2/2016 |NULL |
You can continue loading your fact table with the "New" ID for the employee. In this case you will have 2 different sets of Keys for that employee.
+-------+----------+
|emp_id | Amount |
| 1 | 2000000 |
| 2 | 300000 |
| 3 | 100000 |
+----------+-------+
If you want to rollup (say Sum of amounts) for an employee from the beginning, you would join the fact and dimension using the ID key and group by emp_key.
So,
select emp_key, sum(amount) from employee dim, account fact where dim.ID = fact.ID group by emp_key.
If you want to find out the amount since he became a manager, you just have to do rollup on the ID field.
select dim.ID, sum(amount) from employee dim, account fact where dim.ID = fact.ID group by dim.ID.
or this way -
select fact.ID, sum(amount) from account fact group by fact.ID.

Related

Google Sheets Return Any(All) Row Value(s) For MAX-GROUP Query

I am looking to return non-grouped row values from a query of a table sorted by the MAX value of a column, within a group.
DATA TABLE
| NAME | ASSET | ACTION | DATE |
|--|--|--|--|
| JOE | CAR | BOUGHT | 1/1/2020 |
| JANE | HORSE | BOUGHT | 1/1/2021 |
| JOE | HORSE | BOUGHT | 2/1/2021 |
| JANE | HORSE | SOLD | 3/1/2021 |
| JOE | CAR | SOLD | 1/1/2022 |
| JOE | CAR | BOUGHT | 2/1/2022 |
For the table above, I presented the following code.
=QUERY(A1:D5,"SELECT A,B,C,D, MAX(D) GROUP BY A,B",TRUE)
The following TARGET TABLE is output I'm looking for:
| NAME | ASSET | ACTION | DATE |
|--|--|--|--|
| JANE | HORSE | SOLD | 3/1/2021 |
| JOE | HORSE | BOUGHT | 2/1/2021 |
| JOE | CAR | BOUGHT | 2/1/2022 |
However, because 'C' is not included in the GROUP, the formula returns an error. "Unable to parse query string for Function QUERY parameter 2: ADD_COL_TO_GROUP_BY_OR_AGG: C"
If I were to omit COL C & D, "ACTION" & "DATE" from the SELECT: =QUERY(A1:D5,"SELECT A,B, MAX(D) GROUP BY A,B",TRUE) , I have the correct record rows, but am missing the STATUS.
MAX-DATE TABLE
| NAME | ASSET | max DATE |
|--|--|--|
| JANE | HORSE | 3/1/2021 |
| JOE | HORSE | 2/1/2021 |
| JOE | CAR | 2/1/2022 |
OR, when I add COL C as a "PIVIOT": =QUERY(A1:D5,"SELECT A,B, MAX(D) GROUP BY A,B PIVOT C",TRUE)I have the correct record rows, but do not have the 'current' STATUS within the record row.
PIVOT ACTION TABLE
| NAME | ASSET | BOUGHT | SOLD |
|--|--|--|--|
| JANE | HORSE | 1/1/2021 | 3/1/2021 |
| JOE | HORSE | 2/1/2021 | |
| JOE | CAR | 2/1/2022 | 1/1/2022 |
Still I have not found a method to create my TARGET TABLE.
Am I overlooking a method to include a non-grouped field into a query using MAX()? Or is it impossible within Google Sheets Query without JOIN functions?
(I hope it is obvious that I desire to apply this to a large and dynamic dataset.)
Thank you for your insight. Cheers!

It's not that flexible to work with QUERYs with its aggregation requisites and so on.
You can create a filter, by comparing column D with a "fictional" column created with BYROW: = BYROW(A2:A,LAMBDA(each,MAXIFS($D$2:$D,$A$2:$A,each,$B$2:$B,OFFSET(each,,1))))
That would look like this (I highlighted the matches and added extra rows for reference):
Then, you can set this filter (don't create this column, it's just a visualization of what I did):
=FILTER(A2:D,D2:D = BYROW(A2:A,LAMBDA(each,MAXIFS($D$2:$D,$A$2:$A,each,$B$2:$B,OFFSET(each,,1)))))
This way, you're comparing the dates with the maximum for each category

Union Vertical Blending in Data Studio

I want to blend several tables into 1 table. All of the tables have the same column so I'm thinking to UNION vertical all of the tables.
My data source is Google Sheets/ Spreadsheets.
The data will look like this:
Table1
| Type | Object | Amount |
|:---- |:---------:| ------:|
| Tech | PC | $100 |
| Tech | Keyboard | $50 |
| Tech | Mouse | $60 |
Table2
| Type | Object | Amount |
|:----- |:-----------------------:| ------:|
| Sales | Sales Incentives | $1000 |
| Sales | Meeting with Client | $400 |
| Sales | Visiting stores | $80 |
While the desired output would be:
| Type | Object | Amount |
|:----- |:-----------------------:| ------:|
| Sales | Sales Incentives | $1000 |
| Sales | Meeting with Client | $400 |
| Sales | Visiting stores | $80 |
| Tech | PC | $100 |
| Tech | Keyboard | $50 |
| Tech | Mouse | $60 |
If you can't see the table you can see the picture here
enter image description here
Anyone can help me with this? Thank you

I just got the the answer:
You can use the blending FULL OUTER JOIN and use the formula:
COALESCE(Name (Source #1),Name (Source #2),Name (Source #3))
You can see full information here
Thank you for Mehdi Oidjida for the help.

how to create relationship using cypher

I have been learning neo4j/cypher for the last week. I have finally been able to upload two csv files and create a relationship,"captured". However, I am not fully confident in my understanding of the code as I was following the tutorial on the neo4j site. Could you please help me confirm what I did is correct.
I have two csv files, a "cap.csv" and a "survey.csv". The survey table contains data of each unique survey conducted at the survey sites. the cap table contains data of each unique organisms captured. In the cap table I have a foreign key, "survey_id", which in the Postgres db you would join to the p.key in the survey table.
I want to create a relationship, "captured", showing each unique organsism that was captured based on the "date" column in the survey table.
Survey table
| lake_id | date |survey_id | duration |
| -------- | -------------- | --| --
| 1 | 05/27/14 |1 | 7 |
| 2 | 03/28/13 | 2|10 |
| 2 | 06/29/19 | 3|23 |
| 3 | 08/21/21 | 4|54 |
| 1 | 07/23/18 | 5|23 |
| 2 | 07/22/23 | 6|12 |
Capture table
| cap_id | species |capture_life_stage | weight | survey_id |
| -------- | -------------- | --| -----|---|
| 1 | a |adult | 10 | 1|
| 2 | a | adult|10 | 2 |
| 3 | b | juv|23 | 3 |
| 4 | a | adult|54 | 4 |
| 5 | b | juv|23 | 5 |
| 6 | c | juv |12 | 6 |
LOAD CSV WITH HEADERS FROM 'file:///cap.csv' AS row
WITH
row.id as id,
row.species as species,
row.capture_life_stage as capture_life_stage,
toInteger(row.weight) as weight,
row.survey_id as survey_id
MATCH (c:cap {id: id})
MERGE (s) - [rel:captured {survey_id: survey_id}] ->(c)
return count(rel)
I am struggling to understand the code I wrote above. I followed the neo4j tutorial exactly but used my data (https://neo4j.com/developer/desktop-csv-import/).
I am fairly confident from data checks, but did the above code create the "captured" relationship showing each unique organism captured on that unique survey date? Based on the visual I can see I believe it did but I don't fully understand each step in the code.
What is the purpose of the MATCH (c:cap {id: id}) in the code?

The code below
MATCH (c:cap {id: id})
is the same as
MATCH (c:cap)
Where c.id = id
It is a shorter way of finding Captured node based on id and then you are creating a relationship with Survey node.
Question: s is not defined in your query. Where is it?

Filter out all of user's entries if one of them was selected

I am trying to write a formula that will allow you to put up to 3 values into a pool (per person) and then when one of those values is selected for use all 3 potential values are removed from the pool.
My current formula is long but fairly simple in that it takes the values from each user and adds/subtracts them based on whether they are being used.
=(SUMIF($I$2:$I$50, $A24, $L$2:$L$50) + SUMIF($J$2:$J$50, $A24, $L$2:$L$50) + SUMIF($K$2:$K$50, $A24, $L$2:$L$50)) - (SUMIF($A$2:$A$20, $A24, $L$2:$L$20) + SUMIF($B$2:$B$20, $A24, $L$2:$L$20) + SUMIF($C$2:$C$20, $A24, $L$2:$L$20) + SUMIF($D$2:$D$20, $A24, $L$2:$L$20) + SUMIF($E$2:$E$20, $A24, $L$2:$L$20))
The limiting factor here is that each user is adding a potential of 3 values to the pool but when one is selected it is only subtracting that from the pool and then leaving the other 2 which invalidates the pool as a whole.
For reference I have copied the contents over to the following sheet
https://docs.google.com/spreadsheets/d/1tbL1WuaUoR8JM8cndTBiKnoaLm-UxCaddpSOJbCVmiM/edit?usp=sharing
Does anyone have an idea for how to properly remove all available options from the pool rather than just the one that was selected? I have a few ideas from how to possibly make it work with code but I am trying to make it an automated formula rather than something that needs to be specifically "run" to calculate it.

You need some marker against each user to indicate whether their entries are still eligible for taking. Here is an example: column Eligible (L) is True if we haven't chosen from that user yet. Column Selected (N) is filled manually, by choosing from what is available in column Available (O). The formulas are:
For Eligible column (second row shown):
=and(isna(match(I2, N$2:N, 0)), isna(match(J2, N$2:N, 0)), isna(match(K2, N$2:N, 0)))
which says that neither I2, J2, or K2 match anything in N.
And in column O, only one array formula is needed:
={filter(I2:I, L2:L); filter(J2:J, L2:L); filter(K2:K, L2:L)}
which filters out non-eligible users and stacks the foods in one column using array notation {}.
+--------+---------+--------+------------+-----------+--+----------+------------+
| User | Food 1 | Food 2 | Food 3 | Eligible | | Selected | Available |
+--------+---------+--------+------------+-----------+--+----------+------------+
| User A | Chicken | Pear | Watermelon | TRUE | | Grape | Chicken |
| User B | Garlic | Grape | Rice | FALSE | | Beef | Potato |
| User C | Beef | Corn | Salt | FALSE | | | Pear |
| User D | Potato | Pepper | Orange | TRUE | | | Pepper |
| | | | | | | | Watermelon |
| | | | | | | | Orange |
+--------+---------+--------+------------+-----------+--+----------+------------+

Ideal solution for the following case scenario in database

There are 50 exams to be written by around millions of students online, One person may or may not write more than one exam. A person can also write a single exam more than one time ( retries ) ..
So which of the below solution is better for this case, I am okay with a better solution than these two as well
Option 1. Store each exam in a single table :
Subject 1
+----------------+---------+
| student id | Marks |
+----------------+---------+
| 1 | 85 |
| 2 | 32 |
| 2 | 60 |
+----------------+---------+
Subject 2
+----------------+---------+
| student id | Marks |
+----------------+---------+
| 1 | 85 |
| 2 | 32 |
| 2 | 60 |
+----------------+---------+
Like above with each table will have the student id only if that particular person has taken that exam , and have multiple occurrences of the student id if he has taken it more than once.
Option 2 :
+----------------+---------+---------+
| student id | Subject | Marks |
+----------------+---------+---------+
| 1 | Subj1 | 85 |
| 2 | Subj1 | 32 |
| 2 | Subj1 | 60 |
| 1 | Subj2 | 80 |
| 3 | Subj2 | 90 |
+----------------+---------+---------+
with all the values in a single table.
Which is better in terms of performance and storage perspective.
My various que

I think the best here is following:
Table STUDENT with information about students
Table EXAM with information about exams
Table EXAM_TRY with reference to STUDENT and EXAM tables, and fields DATE_OF_EXAM and RESULT_OF_EXAM
2 indexes on foreign keys in table EXAM_TRY
Depending on situation - index on date field (for example, you would need it for planning work for examiners)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Understanding Slowly Changing Dimension Type 2 - data-warehouse

Related

Google Sheets Return Any(All) Row Value(s) For MAX-GROUP Query

Union Vertical Blending in Data Studio

how to create relationship using cypher

Filter out all of user's entries if one of them was selected

Ideal solution for the following case scenario in database

Categories

Resources