Slow changing dimension with two pairs of start/end dates - data-warehouse

For my DWH I consider implementing scd with two pairs of start/end dates: effective_from_dttm/effective_to_dttm and valid_from_dttm/valid_to_dttm. The reason is that I needed to track changes with both timestamp from source and timestamp that shows when a row was extracted in staging area of dwh. In staging are I have tables with two timestamps: reliable timestamp (processed_dttm) that I generate when extracting from source and unreliable timestamp (last_upd_dttm) which comes from source. Effective/valid dates are created from these two timestamps accordingly.
Consider this example:
DDS Table
id
changing_field
effective_from_dttm
effective_to_dttm
valide_from_dttm
valide_to_dttm
1
something
01.01.2022
null
10.02.2022
null
INPUT TABLE
id
changing_field
last_upd_dttm
processed_dttm
1
something_new!
01.01.2022
11.02.2022
DDS Table (delta applied)
id
changing_field
effective_from_dttm
effective_to_dttm
valide_from_dttm
valide_to_dttm
1
something
01.01.2022
null
10.02.2022
11.02.2022
1
something_new!
01.01.2022
null
11.02.2022
null
As I said last_upd_dttm is unreliable timestamp, for example, in source system record can be changed while last_upd_dttm stays the same due to possible fraud or mistake which I need to detect.
INPUT TABLE
id
changing_field
last_upd_dttm
processed_dttm
1
something_new2!
12.02.2022
12.02.2022
DDS Table (delta applied)
id
changing_field
effective_from_dttm
effective_to_dttm
valide_from_dttm
valide_to_dttm
1
something
01.01.2022
12.02.2022
10.02.2022
11.02.2022
1
something_new!
01.01.2022
12.02.2022
11.02.2022
12.02.2022
1
something_new2!
12.02.2022
null
12.02.2022
null
Now I have a history of effectivity in source system and in dwh. Is there a more simple approach for reflecting such history? I guess having two pairs of dates will make process of building marts too complicated. And is there a type of SCD with this double versioning or maybe someplace where I can read about such approach?

Related

Should we compare null value with known value?

I have a binary classification problem and need to prepare the data for model training. There are two classes, duplicate, and nonduplicate. Assume two records of the data is like
Id
Name
Phone
Email
City
A1
Mick
12345
m#m.com
London
A2
Mick
12345
null
London
It seems that these two records are duplicates. I need to turn them in one record and assign each feature a binary value of 1 if their values match; otherwise, a 0 as follows
Id1
Id2
Name
Phone
Email
City
Label
A1
A2
1
1
?
1
1
As the first table shows, we have a missing value for the email in the second row. I know I cannot compare a known value with a missing one. The question is, what is the best practice in this case?
Note: The number of missing values is high in my dataset, and I cannot drop them.
I tried to put 0, but I know it introduces bias in the dataset.
you can drop the records wit the null values
to do this use
Pandas dropna()

vlookup several criteria and a range

Here's my table:
Exchange No.
Name
Tier
30d Volume (higher than)
Maker
Taker
Specials
1
FTX
1
$0
0.0200%
0.0700%
FTX
2
$2,000,000
0.0150%
0.0600%
FTX
3
$5,000,000
0.0100%
0.0550%
FTX
4
$10,000,000
0.0050%
0.0500%
2
Binance
Regular User
$0
0.0120%
0.0500%
Binance
VIP 1
$15,000,000
0.0120%
0.0500%
Binance
VIP 2
$50,000,000
-0.0100%
0.0500%
I want to retrieve the correct fees in another table as follows:
Volume (past 30d):
volume variable, ie $10,000
FTX
Binance
Column #
2
3
4
IN:
Maker
correct fee
correct fee
OUT-stop:
Maker
correct fee
correct fee
OUT-profit:
Maker
correct fee
correct fee
OUT-manually:
Maker
correct fee
correct fee
B3 cell (second table) should take the fee in cell E2 (of the first table).
Why?
Cause:
C1 (of second table) says "FTX" as per column B row 2:5 (of first
table)
Volume (in B1 of second table) is higher than D2 but lower than D3 (of first
table)
B3 (of second table) says "Maker" which is column F (of first
table)
So I tried to do a vlookup but only my criteria No. 3 would work with vlookup.
Other criteria are a range higher than (No. 1) and two different columns as "index" (in vlookup formula) which are, by the way, the searched text... (No. 2)
Someone has an idea to take into account those special criteria in vlookup, or similar, please?
try:
=INDEX(VLOOKUP(M2, QUERY({'Exchange Fees'!B2:B&":", 'Exchange Fees'!D2:D,
FILTER('Exchange Fees'!A2:G, 'Exchange Fees'!A1:G1=M3)},
"select Col2,Col3 where Col1 = '"&N2&"'", ), 2, 1))

How do I order a mixed text and integer field in a pivot table in Google Sheets?

Let's say that we have two columns on a sheet:
Name Room
-------------
Steve A1
Jill A1
Sam A1
Steve A2
...
Lisa A10
Sally A11
Jim A11
My actual dataset has up to a hundred of these rooms.
The issue I'm running into is with pivot tables. When I want to get a list of rooms and the count (counta is the one I'm using) it works, but the order is not what I wanted. It comes out as:
Room Count
--------------
A1 3
A10 1
A11 2
...
A2 1
I guess I can kind of see why it would be doing that. I'd much rather have it list it out in order. A1, A2, A3... A10, A11, A12, etc.
Is there an easy way to do this without some sort of data manipulation?
An "easy" way to do this without "data manipulation" is to copy the PT, Paste special, Paste values only and then drag the relevant rows (presumably at most only 8) to where you want them. The easiest way is probably with "data manipulation", for example:
=if(len(A1)=2,SUBSTITUTE(A1,"A","A0"),A1)
(Though in you case, whichever column would be the right one, it would not be ColumnA.)
I suggest you transform the string elements into number values using a lookup table.
I've created a sample spreadsheet here.
The input data in the 'input' sheet has the keys as you described.
The next sheet is the "lookup table" to translate each key into a value number. I suggest choosing large numbers to leave room for future intermediate numbers if needed
Pivot 1 is based on the original data as you described
Pivot 2 is based on the re-calculated room name using the lookup table.
The formula I used for the re-calculation is:
=VALUE(SUBSTITUTE(A2,MID(A2,1,1),VLOOKUP(MID(A2,1,1),'Lookup table'!$A$1:$B$2,2)))
I was a little lazy with the string lookup in the original name (MID), assuming your string is the first character and is 1 character long. This can be mended specifically with pattern matching.

how to split one column into multiple variable columns in parse.com

I have the following records in my data base
fields: brand |model |attributes
----------------------------
record 1) apple |iPhone|6s space grey sprint
record 2) audi | a6 |quattro coupe
How can I store the attributes into different column names, with version/color/carrier columns for the first record and engine/type as the different columns for second record. The table should have 5 total columns for 1st record and 4 columns for the second record.
How do I achieve this? Should I split the table? If there are million products and each have varied length attributes then the number of columns in the table will be long. Whats the efficient way of doing this?
You can store the related attributes in a second table.
record|attribute|value
------+---------+-----
1 |version |6s
1 |color |space grey
1 |carrier |sprint
2 |engine |quattro
2 |type |coupe
This would allow for an item to have any number of (optional) attributes.

Combining the select clause in query function with Indirect function. - Google Spreadsheets

So basically I need a dynamic select statement which changes references when dragged across rows or columns.
Example of what I need.
=sum(query('Sheet1'!$A$1:$F$621, Indirect("Select"&$F&"Where A='ABC' AND B="&"Sheet2!"&$A1)))/20
Sample Sheet 1 (Data Sheet)..--Since I am not allowed to use images till i reach rep 10 lol :)
Column 1 - Sales Sites (ABC, DEF, GHI.....)
Column 2 - Sales Roles (SM, ASM, SE.....)
Column 3 - Sales in Month Jan
Column 4 - Sales in Month Feb
Column 5 - Sales in Month Mar
Sample Sheet 2 (Desired Output)
Description (in pivot terms):
Site wise (filter)
Role wise (Rows)
Month wise (Columns - Sum of Jan, Feb etc)
Value (Sum of Jan/20)--To get day wise sales numbers
FYI:
I have tried using pivots, but google spreadsheets don't allow use of calculated fields in pivots in any manner (for the /20 in the formula), hence trying to achieve the same results by formula.
I know a table on the basis of the pivot table could help solve this problem, but to make it more efficient I am trying to avoid using 2 tables.
Many Thanks for your help in advance, please let me know if you need additional info to understand the scenario.

Resources