Tableau FIXED LOD vs COUNTD - tableau-desktop

I am working with a dataset containing 22,232,726 entries collected between 2008 and 2021. Because original entries can not be deleted from the database, a new entry must be created with the same ID to update an observation.
I want to remove all repeated IDs leaving only the latest entry per ID for my analysis.
I used the following Level of Detail function in Tableau to achieve this:
{FIXED [ID]: MAX([Date])} = [Date]
The function returns a total of 17,980,416 entries. However, when I run a distinct count COUNTD([ID]) before and after applying the LOD filter, I get 17,899,956 distinct IDs. Why is my LOD function returning an extra 80,460 repeated IDs to the result?
FYI, there are no Nulls in the ID nor the Date columns. So there can be repeated dates for the same ID, but I expected Tableau to keep only one of them in the results. How can I remove these extra repeated entries or fix this counting problem?

I eventually found a solution to the problem by using a Row_ID field as the criterium for selecting one of the records with an identical ID and Date. I used 2 LOD calcs as filters.
The first filter kept all unique IDs with the latest Date, including some repeated IDs with the same latest date.
1:{FIXED [ID]: MAX([Date])} = [Date]
The second filter took the repeated records with identical ID and Date and kept only the one with the last Row_ID.
2:{FIXED [ID],[Date]: MAX([Row_ID])}=[Row_ID]
The original dataset doesn't have a Row_ID variable, so I had to create it by using Pandas in Python by adding index and index_label parameters:
df.to_csv("my-file-name.csv", index=True, index_label='Row_ID')

Related

Combining multiple data from ID

I have a question for my current Spreadsheet A.
Now I'm trying to make a new sheet for report generation where:
Report shows each ticket recorded on spreadsheet A.
Each ticket have 3 recorded process time. (Verification, Repair, QA)
The month for when the job ticket is first registered.
For illustration purpose, new sheet should look like this:
Ticket ID
Verification
Repair
QA
Month
T-001
X Hour
Y Hour
Z Hour
9
T-002
X Hour
Blank if no recorded time
Blank if no recorded time
9
...
...
...
...
...
Can Google Sheets do that? If can, how do I do it?
I have tried looking for some tutorial videos on Vlookup/Hlookup/Query/Search/Find, but I cant seem to get the results I needed.
EDITED: Changed question 3 from Name to Month
My solution is not the most elegant but it works:
https://docs.google.com/spreadsheets/d/1dEMYbI751pp55YF5M0V19U0QbytsabgwAO_97I1LXqw/copy
First get all ticket names using UNIQUE formula
=unique(C3:C)
When you got it, you have to find rows using 2 conditions:
Process & Ticket. In order to get it using VLOOKUP I make temporary array that contains Process and Ticket columns stitched together and duration column.
Then I use VLOOKUP using 2 stitched keys
=ifna(
arrayformula(
vlookup(G2&$F$3:$F,ArrayFormula({$B$3:$B&$C$3:$C,$D$3:$D}),2,false)))
Ifna prevents from error messages displayed when no value is found.
First arrayformula lets work this formula for an entire column.
Last task is to determine name of an employee. I use vlookup, but as name is futher left then Ticket, I have to make a temporary array {C3:C,A3:A} to search for name.
Warning: Vlookup is listing only first name found on the list.

MAX function not bringing me my desired results

I Have 4 columns I am interested in creating a list. We collected weekly data from our third party vendor. We sort it by the DataCollection week. They do not always submit this data. So, there will be times where a Vender submitted one week but not the next. I need to have a running total of total enrollments by the collection week broken down by the name. I did the MAX function but that only gives me the latest date in the whole table, I want the max for each districts individual date. How do I accomplish this so that say, if the latest week is 2/21/2020 for Name A, and the latest week for Name b was 2/14/2020, I can have both dates and enrollment totals, because as it stands I get only the max date which is 2/21/2020 but the names of those other districts that submitted the data are not coming back.
The code below is what I have.
SELECT DATACOLLECTIONWEEK, NAME,DISTSCH,TOTALENROLLMENTS
FROM DB.SCHEMA.TEST
WHERE datacollectionweek = (SELECT MAX(datacollectionweek)
FROM DB.SCHEMA.TEST)
SELECT DATACOLLECTIONWEEK, NAME,DISTSCH,TOTALENROLLMENTS
FROM DB.SCHEMA.TEST as DB1
WHERE DB1.datacollectionweek = (SELECT MAX(datacollectionweek)
FROM DB.SCHEMA.TEST AS DB2
WHERE DB1.NAME = DB2.NAME)

Google Sheets Avg Query on empty columns (AVG_SUM_ONLY_NUMERIC)

Google Sheets average (avg) Query will fail with error AVG_SUM_ONLY_NUMERIC if any column in the dataset is empty. How you can overcome this?
Essentially, this occurs as the query is being run on a dynamically generated data set, therefore it's impossible to know what columns are empty beforehand. Moreover the query output "layout" must not change, so, if a column is empty, the query should return blank or 0 as for the faulty empty column.
Let's give it a look
Scenario: a Google Sheet is being used to insert markings for students tests.
When a single test is done by students, teacher assigns multiple grades for it. For instance, one marking for writing, one for comprehension, etc.
The sheet should finally build columns containing an average for all the markings assigned within the same date.
For instance, in the above sheet (link here), columns with markings given on December 16th (cols B,G,M,R,V) should be averaged in column AE.
Thanks to brilliant user Marikamitsos, this is achieved with the following query in cell AE4:
=ARRAYFORMULA(QUERY(TRANSPOSE(QUERY(TRANSPOSE(FILTER(B4:Z,B3:Z3=AE3)),
"select "&TEXTJOIN(",", 1, IF(LEN(A4:A),
"avg(Col"&ROW(A4:A)-ROW(A4)+1&")", )))&""),
"select Col2")*1)
How does the above works?
Dataset is filtered by date
Filtered dataset is transposed and an avg Query is run on it
Result dataset is being queried again to easily filter out labels
All this works fine until a student has no markings for a given date, as occurs in cell AG4: student Bob has no markings for October's 28th test, and the query will throw an error AVG_SUM_ONLY_NUMERIC.
Could there be a way to insert a 0 in the filtered dataset FILTER(B4:Z,B3:Z3=AE3) so that ONLY empty rows will be set to 0? This would prevent the query to fail, while avoiding altering the dataset layout.
Or could there be a way to ignore zeroes in avg query?
NOTE: students cannot be graded with '0' when skipping a test!
See if this works
=ARRAYFORMULA(QUERY(TRANSPOSE(QUERY(TRANSPOSE(FILTER(B4:Z+0,B3:Z3=AG3)), "select "&TEXTJOIN(",", 1, IF(LEN(A4:A), "avg(Col"&ROW(A4:A)-ROW(A4)+1&")", )))&""),"select Col2")*1)

How to count entries by date with google time stamp

I have a large amount of data from multiple google form submitters with a google timestamp, column A. I am using
=ARRAYFORMULA(COUNTIFS('Angela-5'!$AQ$2:$AQ,"Missed appointment",INT('Angela-5'!$A$2:$A),TODAY()))
to count entries for today, which works. However, when I try to count entries for the last week,
=ARRAYFORMULA(COUNTIFS('Angela-5'!$AQ$2:$AQ,"Missed appointment",INT('Angela-5'!$A$2:$A),TODAY()-7))
it does not work.
How can I make this work?
Try
=ARRAYFORMULA(COUNTIFS('Angela-5'!$AQ$2:$AQ,"Missed appointment",INT('Angela-5'!$A$2:$A),(+TODAY()-7)))
TODAY()-7 is not being recognised as a formula by the criterion#2; changing it to (+TODAY()-7) forces the formula value (a week ago) to be expressed and it is recognised by the criterion#2.

remove duplicates based on one column and keep last entry

I'm trying to remove duplicates based on one column and keep the last entry. Right now my formula is keeping the first value.
I'm using the formula found in this post:
Selecting all rows with distinct column values - Google query language
Well the short answer is just to change 0 (or false) in your formula to 1 (or true) so that VLOOKUP matches the last entry for each unique value
=ArrayFormula(iferror(VLOOKUP(unique(Data!D:D),{Data!D:D,Data!A:D}, {2,3,4,5},1 ),""))
This does appear to work for your test data
but that isn't the end of the story.
If you use VLOOKUP with this formula the data has to be sorted on the lookup column according to the documentation but in the comments above you said that you can't assume the data is sorted on the lookup column. Things do go horribly wrong if you try this on unsorted data. So you have to sort it on the lookup column like this
=ArrayFormula(iferror(VLOOKUP(sort(unique(Data1!D2:D),1,true),sort({Data1!D2:D,Data1!A2:D},1,true), {2,3,4,5},1 )))
the only slight downside being that this doesn't include the headings (because they would get sorted to the end of the data).
Here is the same test data sorted in descending order on ID
This gives the correct result (but without headers)
You can add the headers just by putting
=query(Data1!A:D,"select * limit 0")
above the data.

Resources