Pivoting a dask dataframe using multiple columns as index - dask

I have a Dask DataFrame of following format:
date hour device param value
20190701 21 dev_01 att_1 0.000000
20190718 22 dev_01 att_2 20.000000
20190718 22 dev_01 att_3 18.611111
20190701 21 dev_01 att_4 18.706083
20190718 22 dev_01 att_5 23.333333
I am trying to pivot the dataframe using Dask.DataFrames.pivot_table() API. However, I want to use 'date', 'hour' and 'device' as the index (i.e, in the pivoted table each row would be uniquely identified by the date, hour and device identifier):
ddf.pivot_table(index = ['date', 'hour', 'device'], columns='param', values='value')
However, it's failing with the following error:
'index' must be the name of an existing column
As I understand from the API documentation (here), the parameter 'index' accepts name of a single column (and not a list) and hence this error.
Is there any other alternative of pivoting a dask dataframe using multiple columns as index?

As mentioned in the docstring the column on which you pivot must be a single column, and it must be of categorical dtype. So to accomplish what you want you would have to convert your three columns into a single categorical column.
This is doable using normal Pandas syntax, but will likely require a full pass through the data to get the categories.

Related

How to check for overlapping dates

I am looking for a solution on either Google sheets or app script to check for overlapping dates for the same account. There will be multiple accounts and the dates won't be in any particular order. Here is an example below. I am trying to achieve the right column "check" with some formula or automation. Any suggestions would be greatly appreciated.
Start Date
End Date
Account No.
Check
2023-01-01
2023-01-02
123
ERROR
2023-01-02
2023-01-05
123
ERROR
2023-02-25
2023-02-27
456
OK
2023-01-11
2023-01-12
456
OK
2023-01-01
2023-01-15
789
ERROR
2023-01-04
2023-01-07
789
ERROR
2023-01-01
2023-01-10
012
OK
2023-01-15
2023-01-20
012
OK
I also found some similar past questions, but they don't have the "for the same account" component and/or requires some sort of chronological order, which my sheet will not have.
How to calculate the overlap between some Google Sheet time frames?
How to check if any of the time ranges overlap with each other in Google Sheets
Another approach (to be entered in D2):
=arrayformula(lambda(last_row,
lambda(acc_no,start_date,end_date,
if(isnumber(match(acc_no,unique(query(query(split(flatten(acc_no&"|"&split(map(start_date,end_date,lambda(start_date,end_date,join("|",sequence(1,end_date-(start_date-1),start_date)))),"|")),"|"),"select Col1,count(Col2) where Col2 is not null group by Col1,Col2",0),"select Col1 where Col2>1",1)),0)),"ERROR","OK"))(
C2:index(C2:C,last_row),A2:index(A2:A,last_row),B2:index(B2:B,last_row)))(
counta(A2:A)))
Briefly, we are creating a sequence of dateserial numbers between the start & end dates for each row, doing some string manipulation to turn it into a table of account number against each date, then QUERYing it to get each account number which has dateserials with count>1 (i.e. overlaps), using UNIQUE to get the distinct list of those account numbers, then finally matching this list against the original list of account numbers to give the ERROR/OK output.
(1) Here is one way, considering each case which could result in an overlap separately:
=ArrayFormula(if(A2:A="",,
if((countifs(A2:A,"<="&A2:A,B2:B,">="&A2:A,C2:C,C2:C,row(A2:A),"<>"&row(A2:A))
+countifs(A2:A,"<="&B2:B,B2:B,">="&B2:B,C2:C,C2:C,row(A2:A),"<>"&row(A2:A))
+countifs(A2:A,">="&A2:A,B2:B,"<="&B2:B,C2:C,C2:C,row(A2:A),"<>"&row(A2:A))
)>0,"ERROR","OK")
)
)
(2) Here is the method using the Overlap formula
min(end1,end2)-max(start1,start2)+1
which results in
=ArrayFormula(if(byrow(A2:index(C:C,counta(A:A)),lambda(r,sum(text(if(index(r,2)<B2:B,index(r,2),B2:B)-if(index(r,1)>A2:A,index(r,1),A2:A)+1,"0;\0;\0")*(C2:C=index(r,3))*(row(A2:A)<>row(r)))))>0,"ERROR","OK"))
(3) Most efficient is to use the original method of comparing previous and next dates, but then you need to sort and sort back like this:
=lambda(data,sort(map(sequence(rows(data)),lambda(c,if(if(c=1,0,(index(data,c-1,2)>=index(data,c,1))*(index(data,c-1,3)=index(data,c,3)))+if(c=rows(data),0,(index(data,c+1,1)<=index(data,c,2))*(index(data,c+1,3)=index(data,c,3)))>0,"ERROR","OK"))),index(data,0,4),1))(SORT(filter({A2:C,row(A2:A)},A2:A<>""),3,1,1,1))
HOWEVER, this only checks for local overlaps. not globally. You can see what I mean if you change the dataset slightly:
Clearly the first and third pair of dates have an overlap but G4 contains "OK". This is because each pair of dates is only checked against the adjacent pairs of dates. This also applies to the original reference cited by OP - here's an example where it would give a similar result:
The formula posted by #The God of Biscuits gives the correct (global) result :-)

Tableau FIXED LOD vs COUNTD

I am working with a dataset containing 22,232,726 entries collected between 2008 and 2021. Because original entries can not be deleted from the database, a new entry must be created with the same ID to update an observation.
I want to remove all repeated IDs leaving only the latest entry per ID for my analysis.
I used the following Level of Detail function in Tableau to achieve this:
{FIXED [ID]: MAX([Date])} = [Date]
The function returns a total of 17,980,416 entries. However, when I run a distinct count COUNTD([ID]) before and after applying the LOD filter, I get 17,899,956 distinct IDs. Why is my LOD function returning an extra 80,460 repeated IDs to the result?
FYI, there are no Nulls in the ID nor the Date columns. So there can be repeated dates for the same ID, but I expected Tableau to keep only one of them in the results. How can I remove these extra repeated entries or fix this counting problem?
I eventually found a solution to the problem by using a Row_ID field as the criterium for selecting one of the records with an identical ID and Date. I used 2 LOD calcs as filters.
The first filter kept all unique IDs with the latest Date, including some repeated IDs with the same latest date.
1:{FIXED [ID]: MAX([Date])} = [Date]
The second filter took the repeated records with identical ID and Date and kept only the one with the last Row_ID.
2:{FIXED [ID],[Date]: MAX([Row_ID])}=[Row_ID]
The original dataset doesn't have a Row_ID variable, so I had to create it by using Pandas in Python by adding index and index_label parameters:
df.to_csv("my-file-name.csv", index=True, index_label='Row_ID')

is there a function to get mean values of a column for every unique date in date column?

jupyter notebook screenshot showing al columns in the datasetI have an AQI(Air Quality Index) dataset for which there are various columns such as O3, SO2, PM2.5,etc and a datetime column which has timestamps in it like (20-Sep-2017 - 01:00, 20-Sep-2017 - 00:00). I want to get mean value of columns for every unique date such as O3 has several values but I want only mean for 20-Sep-2017. I've tried regex, and many other things but did not get desired output.

Google Sheets Avg Query on empty columns (AVG_SUM_ONLY_NUMERIC)

Google Sheets average (avg) Query will fail with error AVG_SUM_ONLY_NUMERIC if any column in the dataset is empty. How you can overcome this?
Essentially, this occurs as the query is being run on a dynamically generated data set, therefore it's impossible to know what columns are empty beforehand. Moreover the query output "layout" must not change, so, if a column is empty, the query should return blank or 0 as for the faulty empty column.
Let's give it a look
Scenario: a Google Sheet is being used to insert markings for students tests.
When a single test is done by students, teacher assigns multiple grades for it. For instance, one marking for writing, one for comprehension, etc.
The sheet should finally build columns containing an average for all the markings assigned within the same date.
For instance, in the above sheet (link here), columns with markings given on December 16th (cols B,G,M,R,V) should be averaged in column AE.
Thanks to brilliant user Marikamitsos, this is achieved with the following query in cell AE4:
=ARRAYFORMULA(QUERY(TRANSPOSE(QUERY(TRANSPOSE(FILTER(B4:Z,B3:Z3=AE3)),
"select "&TEXTJOIN(",", 1, IF(LEN(A4:A),
"avg(Col"&ROW(A4:A)-ROW(A4)+1&")", )))&""),
"select Col2")*1)
How does the above works?
Dataset is filtered by date
Filtered dataset is transposed and an avg Query is run on it
Result dataset is being queried again to easily filter out labels
All this works fine until a student has no markings for a given date, as occurs in cell AG4: student Bob has no markings for October's 28th test, and the query will throw an error AVG_SUM_ONLY_NUMERIC.
Could there be a way to insert a 0 in the filtered dataset FILTER(B4:Z,B3:Z3=AE3) so that ONLY empty rows will be set to 0? This would prevent the query to fail, while avoiding altering the dataset layout.
Or could there be a way to ignore zeroes in avg query?
NOTE: students cannot be graded with '0' when skipping a test!
See if this works
=ARRAYFORMULA(QUERY(TRANSPOSE(QUERY(TRANSPOSE(FILTER(B4:Z+0,B3:Z3=AG3)), "select "&TEXTJOIN(",", 1, IF(LEN(A4:A), "avg(Col"&ROW(A4:A)-ROW(A4)+1&")", )))&""),"select Col2")*1)

How do I get QUERY function to return correct data?

So I have this spreadsheet with data in it, there are 29 columns and 54 rows.
On the 2nd sheet I'm trying to find all of the rows that fit a certain criteria.
For some reason, if I include the column X in my query data, the results are completely messed up. The 1st row of the result is just concatenating the first 23 rows together whether they fit the criteria or not. If I only include up to Column W the query is OK and it returns the correct results. But the problem is that I need to get data from Columns A and AB, so I need to include column X in my data range.
In this spreadsheet you can see the data on Sheet1, the query that includes column X on Sheet2, and on Sheet3 I have the same exact query except it only goes up to Column W and you can see the correct results there.
Basically, I need the query to return the value of Column A and Column AB for every row where Column B is marked with an "x".
Here is the sheet
Include the third parameter of query, which is the number of header rows:
=query(Sheet1!A2:X, "select A where B='x'", 1)
The parameter is optional, but if it's omitted, query will guess the number of header rows based on the data. Sometimes it guesses correctly, sometimes not (hence the dependence on what columns are included in the query). In your case, it decided that the table had 23 header rows and concatenated them in the output.
I don't know why you have arrayformula wrapper for query, it does not really do anything.
This is a duplicate of https://webapps.stackexchange.com/questions/103761/how-do-i-get-query-to-return-the-right-data which I answered hours ago:
You can use the Filter function to do this , with a literal array :

Resources