I have a large number of parquet files in s3 which together represent a large timeseries describing an evolution of a certain state
delta timestamp
0 True 1
1 False 2
2 False 2
3 True 3
4 True 4
5 True 5
6 True 6
7 False 7
8 False 7
9 False 7
10 True 8
11 True 9
Rows grouped by same timestamp with delta==False represent a full state of a system at given timestamp (note same timestamp values), while delta==True represent a partial change of the previous state.
My goal is to apply a row wise operation to this timeseries which restores state, updates it on each change and performs certain calculation, producing a new timeseries (or an aggregate value, this is not relevant at this point). All of this should be performed in parallel on a multi-machine cluster with Dask.
For this, I want to create a partitioned Dask Dataframe which makes sure that each partition starts with a block of rows with delta == False e.g. each partition knows full state at the beginning and apply my function to this partition.
Partition 1 (note row 0 is discarded since no previous `delta==False` values are found)
1 False 2
2 False 2
3 True 3
4 True 4
5 True 5
6 True 6
Partition 2
7 False 7
8 False 7
9 False 7
10 True 8
11 True 9
In original data timestamps are already sorted (though minor misordering may occur)
My questions are:
Does this sound like a correct approach? If so, how do I perform such a partition? Going through docs, I found set_index, which in my case doesn't help, and a map_overlap function which also doesn't fit since it uses fixed threshold for each partition, which I don't know beforehand.
What would happen if I have a long range of rows with delta==True? I will need to somehow partition it as well, while keeping reference to previous partition until I reach a partition which contains full state, i.e. records with delta==False. I guess this should be handled by a task graph, but is there a way tell Dask to wait for a certain partition to be processed?
Thanks!
Related
This may be beyond my skill level in Google Sheets, and it's certainly straining my brain to think through, but I have two columns out a large spreadsheet (30000 lines or so) that I need to find matches between unique values on one list, and non-unique but specific values ONLY on another list. That is, I would need the following list to return only the values on the left that had a 3 in the right column every time that value appears on the left, not just for a specific instance.
"Unique" Identifier (can repeat)
Value
1
2
2
3
3
2
4
2
5
3
6
2
1
2
2
2
3
2
4
2
5
2
6
2
I have the following formula from another couple answers mocked up, but it doesn't get me all the way there:=UNIQUE(FILTER(A2:A,B2:B>0))
How can I get it to exclude the ones that have, for instance, both a 2 and a 3 in the right column for the same value in the left column?
Edit: To put it in more real terms (I was trying to keep it abstract so I could understand the basics), I have a Catalog ID and a Condition for items, and need to find all Catalog IDs that only have Good copies, not any Very Good copies. This link should show what I want to achieve:
https://docs.google.com/spreadsheets/d/e/2PACX-1vSjenkDS2Mk3t4kTcDoJqSc8AV6ONu4Q17K1HPaIUdJkb7dhdnbAt-CzUxGO3ZoJISNpGajUtFTGz8c/pubhtml?gid=0&single=true
to return only the values on the left that had a 3 in the right column every time
try:
=UNIQUE(FILTER(A:A; B:B=3))
update 1:
=UNIQUE(FILTER(Sheet1!A:A; Sheet1!B:B="Good"))
update 2:
=UNIQUE(FILTER(Sheet1!A:A, Sheet1!B:B="Good",
NOT(COUNTIF(FILTER(Sheet1!A:A, Sheet1!B:B<>"Good"), Sheet1!A:A))))
I have a dataset that looks like this:
outlet name (string variable): name of media outlet (maximum 12), the last three outlets in the file are The Guardian, The Telegraph and The Independent.
score 1: scale
score 2: scale
...
score 7: scale.
What I want to do is compute a set of 21 new variables that show for each of the cases (media outlets), for each of the seven variables (scores), the difference between the score of that specific outlet, and the scores of the three outlets of interest: The Guardian, The Telegraph and The Independent (7 variables X 3 benchmark outlets=21). Essentially I want to compare each outlet's scores to my three benchmark outlets.
So for example I should have a new variable, named score1_Guardian, that for outlet 1 will be computed as: the score outlet 1 got for that variable - the score The Guardian got for that variable. Variable score2_Guardian will show, for each outlet, the difference between the score each specific outlet got on that variable and the score The Guardian got for that variable, and so on. So in this example, the outlet The Guardian will score 0 on all score1_Guardian to score7_Guardian variables.
There are simpler ways to do this than what I suggest below, but I like it better this way - less code and less temporary variables.
First I create a fake dataset according to your parameters:
data list list/outlet (a12) score1 to score7 (7f6).
begin data
'outlet1' 1 2 3 4 5 6 7
'outlet2' 2 3 4 5 6 7 8
'outlet3' 5 6 7 8 9 1 2
'Guardian' 7 8 9 1 2 5 6
'Telegraph' 5 12 12 3 4 4 2
'Independent' 2 2 2 2 2 2 2
end data.
Now we can get to work:
*going from wide to long form - just to avoid creating too many variables on the way.
varstocasese /make score from score1 to score7/index scorenum(score).
if outlet='Guardian' Guardian=score.
if outlet='Telegraph' Telegraph=score.
if outlet='Independent' Independent=score.
AGGREGATE /OUTFILE=* MODE=ADDVARIABLES OVERWRITEVARS=YES
/BREAK=scorenum /Guardian=MAX(Guardian) /Telegraph=MAX(Telegraph) /Independent=MAX(Independent).
*now we have three new variables ready to compare.
compute Guardian=score - Guardian.
compute Telegraph=score - Telegraph.
compute Independent=score - Independent.
* last step - going back to wide format.
compute scorenum=substr(scorenum,6,1).
CASESTOVARS /id=outlet /index=scorenum/sep="_".
I'm trying to find a simple solution for first-n-per-group.
I have a table of data, first column dates and rest data. I want to group based around the date, as multiple entries per date are allowed. For the second column some numbers, but want the FIRST record.
Currently the aggregate function I could possibly use is MIN() but that will return the lowest value and not the first.
A B
01/01/2018 10
01/01/2018 15
02/01/2018 10
02/01/2018 2
02/01/2018 100
02/01/2018 20
03/01/2018 5
03/01/2018 2
Desired output
A B
01/01/2018 10
02/01/2018 10
03/01/2018 5
Current results using MIN() - undesired
A B
01/01/2018 10
02/01/2018 2
03/01/2018 2
It's a shame there isn't a FIRST() aggregate function in Google Sheets, which would make this a lot easier.
I saw a couple of examples of using the Row Number and ArrayQuery, but that doesn't seem to work for me. There are about 5000 rows of data so trying to keep this as efficient as possible, and not have to recalculate the entire sheet on any change, each taking a few seconds.
Currently I have this, which appends a third column with the Row Number:
=query({A1:B, arrayformula(row(A1:B))}, "select min(Col1),min(Col2) group by Col1")
Thanks
EDIT 1
A suggested solution was =SORTN(A:B,2^99,2,1,1), which is a clean simple one. However, this requires a large range of "free space" to display the returned dataset. Imagine 3000+ rows.
I was hoping for a QUERY() -based solution, as I wanted to do further operations with the results. Specifically, count the occurrences of distinct values.
For example: I wanted a returned dataset of
A B
01/01/2018 10
02/01/2018 10
03/01/2018 5
Yet I want to count the occurrences of those values (and then ignoring the dates). For example:
B C
10 2
5 1
Perhaps I've confused the situation by using numbers? the "data" in ColB is TEXT (short 3 letter codes), however I used numbers to show I couldn't use MIN() function as that returns the numerically lowest value.
So in brief:
Go through all rows (3000+ rows) and group by the FIRST row of a particular date
return the FIRST value of that row
COUNT() all unique occurrences of those FIRST values, disregarding the date. Just a list with the unique values and their count (again, only the first one of any particular day)
=SORTN(A:B,2^99,2,1,1)
If your data is sorted as in the sample, You can easily remove duplicates with SORTN()
I have a google spreadsheet that has 6 cells with specific numbers in them. Every week, a series of numbers is entered in and I would like to flag the numbers in a separate column if they appear for that week. I was using the formula below where my numbers are in D2->I2 and the weekly ones would be in D18->I18 for example.
=arrayformula(sumproduct((D2:I2=D18:I18)))
Now, while this works, it's not quite what I'm trying to do. Unless the numbers match each other exactly, 1 2 3 4 5 6 to 1 2 3 4 5 6 then the addition doesn't happen. What I would like to have happen is that if, for example, the master column has 1 2 3 4 5 6 and the weekly column has 3 7 9 1 8 5 then the cell with the formula would display the value of 3 for matching three of the numbers that week.
Does anyone have a suggestion on how best to accomplish this?
See if this works ?
=ArrayFormula(sum(--regexmatch(D2:I2&"", join("|", D18:I18&""))))
with exclusion of empty cells in both ranges:
=iferror(ArrayFormula(sum(--regexmatch(to_text(filter(D2:I2, len(D2:I2))), "\b("&join("|", to_text(filter(D18:I18, len(D18:I18))))&")\b"))))
I imported data from survey monkey into spss and survey monkey automatically assigns values and value labels. My values and labels are currently something like this:
1 "Married"
2 "Single"
3 "777"
4 "999"
I re-coded variables so that 3=777 and 4=999. Then I set 777 and 999 to missing. I then used ADD VALUE LABELS to add the 777= "Refused" and 999= "Don't know". How do I use syntax to delete the Value and Value Labels for 3 and 4? These are no longer true since I re-coded values 3 and 4. I know I can use VALUE LABELS to delete all my values and labels but I would have to specify all my categories which would be tedious. Ideally I would want to re-code the 3 and 4 values, add values labels for the new 777 and 999 values and delete the old 3 and 4. If I only had a few variables I would consider doing it a different way but I want to write syntax that I could use for a list of about 100 variables. I will also be pulling data from survey monkey on a weekly basis and would like to have the syntax file to rename, recode, and add value labels ready to go each time I pull the data.
I don't believe there is a way to delete specific value labels for specific values only. So the workaround is to explicitly set the values for the entire set of values:
DATA LIST FREE / MS.
BEGIN DATA
1 2 3 4
END DATA.
/* 1. Original values labels */.
VALUE LABELS MS 1 "Sinlge" 2 "Married" 3 "777" 4 "999".
CTABLES /TABLE MS[C].
/* 2. Recode values and re-label - Note values 3 and 4 are still assigned values but they happen to be blank as they are being registered by CTABLES */.
RECODE MS (3=777) (4=999).
ADD VALUE LABELS MS 3 "" 4 "" 777 "Refused" 999 "Unknown".
CTABLES /TABLE MS[C].
/* 3. Workaround is to assign explicitly entire set of values */.
VALUE LABELS MS 1 "Sinlge" 2 "Married" 777 "Refused" 999 "Unknown".
CTABLES /TABLE MS[C].
Update:
Well, nothing is impossible in the realms of computing. Raynald Levesque outlines a workaround solution here. And Ruben Geert van den Berg provides a python solution on his website also.
That's can make with python begin-end program block inside SPSS syntax:
DATA LIST FREE / MS (F1.0).
BEGIN DATA
END DATA.
VALUE LABELS MS 1 "Married" 2 "Single" 3 "777" 4 "999".
ADD VALUE LABELS MS 777 "Refused" 999 "Don't know".
BEGIN PROGRAM.
import spss
qst='MS'
values=[3,4]
with spss.DataStep():
datasetObj=spss.Dataset();varObj = datasetObj.varlist[qst];valObj=varObj.valueLabels
print 'Before:',valObj
for i in values:
try:
del valObj[i]
except:
continue
print 'After:',valObj
END PROGRAM.
Output Log:
Before: {1.0: 'Married', 2.0: 'Single', 3.0: '777', 4.0: '999', 999.0: "Don't know", 777.0: 'Refused'}
After: {1.0: 'Married', 2.0: 'Single', 777.0: 'Refused', 999.0: "Don't know"}