I'm new to Pyspark, have used pandas for most of my data work. I'm trying to use Pyspark's Sliding window with Window function to make samples from my data set, ideally I want to slide my window with a gap between rows.
I'm wondering if there's a step_size parameter in either the rangeBetween or rowsBetween methods ( I couldn't find any in their documentation ).
EXAMPLE: with set of rows as below (assume each row is a date)
A
B
C
D
E
F
G
H
I want to be able to say, choose the first 3 rows, leave 2 rows in between, and then choose the next 3 rows
like: A,B,C ; C,D,E ; E,F,G; G,H,I etc ...
I tried this:
windowSpec = Window.orderBy(func.col("column")).rangeBetween(start, end)
Obviously the above snipped only slides data logically between start and end, but doesn't give me the flexibility of skipping any rows in between if they logically satisfy the condition.
Any help is much appreciated. TIA !
Related
I am trying to determine how often values appear in a row based on the lead value of the row. Essentially, if "A" is the first value of the row, what percentage of those "A" rows contain the value "B" in the subsequent columns, what percentage contain "C" in subsequent columns, etc.
Below is an example table with the leads and their partners
Lead
Partner 1
Partner 2
A
B
C
A
C
E
B
A
E
C
B
A
A
D
B
B
C
E
A
B
D
B
E
D
C
D
B
A
E
C
I want to output a table which stays what percentages of times values B-E appear for rows which start with A. In the example above, A is the lead 5 times, and B appears in those A rows 3 times, so the value is 60%
A Partners:
Value
%
B
60%
C
60%
D
40%
E
40%
Partners will always be unique, i.e. the same value wont appear in both columns 2 and 3 (e.g. no "BEE"). It doesn't matter which column the partner appears in (2 or 3), it only matters if they appear in either column after where A is the lead.
I plan to have multiple "Partner tables" like the solution above, so I can also see how many times A&C-E appear in B-led rows, etc. But once I know how to make one table I can then make the others.
I tried a combination of IF and COUNTIF formulas, basically trying to say
If A2 contains A, then count the number of times B appears in the subsequent columns and divide it by the number of times A is in the lead.
=If((A2="A"),((COUNTIF(B2:C11,"B")/COUNTIF(A2:A11,"A")),0)
This of course results in skewed results because it counts how many times B appears in all rows, not just the ones which are lead by A. I'm having trouble limiting the count of Bs to only A rows.
Thank you!
You can set this formula:
=COUNTIF(FILTER(B:C,A:A = $F$1),F2)/COUNTA(FILTER(A:A,A:A = $F$1))
Or with BYROW for the four (or all you need) rows:
=BYROW(F2:F5,LAMBDA(each,COUNTIF(FILTER(B:C,A:A = $F$1),each)/COUNTA(FILTER(A:A,A:A = $F$1))))
My intention is to convert a single line of data into rows consist of a specific number of columns in Google Sheets.
For example, starting with the raw data:
A
B
C
D
E
F
1
id1
attr1-1
attr2-1
id2
attr2-1
attr2-2
And the expected result is:
(by dividing columns by three)
A
B
C
1
id1
attr1-1
attr1-2
2
id2
attr2-1
attr2-2
I already know that it's possible a bit manually, like:
=ARRAYFORMULA({A1:C1;D1:F1})
But I have to start over with it every time the target range is moved OR the subset size needs to be changed (in the case above it was three)!
So I guess there will be a much more graceful way (i.e. formula does not require manual update) to do the same thing and suspect ARRAYFORMULA() is the key.
Any help will be appreciated!
I added a new sheet ("Erik Help") where I reduced your manually entered parameters from two to one (leaving only # of columns to be entered in A2).
The formula that reshapes the grid:
=ArrayFormula(IFERROR(VLOOKUP(SEQUENCE(ROUNDUP(COUNTA(7:7)/A2),A2),{SEQUENCE(COUNTA(7:7),1),FLATTEN(FILTER(7:7,7:7<>""))},2,FALSE)))
SEQUENCE is used to shape the grid according to whatever is entered in A2. Rows would be the count of items in Row 7 divided by the number in A2 (rounded to the nearest whole number); and the columns would just be whatever number is entered in A2.
Example: If there are 11 items in Row 7 and you want 4 columns, ROUNDUP(11/4)=3 rows to the SEQUENCE and your requested 4 columns.
Then, each of those numbers in the grid is VLOOKUP'ed in a virtual array consisting of a vertical SEQUENCE of ordered numbers matching the number of data pieces in Row 7 (in Column 1) and a FLATTENed (vertical) version of the Row-7 data pieces themselves (in Column 2). Matches are filled into the original SEQUENCE grid, while non-matches are left blank by IFERROR
Though it's a bit messy, managed to get it done thanks to SEQUENCE() function anyway.
It constructs a grid by accepting number of rows/columns input, and that was exactly I was looking for.
For reference set up a sheet with the sample data here:
https://docs.google.com/spreadsheets/d/1p972tYlsPvC6nM39qLNjYRZZWGZYsUnGaA7kXyfJ8F4/edit#gid=0
Use a custom formula
Although you already solved this. If you are doing this kind of thing a lot, it could be beneficial to look into Apps Script and custom formulas.
In this case you could use something like:
function transposeSingleRow(range, size) {
// initialize new range
let newRange = []
// initialize counter to keep track
let count = 0;
// start while loop to go through row (range[0])
while (count < range[0].length){
// add a slice of the original range to the new range
newRange.push(
range[0].slice(count, count + size)
);
// increment counter
count += size;
}
return newRange;
}
Which works like this:
The nice thing about the formula here is that you select the range, and then you put in a number to represent its throw, or how many elements make up a complete row. So if instead of 3 attributes you had 4, instead of calling:
=transposeSingleRow(A7:L7, 3)
you could do:
=transposeSingleRow(A7:L7, 4)
Additionally, if you want this conversion to be permanent and not dependent on formula recalculation. Making it in run fully in Apps Script without using formulas would be neccesary.
Reference
Apps Script
Custom Functions
Here's my problem: I have 2 sheets in my document (lets call them Sheet 1 and Sheet 2). They contain similar stuff and both look like this (Names may differ, as well as values):
Column A, C, D and F contain times (in m:ss).
Column B and E both calculate the time-difference between NameX and NameY and add ">, < or ~ ~" depending on the actual difference (ignore the coloring).
Now here comes my problem: I want to find 3 minima (on Sheet 3).
Minimum 1 is easy, as I can just use this function (it automatically filters out column B and E):
MIN('Sheet 1'!A2:F2, 'Sheet 2'!A2:F2)
Minimum 2 and 3 are were I struggle.
Minimum 2: Using the example values, I want to find the minimum of (1:01+1:02), (1:02+1:05), (1:01+1:01) and (1:01+1:02) (+ whatever times are on sheet 2). Result should be 2:02.
Minimum 3: Again, using the example values, I want to find the minimum of (1:01+1:02+1:03), (1:02+1:05+0:30), (1:01+1:01+1:12) and (1:01+1:02+2:02) (+ whatever times are on sheet 2). Result should be 2:37.
I am currently using this formula (for minimum 3):
=MIN(
IFERROR(FILTER(IFERROR(ARRAYFORMULA({'Sheet 1'!A2:F2}+{'Sheet 1'!A3:F3}+{'Sheet 1'!A4:F4})),
IFERROR(ARRAYFORMULA({'Sheet 1'!A2:F2}+{'Sheet 1'!A3:F3}+{'Sheet 1'!A4:F4}))<>0)),
IFERROR(FILTER(IFERROR(ARRAYFORMULA({'Sheet 2'!A2:F2}+{'Sheet 2'!A3:F3}+{'Sheet 2'!A4:F4})),
IFERROR(ARRAYFORMULA({'Sheet 2'!A2:F2}+{'Sheet 2'!A3:F3}+{'Sheet 2'!A4:F4}))<>0))
)
Some notes: The inner IFERROR-function is needed to filter out errors that obviously occur when trying to add up column B and E. FILTER-function filters out columns that are empty (there's none in this example). The second IFERROR-function filters out FILTER-functions that return an error when they get no input at all (all columns in a sheet are empty). I want to filter of these since I don't want to get 0:00 as result
My problem is this: In my actual sheet I have 11 sheets with 16 rows to add up, but I don't want to use the formula above and create an insane monster of a formula that would x-times as long as the formula above.
So my question is: Is there an easier way to solve this problem for mimimum 3 (and therefore 4, 5, 6 ...) that I'm not seeing?
It's a little monstrous, but this might work:
=MIN(FILTER({
MMULT(SEQUENCE(1,ROWS(Sheet1!A2:F),1,0),N(Sheet1!A2:F));
MMULT(SEQUENCE(1,ROWS(Sheet2!A2:F),1,0),N(Sheet2!A2:F));
MMULT(SEQUENCE(1,ROWS(Sheet3!A2:F),1,0),N(Sheet3!A2:F));
MMULT(SEQUENCE(1,ROWS(Sheet4!A2:F),1,0),N(Sheet4!A2:F));
MMULT(SEQUENCE(1,ROWS(Sheet5!A2:F),1,0),N(Sheet5!A2:F));
MMULT(SEQUENCE(1,ROWS(Sheet6!A2:F),1,0),N(Sheet6!A2:F));
MMULT(SEQUENCE(1,ROWS(Sheet7!A2:F),1,0),N(Sheet7!A2:F));
MMULT(SEQUENCE(1,ROWS(Sheet8!A2:F),1,0),N(Sheet8!A2:F));
MMULT(SEQUENCE(1,ROWS(Sheet9!A2:F),1,0),N(Sheet9!A2:F));
MMULT(SEQUENCE(1,ROWS(Sheet10!A2:F),1,0),N(Sheet10!A2:F));
MMULT(SEQUENCE(1,ROWS(Sheet11!A2:F),1,0),N(Sheet11!A2:F))},
{1,0,1,1,0,1}))
I need to repeat a table every x row.
Here is what I have :
Auto Populate a Table
(1) I want to auto-populate
(2) this table
(3) 'x' times
(4) Every 'y' rows
I've try with offset, sequence , importrange and arrayFormula but I don't find a way to do it !
But I'm sure it's possible to do it.
I do this to help doctors make their schedules.
I tired to think differently (See Solution 2)
>Open my Google Sheet
Instead of ducplicate the table n times
I auto-populate : (x days) * (x workstations)
=ARRAYFORMULA("1"&T(SEQUENCE(DATES!B3*INFOS!K2)))
THEN in another column I add (+1 day) every (x workstations)
=DATE(YEAR(DATES!$B$1);MONTH(DATES!$B$1);DAY(DATES!$B$1)+INT((ROW()-2)/INFOS!$K$2))
AND now I have to repeat my list of workstations until the end of the column (but I'm still looking how).
If I have a simple spreadsheet such as this:
A B C D
1 Hello 30 60
2 World 45 90
...
…where I want column D to simply be column C*2, how could I do that? Obviously, I could manually set the contents of each row's D column to be =C1 *2, =C2 *2, and so on, but if I have hundreds of rows, it'd be easier to have something along the lines of =C$ROW *2 — is that possible?
If you copy/paste the formula down, the spreadsheet app will automatically update indices appropriately. This is the standard way to do this kind of operation in spreadsheet applications; if you want to force a particular row/col ref to not update, prepend a $:
=sum($A$1:$A$9999) // this reference will never change
=$A1 // this will always reference column A but will follow row changes