Pyspark substring with values from another table - join

Pyspark: I have two dataframes. First one is one column containing a long string. Second dataframe is a lookup dataframe holding some values that indicate some substring start and ends. I'd like to use the second data frame to split up the first and have a resultant dataframe with the original data and the string split values:
Dataframe A:
Data
000 456 9b
876 998 1c
Dataframe B:
Description
Start
End
Length
City
1
3
3
Country
5
7
3
IheartSpark
9
10
2
The result would be this:
Data
City
Country
IheartSpark
000 456 9b
000
456
9b
876 998 1c
876
998
1c
Dataframe b is only 30 rows or so and i was thinking of broadcasting this if possible (this will run in a cluster).
Any thoughts?

Try with crossJoin and pivot functions to get the desired output.
Example:
df.show()
#+----------+
#| Data|
#+----------+
#|000 456 9b|
#|876 998 1c|
#+----------+
df1.show()
#+-----------+-----+---+------+
#| Descr|start|end|length|
#+-----------+-----+---+------+
#| City| 1| 3| 3|
#| Country| 5| 7| 3|
#|IheartSpark| 9| 10| 2|
#+-----------+-----+---+------+
from pyspark.sql.functions import *
df.crossJoin(broadcast(df1)).\
withColumn("nn",expr("""substring(Data,start,length)""")).\
groupBy("Data").\
pivot("Descr").\
agg(first(col("nn"))).\
show()
#+----------+----+-------+-----------+
#| Data|City|Country|IheartSpark|
#+----------+----+-------+-----------+
#|000 456 9b| 000| 456| 9b|
#|876 998 1c| 876| 998| 1c|
#+----------+----+-------+-----------+

Related

Google sheets COUNTIF excluding hidden rows

In google sheets, I have a list of strings (1 per row) where each string is split with 1 character per column, so my sheet looks something like below:
A
B
C
D
E
F
1
F
R
A
N
K
2
P
A
S
S
1
2
I then have this sheet filtered, so Can select only the rows where the first character is F, for example. On another sheet in the same workbook, I have a table of how often each character appears in each column, that looks something like this:
A
B
C
D
E
F
1
Char
Overall
1
2
3
2
A
979
141
304
165
3
B
281
173
69
15
I would like to have this table dynamically update, so that when I filter the first sheet my table shows the frequency only for the strings that meet the filter.
In Excel, this can be accomplished using a combination of SUMPRODUCT and SUBTOTAL but this doesn't work in google sheets. I've seen this done in sheets using helper columns, but I would like the solution to work for a string of an arbitrary number of strings with different lengths without having to change the sheet. Can this be done in Google Sheets?
Thanks!
Hidden cells are assigned with the value 0. One way to solve this is by adding a "helper" column in column A and set all the values in it to 1.
| A | B | C | D | E | F | G
--+--------+------+---+---------+-----+-----+-----
1 | Helper | Char | | Overall | 1 | 2 | 3
--+--------+------+---+---------+-----+-----+-----
2 | 1 | A | | 979 | 141 | 304 | 165
3 | 1 | B | | 281 | 173 | 69 | 15
Now instead of using COUNTIF, use the COUNTIFS formula where the second condition A2:A = 1. For example:
=COUNTIFS([YOUR_CONDITION], A2:A,"=1")
the A column values of hidden rows will calculate as 0, therefore will not be counted.

Google Sheets: Compare each cell of a column seperately and check another cell in the found row for conditional formatting

Hello all Sheet users out there.
I have a sheet with a list of resources with their production and usage being calculated on the left side and the overall prod/use being monitored on the right side.
A B C D | E F G H
1 Input In Output Out | Resource totIn totOut effective
2 Iron 20 FeIngot 30 | Iron 30 =SUMIF(...) =totIn-totOut
3 Copper 20 CuIngot 20 | Copper 25 =SUMIF(...) =totIn-totOut
4 Stone 10 Gravel 50 | CuIngot =SUMIF(...) =SUMIF(...) =totIn-totOut
5 FeIngot 10 FePlate 5 | FeIngot =SUMIF(...) =SUMIF(...) =totIn-totOut
6 CuIngot 25 Wire 75 | Stone 45 =SUMIF(...) =totIn-totOut
7 CuIngot 10 Cable 20 | Gravel =SUMIF(...) =SUMIF(...) =totIn-totOut
The actual sheet would look more like this:
A B C D | E F G H
1 Input In Output Out | Resource totIn totOut effective
2 Iron 20 FeIngot 30 | Iron 30 20 10
3 Copper 20 CuIngot 20 | Copper 25 20 5
4 Stone 10 Gravel 50 | CuIngot 20 35 -15
5 FeIngot 10 FePlate 5 | FeIngot 30 10 20
6 CuIngot 25 Wire 75 | Stone 45 10 35
7 CuIngot 10 Cable 20 | Gravel 50 0 50
On the left side, I want to mark all cells in column "In" red that have a negative effective production calculated on the right side. I thought about using the conditional formatting, looping through every text cell in the "Resource" column to find the one that equals the "Input" of the same row the cell I want to check is in and then check if the "effective" value of the "Resource" I found is less than 0. The problem is that I don't know how to loop through the values and store the matching row to check if the H value is negative.
Example 1: B6 is checked. A6 needs to be compared to every cell in E2:E and when there is a match, in this case E4, check if H4 is negative. It is, so there is formatting applied.
Example 2: B3 is checked. A3 needs to be compared to every cell in E2:E and when there is a match, in this case E3, check if H3 is negative. It is not, so there is no formatting applied.
Is there any way that I can apply this formatting in the conditional formatting tool?
Keep in mind that my sheet is much more complex than these examples and it has about 120 resources that can't all be moved in order with the left side because multiple rows can use the same resource as input or output.
Thank you in advance for every ounce of your help.
try this formula =VLOOKUP($A1,$E:$H,4,false)<0 in conditional formatting

Duplicates in Google Sheets

I am using Google sheets and I am trying to concatenate multiple column A values in Column C, when and if Column B has a duplicate:
Sample data:
Column A Column B Column C
1 1247 Santa Fe 1250/1150
2 1250 Santa Fe 1247/1150
3 1258 North Shore 1354
4 1341 Hogan 1255
5 1255 Hogan 1341
6 1354 North Shore 1258
7 1150 Santa Fe 1247/1250
Here, Column C needs to have multiple concatenated values of A, corresponding to the duplicates in column B.
C1:
=JOIN("/",FILTER($A$1:$A$7,$B$1:$B$7=B1,ROW($B$1:$B$7)<>ROW(B1)))
Drag fill down.

How I can sort my results based on the order that sets the user?

In my database I have many columns that will summarize in:
Code input
Amount 1
Amount 2
Code Phase
Code Sector
Code Group
Take an example of the rows that I have:
+--------------+----------+----------+-------+--------+-------+
| code_input | amount_1 | amount_2 | phase | sector | group |
+--------------+----------+----------+-------+--------+-------+
| 0171090150 | 22 | 14 | 09 | 90 | 10 |
| 0258212527 | 12 | 99 | 08 | 30 | 20 |
| 0359700504 | 30 | 10 | 09 | 20 | 20 |
+--------------+----------+----------+-------+--------+-------+
The user has a place in which he can decide who goes first, second, third and fourth. So, he can decide if the code_phase is first, second cod_sector, third code_group, finally the code_input. Or the user can play with that order (cod_sector first, code_phase second, etc).
In my database, inputs are those amounts recorded. Therefore, if a sector includes 2 inputs, the total of this sector is the sum of these two inputs.
Example of result with one order:
# => Order: Phase, Sector, Group, Input
- Phase 09 52 24
- Sector 90 22 14
- Group 10 22 14
- Input 0171090150 22 14
- Sector 20 30 10
- Group 20 30 10
- Input 0359700504 30 10
- Phase 08 12 99
- Sector 30 12 99
- Group 20 12 99
- Input 0258212527 12 99
I use Ruby on Rails and I have a code of 3579 lines for all combinations that users can put. But it is a code that is not maintainable, cumbersome and sometimes I get confused myself where there may be a mistake.
So, I wonder if any of you can help me know if there is any gem that can help me; may not do the whole order; but help me greatly to optimize my code Or if you recommend a method or algorithm that can make this order.
Sorry for my english.

Pandas stack unstack pivot hierarchical index - reshape dataframe

I have massaged a dataframe so it looks like this:
123
456
789
0AB
CDE
FGH
...
,,,
I would like to transform it, so it looks like this:
123789CDE...
4560ABFGH,,,
The pattern is this:
123 789 CDE ...
456 0AB FGH ,,,
That is, I take two rows and concatenate the next two rows, etc, so I get a wide dataframe.
But my real dataframe is not three columns, it is maybe 50 columns, and maybe 100,000 rows, so my dataframe is 100,000 x 50 big. I want to take 100 rows, and concatenate the next 100 rows, etc so I get a wide dataframe with dimension 100 x (50 * 100,000/100) = 100 x 50,000.
Can Pandas do this? My aim is to do some calculations on each of these 100 rows. Or is hierarchical indexing better?
shell [33]>>> df
[33]>>>
0
0 123
1 456
2 789
3 0AB
4 CDE
5 FGH
6 ...
7 ,,,
shell [34]>>> pd.DataFrame(df.values.reshape(4, 2)).sum()
[34]>>>
0 123789CDE...
1 4560ABFGH,,,
dtype: object
Another approach is using groupby.
shell [35]>>> df['group'] = 0
shell [36]>>> df[1::2]['group'] = 1
shell [37]>>> grouped = df.groupby('group')
shell [38]>>> grouped.sum()
[38]>>>
0
group
0 123789CDE...
1 4560ABFGH,,,
Maybe worth studying not to create a new frame and instead work directly on the groups? Certainly for multiple columns and huge numnber of rows.
shell [39]>>> for key, group in grouped:
print key
print group
....:
0
0 group
0 123 0
2 789 0
4 CDE 0
6 ... 0
1
0 group
1 456 1
3 0AB 1
5 FGH 1
7 ,,, 1

Resources