Condition based join in pyspark - join

Given two dataframes:
A
+---+---+---+
|id1|id2|id3|
+---+---+---+
|11 |22 |aaa|
|12 |23 |bbb|
|13 |34 |L12|
|14 |32 |L22|
+---+---+---+
B
+---+--------
|id1|id2|type |
+---+--------
| 22|11 |red |
| 23|12 |red |
| 34|L12|blue|
| 32|L22|blue|
+---+--------
I'd like to join them as follows:
if B.type == 'red': A.id1 == B.id2
else if B.type == 'blue': (A.id2 == B.id1) & (A.id3 == B.id2)
Thus, in the end I'd have:
+---+---+---+---+---+----+
|id1|id2|id3|id1|id2|type|
+---+---+---+---+---+----+
| 11| 22|aaa| 22| 11| red|
| 12| 23|bbb| 23| 12| red|
| 13| 34|L12| 34|L12|blue|
| 14| 32|L22| 32|L22|blue|
+---+---+---+---+---+----+
But the above result is obtained by extracting the condition
e.g. join_condition = (when(B.type == 'red', A.id == B.id2) ...
I'd like to approach the problem like:
reds = B.filter(type == 'red')
blues = B.filter(type == 'blue)
and then join them one by one:
a_reds = A.join(reds, A.id1 == B.id2, 'left')
a_blues = A.join(blues, (A.id2 == B.id1) & (A.id3 == B.id2))
Now in order to get to a unified table, I'd like to union them, but to not include the null values which appear after calling union.
e.g.:
+---+---+---+----+----+----+
|id1|id2|id3| id1| id2|type|
+---+---+---+----+----+----+
| 14| 32|L22|null|null|null|
| 11| 22|aaa| 22| 11| red|
| 12| 23|bbb| 23| 12| red|
| 13| 34|L12|null|null|null|
| 12| 23|bbb|null|null|null|
| 14| 32|L22| 32| L22|blue|
| 13| 34|L12| 34| L12|blue|
| 11| 22|aaa|null|null|null|
+---+---+---+----+----+----+
Can it be done? If so, how?
Thank you.

You can avoid the null records by not doing the left join.
Or you can filter out records where "type=null" after performing the union.

You can use the conditional join instead of 2 joins + union.
# Assuming A and B is the dataframe name.
from pyspark.sql import functions as F
join_cond = (F.when(F.col('type') == 'red', A.id1 == B.id2)
.when(F.col('type') == 'blue', (A.id2 == B.id1) & (A.id3 == B.id2)))
df = A.join(B, join_cond)
Result
+---+---+---+---+---+----+
|id1|id2|id3|id1|id2|type|
+---+---+---+---+---+----+
| 11| 22|aaa| 22| 11| red|
| 12| 23|bbb| 23| 12| red|
| 13| 34|L12| 34|L12|blue|
| 14| 32|L22| 32|L22|blue|
+---+---+---+---+---+----+

Related

Google Sheet - Return unique id with maximum mark associated

Good day
I'm trying to return the maximum mark of each student. If the fails the training, they can try a new attempt, and my summary sheet should include only unique value with highest mark obtained.
Example of data:
| | A |B |
| -| - | - |
|1 |email | score|
|2|abc#mail.com | 1|
|3 |abd#mail.com | 3|
|4 |abc#mail.com | 3|
|5 |abc#mail.com | 4|
|6 |abe#mail.com | 5|
|7 |abe#mail.com | 4|
|8 |abe#mail.com | 7|
|9 |jvr#mail.com | 1|
|10 |jvr#mail.com | 7|
And i would like to return this table:
| | D |E |
|- | - | - |
|1 |email | score|
|2| abc#mail.com | 4|
|3 |abd#mail.com | 3|
|4 |abe#mail.com | 7|
|5 |jvr#mail.com | 7|
Code used in COL D2: <br>
=UNIQUE(A2:A,FALSE,FALSE)<br>
Code used in COL E2: <br>
=if(G2<>"", ARRAYFORMULA(VLOOKUP(G2,D2:E,2,false)),"")<br>
Code used in COL E3: <br>
=if(G3<>"", ARRAYFORMULA(VLOOKUP(G3,D3:E,2,false)),"")<br>
Is there any way to optimize this?
In D2 paste
=UNIQUE(A2:A)
In E2 paste this formula
=MAX(TRANSPOSE(FILTER($B$2:$B, $A$2:$A=D2)))
use:
=SORTN(SORT(A2:B, 2, 0), 9^9, 2, 1, 1)
A2:B range
2 sort by 2nd column
0 in descending order
9^9 return all rows
2 group by
1 first column
1 in ascending order

Switch values and headers

I have a google sheet with a table
card1 | card2 | card3 | card4 | card5 | card6 | card7
------+-------+-------+-------+-------+-------+------
set3 | set1 | set1 | set2 | set2 | set4 | set1
set4 | set2 | set3 | set3 | set4 | | set2
| set4 | | | | | set3
| | | | | | set4
How can I switch the values in the table and the values in the header to produce a table like this:
set1 | set2 | set3 | set4
------+-------+-------+-------
card2 | card2 | card1 | card1
card5 | card4 | card3 | card2
| card5 | card4 | card5
| card7 | card7 | card6
| | | card7
use:
=ARRAYFORMULA(TRIM(TRANSPOSE(SPLIT(FLATTEN(QUERY(QUERY(SPLIT(FLATTEN(
IF(A2:G30="",,A2:G30&"¤×"&A1:G1&"¤")), "×"),
"select max(Col2) where Col2 is not null group by Col2 pivot Col1"),,9^9)), "¤"))))

How to combine 2 queries result in single query from Google Sheets

I would like to combine 2 result in one, to make them display in 2 column
what I have:
+------------+------------+------------+------------+------------+------------+------------+
| B | C | D | E | F | G | H |
+------------+------------+------------+------------+------------+------------+------------+
| Supplier A | 40ft | 19-0201 | 02/09/2019 | 05/09/2019 | 05/09/2019 | $2,590.60 |
| Supplier B | 20ft | 19-0206 | 04/09/2019 | 06/09/2019 | 07/09/2019 | $7,198.10 |
| Supplier C | 40ft | 19-0208 | 04/09/2019 | 06/09/2019 | 07/09/2019 | $3,673.40 |
| Supplier B | 20ft | 19-0207 | 04/09/2019 | 07/09/2019 | 08/09/2019 | $5,592.20 |
| Supplier C | 20ft | 19-0203 | 06/09/2019 | 05/09/2019 | 06/09/2019 | $863.30 |
| Supplier B | 20ft | 19-0204 | 05/09/2019 | 05/09/2019 | 06/09/2019 | $4,190.20 |
| Supplier D | 28ft | 19-0205 | 05/09/2019 | 07/09/2019 | 08/09/2019 | $1,390.60 |
| Supplier E | 14ft | 19-0209 | 07/09/2019 | 09/09/2019 | 09/09/2019 | $180.30 |
| Supplier B | 10ft | 19-0211 | 08/09/2019 | 08/09/2019 | 09/09/2019 | $12,392.80 |
| Supplier C | 40ft | 19-0210 | 07/09/2019 | 10/09/2019 | 11/09/2019 | $6,591.30 |
| Supplier B | 20ft | 19-0202 | 03/09/2019 | 12/09/2019 | 13/09/2019 | $1,380.50 |
| Supplier F | 14ft | 19-0213 | 09/09/2019 | 12/09/2019 | 12/09/2019 | $4,576.30 |
this is first query code :
=ARRAYFORMULA(TEXTJOIN(CHAR(10),TRUE,SUBSTITUTE(TRIM(TRANSPOSE(QUERY(TRANSPOSE(QUERY(B16:H34,"SELECT B, D, SUM(H) GROUP BY B, D ORDER BY B ASC LABEL SUM(H)'' FORMAT SUM(H) '$##,##0.00' ")),,COLUMNS(QUERY(B16:H34,"SELECT B, D, SUM(H) GROUP BY B, D ORDER BY B ASC LABEL SUM(H)'' FORMAT SUM(H) '$##,##0.00' ")))))," → "," → ")))&CHAR(10)&CHAR(10)&"Total Costing : "&TEXT(SUM(H16:H34),"$0,000.00")
+-----------------------------------+
| Supplier A 19-0201 $2,590.60 |
| Supplier B 19-0202 $1,380.50 |
| Supplier B 19-0204 $4,190.20 |
| Supplier B 19-0206 $7,198.10 |
| Supplier B 19-0207 $5,592.20 |
| Supplier B 19-0211 $12,392.80 |
| Supplier C 19-0203 $863.30 |
| Supplier C 19-0208 $3,673.40 |
| Supplier C 19-0210 $6,591.30 |
| Supplier D 19-0205 $1,390.60 |
| Supplier E 19-0209 $180.30 |
| Supplier F 19-0213 $4,576.30 |
| |
| Total Costing $50,618.60 |
and my second query code :
={QUERY({B16:H34},"SELECT Col1, SUM(Col7)/"& SUM(H16:H34)&" WHERE Col1 IS NOT NULL GROUP BY Col1 LABEL SUM(Col7)/"& SUM(H16:H34)&"'Scale Of Amount' FORMAT SUM(Col7)/"& SUM(H16:H34)&"'(0.00%)'");"Total Costing Scale","(100%)"}
+---------------+--------------+
| | SUM of Amount|
+---------------+--------------+
| Supplier A | (5.12%) |
| Supplier B | (60.75%) |
| Supplier C | (21.98%) |
| Supplier D | (2.75%) |
| Supplier E | (0.36%) |
| Supplier F | (9.04%) |
| Total Costing | (100.00%) |
How to make it show :
+-------------------------+-------------------------+
| | SUM of Amount |
+-------------------------+-------------------------+
| Supplier A 19-0201 | $2,590.60 (5.12%) |
| Supplier B 19-0202 | $1,380.50 |
| Supplier B 19-0204 | $4,190.20 |
| Supplier B 19-0206 | $7,198.10 |
| Supplier B 19-0207 | $5,592.20 |
| Supplier B 19-0211 | $12,392.80 (60.75%) |
| Supplier C 19-0203 | $863.30 |
| Supplier C 19-0208 | $3,673.40 |
| Supplier C 19-0210 | $6,591.30 (21.98%) |
| Supplier D 19-0205 | $1,390.60 (2.75%) |
| Supplier E 19-0209 | $180.30 (0.36%) |
| Supplier F 19-0213 | $4,576.30 (9.04%) |
| | |
| Total Costing | $50,618.60 (100.00%) |
This the formula that can be used:
=arrayformula({query( sort ( unique({B2:B,D2:D,sumif(B2:B&":"&D2:D,"=" &
B2:B&":"&D2:D,H2:H)}),1,true,2,true),"Select * where Col1 is not null"),
iferror(vlookup(transpose(split(join(",",rept("0,",query(filter(B2:B,B2:B<>""),
"Select Count(Col1) group by Col1 label Count(Col1) ''")-1) &
sequence(counta(unique(B2:B)))),",",true,false)),
{sequence(counta(unique(B2:B))),
query( unique({B2:B,sumif(B2:B,"="&B2:B,H2:H)/sum(H2:H)}),
"Select Col1, Col2 where Col1 is not null")},3,false),"");
{"Total","",sum(filter(H2:H,H2:H<>"")),1}})
Update 1:
= arrayformula
(
{
query (sort (unique({B2:B,D2:D,sumif(B2:B&":"&D2:D,"=" & B2:B&":"&D2:D,H2:H)}),1,true,2,true),"Select * where Col1 is not null"),
iferror (
vlookup(transpose(split(join(",",
rept
(
"0,",query(unique(filter({B2:B,B2:B&":"&D2:D},B2:B<>"")),"Select Count(Col1) group by Col1 label Count(Col1) ''")-1
) & sequence (counta(unique(B2:B)))),",",true,false)),
{
sequence(counta(unique(B2:B))),
query (unique ({B2:B,sumif(B2:B,"="&B2:B,H2:H)/sum(H2:H)}),"Select Col1, Col2 where Col1 is not null")
},3,false
),""
) ; {"Total","",sum(filter(H2:H,H2:H<>"")),1}
}
)

Group By using all column from left table after join with replicated names in pyspark data frame

I have a Spark DataFrame obtained by joining two table. They share the column "name
valuesA = [('A',1,5),('B',7,12),('C',3,6),('D',4,9)]
TableA = spark.createDataFrame(valuesA,['name','id', 'otherValue']).alias('ta')
valuesB = [('A',1),('A',4),('B',2),('B',8),('E',4)]
TableB = spark.createDataFrame(valuesB,['name','id']).alias('tb')
joined = TableA.join(TableB, TableA.name==TableB.name, 'left')
I would like to do something similar to a select joined.select('ta.*').show() for groupby but joined.groupBy('ta.*').count() raises an error.
How can I implement something like that without having to explicitly list the columns? joined.groupBy(TableA.columns).count() raises issue because "name" is not unique
As an alternative how can I retrieve the column with proper alias from joined?
PS Doing the join as joined = TableA.join(TableB, ['name'], 'left') is not a useful solution because I have columns that are not used in the join condition that have the same name in table A and B
You can always use a list comprehension to get a list of column names for the groupBy:
aliasListTableA = ['ta.' + c for c in TableA.columns]
joined.groupBy(aliasListTableA).count().show()
Output:
+----+---+----------+-----+
|name| id|otherValue|count|
+----+---+----------+-----+
| B| 7| 12| 2|
| D| 4| 9| 1|
| C| 3| 6| 1|
| A| 1| 5| 2|
+----+---+----------+-----+
In general I try to avoid alias as it hides the origin of a column:
aliasListTableA = ['ta_' + c for c in TableA.columns]
aliasListTableB = ['tb_' + c for c in TableB.columns]
joined = joined.toDF(*(aliasListTableA + aliasListTableB))
joined.show()
Output:
+-------+-----+-------------+-------+-----+
|ta_name|ta_id|ta_otherValue|tb_name|tb_id|
+-------+-----+-------------+-------+-----+
| B| 7| 12| B| 2|
| B| 7| 12| B| 8|
| D| 4| 9| null| null|
| C| 3| 6| null| null|
| A| 1| 5| A| 1|
| A| 1| 5| A| 4|
+-------+-----+-------------+-------+-----+

How do I write queries to compare nodes and edges on different paths?

We are new to Neo4j (and excited!) and I'm trying to apply Cypher to our problem. We have a query that matches paths but which needs to remove paths that involve any nodes or edges that were traversed on any other paths that originated from the same node or edge. Here's a test case:
CREATE (a1:A {name: 'a1'})-[ab1:AB {name: 'ab1'}]->(b1:B {name: 'b1'}),
(a1)-[ab2:AB {name: 'ab2'}]->(b2:B {name: 'b2'}),
(a2:A {name: 'a2'})-[ab3:AB {name: 'ab3'}]->(b1),
(a2)-[ab5:AB {name: 'ab5'}]->(b3:B {name: 'b3'}),
(a3:A {name: 'a3'})-[ab4:AB {name: 'ab4'}]->(b2),
(a3)-[ab6:AB {name: 'ab6'}]->(b3),
(a4:A {name: 'a4'})-[ab7:AB {name: 'ab7'}]->(b3);
Formatted for readability:
a1-[ab1]->b1
a1-[ab2]->b2
a2-[ab3]->b1
a2-[ab5]->b3
a3-[ab4]->b2
a3-[ab6]->b3
a4-[ab7]->b3
We want to find these paths: [A, AB, B, AB, A, AB, B, AB, A] (four steps). (Note: we don't care about edge directionality.) Here's my first try (our terminology: i_id = 'initial' and t_id = 'terminal').
MATCH p = (i_id:A)-[ab1:AB]-(b1:B)-[ab2:AB]-(a1:A)-[ab3:AB]-(b2:B)-[ab4:AB]-(t_id:A)
RETURN i_id.name, ab1.name, b1.name, ab2.name, a1.name, ab3.name, b2.name, ab4.name, t_id.name
ORDER BY i_id.name;
The result is reasonable, given Cypher's Uniqueness feature:
+-------------------------------------------------------------------------------------------------+
| i_id.name | ab1.name | b1.name | ab2.name | a1.name | ab3.name | b2.name | ab4.name | t_id.name |
+-------------------------------------------------------------------------------------------------+
| "a1" | "ab1" | "b1" | "ab3" | "a2" | "ab5" | "b3" | "ab6" | "a3" |
| "a1" | "ab1" | "b1" | "ab3" | "a2" | "ab5" | "b3" | "ab7" | "a4" |
| "a1" | "ab2" | "b2" | "ab4" | "a3" | "ab6" | "b3" | "ab5" | "a2" |
| "a1" | "ab2" | "b2" | "ab4" | "a3" | "ab6" | "b3" | "ab7" | "a4" |
| "a2" | "ab3" | "b1" | "ab1" | "a1" | "ab2" | "b2" | "ab4" | "a3" |
| "a2" | "ab5" | "b3" | "ab6" | "a3" | "ab4" | "b2" | "ab2" | "a1" |
| "a3" | "ab4" | "b2" | "ab2" | "a1" | "ab1" | "b1" | "ab3" | "a2" |
| "a3" | "ab6" | "b3" | "ab5" | "a2" | "ab3" | "b1" | "ab1" | "a1" |
| "a4" | "ab7" | "b3" | "ab5" | "a2" | "ab3" | "b1" | "ab1" | "a1" |
| "a4" | "ab7" | "b3" | "ab6" | "a3" | "ab4" | "b2" | "ab2" | "a1" |
+-------------------------------------------------------------------------------------------------+
However, we want additional filtering. Consider WHERE i_id.name = 'a2':
+-------------------------------------------------------------------------------------------------+
| i_id.name | ab1.name | b1.name | ab2.name | a1.name | ab3.name | b2.name | ab4.name | t_id.name |
+-------------------------------------------------------------------------------------------------+
| "a2" | "ab3" | "b1" | "ab1" | "a1" | "ab2" | "b2" | "ab4" | "a3" |
| "a2" | "ab5" | "b3" | "ab6" | "a3" | "ab4" | "b2" | "ab2" | "a1" |
+-------------------------------------------------------------------------------------------------+
Notice how the first path contains ab4.name = "ab4", which is also found on the second path as ab3.name. Conversely, "ab2" is found on the second path as ab4.name and on the first path as ab3.name. In our application we want these two to 'cancel out' so that the query returns no matches for a2.
So finally, my question: How would you approach doing this in Cypher? Multiple queries is OK as long as they execute quickly :-) I'm brand new to Cypher, but some of the things I thought might be useful are (straw-clutching, here :-)
comparing paths as collections (something like WHERE ab4.name NOT
IN ...?)
labeling/adding properties to items indicate the i_id and
path they're located at?
FOR EACH?
UNWIND?
GROUP BY?
We'd like to do as much in Cypher as possible, but if the answer is "You can't do that," then we'll pull the above candidate results into memory and finish processing there. Thanks very much!
So I've worked up a solution using your second suggestion, which is to add properties to the relationships to indicate if they are on two or more pathways.
First, create a traversed property on each AB relationship and set it to 0:
MATCH ()-[ab:AB]-()
SET ab.traversed = 0
Now I'm going to use the a2 as the starting node for an example. This query finds all pathways from a2 to another node with label A that is four steps long. The traversed property of each of the relationships is set to the count of times that relationship was encountered on a pathway.
MATCH p = (a2:A {name:'a2'})-[:AB*4]-(:A)
UNWIND RELATIONSHIPS(p) AS r
WITH r, COUNT(*) AS times_traversed
SET r.traversed = times_traversed
RETURN r.name, r.traversed
ORDER BY r.name
And we get the following output:
As you explain in your example, ab2 and ab4 are on both pathways and so their traversed property is 2.
With these properties set on each relationship, you can filter the pathways to only the pathways whose sum of the traversed properties is equal to the path length, which is 4 in your case.
MATCH p = (a2:A {name:'a2'})-[:AB*4]-(:A)
WHERE REDUCE(traversal = 0, r IN RELATIONSHIPS(p) | traversal + r.traversed) = LENGTH(p)
RETURN p
This returns no paths, since the sum of the traversed properties is 6 for both pathways, and not the required 4.
But like I said, this is super inelegant and there is probably a better way to do this.

Resources