Pyspark join multiple dataframes with sql join - join

coming from SAS I want to join multiple dataframes in one SQL join in PySpark. In SAS, thats possible, however, i get the sense that in Pyspark it is not. My script looks like this:
A.createOrReplaceTempView("A")
B.createOrReplaceTempView("B")
C.createOrReplaceTempView("C")
D = spark.sql("select a.*, b.VAR_B, C.VAR_C
from A a left join B b on a.VAR == b.VAR
left join C c on a.VAR == c.VAR")
Is that possible in PySpark? Thank you!

In PySpark, joins work in a similar way to SQL.
First define a df, for example
df_a = spark.sql('select * from a)
df_b = spark.sql('select * from b)
df_c = spark.sql('select * from c)
Then you can do the join as following -
df_joined_a = df_a.join(df_b, a['VAR'] == b['VAR'], 'left')\
.select(df_a['*'], df_b['VAR'].alias('b_var'))
df_joined_c = df_joined_a.join(df_c, df_joined_a['VAR'] == c['VAR'], 'left')\
.select(df_joined_a['*'], df.c['VAR'])
More examples are available here -
https://sparkbyexamples.com/pyspark/pyspark-join-explained-with-examples/

Related

Skewed dataset join in Spark?

I am joining two big datasets using Spark RDD. One dataset is very much skewed so few of the executor tasks taking a long time to finish the job. How can I solve this scenario?
Pretty good article on how it can be done: https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/
Short version:
Add random element to large RDD and create new join key with it
Add random element to small RDD using explode/flatMap to increase number of entries and create new join key
Join RDDs on new join key which will now be distributed better due to random seeding
Say you have to join two tables A and B on A.id=B.id. Lets assume that table A has skew on id=1.
i.e. select A.id from A join B on A.id = B.id
There are two basic approaches to solve the skew join issue:
Approach 1:
Break your query/dataset into 2 parts - one containing only skew and the other containing non skewed data.
In the above example. query will become -
1. select A.id from A join B on A.id = B.id where A.id <> 1;
2. select A.id from A join B on A.id = B.id where A.id = 1 and B.id = 1;
The first query will not have any skew, so all the tasks of ResultStage will finish at roughly the same time.
If we assume that B has only few rows with B.id = 1, then it will fit into memory. So Second query will be converted to a broadcast join. This is also called Map-side join in Hive.
Reference: https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization
The partial results of the two queries can then be merged to get the final results.
Approach 2:
Also mentioned by LeMuBei above, the 2nd approach tries to randomize the join key by appending extra column.
Steps:
Add a column in the larger table (A), say skewLeft and populate it with random numbers between 0 to N-1 for all the rows.
Add a column in the smaller table (B), say skewRight. Replicate the smaller table N times. So values in new skewRight column will vary from 0 to N-1 for each copy of original data. For this, you can use the explode sql/dataset operator.
After 1 and 2, join the 2 datasets/tables with join condition updated to-
*A.id = B.id && A.skewLeft = B.skewRight*
Reference: https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/
Depending on the particular kind of skew you're experiencing, there may be different ways to solve it. The basic idea is:
Modify your join column, or create a new join column, that is not skewed but which still retains adequate information to do the join
Do the join on that non-skewed column -- resulting partitions will not be skewed
Following the join, you can update the join column back to your preferred format, or drop it if you created a new column
The "Fighting the Skew In Spark" article referenced in LiMuBei's answer is a good technique if the skewed data participates in the join. In my case, skew was caused by a very large number of null values in the join column. The null values were not participating in the join, but since Spark partitions on the join column, the post-join partitions were very skewed as there was one gigantic partition containing all of the nulls.
I solved it by adding a new column which changed all null values to a well-distributed temporary value, such as "NULL_VALUE_X", where X is replaced by random numbers between say 1 and 10,000, e.g. (in Java):
// Before the join, create a join column with well-distributed temporary values for null swids. This column
// will be dropped after the join. We need to do this so the post-join partitions will be well-distributed,
// and not have a giant partition with all null swids.
String swidWithDistributedNulls = "swid_with_distributed_nulls";
int numNullValues = 10000; // Just use a number that will always be bigger than number of partitions
Column swidWithDistributedNullsCol =
when(csDataset.col(CS_COL_SWID).isNull(), functions.concat(
functions.lit("NULL_SWID_"),
functions.round(functions.rand().multiply(numNullValues)))
)
.otherwise(csDataset.col(CS_COL_SWID));
csDataset = csDataset.withColumn(swidWithDistributedNulls, swidWithDistributedNullsCol);
Then joining on this new column, and then after the join:
outputDataset.drop(swidWithDistributedNullsCol);
Taking reference from https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/
below is the code for fighting the skew in spark using Pyspark dataframe API
Creating the 2 dataframes:
from math import exp
from random import randint
from datetime import datetime
def count_elements(splitIndex, iterator):
n = sum(1 for _ in iterator)
yield (splitIndex, n)
def get_part_index(splitIndex, iterator):
for it in iterator:
yield (splitIndex, it)
num_parts = 18
# create the large skewed rdd
skewed_large_rdd = sc.parallelize(range(0,num_parts), num_parts).flatMap(lambda x: range(0, int(exp(x))))
skewed_large_rdd = skewed_large_rdd.mapPartitionsWithIndex(lambda ind, x: get_part_index(ind, x))
skewed_large_df = spark.createDataFrame(skewed_large_rdd,['x','y'])
small_rdd = sc.parallelize(range(0,num_parts), num_parts).map(lambda x: (x, x))
small_df = spark.createDataFrame(small_rdd,['a','b'])
Dividing the data into 100 bins for large df and replicating the small df 100 times
salt_bins = 100
from pyspark.sql import functions as F
skewed_transformed_df = skewed_large_df.withColumn('salt', (F.rand()*salt_bins).cast('int')).cache()
small_transformed_df = small_df.withColumn('replicate', F.array([F.lit(i) for i in range(salt_bins)]))
small_transformed_df = small_transformed_df.select('*', F.explode('replicate').alias('salt')).drop('replicate').cache()
Finally the join avoiding the skew
t0 = datetime.now()
result2 = skewed_transformed_df.join(small_transformed_df, (skewed_transformed_df['x'] == small_transformed_df['a']) & (skewed_transformed_df['salt'] == small_transformed_df['salt']) )
result2.count()
print "The direct join takes %s"%(str(datetime.now() - t0))
Apache DataFu has two methods for doing skewed joins that implement some of the suggestions in the previous answers.
The joinSkewed method does salting (adding a random number column to split the skewed values).
The broadcastJoinSkewed method is for when you can divide the dataframe into skewed and regular parts, as described in Approach 2 from the answer by moriarty007.
These methods in DataFu are useful for projects using Spark 2.x. If you are already on Spark 3, there are dedicated methods for doing skewed joins.
Full disclosure - I am a member of Apache DataFu.
You could try to repartition the "skewed" RDD to more partitions, or try to increase spark.sql.shuffle.partitions (which is by default 200).
In your case, I would try to set the number of partitions to be much higher than the number of executors.

How to perform update in Apache Spark SQL

I have to update a JavaSchemaRDD with some new values by having some WHERE conditions.
This is the SQL query which I want to convert into Spark SQL:
UPDATE t1
SET t1.column1 = '0', t1.column2 = 1, t1.column3 = 1
FROM TABLE1 t1
INNER JOIN TABLE2 t2 ON t1.id_column = t2.id_column
WHERE (t2.column1 = 'A') AND (t2.column2 > 0)
Yup I got solution my self. I have achieved this using Spark core only, I have not used Spark-Sql for this. I have 2 RDD's (also can be called as tables or datasets) t1 and t2. If we observe my query in the question I am updating t1 based on one join condition and two where conditions. Meaning I need three columns(id_column, column1 and column2) from t2. So I have taken these columns in to 3 individual collections. And then I put an iteration over 1st RDD t1 and during the iteration I have added those three condition statements(1 Join and 2 where conditions) using java "if" conditions. So based on "if" conditions result first RDD values got updated.

F# query for join/group/aggregate?

How can I get F# to do the equivalent of
select a.id, avg(case when a.score = b.score then 1.0 else 0.0 end)
from table1 a join table2 b on a.id = b.id and a.date = b.date
group by a.id
in a query expression? I've come up with
query {
for a in db.table1 do
join b in db.table2 on ((a.id, a.date) = (b.id, b.date))
groupBy a.id into g
select (g.Key, ???) }
but I can't figure out what to insert into "???". To make things worse, the "score" column can be null, which complicates the math.
Alternatively, is there an easier way to do this? I'm not very familiar with .NET database access. Ideally, I'd just give it a block of SQL, it would parse it, and spit back some typed data. As it is, trying to figure out the not-SQL syntax for straightforward SQL is pretty frustrating.
The translation to SQL can generally deal better with C#-style LINQ operations than with native F# functions. So it is easier to go with Select and Average extension methods than with standard F# functions like Seq.map and Seq.average.
I tried writing a similar grouping using a sample Northwind database - I did not find nice tables to join, but the following does basic aggregation with CASE and works fine:
open System.Linq
query {
for p in db.Products do
groupBy p.CategoryID.Value into g
select (g.Key, g.Select(fun a ->
if a.UnitPrice.Value > 10.0M then 1.0 else 0.0).Average()) }
|> Array.ofSeq
It generates a query that is a bit more complicated, but looks right (and uses CASE on the SQL side). You can see that by setting db.DataContext.Log <- System.Console.Out.
Another thing that should generally work would be to use nested query, but I have not tried that.
Some years now, but never too late? Is this of any help? In particular, note the use of FirstOrDefault. Sorry for the Norwegian names, but that's not important. The "x" demonstrates access to the first table.
type Result3 = { Aarsak: int; Beskrivelse: string; Antall: int; Varighet: Nullable<int> }
let query3 = query {
for stopptid in dc.StoppTider do
where (stopptid.DatoS = datoS && stopptid.SkiftNr = skiftNr)
groupBy stopptid.Aarsak into g
join stoppaarsak in dc.StoppAarsak on (g.FirstOrDefault().Aarsak.ToString() = stoppaarsak.Nr)
select { Aarsak = g.Key; Beskrivelse = stoppaarsak.Norsk; Antall = g.Count(); Varighet = g.Sum(fun x -> x.Varighet) }
}
I first ended up here when googling. Since it didn't help, I googled equivalent solutions in C#, got a hit on SO, got that to work for my case in C#, then in F#. This is the link:
LINQ: combining join and group by

SQL Join with subquery

Want to create a query that will join 3 tables together (a, b, c). Then update a specific cell in b.1, based on the subtraction of two specific cells in table b( b.2 - b.3).
Any help would be appreciated.
UPDATE b
SET b.c1 = b.c2 - b.c3
FROM a
JOIN b ON ...
JOIN c ON ...

Can a DB2 WITH statement be used as part of an UPDATE or MERGE?

I need to update some rows in a DB table. How I identify the rows to be updated involved a series of complicated statements, and I managed to boil them down to a series of WITH statements. Now I have the correct data values, I need to update the table.
Since I managed to get these values with a WITH statement, I was hoping to use it in the UPDATE/MERGE. A simplified example follows:
with data1
(
ID_1
)
as
(
Select ID
from ID_TABLE
where ID > 10
)
,
cmedb.data2
(
MIN_ORIGINAL_ID
,OTHER_ID
)
as
(
Select min(ORIGINAL_ID)
,OTHER_ID
from OTHER_ID_TABLE
where OTHER_ID in
(
Select distinct ID_1
From data1
)
group by OTHER_ID
)
select MIN_ORIGINAL_ID
,OTHER_ID
from cmedb.data2
Now I have the two columns of data, I want to use them to update a table. So instead of having the select at the bottom, I've tried all sorts of combinations of merges and updates, including having the WITH statement above the UPDATE/MERGE, or as part of the UPDATE/MERGE statement. The following is what comes closest in my mind to what I want to do:
merge into ID_TABLE as it
using
(
select MIN_ORIGINAL_ID
,OTHER_ID
from cmedb.data2
) AS SEL
ON
(
it.ID = sel.OTHER_ID
)
when matched then
update
set it.ORIGINAL_ID = sel.MIN_ORIGINAL_ID
So it doesn't work. I'm unsure if this is even possible, as I've found no examples on the internet using WITH statements in combination with UPDATE or MERGE. I have examples of WITH statements being used in conjunction with INSERT, so believe it might be possible.
If anyone can help it would be great, and please let me know if I've left out any information that would be useful to solve the problem.
Disclaimer: The example I've provided is a boiled down version of what I'm trying to do, and may not actually make any sense!
As #Andrew White says, you can't use a common table expression in a MERGE statement.
However, you can eliminate the common table expressions with nested subselects. Here is your example select statement, rewritten using nested subselects:
select min_original_id, other_id
from (
select min(original_id), other_id
from other_id_table
where other_id in (
select distinct id_1 from (select id from id_table where id > 10) AS DATA1 (ID_1)
)
group by other_id
) AS T (MIN_ORIGINAL_ID, OTHER_ID);
This is somewhat convoluted (the exact statement could be written better), but I realize that you were just giving a simplified example.
You may be able to rewrite your MERGE statement using nested subselects instead of common table expressions. It is certainly syntactically possible.
For example:
merge into other_id_table x
using (
select min_original_id, other_id
from (
select min(original_id), other_id
from other_id_table
where other_id in (
select distinct id_1 from (select id from id_table where id > 10) AS DATA1 (ID_1)
)
group by other_id
) AS T (MIN_ORIGINAL_ID, OTHER_ID)
) as y
on y.other_id = x.other_id
when matched
then update set other_id = y.min_original_id;
Again, this is convoluted, but it shows you that it is at least possible.
A way to use WITH statement with UPDATE (and INSERT too) is using SELECT FROM UPDATE statement (here):
WITH TEMP_TABLE AS (
SELECT [...]
)
SELECT * FROM FINAL TABLE (
UPDATE TABLE_A SET (COL1, COL2) = (SELECT [...] FROM TEMP_TABLE)
WHERE [...]
);
I'm looking up the grammar now but I am pretty sure the answer is no. At least not in the version of DB2 I last used. Take a peek at the update and merge doc pages for their syntax. Even if you see the fullselect in the syntax you can't use with as that is explicitly separate according to the select doc page.
If you're running DB2 V8 or later, there's an interesting SQL hack here that allows you to UPDATE/INSERT in a query with a WITH statement. For inserts & updates that require a lot of preliminary data prepping, I find this method offers a lot of clarity.
Edit One correction here - selecting from UPDATE statements was introduced in V9 i believe, so the above will work for inserts on V8 or greater, and updates for V9 or greater.
Put the CTEs into a view, and select from the view in the merge. You get a clean, readable view that way, and a clean, readable merge.
Another method is to simply substitute your WITH queries and just use subselects.
For example, if you had (and I tried to include a somewhat complex example with some WHERE logic, an aggregate function (MAX) and a GROUP BY, just to show it more real world):
WITH
Q1 AS (
SELECT
A.X,
A.Y,
A.Z,
MAX(A.W) AS W
FROM
TABLEB B
INNER JOIN TABLEA A ON B.X = A.X AND B.Y = A.Y AND B.Z = A.Z
WHERE A.W <= DATE('2013-01-01')
GROUP BY
A.X,
A.Y,
A.Z
),
Q2 AS (
SELECT
A.X,
A.Y,
A.Z,
A.W,
MAX(A.V) AS V
FROM
Q1
INNER JOIN TABLEA A ON Q1.X = A.X AND Q1.Y = A.Y AND Q1.Z = A.Z AND Q1.W = A.W
GROUP BY
A.X,
A.Y,
A.Z,
A.W
)
SELECT
B.U,
A.T
FROM
Q2
INNER JOIN TABLEA A ON Q2.X = A.X AND Q2.Y = A.Y AND Q2.Z = A.Z AND Q2.W = A.W AND Q2.V = A.V)
RIGHT OUTER JOIN TABLEB B ON Q2.X = B.X AND Q2.Y = B.Y AND Q2.Z = B.Z
... you could turn this into something appropriate for a MERGE INTO by doing the following:
remove the WITH at the top
remove the comma from the end of the Q1 block (after the closing parenthesis)
take the Q1 AS from before the opening parenthesis and put is after the ending parenthesis (remove the comma) and then put the AS in front of the Q1.
take this new Q1 block and cut it and paste it into the Q2 block after the FROM Q1 (replacing the Q1 with the query in your clipboard) NOTE: leave the other references to Q1 (in the inner join keys) alone, of course.
Now you have a bigger Q2 query. Do steps 3 and 4 again, this time replacing the Q2 (after the FROM) in your main select with the bigger Q2 query in your clipboard.
In the end, you'll have a straight SELECT query that looks like this (reformatted to show proper indentation):
SELECT
B.U,
A.T
FROM
(SELECT
A.X,
A.Y,
A.Z,
A.W,
MAX(A.V) AS V
FROM
(SELECT
A.X,
A.Y,
A.Z,
MAX(A.W) AS W
FROM
TABLEB B
INNER JOIN TABLEA A ON B.X = A.X AND B.Y = A.Y AND B.Z = A.Z
WHERE A.W <= DATE('2013-01-01')
GROUP BY
A.X,
A.Y,
A.Z) AS Q1
INNER JOIN TABLEA A ON Q1.X = A.X AND Q1.Y = A.Y AND Q1.Z = A.Z AND Q1.W = A.W
GROUP BY
A.X,
A.Y,
A.Z,
A.W) AS Q2
INNER JOIN TABLEA A ON Q2.X = A.X AND Q2.Y = A.Y AND Q2.Z = A.Z AND Q2.W = A.W AND Q2.V = A.V
RIGHT OUTER JOIN TABLEB B ON Q2.X = B.X AND Q2.Y = B.Y AND Q2.Z = B.Z
I have done this in my own personal experience (just now actually) and it works perfectly.
Good luck.

Resources