Skewed dataset join in Spark? - join

I am joining two big datasets using Spark RDD. One dataset is very much skewed so few of the executor tasks taking a long time to finish the job. How can I solve this scenario?

Pretty good article on how it can be done: https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/
Short version:
Add random element to large RDD and create new join key with it
Add random element to small RDD using explode/flatMap to increase number of entries and create new join key
Join RDDs on new join key which will now be distributed better due to random seeding

Say you have to join two tables A and B on A.id=B.id. Lets assume that table A has skew on id=1.
i.e. select A.id from A join B on A.id = B.id
There are two basic approaches to solve the skew join issue:
Approach 1:
Break your query/dataset into 2 parts - one containing only skew and the other containing non skewed data.
In the above example. query will become -
1. select A.id from A join B on A.id = B.id where A.id <> 1;
2. select A.id from A join B on A.id = B.id where A.id = 1 and B.id = 1;
The first query will not have any skew, so all the tasks of ResultStage will finish at roughly the same time.
If we assume that B has only few rows with B.id = 1, then it will fit into memory. So Second query will be converted to a broadcast join. This is also called Map-side join in Hive.
Reference: https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization
The partial results of the two queries can then be merged to get the final results.
Approach 2:
Also mentioned by LeMuBei above, the 2nd approach tries to randomize the join key by appending extra column.
Steps:
Add a column in the larger table (A), say skewLeft and populate it with random numbers between 0 to N-1 for all the rows.
Add a column in the smaller table (B), say skewRight. Replicate the smaller table N times. So values in new skewRight column will vary from 0 to N-1 for each copy of original data. For this, you can use the explode sql/dataset operator.
After 1 and 2, join the 2 datasets/tables with join condition updated to-
*A.id = B.id && A.skewLeft = B.skewRight*
Reference: https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/

Depending on the particular kind of skew you're experiencing, there may be different ways to solve it. The basic idea is:
Modify your join column, or create a new join column, that is not skewed but which still retains adequate information to do the join
Do the join on that non-skewed column -- resulting partitions will not be skewed
Following the join, you can update the join column back to your preferred format, or drop it if you created a new column
The "Fighting the Skew In Spark" article referenced in LiMuBei's answer is a good technique if the skewed data participates in the join. In my case, skew was caused by a very large number of null values in the join column. The null values were not participating in the join, but since Spark partitions on the join column, the post-join partitions were very skewed as there was one gigantic partition containing all of the nulls.
I solved it by adding a new column which changed all null values to a well-distributed temporary value, such as "NULL_VALUE_X", where X is replaced by random numbers between say 1 and 10,000, e.g. (in Java):
// Before the join, create a join column with well-distributed temporary values for null swids. This column
// will be dropped after the join. We need to do this so the post-join partitions will be well-distributed,
// and not have a giant partition with all null swids.
String swidWithDistributedNulls = "swid_with_distributed_nulls";
int numNullValues = 10000; // Just use a number that will always be bigger than number of partitions
Column swidWithDistributedNullsCol =
when(csDataset.col(CS_COL_SWID).isNull(), functions.concat(
functions.lit("NULL_SWID_"),
functions.round(functions.rand().multiply(numNullValues)))
)
.otherwise(csDataset.col(CS_COL_SWID));
csDataset = csDataset.withColumn(swidWithDistributedNulls, swidWithDistributedNullsCol);
Then joining on this new column, and then after the join:
outputDataset.drop(swidWithDistributedNullsCol);

Taking reference from https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/
below is the code for fighting the skew in spark using Pyspark dataframe API
Creating the 2 dataframes:
from math import exp
from random import randint
from datetime import datetime
def count_elements(splitIndex, iterator):
n = sum(1 for _ in iterator)
yield (splitIndex, n)
def get_part_index(splitIndex, iterator):
for it in iterator:
yield (splitIndex, it)
num_parts = 18
# create the large skewed rdd
skewed_large_rdd = sc.parallelize(range(0,num_parts), num_parts).flatMap(lambda x: range(0, int(exp(x))))
skewed_large_rdd = skewed_large_rdd.mapPartitionsWithIndex(lambda ind, x: get_part_index(ind, x))
skewed_large_df = spark.createDataFrame(skewed_large_rdd,['x','y'])
small_rdd = sc.parallelize(range(0,num_parts), num_parts).map(lambda x: (x, x))
small_df = spark.createDataFrame(small_rdd,['a','b'])
Dividing the data into 100 bins for large df and replicating the small df 100 times
salt_bins = 100
from pyspark.sql import functions as F
skewed_transformed_df = skewed_large_df.withColumn('salt', (F.rand()*salt_bins).cast('int')).cache()
small_transformed_df = small_df.withColumn('replicate', F.array([F.lit(i) for i in range(salt_bins)]))
small_transformed_df = small_transformed_df.select('*', F.explode('replicate').alias('salt')).drop('replicate').cache()
Finally the join avoiding the skew
t0 = datetime.now()
result2 = skewed_transformed_df.join(small_transformed_df, (skewed_transformed_df['x'] == small_transformed_df['a']) & (skewed_transformed_df['salt'] == small_transformed_df['salt']) )
result2.count()
print "The direct join takes %s"%(str(datetime.now() - t0))

Apache DataFu has two methods for doing skewed joins that implement some of the suggestions in the previous answers.
The joinSkewed method does salting (adding a random number column to split the skewed values).
The broadcastJoinSkewed method is for when you can divide the dataframe into skewed and regular parts, as described in Approach 2 from the answer by moriarty007.
These methods in DataFu are useful for projects using Spark 2.x. If you are already on Spark 3, there are dedicated methods for doing skewed joins.
Full disclosure - I am a member of Apache DataFu.

You could try to repartition the "skewed" RDD to more partitions, or try to increase spark.sql.shuffle.partitions (which is by default 200).
In your case, I would try to set the number of partitions to be much higher than the number of executors.

Related

proc sql inner join behavior and required select statements

I recently started using SAS, only receiving a basic training that didn't cover proc sql. I'd like to read up a bit more on SAS sql when I have the time.
For now, I found a solution to what I wanted to do, but I'm having difficulties understanding what is happening.
My issue started when I wanted to find out which subjects in my dataset have a certain value for all their records. I made use of my previously written snippet of code that I thought I understood. I just tried adding a couple more variables and group by statements:
data have;
input subject:$1. myvar:1. mycount:1.;
datalines;
a 1 1
a 0 2
a 0 3
b 1 1
b 0 2
b 1 3
c 1 1
c 1 2 /*This subject has myvar = 1 for all its observations*/
;
run;
*find subjects;
proc sql;
create table want as
/* select*/
/* distinct x.subject */
/* from */
(select distinct subject, count(myvar) as myvar_c
from have where myvar = 1 group by subject) x,
(select distinct subject, max(mycount) as max_c
from have group by subject) y
where x.subject = y.subject and x.myvar_c = y.max_c;
quit;
When removing the commented 'select distinct x.subject from' in the create table statement, the above code works as should.
However, I've previously also created another piece of code, to select all subjects in my dataset that have two types of records:
data have2;
input subject:$1. mytype:1.;
datalines;
a 1
a 0
a 0
b 1
b 0
b 1
c 1
c 1 /*This subject doesn't have two types of records in all its observations*/
;
run;
*Find subjects;
proc sql;
create table want2 as select
distinct x.subject from
have2 x,
(select distinct subject, count(distinct mytype) as mytype_c from have2 group by subject) y
where y.mytype_c = 2 and x.subject = y.subject;
quit;
Which is similar, but didn't require the additional select statement. The first code has 3 select statements, the second code only requires two select statements.
Can someone inform me why this is exactly required?
Or link me some good documentation that lists the specifications of these types of joins - can anyone also inform me of the specific name of this type of join where you only use a comma?
while I'm writing, also see that could've used my code I initially wrote to find subjects that have only 1 type of record and tweak it for my current issue >.< but still would like to know what is happening in the first example.
The SQL join construct
FROM ONE, TWO, THREE, …
is known as a CROSS JOIN and is a join without criteria. The comma (,) syntax is less prevalent today and the following construct is recommended
FROM ONE
CROSS JOIN TWO
CROSS JOIN THREE
The result set is a cartesian product and the number of rows is the product of the number of rows in the cross joined tables.
When the query has criteria (WHERE clause) the join is an INNER JOIN.
The SAS documentation for Proc SQL is a good starting point and includes examples.
joined-table Component
Joins a table with itself or with other tables or views.
…
Table of Contents
Syntax
Required Arguments
Optional Argument
Details
Types of Joins
Joining Tables
Table Limit
Specifying the Rows to Be Returned
Table Aliases
Joining a Table with Itself
Inner Joins
Outer Joins
Cross Joins
Union Joins
Natural Joins
Joining More Than Two Tables
Comparison of Joins and Subqueries
General tip:
If you want to fool around (fiddle) with SQL queries in a browser, try visiting
SQL Fiddle web site.

How to join two RDD's with specific column in Pyspark?

How can I Join two RDD with item_id columns
##
RDD1 = spark.createDataFrame([('45QNN', 867),
('45QNN', 867),
('45QNN', 900 )]
, ['id', 'item_id'])
RDD1=RDD1.rdd
RDD2 = spark.createDataFrame([('867',229000,'house',90),
('900',350000,'apartment',120)]
, ['item_id', 'amount','parent','size'])
RDD2=RDD2.rdd
As suggested How do you perform basic joins of two RDD tables in Spark using Python?
I did try but it gets empty data set
innerJoinedRdd = RDD1.join(RDD2)
or
RDD2.join(RDD1, RDD1("item_id")==RDD2("item_id")).take(5)
I need all the columns except parent. Please help?

Can I use another dataframe column to query spark sql

I have two huge tables in Hive. 'table 1' and 'table 2'. Both table has a common column 'key'.
I have queried 'table 1' with the desired conditions and created a DataFrame 'df1'.
Now, I want to query 'table 2' and want to use a column from 'df1' in the where clause.
Here is the code sample:
val df1 = hiveContext.sql("select * from table1 limit 100")
Can I do something like
val df2 = hiveContext.sql("select * from table2 where key = df1.key")
** Note : I don't want to make a single query with joining both tables
Any help will be appreciated.
Since you have explicitly written that you do NOT want to join the tables, then the short answer is "No, you cannot do such a query".
I'm not sure why you don't want to do the join, but it is definitely needed if you want to do the query. If you are worried about joining two "huge tables", then don't be. Spark was build for this kind of thing :)
The solution that I found is the following
Let me first give the dataset size.
Dataset1 - pretty small (10 GB)
Dataset2 - big (500 GB+)
There are two solutions to dataframe joins
Solution 1
If you are using Spark 1.6+, repartition both dataframes by the
column on which join has to be done. When I did it, the join was done
in less than 2 minutes.
df.repartition(df("key"))
Solution 2
If you are not using Spark 1.6+ (even if using 1.6+), if one
data is small, cache it and use that in broadcast
df_small.cache
df_big.join(broadcast(df_small) , "key"))
This was done in less than a minute.

How to perform update in Apache Spark SQL

I have to update a JavaSchemaRDD with some new values by having some WHERE conditions.
This is the SQL query which I want to convert into Spark SQL:
UPDATE t1
SET t1.column1 = '0', t1.column2 = 1, t1.column3 = 1
FROM TABLE1 t1
INNER JOIN TABLE2 t2 ON t1.id_column = t2.id_column
WHERE (t2.column1 = 'A') AND (t2.column2 > 0)
Yup I got solution my self. I have achieved this using Spark core only, I have not used Spark-Sql for this. I have 2 RDD's (also can be called as tables or datasets) t1 and t2. If we observe my query in the question I am updating t1 based on one join condition and two where conditions. Meaning I need three columns(id_column, column1 and column2) from t2. So I have taken these columns in to 3 individual collections. And then I put an iteration over 1st RDD t1 and during the iteration I have added those three condition statements(1 Join and 2 where conditions) using java "if" conditions. So based on "if" conditions result first RDD values got updated.

Hive Bucketed Map Join

I am facing issue in executing bucketed map join.
I am using hive 0.10.
Table1 is a partitioned table on year,month and day. Each partition data is bucketed by column c1 into 128 buckets. I have almost 100 million records per day.
Table 1
create table1
(
....
....
)
partitioned by (year int,month int,day int)
CLUSTERED BY(c1) INTO 128 BUCKETS;
Table2 is a large lookup table bucketed on column c1. I have 80 million records loaded into 128 buckets.
Table 2
create table2
(
c1
c2
...
)
CLUSTERED BY(c1) INTO 128 BUCKETS;
I have checked the data and it's loaded as per expectation into buckets.
Now, I am trying to enforce bucketed map join.That's where I am stuck.
set hive.auto.convert.join=true;
set hive.optimize.bucketmapjoin = true;
set hive.mapjoin.bucket.cache.size=1000000;
select a.c1 as c1_tb2,a.c2
b.c1,b....
from table2 a
JOIN table1 b
ON (a.c1=b.c1);
I am still not getting bucketed map join. Am I missing something? Even I tried to execute join on only 1 partition. But, still I am getting same result.
Or
Bucketed map join doesn't work partition tables?
Please help.Thanks.
This explanation is for Hive 0.13. AFAICT, bucketed map join doesn't take effect for auto converted map joins. You will need to explicitly call out map join in the syntax like this:
set hive.optimize.bucketmapjoin = true;
explain extended select /* +MAPJOIN(b) */ count(*)
from nation_b1 a
join nation_b2 b on (a.n_regionkey = b.n_regionkey);
Note that only explain extended shows you the flag that indicates if bucket map join is being used or not. Look for this line in the plan.
BucketMapJoin: true
Tables are bucketed in hive to manage/process the portion of data individually. It will make the process easy to manage and efficient in terms of performance.
Lets understand the join when the data is stored in buckets:
Lets say there are two tables user and user_visits and both table data is bucketed using user_id in 4 buckets . It means bucket 1 of user will contain rows with same user ids as that of bucket 1 of user_visits. And if a join is performed on these two tables on user_id columns, if it is possible to send bucket 1 of both tables to same mapper then good amount of optimization can be achieved. This is exactly done in bucketed map join.
Prerequisites for bucket map join:
Tables being joined are bucketized on the join columns,
The number of buckets in one table is a same/multiple of the number of buckets in the other table.
The buckets can be joined with each other, If the tables being joined are bucketized on the join columns. If table A has 4 buckets and table B has 4 buckets, the following join
SELECT /*+ MAPJOIN(b) */ a.key, a.valueFROM a JOIN b ON a.key = b.key
can be done on the mapper only. Instead of fetching B completely for each mapper of A, only the required buckets are fetched. For the query above, the mapper processing bucket 1 for A will only fetch bucket 1 of B. It is not the default behavior, and is governed by the following parameter
set hive.optimize.bucketmapjoin = true
If the tables being joined are sorted and bucketized on the join columns, and they have the same number of buckets, a sort-merge join can be performed. The corresponding buckets are joined with each other at the mapper. If both A and B have 4 buckets,
SELECT /*+ MAPJOIN(b) */ a.key, a.valueFROM A a JOIN B b ON a.key = b.key
can be done on the mapper only. The mapper for the bucket for A will traverse the corresponding bucket for B. This is not the default behavior, and the following parameters need to be set:
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;

Resources