pyspark - Check if shuffle caused when joining 2 DFs repartitioned on same column - join

Spark Version: 2.4.7
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) is set
I have two pyspark dataframes which I want to co-partition and hence avoid shuffling. I tried to
repartition them, giving same number of partitions and same column for repartitioning, and wrote as
parquet files. Now I'm trying to check if a shuffle is actually happenning or not when I subsequently
use these tables.
Stats on initial dataframes
>>> initital_df1.count(), initial_df1.select('id').distinct().count() # distinct rows per id
(1000, 1000)
>>> initital_df2.count(), initial_df2.select('id').distinct().count() # 300 rows per id (25 yrs of data, 1 row for each month)
(300000, 1000)
>>> initial_df1.rdd.getNumPartitions(), initial_df2.rdd.getNumPartitions()
(2, 8)
>>> initial_df1.rdd.glom().map(len).collect()
[500, 500]
>>> initial_df2.rdd.glom().map(len).collect()
[56400, 50700, 45000, 40800, 36300, 30000, 25200, 15600]
Step 1: repartition the initital dataframes with same number of partitions and same partitioner column
>>> initial_df1.repartition(4, 'id').write.parquet(df1_write_path)
>>> initial_df2.repartition(4, 'id').write.parquet(df2_write_path)
Step 2: read the previously repartitioned dataframes at a later point of time to perform some computation
>>> repartitioned_df1 = spark.read.parquet(df1_write_path)
>>> repartitioned_df2 = spark.read.parquet(df2_write_path)
Stats on repartitioned dataframes
>>> repartitioned_df1.rdd.getNumPartitions(), repartitioned_df2.rdd.getNumPartitions()
(4, 4)
>>> initial_df1.rdd.glom().map(len).collect()
[279, 255, 238, 228]
>>> initial_df2.rdd.glom().map(len).collect()
[83700, 76500, 71400, 68400]
Step 3: join the repartitioned dataframes read above to perform aggregations and write the aggregated dataframe
>>> agg_df = repartitioned_df1.join(repartitioned_df2, 'id').groupBy('id').agg(F.sum(col1_df1), F.avg(col1_df2))
>>> agg_df.write.parquet(agg_df_write_path)
Tested two scenarios:
repartitioned_df1 and repartitioned_df2 are used in step (3)
initial_df1 and initial_df2 are used in step (3)
Observarions:
agg_df.explain() gives the same spark plan in both the cases
Shuffle read and write data in SparkUI shows the same set of values for both cases
scenario (1) is at least 2x faster than scenario (2)
Question: How do I check if a shuffle is being caused (i.e. the dataframes are hash partitioned again) or the initial repartitioning caused the two dataframes to be co-partitioned and hence there would just be movement of partitions already created during initial writing step?

Related

Python vectorizing a dataframe lookup table

I have two dataframes. One is a lookup table consisting of key/value pairs. The other is my main dataframe. The main dataframe has many more records than the lookup table. I need to construct a 'key' from existing columns in my main dataframe and then lookup a value matching that key in my lookup table. Here they are:
lk = pd.DataFrame( { 'key': ['key10', 'key9'],'value': [100, 90]})
lk.set_index('key', inplace=True)
date_today = datetime.now()
df = pd.DataFrame({ 'date1':[date_today, date_today,date_today],
'year':[1999,2001,2003],
'month':[10,9,10],
'code':[10,4,5],
'date2':[None, date_today, None],
'keyed_value': [0,0,0]})
This is how i get a value:
df['constructed'] = "key" + df['month'].astype('str')
def getKeyValue(lk, k):
return lk.loc[k, 'value']
print(getKeyValue(lk, df['constructed']))
Here are my issues:
1) I don't want to use iteration or apply methods. My actual data is over 2 million rows and 200 columns. It was really slow (over 2 minutes) with apply. So i opted for an inner join and hence the need to created a new 'constructed' column. After the join i drop the 'constructed' column. The join has helped by bringing execution down to 48 seconds. But there has to be faster way (i am hoping).
2) How do i vectorize this? I don't know how to even approach it. Is it even possible? I tried this but just got an error:
df['keyed_values'] = getKeyValue(lk, df['constructed'])
Any help or pointers is much appreciated.

Loading Cassandra Data into Dask Dataframe

I am trying to load data from a cassandra database into a Dask dataframe. I have tried querying the following with no success:
query="""SELECT * FROM document_table"""
df = man.session.execute(query)
df = dd.DataFrame(list(df))
TypeError Traceback (most recent call last)
<ipython-input-135-021507f6f2ab> in <module>()
----> 1 a = dd.DataFrame(list(df))
TypeError: __init__() missing 3 required positional arguments: 'name', 'meta', and 'divisions'
Does anybody know an easy way to load data directly from Cassandra into Dask? It is too much memory too load into pandas first.
Some problems with your code:
the line df = presumably loads the whole data-set into memory. Dask is not invoked here, it plays no part in this. Someone with knowledge of the Cassandra driver can confirm this.
list(df) produces a list of the column names of a dataframe and drops all the data
dd.DataFrame, if you read the docs is not constructed like this.
What you probably want to do is a) make a function that returns one partition of the data, b) delay this function and call with the various values of the partitions c) use dd.from_delayed to make the dask dataframe. E.g., assuming the table has a field partfield which handily has possible values 1..6 and similar number of rows for each partition:
#dask.delayed
def part(x):
session = # construct Cassandra session
q = "SELECT * FROM document_table WHERE partfield={}".format(x)
df = man.session.execute(query)
return dd.DataFrame(list(df))
parts = [part(x) for x in range(1, 7)]
df = dd.from_delayed(parts)

Dask groupby and apply : Value error Expected axis has 6 elements, new values have 5 elements

I am trying collapse rows of a dataframe based on a key. My file is big and pandas throws a memory error. I am currently trying to use dask. I am attaching the snippet of the code here.
def f(x):
p = x.groupby(id).agg(''.join).reset_index()
return p
metadf = pd.DataFrame(columns=['c1','p1','pd1','d1'])
df = df.groupby(idname).apply(f, meta=metadf).reset_index().compute()
p has the same structure as metadf. The shape of both the dataframes are same.
When I execute this, I get the following error:
"ValueError: Length mismatch: Expected axis has 6 elements, new values have 5 elements"
What am I missing here? Is there any other way to collapse rows based on a key in dask?
The task in hand, to do the following sample in a dask dataframe
Input csv file :
key,c1,c2,c3......,cn
1,car,phone,cat,.....,kite
2,abc,def,hij,.......,pot
1,yes,no,is,.........,hello
2,hello,yes,no,......,help
Output csv file:
key,c1,c2,c3,.......,cn
1,caryes,phoneno,catis,.....,kitehello
2,abchello,defyes,hijno,....,pothelp
In this case meta= corresponds to the output of df.groupby(...).apply(f) and not just to the output of f. Perhaps these differ in some subtle way?
I would address this by first not providing meta= at all. Dask.dataframe will give you a warning asking you to be explicit but things should hopefully progress anyway if it is able to determine the right dtypes and columns by running some sample data through your function.

Fuzzy join between two large datasets in Spark

I need to do a fuzzy join between two large dataset (assuming 30Gb for each dataset) based on the similarity of two columns of string. For example:
Table 1:
Key1 |Value1
-------------
1 |qsdm fkq jmsk fqj msdk
Table 2:
Key2 |Value2
-------------
1 |qsdm fkqj mskf qjm sdk
We aims to calculate the cosine of similarity between each row of value1 with each row of value2, after that, thank to a thresold predefined, I can join two tables.
Key words: Entity resolution, cosine of similarity, inverted indices
(to optimize the calculation of similarity), TF-IDF, token weight,
words, document (a cell in value column), dataset
I use Spark (PySpark) for computing the join. At a moment of process, I have:
a RDD RDD1 of (key1, dict1): key1 is the key of table1, dict1 is a dictionary of word and its weight over dataset table1 (vector of weight)
a RDD RDD2 of (key2, dict2): key2 is the key of table2, dict2 is a dictionary of word and its weight over dataset table2 (vector of weight)
a RDD NORM1 of (key1, norm1): key1 is the key of table1, norm1 is a value pre-computed over dict1
a RDD NORM2 of (key2, norm2): key2 is the key of table2, norm2 is a value pre-computed over dict2
Using the strategy of inverted indices, I have reduced the number of computation about similarity between two documents (string). It's an array of RDD
CommonTokens((key1, key2), tokens): key1 is key in table1, key2 is key in table2, tokens is a list of common tokens between value1 and value2. For each element in CommonTokens, I compute the cosine of similarity to generate ((key1, key2), similarity).
In spark, I did:
collectAsMap RDD1, NORM1, RDD2, NORM2 to build 4 dictionaries
create a function similarity:
input: (key1, key2, commonTokens)
lookup key1 in RDD1 and NORM1, key2 in RDD2 and NORM2
Calculate the cosin
return (key1, key2, similarity)
Apply the map in CommonTokens with similarity function defined above
Configuration to submit my job to YARN:
spark-submit --master yarn-client --executor-cores 3 --executor-memory
20G --driver-memory 20G --driver-cores 12 --queue cku --num-executors
6 run/Join.py &
Problem in spark:
a lot of CollectAsMap ==> overload the driver ==> Deadlock
can not do a RDD transformation inside another RDD transformation (instead of using collectAsMap, use directly RDD1, RDD2, NORM1, NORM2 to lookup key1, key2 inside CommonTokens.Map)
I tried to "convert" RDD1, RDD2, NORM1, NORM2 to dataframes and use Spark SQL to "select" (lookup) but it was not working inside the map
a bonus question is if my algorithm was efficient for my case?
Thanks for any suggestion
(Sorry for my english, feel free to ask me for further information if my question is not clear)
You likely want to look into Local Sensitivity Hashing. Fortunately spark has already done the work for you. This will reduce the number of computations, and give you the euclidean difference between the two vectors. (Euclidean vs Cosine). The only real warning I'd give you would have to make all the vectors the same length, but it seems it would give you what you want with less work.
from pyspark.ml.feature import BucketedRandomProjectionLSH
from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col
dataA = [(0, Vectors.dense([1.0, 1.0]),),
(1, Vectors.dense([1.0, -1.0]),),
(2, Vectors.dense([-1.0, -1.0]),),
(3, Vectors.dense([-1.0, 1.0]),)]
dfA = spark.createDataFrame(dataA, ["id", "features"])
dataB = [(4, Vectors.dense([1.0, 0.0]),),
(5, Vectors.dense([-1.0, 0.0]),),
(6, Vectors.dense([0.0, 1.0]),),
(7, Vectors.dense([0.0, -1.0]),)]
dfB = spark.createDataFrame(dataB, ["id", "features"])
key = Vectors.dense([1.0, 0.0])
brp = BucketedRandomProjectionLSH(inputCol="features", outputCol="hashes", bucketLength=2.0,
numHashTables=3)
model = brp.fit(dfA)
# Feature Transformation
print("The hashed dataset where hashed values are stored in the column 'hashes':")
model.transform(dfA).show()
# Compute the locality sensitive hashes for the input rows, then perform approximate
# similarity join.
# We could avoid computing hashes by passing in the already-transformed dataset, e.g.
# `model.approxSimilarityJoin(transformedA, transformedB, 1.5)`
print("Approximately joining dfA and dfB on Euclidean distance smaller than 1.5:")
model.approxSimilarityJoin(dfA, dfB, 1.5, distCol="EuclideanDistance")\
.select(col("datasetA.id").alias("idA"),
col("datasetB.id").alias("idB"),
col("EuclideanDistance")).show()
# Compute the locality sensitive hashes for the input rows, then perform approximate nearest
# neighbor search.
# We could avoid computing hashes by passing in the already-transformed dataset, e.g.
# `model.approxNearestNeighbors(transformedA, key, 2)`
print("Approximately searching dfA for 2 nearest neighbors of the key:")
model.approxNearestNeighbors(dfA, key, 2).show()

Join on DataFrames creating CartesianProduct in Physical Plan on Spark 1.5.2

I am running into performance issue when joining data frames created from avro files using spark-avro library.
The data frames are created from 120K avro files and the total size is around 1.5 TB.
The two data frames are very huge with billions of records.
The join for these two DataFrames runs forever.
This process runs on a yarn cluster with 300 executors with 4 executor cores and 8GB memory.
Any insights on this join will help. I have posted the explain plan below.
I notice a CartesianProduct in the Physical Plan. I am wondering if this is causing the performance issue.
Below is the logical plan and the physical plan. ( Due to the confidential nature, I am unable to post any of the column names or the file names here )
== Optimized Logical Plan ==
Limit 21
Join Inner, [ Join Conditions ]
Join Inner, [ Join Conditions ]
Project [ List of columns ]
Relation [ List of columns ] AvroRelation[ fileName1 ] -- large file - .5 billion records
InMemoryRelation [List of columns ], true, 10000, StorageLevel(true, true, false, true, 1), (Repartition 1, false), None
Project [ List of Columns ]
Relation[ List of Columns] AvroRelation[ filename2 ] -- another large file - 800 million records
== Physical Plan ==
Limit 21
Filter (filter conditions)
CartesianProduct
Filter (more filter conditions)
CartesianProduct
Project (selecting a few columns and applying a UDF to one column)
Scan AvroRelation[avro file][ columns in Avro File ]
InMemoryColumnarTableScan [List of columns ], true, 10000, StorageLevel(true, true, false, true, 1), (Repartition 1, false), None)
Project [ List of Columns ]
Scan AvroRelation[Avro File][List of Columns]
Code Generation: true
The code is shown below.
val customerDateFormat = new SimpleDateFormat("yyyy/MM/dd");
val dates = new RetailDates()
val dataStructures = new DataStructures()
// Reading CSV Format input files -- retailDates
// This DF has 75 records
val retailDatesWithSchema = sqlContext.read
.format("com.databricks.spark.csv")
.option("delimiter", ",")
.schema(dates.retailDatesSchema)
.load(datesFile)
.coalesce(1)
.cache()
// Create UDF to convert String to Date
val dateUDF: (String => java.sql.Date) = (dateString: String) => new java.sql.Date(customerDateFormat.parse(dateString).getTime())
val stringToDateUDF = udf(dateUDF)
// Reading Avro Format Input Files
// This DF has 500 million records
val userInputDf = sqlContext.read.avro(“customerLocation")
val userDf = userInputDf.withColumn("CAL_DT", stringToDateUDF(col("CAL_DT"))).select(
"CAL_DT","USER_ID","USER_CNTRY_ID"
)
val userDimDf = sqlContext.read.avro(userDimFiles).select("USER_ID","USER_CNTRY_ID","PRIMARY_USER_ID") // This DF has 800 million records
val retailDatesWithSchemaBroadcast = sc.broadcast(retailDatesWithSchema)
val userDimDfBroadcast = sc.broadcast(userDimDf)
val userAndRetailDates = userDnaSdDf
.join((retailDatesWithSchemaBroadcast.value).as("retailDates"),
userDf("CAL_DT") between($"retailDates.WEEK_BEGIN_DATE", $"retailDates.WEEK_END_DATE")
, "inner")
val userAndRetailDatesAndUserDim = userAndRetailDates
.join((userDimDfBroadcast.value)
.withColumnRenamed("USER_ID", "USER_DIM_USER_ID")
.withColumnRenamed("USER_CNTRY_ID","USER_DIM_COUNTRY_ID")
.as("userdim")
, userAndRetailDates("USER_ID") <=> $"userdim.USER_DIM_USER_ID"
&& userAndRetailDates("USER_CNTRY_ID") <=> $"userdim.USER_DIM_COUNTRY_ID"
, "inner")
userAndRetailDatesAndUserDim.show()
Thanks,
Prasad.
There is not much here to go on (even if your data or even column / table names are confidential it could be useful to see some code which could show what your are trying to achieve) but CartesianProduct is definitely a problem. O(N^2) is something you really want to avoid on large datasets and in this particular case it hits all the weak spots in Spark.
Generally speaking if join is expanded to either explicit Cartesian product or equivalent operation it means that join expression is not based on equality and therefore cannot be optimized using shuffle (or broadcast + hashing) based join (SortMergeJoin, HashJoin).
Edit:
In your case following condition is most likely the problem:
userDf("CAL_DT") between(
$"retailDates.WEEK_BEGIN_DATE", $"retailDates.WEEK_END_DATE")
It would be better to compute for example WEEK_BEGIN_DATE on userDf and join directly
$"userDf.WEEK_BEGIN_DATE" === $"retailDates.WEEK_BEGIN_DATE"
Another small improvement is to parse date without using UDFs for example with unix_timestamp function.
EDIT:
Another issue here, pointed out by rchukh is that <=> in Spark <= 1.6 is expanded to Cartesian product - SPARK-11111

Resources