Join on DataFrames creating CartesianProduct in Physical Plan on Spark 1.5.2 - join

I am running into performance issue when joining data frames created from avro files using spark-avro library.
The data frames are created from 120K avro files and the total size is around 1.5 TB.
The two data frames are very huge with billions of records.
The join for these two DataFrames runs forever.
This process runs on a yarn cluster with 300 executors with 4 executor cores and 8GB memory.
Any insights on this join will help. I have posted the explain plan below.
I notice a CartesianProduct in the Physical Plan. I am wondering if this is causing the performance issue.
Below is the logical plan and the physical plan. ( Due to the confidential nature, I am unable to post any of the column names or the file names here )
== Optimized Logical Plan ==
Limit 21
Join Inner, [ Join Conditions ]
Join Inner, [ Join Conditions ]
Project [ List of columns ]
Relation [ List of columns ] AvroRelation[ fileName1 ] -- large file - .5 billion records
InMemoryRelation [List of columns ], true, 10000, StorageLevel(true, true, false, true, 1), (Repartition 1, false), None
Project [ List of Columns ]
Relation[ List of Columns] AvroRelation[ filename2 ] -- another large file - 800 million records
== Physical Plan ==
Limit 21
Filter (filter conditions)
CartesianProduct
Filter (more filter conditions)
CartesianProduct
Project (selecting a few columns and applying a UDF to one column)
Scan AvroRelation[avro file][ columns in Avro File ]
InMemoryColumnarTableScan [List of columns ], true, 10000, StorageLevel(true, true, false, true, 1), (Repartition 1, false), None)
Project [ List of Columns ]
Scan AvroRelation[Avro File][List of Columns]
Code Generation: true
The code is shown below.
val customerDateFormat = new SimpleDateFormat("yyyy/MM/dd");
val dates = new RetailDates()
val dataStructures = new DataStructures()
// Reading CSV Format input files -- retailDates
// This DF has 75 records
val retailDatesWithSchema = sqlContext.read
.format("com.databricks.spark.csv")
.option("delimiter", ",")
.schema(dates.retailDatesSchema)
.load(datesFile)
.coalesce(1)
.cache()
// Create UDF to convert String to Date
val dateUDF: (String => java.sql.Date) = (dateString: String) => new java.sql.Date(customerDateFormat.parse(dateString).getTime())
val stringToDateUDF = udf(dateUDF)
// Reading Avro Format Input Files
// This DF has 500 million records
val userInputDf = sqlContext.read.avro(“customerLocation")
val userDf = userInputDf.withColumn("CAL_DT", stringToDateUDF(col("CAL_DT"))).select(
"CAL_DT","USER_ID","USER_CNTRY_ID"
)
val userDimDf = sqlContext.read.avro(userDimFiles).select("USER_ID","USER_CNTRY_ID","PRIMARY_USER_ID") // This DF has 800 million records
val retailDatesWithSchemaBroadcast = sc.broadcast(retailDatesWithSchema)
val userDimDfBroadcast = sc.broadcast(userDimDf)
val userAndRetailDates = userDnaSdDf
.join((retailDatesWithSchemaBroadcast.value).as("retailDates"),
userDf("CAL_DT") between($"retailDates.WEEK_BEGIN_DATE", $"retailDates.WEEK_END_DATE")
, "inner")
val userAndRetailDatesAndUserDim = userAndRetailDates
.join((userDimDfBroadcast.value)
.withColumnRenamed("USER_ID", "USER_DIM_USER_ID")
.withColumnRenamed("USER_CNTRY_ID","USER_DIM_COUNTRY_ID")
.as("userdim")
, userAndRetailDates("USER_ID") <=> $"userdim.USER_DIM_USER_ID"
&& userAndRetailDates("USER_CNTRY_ID") <=> $"userdim.USER_DIM_COUNTRY_ID"
, "inner")
userAndRetailDatesAndUserDim.show()
Thanks,
Prasad.

There is not much here to go on (even if your data or even column / table names are confidential it could be useful to see some code which could show what your are trying to achieve) but CartesianProduct is definitely a problem. O(N^2) is something you really want to avoid on large datasets and in this particular case it hits all the weak spots in Spark.
Generally speaking if join is expanded to either explicit Cartesian product or equivalent operation it means that join expression is not based on equality and therefore cannot be optimized using shuffle (or broadcast + hashing) based join (SortMergeJoin, HashJoin).
Edit:
In your case following condition is most likely the problem:
userDf("CAL_DT") between(
$"retailDates.WEEK_BEGIN_DATE", $"retailDates.WEEK_END_DATE")
It would be better to compute for example WEEK_BEGIN_DATE on userDf and join directly
$"userDf.WEEK_BEGIN_DATE" === $"retailDates.WEEK_BEGIN_DATE"
Another small improvement is to parse date without using UDFs for example with unix_timestamp function.
EDIT:
Another issue here, pointed out by rchukh is that <=> in Spark <= 1.6 is expanded to Cartesian product - SPARK-11111

Related

Filtering out based on count using Apache Beam

I am using Dataflow and Apache Beam to process a dataset and store the result in a headerless csv file with two columns, something like this:
A1,a
A2,a
A3,b
A4,a
A5,c
...
I want to filter out certain entries based on the following two conditions:
1- In the second column, if the number of occurrences of a certain value is less than N, then remove all such rows. For instance if N=10 and c only appears 7 times, then I want all those rows to be filtered out.
2- In the second column, if the number of occurrences of a certain value is more than M, then only keep M many of such rows and filter out the rest. For instance if M=1000 and a appears 1200 times, then I want 200 of such entries to be filtered out, and the other 1000 cases to be stored in the csv file.
In other words, I want to make sure all elements of the second columns appear more than N and less than M many times.
My question is whether this is possible by using some filter in Beam? Or should it be done as a post-process step once the csv file is created and saved?
You can use beam.Filter to filter out all the second column values that matches your range's lower bound condition into a PCollection.
Then correlate that PCollection (as a side input) with your original PCollection to filter out all the lines that need to be excluded.
As for the upperbound, since you want to keep any upperbound amount of elements instead of excluding them completely, you should do some post processing or come up with some combine transforms to do that.
An example with Python SDK using word count.
class ReadWordsFromText(beam.PTransform):
def __init__(self, file_pattern):
self._file_pattern = file_pattern
def expand(self, pcoll):
return (pcoll.pipeline
| beam.io.ReadFromText(self._file_pattern)
| beam.FlatMap(lambda line: re.findall(r'[\w\']+', line.strip(), re.UNICODE)))
p = beam.Pipeline()
words = (p
| 'read' >> ReadWordsFromText('gs://apache-beam-samples/shakespeare/kinglear.txt')
| "lower" >> beam.Map(lambda word: word.lower()))
import random
# Assume this is the data PCollection you want to do filter on.
data = words | beam.Map(lambda word: (word, random.randint(1, 101)))
counts = (words
| 'count' >> beam.combiners.Count.PerElement())
words_with_counts_bigger_than_100 = counts | beam.Filter(lambda count: count[1] > 100) | beam.Map(lambda count: count[0])
Now you get a pcollection like
def cross_join(left, rights):
for x in rights:
if left[0] == x:
yield (left, x)
data_with_word_counts_bigger_than_100 = data | beam.FlatMap(cross_join, rights=beam.pvalue.AsIter(words_with_counts_bigger_than_100))
Now you filtered out elements below lowerbound from the data set and get
Note the 66 from ('king', 66) is the fake random data I put in.
To debug with such visualizations, you can use interactive beam. You can setup your own notebook runtime following instructions; Or you can use hosted solutions provided by Google Dataflow Notebooks.

Python vectorizing a dataframe lookup table

I have two dataframes. One is a lookup table consisting of key/value pairs. The other is my main dataframe. The main dataframe has many more records than the lookup table. I need to construct a 'key' from existing columns in my main dataframe and then lookup a value matching that key in my lookup table. Here they are:
lk = pd.DataFrame( { 'key': ['key10', 'key9'],'value': [100, 90]})
lk.set_index('key', inplace=True)
date_today = datetime.now()
df = pd.DataFrame({ 'date1':[date_today, date_today,date_today],
'year':[1999,2001,2003],
'month':[10,9,10],
'code':[10,4,5],
'date2':[None, date_today, None],
'keyed_value': [0,0,0]})
This is how i get a value:
df['constructed'] = "key" + df['month'].astype('str')
def getKeyValue(lk, k):
return lk.loc[k, 'value']
print(getKeyValue(lk, df['constructed']))
Here are my issues:
1) I don't want to use iteration or apply methods. My actual data is over 2 million rows and 200 columns. It was really slow (over 2 minutes) with apply. So i opted for an inner join and hence the need to created a new 'constructed' column. After the join i drop the 'constructed' column. The join has helped by bringing execution down to 48 seconds. But there has to be faster way (i am hoping).
2) How do i vectorize this? I don't know how to even approach it. Is it even possible? I tried this but just got an error:
df['keyed_values'] = getKeyValue(lk, df['constructed'])
Any help or pointers is much appreciated.

Spark dataframe reduceByKey

I am using Spark 1.5/1.6, where I want to do reduceByKey operation in DataFrame, I don't want to convert the df to rdd.
Each row looks like and I have multiple rows for id1.
id1, id2, score, time
I want to have something like:
id1, [ (id21, score21, time21) , ((id22, score22, time22)) , ((id23, score23, time23)) ]
So, for each "id1", I want all records in a list
By the way, the reason why don't want to convert df to rdd is because I have to join this (reduced) dataframe to another dataframe, and I am doing re-partitioning on the join key, which makes it faster, I guess the same cannot be done with rdd
Any help will be appreciated.
To simply preserve the partitioning already achieved then re-use the parent RDD partitioner in the reduceByKey invocation:
val rdd = df.toRdd
val parentRdd = rdd.dependencies(0) // Assuming first parent has the
// desired partitioning: adjust as needed
val parentPartitioner = parentRdd.partitioner
val optimizedReducedRdd = rdd.reduceByKey(parentPartitioner, reduceFn)
If you were to not specify the partitioner as follows:
df.toRdd.reduceByKey(reduceFn) // This is non-optimized: uses full shuffle
then the behavior you noted would occur - i.e. a full shuffle occurs. That is because the HashPartitioner would be used instead.

Joining two spark dataframes on time (TimestampType) in python

I have two dataframes and I would like to join them based on one column, with a caveat that this column is a timestamp, and that timestamp has to be within a certain offset (5 seconds) in order to join records. More specifically, a record in dates_df with date=1/3/2015:00:00:00 should be joined with events_df with time=1/3/2015:00:00:01 because both timestamps are within 5 seconds from each other.
I'm trying to get this logic working with python spark, and it is extremely painful. How do people do joins like this in spark?
My approach is to add two extra columns to dates_df that will determine the lower_timestamp and upper_timestamp bounds with a 5 second offset, and perform a conditional join. And this is where it fails, more specifically:
joined_df = dates_df.join(events_df,
dates_df.lower_timestamp < events_df.time < dates_df.upper_timestamp)
joined_df.explain()
Captures only the last part of the query:
Filter (time#6 < upper_timestamp#4)
CartesianProduct
....
and it gives me a wrong result.
Do I really have to do a full blown cartesian join for each inequality, removing duplicates as I go along?
Here is the full code:
from datetime import datetime, timedelta
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import udf
master = 'local[*]'
app_name = 'stackoverflow_join'
conf = SparkConf().setAppName(app_name).setMaster(master)
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
def lower_range_func(x, offset=5):
return x - timedelta(seconds=offset)
def upper_range_func(x, offset=5):
return x + timedelta(seconds=offset)
lower_range = udf(lower_range_func, TimestampType())
upper_range = udf(upper_range_func, TimestampType())
dates_fields = [StructField("name", StringType(), True), StructField("date", TimestampType(), True)]
dates_schema = StructType(dates_fields)
dates = [('day_%s' % x, datetime(year=2015, day=x, month=1)) for x in range(1,5)]
dates_df = sqlContext.createDataFrame(dates, dates_schema)
dates_df.show()
# extend dates_df with time ranges
dates_df = dates_df.withColumn('lower_timestamp', lower_range(dates_df['date'])).\
withColumn('upper_timestamp', upper_range(dates_df['date']))
event_fields = [StructField("time", TimestampType(), True), StructField("event", StringType(), True)]
event_schema = StructType(event_fields)
events = [(datetime(year=2015, day=3, month=1, second=3), 'meeting')]
events_df = sqlContext.createDataFrame(events, event_schema)
events_df.show()
# finally, join the data
joined_df = dates_df.join(events_df,
dates_df.lower_timestamp < events_df.time < dates_df.upper_timestamp)
joined_df.show()
I get the following output:
+-----+--------------------+
| name| date|
+-----+--------------------+
|day_1|2015-01-01 00:00:...|
|day_2|2015-01-02 00:00:...|
|day_3|2015-01-03 00:00:...|
|day_4|2015-01-04 00:00:...|
+-----+--------------------+
+--------------------+-------+
| time| event|
+--------------------+-------+
|2015-01-03 00:00:...|meeting|
+--------------------+-------+
+-----+--------------------+--------------------+--------------------+--------------------+-------+
| name| date| lower_timestamp| upper_timestamp| time| event|
+-----+--------------------+--------------------+--------------------+--------------------+-------+
|day_3|2015-01-03 00:00:...|2015-01-02 23:59:...|2015-01-03 00:00:...|2015-01-03 00:00:...|meeting|
|day_4|2015-01-04 00:00:...|2015-01-03 23:59:...|2015-01-04 00:00:...|2015-01-03 00:00:...|meeting|
+-----+--------------------+--------------------+--------------------+--------------------+-------+
I did spark SQL query with explain() to see how it is done, and replicated the same behavior in python. First here is how to do the same with SQL spark:
dates_df.registerTempTable("dates")
events_df.registerTempTable("events")
results = sqlContext.sql("SELECT * FROM dates INNER JOIN events ON dates.lower_timestamp < events.time and events.time < dates.upper_timestamp")
results.explain()
This works, but the question was about how to do it in python, so the solution seems to be just a plain join, followed by two filters:
joined_df = dates_df.join(events_df).filter(dates_df.lower_timestamp < events_df.time).filter(events_df.time < dates_df.upper_timestamp)
joined_df.explain() yields the same query as sql spark results.explain() so I assume this is how things are done.
Although a year later, but might help others..
As you said, a full cartesian product is insane in your case. Your matching records will be close in time (5 minutes) so you can take advantage of that and save a lot of time if you first group together records to buckets based on their timestamp, then join the two dataframes on that bucket and only then apply the filter. Using that method causes Spark to use a SortMergeJoin and not a CartesianProduct and greatly boosts performance.
There is a small caveat here - you must match to both the bucket and the next one.
It's better explain in my blog, with working code examples (Scala + Spark 2.0 but you can implement the same in python too...)
http://zachmoshe.com/2016/09/26/efficient-range-joins-with-spark.html

Neo4j performance - is it really fast?

I'm wanting to run some tests on neo4j, and compare its performance with other databases, in this case postgresql.
This postgres database have about 2000000 'content's distributed around 3000 'categories'. ( this means that there is a table 'content', one 'category' and a relation table 'content-to-category' since one content can be in more than 1 category).
So, mapping this to a neo4j db, i'm creating nodes 'content', 'category' and their relations ( content to category, and content to content, cause contents can have related contents).
category -> category ( categories can have sub-categories )
content -> category
content -> content (related)
Do you think this 'schema' is ok for this type of domain ?
migrating all data from postgresql do neo4j: it is taking forever ( about 4, 5 days ). This is just some search for nodes and creating/updating accordingly. (search is using indexes and the insert/update if taking 500ms for each node)
Am i doing something wrong ?
Migration is done, so i went to try some querying ...
i ended up with about 2000000 content nodes, 3000 category nodes, and more than 4000000 relationships.
(please note that i'm new to all this neo4j world, so i have no idea how to optimize cypher queries...)
One of the queries i wanted to test is: get the 10 latest published contents of a given 'definition' in a given category (this includes contents that are in sub categories of the given category)
experimenting a little, i ended up with something like this :
START
c = node : node_auto_index( 'type: category AND code: category_code' ),
n = node : node_auto_index( 'type: content AND state: published AND definitionCode: definition_name' )
MATCH (c) <- [ r:BELONGS_TO * ] - (n)
RETURN n.published_stamp, n.title
ORDER BY n.published_stamp DESC
LIMIT 6
this takes around 3 seconds, excluding the first run, that takes a lot more ... is this normal ?
What am i doing wrong ?
please note that i'm using neo4j 1.9.2, and auto indexing some node properties ( type, code, state, definitionCode and published_stamp included - title is not auto indexed )
also, returning 'c' on the previous query ( start c = node : node_auto_index( 'type: category AND code : category-code' ) return c; ) is fast (again, excluding the first run, that takes aroung 20-30ms)
also, i'm not sure if this is the right way to use indexes ...
Thank you in advance (sorry if something is not making sense - ask me and i'll try to explain better).
Have you looked at the batch import facilities: http://www.neo4j.org/develop/import? You really should look at that for the initial import - it will take minutes instead of days.
I will ask some of our technical folks to get back to you on some of the other stuff. You really should not be seeing this.
Rik
How many nodes are returned by this?
START
n = node : node_auto_index( 'type: content AND state: published AND definitionCode: definition_name' )
RETURN count(*)
I would try to let the graph do the work.
How deep are your hierarchies usually?
Usually you limit arbitrary length relationships to not have the combinatorial explosion:
I would also have a different relationship-type between content and category than the category tree.
Can you point out your current relationship-types?
START
c = node : node_auto_index( 'type: category AND code: category_code' ),
MATCH (c) <- [:BELONGS_TO*5] - (n)
WHERE n.type = 'content' AND n.state='published' and n.definitionCode = 'definition_name'
RETURN n.published_stamp, n.title
ORDER BY n.published_stamp DESC
LIMIT 6
Can you try that?
For import it is easiest to generate CSV from your SQL and import that using http://github.com/jexp/batch-import
Are you running Linux, maybe on an ext4 filesystem?
You might want to set the barrier=0 mount option, as described here: http://structr.org/blog/neo4j-performance-on-ext4
Further discussion of this topic: https://groups.google.com/forum/#!topic/neo4j/nflUyBsRKyY

Resources