Python vectorizing a dataframe lookup table - vectorization

I have two dataframes. One is a lookup table consisting of key/value pairs. The other is my main dataframe. The main dataframe has many more records than the lookup table. I need to construct a 'key' from existing columns in my main dataframe and then lookup a value matching that key in my lookup table. Here they are:
lk = pd.DataFrame( { 'key': ['key10', 'key9'],'value': [100, 90]})
lk.set_index('key', inplace=True)
date_today = datetime.now()
df = pd.DataFrame({ 'date1':[date_today, date_today,date_today],
'year':[1999,2001,2003],
'month':[10,9,10],
'code':[10,4,5],
'date2':[None, date_today, None],
'keyed_value': [0,0,0]})
This is how i get a value:
df['constructed'] = "key" + df['month'].astype('str')
def getKeyValue(lk, k):
return lk.loc[k, 'value']
print(getKeyValue(lk, df['constructed']))
Here are my issues:
1) I don't want to use iteration or apply methods. My actual data is over 2 million rows and 200 columns. It was really slow (over 2 minutes) with apply. So i opted for an inner join and hence the need to created a new 'constructed' column. After the join i drop the 'constructed' column. The join has helped by bringing execution down to 48 seconds. But there has to be faster way (i am hoping).
2) How do i vectorize this? I don't know how to even approach it. Is it even possible? I tried this but just got an error:
df['keyed_values'] = getKeyValue(lk, df['constructed'])
Any help or pointers is much appreciated.

Related

Dask - load dataframe from SQL without specifying index_col

I'm trying to load a Dask dataframe from a SQL connection. Per the read_sql_table documentation, it is necessary to pass in an index_col. What should I do if there's a possibility that there are no good columns to act as index?
Could this be a suitable replacement?
# Break SQL Query into chunks
chunks = []
num_chunks = math.ceil(num_records / chunk_size)
# Run query for each chunk on Dask workers
for i in range(num_chunks):
query = 'SELECT * FROM ' + table + ' LIMIT ' + str(i * chunk_size) + ',' + str(chunk_size)
chunk = dask.delayed(pd.read_sql)(query, sql_uri)
chunks.append(chunk)
# Aggregate chunks
df = dd.from_delayed(chunks)
dfs[table] = df
Unfortunately, LIMIT/OFFSET is not in general a reliable way to partition a query in most SQL implementations. In particular, it is often the case that, to get to an offset and fetch later rows from a query, the engine must first parse through earlier rows, and thus the work to generate a number of partitions is much magnified. In some cases, you might even end up with missed or duplicated rows.
This was the reasoning behind requiring boundary values in the dask sql implementation.
However, there is nothing in principle wrong with the way you are setting up your dask dataframe. If you can show that your server does not suffer from the problems we were anticipating, then you are welcome to take that approach.

Loading Cassandra Data into Dask Dataframe

I am trying to load data from a cassandra database into a Dask dataframe. I have tried querying the following with no success:
query="""SELECT * FROM document_table"""
df = man.session.execute(query)
df = dd.DataFrame(list(df))
TypeError Traceback (most recent call last)
<ipython-input-135-021507f6f2ab> in <module>()
----> 1 a = dd.DataFrame(list(df))
TypeError: __init__() missing 3 required positional arguments: 'name', 'meta', and 'divisions'
Does anybody know an easy way to load data directly from Cassandra into Dask? It is too much memory too load into pandas first.
Some problems with your code:
the line df = presumably loads the whole data-set into memory. Dask is not invoked here, it plays no part in this. Someone with knowledge of the Cassandra driver can confirm this.
list(df) produces a list of the column names of a dataframe and drops all the data
dd.DataFrame, if you read the docs is not constructed like this.
What you probably want to do is a) make a function that returns one partition of the data, b) delay this function and call with the various values of the partitions c) use dd.from_delayed to make the dask dataframe. E.g., assuming the table has a field partfield which handily has possible values 1..6 and similar number of rows for each partition:
#dask.delayed
def part(x):
session = # construct Cassandra session
q = "SELECT * FROM document_table WHERE partfield={}".format(x)
df = man.session.execute(query)
return dd.DataFrame(list(df))
parts = [part(x) for x in range(1, 7)]
df = dd.from_delayed(parts)

Dask groupby and apply : Value error Expected axis has 6 elements, new values have 5 elements

I am trying collapse rows of a dataframe based on a key. My file is big and pandas throws a memory error. I am currently trying to use dask. I am attaching the snippet of the code here.
def f(x):
p = x.groupby(id).agg(''.join).reset_index()
return p
metadf = pd.DataFrame(columns=['c1','p1','pd1','d1'])
df = df.groupby(idname).apply(f, meta=metadf).reset_index().compute()
p has the same structure as metadf. The shape of both the dataframes are same.
When I execute this, I get the following error:
"ValueError: Length mismatch: Expected axis has 6 elements, new values have 5 elements"
What am I missing here? Is there any other way to collapse rows based on a key in dask?
The task in hand, to do the following sample in a dask dataframe
Input csv file :
key,c1,c2,c3......,cn
1,car,phone,cat,.....,kite
2,abc,def,hij,.......,pot
1,yes,no,is,.........,hello
2,hello,yes,no,......,help
Output csv file:
key,c1,c2,c3,.......,cn
1,caryes,phoneno,catis,.....,kitehello
2,abchello,defyes,hijno,....,pothelp
In this case meta= corresponds to the output of df.groupby(...).apply(f) and not just to the output of f. Perhaps these differ in some subtle way?
I would address this by first not providing meta= at all. Dask.dataframe will give you a warning asking you to be explicit but things should hopefully progress anyway if it is able to determine the right dtypes and columns by running some sample data through your function.

Spark dataframe reduceByKey

I am using Spark 1.5/1.6, where I want to do reduceByKey operation in DataFrame, I don't want to convert the df to rdd.
Each row looks like and I have multiple rows for id1.
id1, id2, score, time
I want to have something like:
id1, [ (id21, score21, time21) , ((id22, score22, time22)) , ((id23, score23, time23)) ]
So, for each "id1", I want all records in a list
By the way, the reason why don't want to convert df to rdd is because I have to join this (reduced) dataframe to another dataframe, and I am doing re-partitioning on the join key, which makes it faster, I guess the same cannot be done with rdd
Any help will be appreciated.
To simply preserve the partitioning already achieved then re-use the parent RDD partitioner in the reduceByKey invocation:
val rdd = df.toRdd
val parentRdd = rdd.dependencies(0) // Assuming first parent has the
// desired partitioning: adjust as needed
val parentPartitioner = parentRdd.partitioner
val optimizedReducedRdd = rdd.reduceByKey(parentPartitioner, reduceFn)
If you were to not specify the partitioner as follows:
df.toRdd.reduceByKey(reduceFn) // This is non-optimized: uses full shuffle
then the behavior you noted would occur - i.e. a full shuffle occurs. That is because the HashPartitioner would be used instead.

Join on DataFrames creating CartesianProduct in Physical Plan on Spark 1.5.2

I am running into performance issue when joining data frames created from avro files using spark-avro library.
The data frames are created from 120K avro files and the total size is around 1.5 TB.
The two data frames are very huge with billions of records.
The join for these two DataFrames runs forever.
This process runs on a yarn cluster with 300 executors with 4 executor cores and 8GB memory.
Any insights on this join will help. I have posted the explain plan below.
I notice a CartesianProduct in the Physical Plan. I am wondering if this is causing the performance issue.
Below is the logical plan and the physical plan. ( Due to the confidential nature, I am unable to post any of the column names or the file names here )
== Optimized Logical Plan ==
Limit 21
Join Inner, [ Join Conditions ]
Join Inner, [ Join Conditions ]
Project [ List of columns ]
Relation [ List of columns ] AvroRelation[ fileName1 ] -- large file - .5 billion records
InMemoryRelation [List of columns ], true, 10000, StorageLevel(true, true, false, true, 1), (Repartition 1, false), None
Project [ List of Columns ]
Relation[ List of Columns] AvroRelation[ filename2 ] -- another large file - 800 million records
== Physical Plan ==
Limit 21
Filter (filter conditions)
CartesianProduct
Filter (more filter conditions)
CartesianProduct
Project (selecting a few columns and applying a UDF to one column)
Scan AvroRelation[avro file][ columns in Avro File ]
InMemoryColumnarTableScan [List of columns ], true, 10000, StorageLevel(true, true, false, true, 1), (Repartition 1, false), None)
Project [ List of Columns ]
Scan AvroRelation[Avro File][List of Columns]
Code Generation: true
The code is shown below.
val customerDateFormat = new SimpleDateFormat("yyyy/MM/dd");
val dates = new RetailDates()
val dataStructures = new DataStructures()
// Reading CSV Format input files -- retailDates
// This DF has 75 records
val retailDatesWithSchema = sqlContext.read
.format("com.databricks.spark.csv")
.option("delimiter", ",")
.schema(dates.retailDatesSchema)
.load(datesFile)
.coalesce(1)
.cache()
// Create UDF to convert String to Date
val dateUDF: (String => java.sql.Date) = (dateString: String) => new java.sql.Date(customerDateFormat.parse(dateString).getTime())
val stringToDateUDF = udf(dateUDF)
// Reading Avro Format Input Files
// This DF has 500 million records
val userInputDf = sqlContext.read.avro(“customerLocation")
val userDf = userInputDf.withColumn("CAL_DT", stringToDateUDF(col("CAL_DT"))).select(
"CAL_DT","USER_ID","USER_CNTRY_ID"
)
val userDimDf = sqlContext.read.avro(userDimFiles).select("USER_ID","USER_CNTRY_ID","PRIMARY_USER_ID") // This DF has 800 million records
val retailDatesWithSchemaBroadcast = sc.broadcast(retailDatesWithSchema)
val userDimDfBroadcast = sc.broadcast(userDimDf)
val userAndRetailDates = userDnaSdDf
.join((retailDatesWithSchemaBroadcast.value).as("retailDates"),
userDf("CAL_DT") between($"retailDates.WEEK_BEGIN_DATE", $"retailDates.WEEK_END_DATE")
, "inner")
val userAndRetailDatesAndUserDim = userAndRetailDates
.join((userDimDfBroadcast.value)
.withColumnRenamed("USER_ID", "USER_DIM_USER_ID")
.withColumnRenamed("USER_CNTRY_ID","USER_DIM_COUNTRY_ID")
.as("userdim")
, userAndRetailDates("USER_ID") <=> $"userdim.USER_DIM_USER_ID"
&& userAndRetailDates("USER_CNTRY_ID") <=> $"userdim.USER_DIM_COUNTRY_ID"
, "inner")
userAndRetailDatesAndUserDim.show()
Thanks,
Prasad.
There is not much here to go on (even if your data or even column / table names are confidential it could be useful to see some code which could show what your are trying to achieve) but CartesianProduct is definitely a problem. O(N^2) is something you really want to avoid on large datasets and in this particular case it hits all the weak spots in Spark.
Generally speaking if join is expanded to either explicit Cartesian product or equivalent operation it means that join expression is not based on equality and therefore cannot be optimized using shuffle (or broadcast + hashing) based join (SortMergeJoin, HashJoin).
Edit:
In your case following condition is most likely the problem:
userDf("CAL_DT") between(
$"retailDates.WEEK_BEGIN_DATE", $"retailDates.WEEK_END_DATE")
It would be better to compute for example WEEK_BEGIN_DATE on userDf and join directly
$"userDf.WEEK_BEGIN_DATE" === $"retailDates.WEEK_BEGIN_DATE"
Another small improvement is to parse date without using UDFs for example with unix_timestamp function.
EDIT:
Another issue here, pointed out by rchukh is that <=> in Spark <= 1.6 is expanded to Cartesian product - SPARK-11111

Resources