How to calculate the columnwise minimum of a dask pivot table? - dask

I would like to create a pivot table in dask and then calculate the column wise minimum.
import dask.dataframe as dd
from dask.distributed import Client
client = Client()
df = dd.read_csv("data.csv")
# In order to use pivot_table, the columns use as index and columns need to be categorical:
df = df.categorize(columns=['A', 'B'])
#df['A'] = df['A'].cat.as_ordered()
#df['B'] = df['B'].cat.as_ordered()
pt = df.pivot_table(index='A', columns='B', values='C', aggfunc='mean')
pt.min().compute()
TypeError: Categorical is not ordered for operation min
you can use .as_ordered() to change the Categorical to an ordered one
...
# Trying to uncategorize the index, takes forever
pt.index = list(pt.index)
pt.min().compute()
Is there a better way to archive this?

Related

Python vectorizing a dataframe lookup table

I have two dataframes. One is a lookup table consisting of key/value pairs. The other is my main dataframe. The main dataframe has many more records than the lookup table. I need to construct a 'key' from existing columns in my main dataframe and then lookup a value matching that key in my lookup table. Here they are:
lk = pd.DataFrame( { 'key': ['key10', 'key9'],'value': [100, 90]})
lk.set_index('key', inplace=True)
date_today = datetime.now()
df = pd.DataFrame({ 'date1':[date_today, date_today,date_today],
'year':[1999,2001,2003],
'month':[10,9,10],
'code':[10,4,5],
'date2':[None, date_today, None],
'keyed_value': [0,0,0]})
This is how i get a value:
df['constructed'] = "key" + df['month'].astype('str')
def getKeyValue(lk, k):
return lk.loc[k, 'value']
print(getKeyValue(lk, df['constructed']))
Here are my issues:
1) I don't want to use iteration or apply methods. My actual data is over 2 million rows and 200 columns. It was really slow (over 2 minutes) with apply. So i opted for an inner join and hence the need to created a new 'constructed' column. After the join i drop the 'constructed' column. The join has helped by bringing execution down to 48 seconds. But there has to be faster way (i am hoping).
2) How do i vectorize this? I don't know how to even approach it. Is it even possible? I tried this but just got an error:
df['keyed_values'] = getKeyValue(lk, df['constructed'])
Any help or pointers is much appreciated.

looping program for MLP Keras prediction

I am (sort of a beginner starting out) experimenting with Keras on a time series data application where I created a regression model and then saved it to run on a different Python script.
The time series data that I am dealing with is hourly data, and I am using a saved model in Keras to predict a value for each of hour in the data set. (data = CSV file is read into pandas) With a years worth of time series data there is 8760 (hours in a year) predictions and finally I am attempting to sum the values of the predictions at the end.
In the code below I am not showing how the model architecture gets recreated (keras requirement for a saved model) and the code works its just extremely slow. This method seems fine for under a 200 predictions but for a 8760 the code seems to bog down way too much to ever finish.
I don't have any experience with databases but would this be a better method versus storing 8760 keras predictions in a Python list? Thanks for any tips I am still riding the learning curve..
#set initial loop params & empty list to store modeled data
row_num = 0
total_estKwh = []
for i, row in data.iterrows():
params = row.values
if (params.ndim == 1):
params = np.array([params])
estimatedKwh = load_trained_model(weights_path).predict(params)
print('Analyzing row number:', row_num)
total_estKwh.append(estimatedKwh)
row_num += 1
df = pd.DataFrame.from_records(total_estKwh)
total = df.sum()
totalStd = np.std(df.values)
totalMean = df.mean()
Seems you are making your life very difficult without obvious reason...
For starters, you don't need to load your model for every row - this is overkill! You shoud definitely move load_trained_model(weights_path) out of the for loop, with something like
model = load_trained_model(weights_path) # load ONCE
and replace the respective line in the loop with
estimatedKwh = model.predict(params)
Second, it is again not efficient to call the model for prediction row-by-row; it is preferable to first prepare your params as an array, and then feed this to the model for getting batch predictions. Forget the print statement, too..
All in all, try this:
params_array = []
for i, row in data.iterrows():
params = row.values
if (params.ndim == 1):
params = np.array([params]) # is this if really necessary??
params_array.append(params)
params_array = np.asarray(params_array, dtype=np.float32)
total_estKwh = load_trained_model(weights_path).predict(params_array)
df = pd.DataFrame.from_records(total_estKwh)
total = df.sum()
totalStd = np.std(df.values)
totalMean = df.mean()

Loading Cassandra Data into Dask Dataframe

I am trying to load data from a cassandra database into a Dask dataframe. I have tried querying the following with no success:
query="""SELECT * FROM document_table"""
df = man.session.execute(query)
df = dd.DataFrame(list(df))
TypeError Traceback (most recent call last)
<ipython-input-135-021507f6f2ab> in <module>()
----> 1 a = dd.DataFrame(list(df))
TypeError: __init__() missing 3 required positional arguments: 'name', 'meta', and 'divisions'
Does anybody know an easy way to load data directly from Cassandra into Dask? It is too much memory too load into pandas first.
Some problems with your code:
the line df = presumably loads the whole data-set into memory. Dask is not invoked here, it plays no part in this. Someone with knowledge of the Cassandra driver can confirm this.
list(df) produces a list of the column names of a dataframe and drops all the data
dd.DataFrame, if you read the docs is not constructed like this.
What you probably want to do is a) make a function that returns one partition of the data, b) delay this function and call with the various values of the partitions c) use dd.from_delayed to make the dask dataframe. E.g., assuming the table has a field partfield which handily has possible values 1..6 and similar number of rows for each partition:
#dask.delayed
def part(x):
session = # construct Cassandra session
q = "SELECT * FROM document_table WHERE partfield={}".format(x)
df = man.session.execute(query)
return dd.DataFrame(list(df))
parts = [part(x) for x in range(1, 7)]
df = dd.from_delayed(parts)

Spark dataframe reduceByKey

I am using Spark 1.5/1.6, where I want to do reduceByKey operation in DataFrame, I don't want to convert the df to rdd.
Each row looks like and I have multiple rows for id1.
id1, id2, score, time
I want to have something like:
id1, [ (id21, score21, time21) , ((id22, score22, time22)) , ((id23, score23, time23)) ]
So, for each "id1", I want all records in a list
By the way, the reason why don't want to convert df to rdd is because I have to join this (reduced) dataframe to another dataframe, and I am doing re-partitioning on the join key, which makes it faster, I guess the same cannot be done with rdd
Any help will be appreciated.
To simply preserve the partitioning already achieved then re-use the parent RDD partitioner in the reduceByKey invocation:
val rdd = df.toRdd
val parentRdd = rdd.dependencies(0) // Assuming first parent has the
// desired partitioning: adjust as needed
val parentPartitioner = parentRdd.partitioner
val optimizedReducedRdd = rdd.reduceByKey(parentPartitioner, reduceFn)
If you were to not specify the partitioner as follows:
df.toRdd.reduceByKey(reduceFn) // This is non-optimized: uses full shuffle
then the behavior you noted would occur - i.e. a full shuffle occurs. That is because the HashPartitioner would be used instead.

Joining two spark dataframes on time (TimestampType) in python

I have two dataframes and I would like to join them based on one column, with a caveat that this column is a timestamp, and that timestamp has to be within a certain offset (5 seconds) in order to join records. More specifically, a record in dates_df with date=1/3/2015:00:00:00 should be joined with events_df with time=1/3/2015:00:00:01 because both timestamps are within 5 seconds from each other.
I'm trying to get this logic working with python spark, and it is extremely painful. How do people do joins like this in spark?
My approach is to add two extra columns to dates_df that will determine the lower_timestamp and upper_timestamp bounds with a 5 second offset, and perform a conditional join. And this is where it fails, more specifically:
joined_df = dates_df.join(events_df,
dates_df.lower_timestamp < events_df.time < dates_df.upper_timestamp)
joined_df.explain()
Captures only the last part of the query:
Filter (time#6 < upper_timestamp#4)
CartesianProduct
....
and it gives me a wrong result.
Do I really have to do a full blown cartesian join for each inequality, removing duplicates as I go along?
Here is the full code:
from datetime import datetime, timedelta
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import udf
master = 'local[*]'
app_name = 'stackoverflow_join'
conf = SparkConf().setAppName(app_name).setMaster(master)
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
def lower_range_func(x, offset=5):
return x - timedelta(seconds=offset)
def upper_range_func(x, offset=5):
return x + timedelta(seconds=offset)
lower_range = udf(lower_range_func, TimestampType())
upper_range = udf(upper_range_func, TimestampType())
dates_fields = [StructField("name", StringType(), True), StructField("date", TimestampType(), True)]
dates_schema = StructType(dates_fields)
dates = [('day_%s' % x, datetime(year=2015, day=x, month=1)) for x in range(1,5)]
dates_df = sqlContext.createDataFrame(dates, dates_schema)
dates_df.show()
# extend dates_df with time ranges
dates_df = dates_df.withColumn('lower_timestamp', lower_range(dates_df['date'])).\
withColumn('upper_timestamp', upper_range(dates_df['date']))
event_fields = [StructField("time", TimestampType(), True), StructField("event", StringType(), True)]
event_schema = StructType(event_fields)
events = [(datetime(year=2015, day=3, month=1, second=3), 'meeting')]
events_df = sqlContext.createDataFrame(events, event_schema)
events_df.show()
# finally, join the data
joined_df = dates_df.join(events_df,
dates_df.lower_timestamp < events_df.time < dates_df.upper_timestamp)
joined_df.show()
I get the following output:
+-----+--------------------+
| name| date|
+-----+--------------------+
|day_1|2015-01-01 00:00:...|
|day_2|2015-01-02 00:00:...|
|day_3|2015-01-03 00:00:...|
|day_4|2015-01-04 00:00:...|
+-----+--------------------+
+--------------------+-------+
| time| event|
+--------------------+-------+
|2015-01-03 00:00:...|meeting|
+--------------------+-------+
+-----+--------------------+--------------------+--------------------+--------------------+-------+
| name| date| lower_timestamp| upper_timestamp| time| event|
+-----+--------------------+--------------------+--------------------+--------------------+-------+
|day_3|2015-01-03 00:00:...|2015-01-02 23:59:...|2015-01-03 00:00:...|2015-01-03 00:00:...|meeting|
|day_4|2015-01-04 00:00:...|2015-01-03 23:59:...|2015-01-04 00:00:...|2015-01-03 00:00:...|meeting|
+-----+--------------------+--------------------+--------------------+--------------------+-------+
I did spark SQL query with explain() to see how it is done, and replicated the same behavior in python. First here is how to do the same with SQL spark:
dates_df.registerTempTable("dates")
events_df.registerTempTable("events")
results = sqlContext.sql("SELECT * FROM dates INNER JOIN events ON dates.lower_timestamp < events.time and events.time < dates.upper_timestamp")
results.explain()
This works, but the question was about how to do it in python, so the solution seems to be just a plain join, followed by two filters:
joined_df = dates_df.join(events_df).filter(dates_df.lower_timestamp < events_df.time).filter(events_df.time < dates_df.upper_timestamp)
joined_df.explain() yields the same query as sql spark results.explain() so I assume this is how things are done.
Although a year later, but might help others..
As you said, a full cartesian product is insane in your case. Your matching records will be close in time (5 minutes) so you can take advantage of that and save a lot of time if you first group together records to buckets based on their timestamp, then join the two dataframes on that bucket and only then apply the filter. Using that method causes Spark to use a SortMergeJoin and not a CartesianProduct and greatly boosts performance.
There is a small caveat here - you must match to both the bucket and the next one.
It's better explain in my blog, with working code examples (Scala + Spark 2.0 but you can implement the same in python too...)
http://zachmoshe.com/2016/09/26/efficient-range-joins-with-spark.html

Resources