How to perform a Diff on a Dask SeriesGroup object - dask

I have a multi-index dask dataframe, which I need to perform a groupby, followed by a diff on. This operation is trivial in pure pandas via the following command:
df.groupby('IndexName')['ValueName'].diff().
Dask, however, doesn't implement the diff function on SeriesGroupBy objects. I've attempted to implement my own with the following command:
df.groupby('IndexName')['ValueName'].apply(lambda x: x.diff(1) )
but this yields the following error:
ValueError: Wrong number of items passed 0, placement implies 3987
Any Ideas:
Below is the sample dataframe:
dummy = {
'Index1' : pd.DataFrame({'A' : np.arange(10),'ValueName': np.random.rand(10)}),
'Index2' : pd.DataFrame({'A' : np.arange(5),'ValueName': np.random.rand(5)})
}
pdf = pd.concat(dummy,names=['IndexName'])
def getDummy(f):
return f
dfs = [delayed(getDummy)(f) for f in [pdf]]
#NOTE: dd.from_pandas doesn't support multiindex...but delayed does
df = dd.from_delayed(dfs)

Related

question for dask output when using dask.array.map_overlap

I would like to use dask.array.map_overlap to deal with the scipy interpolation function. However, I keep meeting errors that I cannot understand and hoping someone can answer this to me.
Here is the error message I have received if I want to run .compute().
ValueError: could not broadcast input array from shape (1070,0) into shape (1045,0)
To resolve the issue, I started to use .to_delayed() to check each partition outputs, and this is what I found.
Following is my python code.
Step 1. Load netCDF file through Xarray, and then output to dask.array with chunk size (400,400)
df = xr.open_dataset('./Brazil Sentinal2 Tile/' + data_file +'.nc')
lon, lat = df['lon'].data, df['lat'].data
slon = da.from_array(df['lon'], chunks=(400,400))
slat = da.from_array(df['lat'], chunks=(400,400))
data = da.from_array(df.isel(band=0).__xarray_dataarray_variable__.data, chunks=(400,400))
Step 2. declare a function for da.map_overlap use
def sumsum2(lon,lat,data, hex_res=10):
hex_col = 'hex' + str(hex_res)
lon_max, lon_min = lon.max(), lon.min()
lat_max, lat_min = lat.max(), lat.min()
b = box(lon_min, lat_min, lon_max, lat_max, ccw=True)
b = transform(lambda x, y: (y, x), b)
b = mapping(b)
target_df = pd.DataFrame(h3.polyfill( b, hex_res), columns=[hex_col])
target_df['lat'] = target_df[hex_col].apply(lambda x: h3.h3_to_geo(x)[0])
target_df['lon'] = target_df[hex_col].apply(lambda x: h3.h3_to_geo(x)[1])
tlon, tlat = target_df[['lon','lat']].values.T
abc = lNDI(points=(lon.ravel(), lat.ravel()),
values= data.ravel())(tlon,tlat)
target_df['out'] = abc
print(np.stack([tlon, tlat, abc],axis=1).shape)
return np.stack([tlon, tlat, abc],axis=1)
Step 3. Apply the da.map_overlap
b = da.map_overlap(sumsum2, slon[:1200,:1200], slat[:1200,:1200], data[:1200,:1200], depth=10, trim=True, boundary=None, align_arrays=False, dtype='float64',
)
Step 4. Using to_delayed() to test output shape
print(b.to_delayed().flatten()[0].compute().shape, )
print(b.to_delayed().flatten()[1].compute().shape)
(1065, 3)
(1045, 0)
(1090, 3)
(1070, 0)
which is saying that the output from da.map_overlap is only outputting 1-D dimension ( which is (1045,0) and (1070,0) ), while in the da.map_overlap, the output I am preparing is 2-D dimension ( which is (1065,3) and (1090,3) ).
In addition, if I turn off the trim argument, which is
c = da.map_overlap(sumsum2,
slon[:1200,:1200],
slat[:1200,:1200],
data[:1200,:1200],
depth=10,
trim=False,
boundary=None,
align_arrays=False,
dtype='float64',
)
print(c.to_delayed().flatten()[0].compute().shape, )
print(c.to_delayed().flatten()[1].compute().shape)
The output becomes
(1065, 3)
(1065, 3)
(1090, 3)
(1090, 3)
This is saying that when trim=True, I cut out everything?
because...
#-- print out the values
b.to_delayed().flatten()[0].compute()[:10,:]
(1065, 3)
array([], shape=(1045, 0), dtype=float64)
while...
#-- print out the values
c.to_delayed().flatten()[0].compute()[:10,:]
array([[ -47.83683837, -18.98359832, 1395.01848583],
[ -47.8482856 , -18.99038681, 2663.68391094],
[ -47.82800624, -18.99207069, 1465.56517187],
[ -47.81897323, -18.97919009, 2769.91556363],
[ -47.82066663, -19.00712956, 1607.85927095],
[ -47.82696896, -18.97167714, 2110.7516765 ],
[ -47.81562653, -18.98302933, 2662.72112163],
[ -47.82176881, -18.98594465, 2201.83205114],
[ -47.84567 , -18.97512514, 1283.20631652],
[ -47.84343568, -18.97270783, 1282.92117225]])
Any thoughts for this?
Thank You.
I guess I got the answer. Please let me if I am wrong.
I am not allowing to use trim=True is because I change the shape of output array (after surfing the internet, I notice that the shape of output array should be the same with the shape of input array). Since I change the shape, the dask has no idea how to deal with it so it returns the empty array to me (weird).
Instead of using trim=False, since I didn't ask cutting-out the buffer zone, it is now okay to output the return values. (although I still don't know why the dask cannot concat the chunked array, but believe is also related to shape)
The solution is using delayed function on da.concatenate, which is
delayed(da.concatenate)([e.to_delayed().flatten()[idx] for idx in range(len(e.to_delayed().flatten()))])
In this case, we are not relying on the concat function in map_overlap but use our own concat to combine the outputs we want.

Error with bagImpute and predict from caret package

I have the following when error when trying to use the preProcess function from the caret package. The predict function works for knn and median imputation, but gives an error for bagging. How should I edit my call to the predict function.
Reproducible example:
data = iris
set.seed(1)
data = as.data.frame(lapply(data, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.8, 0.2), size = length(cc), replace = TRUE) ]))
preprocess_values = preProcess(data, method = c("bagImpute"), verbose = TRUE)
data_new = predict(preprocess_values, data)
This gives the following error:
> data_new = predict(preprocess_values, data)
Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "NULL"
The preprocessing/imputation functions in caret work only for numerical variables.
As stated in the help of preProcess
x a matrix or data frame. Non-numeric predictors are allowed but will be ignored.
You most likely found a bug in the part that should ignore the non numerical variables which throws an uninformative error instead of ignoring them.
If you remove the factor variable the imputation works
library(caret)
df <- iris
set.seed(1)
df <- as.data.frame(lapply(data, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.8, 0.2), size = length(cc), replace = TRUE) ]))
df <- df[,-5] #remove factor variable
preprocess_values <- preProcess(df, method = c("bagImpute"), verbose = TRUE)
data_new <- predict(preprocess_values, df)
The last line of code works but results in a bunch of warnings:
In cprob[tindx] + pred :
longer object length is not a multiple of shorter object length
These warnings are not from caret but from the internal call to ipred::bagging which is called internally by caret::preProcess. The cause for these errors are instances in the data where there are 3 NA values in a row, when they are removed
df <- df[rowSums(sapply(df, is.na)) != 3,]
preprocess_values <- preProcess(df, method = c("bagImpute"), verbose = TRUE)
data_new <- predict(preprocess_values, df)
the warnings disappear.
You should check out recipes, and specifically step_bagimpute, to overcome the above mentioned limitations.

tensorflow inference graph performance optimization

I am trying to understand more about certain surprising results i see in implementing a tf graph .
The graph i am working with is just a forest (bunch of trees). This is just a plain forward inference graph , and nothing related to training. I am sharing the snippets for 2 implementation
code snippet 1:
with tf.name_scope("main"):
def get_tree_output(offset):
loop_vars = (offset,)
leaf_indice = tf.while_loop(cond,
body,
loop_vars,
back_prop=False,
parallel_iterations=1,
name="while_loop")
tree_score = tf.gather(score_tensor, leaf_indice, name="tree-scores")
output = tf.add(tree_score, output)
leaf_indices = tf.map_fn(get_tree_output,
tree_offsets_tensor,
dtype=INT_TYPE,
parallel_iterations=n_trees,
back_prop=False,
name="tree-scores")
tree_scores = tf.gather(score_tensor, leaf_indices, name="tree-scores")
output = tf.reduce_sum(tree_scores, name="sum-output")
output = tf.sigmoid(output, name="sigmoid-output")
code snippet 2:
with tf.name_scope("main"):
tree_offsets_tensor = tf.constant(tree_offsets, dtype=INT_TYPE, name="tree_offsets_tensor")
loop_vars = (tree_offsets_tensor,)
leaf_indices = tf.while_loop(cond,
body,
loop_vars,
back_prop=False,
parallel_iterations=n_trees,
name="while_loop")
tree_scores = tf.gather(score_tensor, leaf_indices, name="tree-scores")
output = tf.reduce_sum(tree_scores, name="sum-output")
output = tf.sigmoid(output, name="sigmoid-output")
The rest of the code is exactly the same : the constant tensors , variables, condition and body for the while loop. thread and parallelism was also the same in both case
code snippet2 : takes about 500 micro sec to do inference
code snippet 1 : take about 12 milli sec to do inference
The difference is that in snippet 1 , I use map_fn to operate on tree_offset_tensor, where as in snippet 2 , I get rid of that map_fn, and just directly use that tensor, so as I understand in snippet1 get_tree_output method gets called with one element from tree_offset_tensor, we are having multiple while_loop for each individual offset value, whereas in snippet 2 we just have one while_loop that just takes multiple offset values (basically the offset_tensor).
I also tried another variation for snippet , instead of using the map_fn I write a hand written for loop
code snippet 1 (variation for loop) :
output = 0
with tf.name_scope("main"):
for offset in tree_offsets:
loop_vars = (offset,)
leaf_indice = tf.while_loop(cond,
body,
loop_vars,
back_prop=False,
parallel_iterations=1,
name="while_loop")
tree_score = tf.gather(score_tensor, leaf_indice, name="tree-scores")
output = tf.add(tree_score, output)
#leaf_indices = tf.map_fn(get_tree_output,
# tree_offsets_tensor, dtype=INT_TYPE,
# parallel_iterations=n_trees, back_prop=False,
# name="tree-scores")
#tree_scores = tf.gather(score_tensor, leaf_indices, name="tree-scores")
#output = tf.reduce_sum(tree_scores, name="sum-output")
output = tf.sigmoid(output, name="sigmoid-output")
This gives minor improvement : 9 millisec

how to convert pandas str.split call to to dask

I have a dask data frame where the index is a string which looks like this:
12/09/2016 00:00;32.0046;-106.259
12/09/2016 00:00;32.0201;-108.838
12/09/2016 00:00;32.0224;-106.004
(its basically a string encoding the datetime;latitude;longitude of the row)
I'd like to split that while still in the dask context to individual columns representing each of the fields.
I can do that with a pandas dataframe as:
df['date'], df['Lat'], df['Lon'] = df.index.str.split(';', 2).str
But that doesn't work in dask for several of the attempts I've tried. If I directly substitute the df for a dask df I get the error:
'Index' object has no attribute 'str'
If I use the column name instead of index as:
forecastDf['date'], forecastDf['Lat'], forecastDf['Lon'] = forecastDf['dateLocation'].str.split(';', 2).str
I get the error:
TypeError: 'StringAccessor' object is not iterable
Here is an runnable example of this working in Pandas
import pandas as pd
df = pd.DataFrame()
df['dateLocation'] = ['12/09/2016 00:00;32.0046;-106.259','12/09/2016 00:00;32.0201;-108.838','12/09/2016 00:00;32.0224;-106.004']
df = df.set_index('dateLocation')
df['date'], df['Lat'], df['Lon'] = df.index.str.split(';', 2).str
df.head()
Here is the error I get if I directly convert that to dask
import dask.dataframe as dd
dd = dd.from_pandas(df, npartitions=1)
dd['date'], dd['Lat'], dd['Lon'] = dd.index.str.split(';', 2).str
>>TypeError: 'StringAccessor' object is not iterable
forecastDf['date'] = forecastDf['dateLocation'].str.partition(';')[0]
forecastDf['Lat'] = forecastDf['dateLocation'].str.partition(';')[2]
forecastDf['Lon'] = forecastDf['dateLocation'].str.partition(';')[4]
Let me know if this works for you!
First make sure the column is string dtype
forecastDD['dateLocation'] = forecastDD['dateLocation'].astype('str')
Then you can use this to split in dask
splitColumns = client.persist(forecastDD['dateLocation'].str.split(';',2))
You can then index the columns in the new dataframe splitColumns and add them back to the original data frame.
forecastDD = forecastDD.assign(Lat=splitColumns.apply(lambda x: x[0], meta=('Lat', 'f8')), Lon=splitColumns.apply(lambda x: x[1], meta=('Lat', 'f8')), date=splitColumns.apply(lambda x: x[2], meta=('Lat', np.dtype(str))))
Unfortunately I couldn't figure out how to do it without calling compute and creating the temp dataframe.

Polynomial regression in spark/ or external packages for spark

After investing good amount of searching on net for this topic, I am ending up here if I can get some pointer . please read further
After analyzing Spark 2.0 I concluded polynomial regression is not possible with spark (spark alone), so is there some extension to spark which can be used for polynomial regression?
- Rspark it could be done (but looking for better alternative)
- RFormula in spark does prediction but coefficients are not available (which is my main requirement as I primarily interested in coefficient values)
Polynomial regression is just another case of a linear regression (as in Polynomial regression is linear regression and Polynomial regression). As Spark has a method for linear regression, you can call that method changing the inputs in such a way that the new inputs are the ones suited to polynomial regression. For instance, if you only have one independent variable x, and you want to do quadratic regression, you have to change your independent input matrix for [x x^2].
I would like to add some information to #Mehdi Lamrani’s answer :
If you want to do a polynomial linear regression in SparkML, you may use the class PolynomialExpansion.
For information check the class in the SparkML Doc
or in the Spark API Doc
Here is an implementation example:
Let's assume we have a train and test datasets, stocked in two csv files, with headers containing the neames of the columns (features, label).
Each data set contains three features named f1,f2,f3, each of type Double (this is the X matrix), as well as a label feature (the Y vector) named mylabel.
For this code I used Spark+Scala:
Scala version : 2.12.8
Spark version 2.4.0.
We assume that SparkML library was already downloaded in build.sbt.
First of all, import librairies :
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.mllib.evaluation.RegressionMetrics
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions.udf
import org.apache.spark.{SparkConf, SparkContext}
Create Spark Session and Spark Context :
val ss = org.apache.spark.sql
.SparkSession.builder()
.master("local")
.appName("Read CSV")
.enableHiveSupport()
.getOrCreate()
val conf = new SparkConf().setAppName("test").setMaster("local[*]")
val sc = new SparkContext(conf)
Instantiate the variables you are going to use :
val f_train:String = "path/to/your/train_file.csv"
val f_test:String = "path/to/your/test_file.csv"
val degree:Int = 3 // Set the degree of your choice
val maxIter:Int = 10 // Set the max number of iterations
val lambda:Double = 0.0 // Set your lambda
val alpha:Double = 0.3 // Set the learning rate
First of all, let's create first several udf-s, which will be used for the data reading and pre-processing.
The arguments' types of the udf toFeatures will be Vector followed by the type of the arguments of the features: (Double,Double,Double)
val toFeatures = udf[Vector, Double, Double, Double] {
(a,b,c) => Vectors.dense(a,b,c)
}
val encodeIntToDouble = udf[Double, Int](_.toDouble)
Now let's create a function which extracts data from CSV and creates, new features from the existing ones, using PolynomialExpansion:
def getDataPolynomial(
currentfile:String,
sc:SparkSession,
sco:SparkContext,
degree:Int
):DataFrame =
{
val df_rough:DataFrame = sc.read
.format("csv")
.option("header", "true") //first line in file has headers
.option("mode", "DROPMALFORMED")
.option("inferSchema", value=true)
.load(currentfile)
.toDF("f1", "f2", "f3", "myLabel")
// you may add or not the last line
val df:DataFrame = df_rough
.withColumn("featNormTemp", toFeatures(df_rough("f1"), df_rough("f2"), df_rough("f3")))
.withColumn("label", Tools.encodeIntToDouble(df_rough("myLabel")))
val polyExpansion = new PolynomialExpansion()
.setInputCol("featNormTemp")
.setOutputCol("polyFeatures")
.setDegree(degree)
val polyDF:DataFrame=polyExpansion.transform(df.select("featNormTemp"))
val datafixedWithFeatures:DataFrame = polyDF.withColumn("features", polyDF("polyFeatures"))
val datafixedWithFeaturesLabel = datafixedWithFeatures
.join(df,df("featNormTemp") === datafixedWithFeatures("featNormTemp"))
.select("label", "polyFeatures")
datafixedWithFeaturesLabel
}
Now, run the function both for the train and test datasets, using the chosen degree for the Polynomial expansion.
val X:DataFrame = getDataPolynomial(f_train,ss,sc,degree)
val X_test:DataFrame = getDataPolynomial(f_test,ss,sc,degree)
Run the algorithm in order to get a model of linear regression, using a pipeline :
val assembler = new VectorAssembler()
.setInputCols(Array("polyFeatures"))
.setOutputCol("features2")
val lr = new LinearRegression()
.setMaxIter(maxIter)
.setRegParam(lambda)
.setElasticNetParam(alpha)
.setFeaturesCol("features2")
.setLabelCol("label")
// Fit the model:
val pipeline:Pipeline = new Pipeline().setStages(Array(assembler,lr))
val lrModel:PipelineModel = pipeline.fit(X)
// Get prediction on the test set :
val result:DataFrame = lrModel.transform(X_test)
Finally, evaluate the result using mean squared error measure :
def leastSquaresError(result:DataFrame):Double = {
val rm:RegressionMetrics = new RegressionMetrics(
result
.select("label","prediction")
.rdd
.map(x => (x(0).asInstanceOf[Double], x(1).asInstanceOf[Double])))
Math.sqrt(rm.meanSquaredError)
}
val error:Double = leastSquaresError(result)
println("Error : "+error)
I hope this might be useful !

Resources