ParamGridBuilder in PySpark does not work with LinearRegressionSGD - machine-learning

I'm trying to figure out why LinearRegressionWithSGD does not work with Spark's ParamGridBuilder. From the Spark documentation:
lr = LinearRegression(maxIter=10)
paramGrid = ParamGridBuilder()\
.addGrid(lr.regParam, [0.1, 0.01]) \
.addGrid(lr.fitIntercept, [False, True])\
.addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
.build()
However, changing LinearRegression to LinearRegressionWithSGD simply does not work. Subsequently SGD parameters are also unable to be passed in (such as iterations or minibatchfraction).
Thanks!!

That is because you are trying to mix functionality from two different libraries: LinearRegressionWithSGD comes from pyspark.mllib (i.e. the old, RDD-based API), while both LinearRegression & ParamGridBuilder come from pyspark.ml (the new, dataframe-based API).
Indeed, a few lines before the code snippet in the documentation you quote (BTW, in the future it would be good to provide a link, too) you'll find the line:
from pyspark.ml.regression import LinearRegression
while for LinearRegressionWithSGD you have used something like:
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel
These two libraries are not compatible: pyspark.mllib takes RDD's of LabeledPoint as input, which is not compatible with the dataframes used in pyspark.ml; and since ParamGridBuilder is part of the latter, it can only be used with dataframes, and not with algorithms included in pyspark.mllib (check the documentation links provided above).
Moreover, keep in mind that LinearRegressionWithSGD is deprecated in Spark 2:
Note: Deprecated in 2.0.0. Use ml.classification.LogisticRegression or LogisticRegressionWithLBFGS.
UPDATE: Thanks to #rvisio's comment below, we know now that, although undocumented, one can actually use solver='sgd' for LinearRegression in pyspark.ml; here is a short example adapted from the docs:
spark.version
# u'2.2.0'
from pyspark.ml.linalg import Vectors
from pyspark.ml.regression import LinearRegression
df = spark.createDataFrame([
(1.0, 2.0, Vectors.dense(1.0)),
(0.0, 2.0, Vectors.sparse(1, [], []))], ["label", "weight", "features"])
lr = LinearRegression(maxIter=5, regParam=0.0, solver="sgd", weightCol="weight") # solver='sgd'
model = lr.fit(df) # works OK
lr.getSolver()
# 'sgd'

Related

How to know the shape of sparse tensor in tensorflow 2.8

I am trying to understand the code given here by Google. It has a line as below in the function def build_model(ratings, embedding_dim=3, init_stddev=1.)
U = tf.Variable(tf.random_normal(
[A_train.dense_shape[0], embedding_dim], stddev=init_stddev))
Its assigning random values to user vector U. What is not clear is how is A_train.dense_shape[0] getting its value from. All the online documentation states that without using an session.run we cant get value from an tensor, since I am using tensorflow 2.8 so hoepfully without using session.run we will get values. Now the problem is when I try to print the same inside or out side the function I am not getting satisfactory result even with tensorflow2.X
Below are all the print that I have tried
tf.print(A_train.dense_shape[0])
print(A_train.dense_shape[0])
Any suggestion what I am doing wrong here. My tensorflow version is 2.8.2
When we write tf.print(A_train.dense_shape[0]) then the calculation is still in graph, this graph must then be executed, we can do that using below code
trr, ter = split_dataframe(ratings) ## this function is defined in the colab notebook given by Google
A_trr = build_rating_sparse_tensor(trr) ## this function is defined in the colab notebook given by Google
A_trr_shape=tf.print(A_trr.dense_shape[1]) ## print the output
with tf.Session() as sess:
sess.run(A_trr_shape) ## execute the shape graph

How to fetch values using Permutation Feature Importance

I have a dataset with 5K (and 60 features) records focused on binary classification.
Please note that this solution doesn't work here
I am trying to generate feature importance using Permutation Feature Importance. However, I get the below error. Can you please look at my code and let me know whether I am making any mistake?
import eli5
from eli5.sklearn import PermutationImportance
logreg =LogisticRegression()
model = logreg.fit(X_train_std, y_train)
perm = PermutationImportance(model, random_state=1)
eli5.show_weights(perm, feature_names = X.columns.tolist())
I get an error like as shown below
AttributeError: 'PermutationImportance' object has no attribute 'feature_importances_'
Can you help me resolve this error?
If you look at your attributes of PermutationImportance object via
ord(perm)
you can see all attributes and methods BUT after you fit your PI object, meaning that you need to do:
perm = PermutationImportance(model, random_state=1).fit(X_train,y)

dask.distributed not utilising the cluster

I'm not able to process this block using the distributed cluster.
import pandas as pd
from dask import dataframe as dd
import dask
df = pd.DataFrame({'reid_encod': [[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10]]})
dask_df = dd.from_pandas(df, npartitions=3)
save_val = []
def add(dask_df):
for _, outer_row in dask_df.iterrows():
for _, inner_row in dask_df.iterrows():
for base_encod in outer_row['reid_encod']:
for compare_encod in inner_row['reid_encod']:
val = base_encod + compare_encod
save_val.append(val)
return save_val
from dask.distributed import Client
client = Client(...)
dask_compute = dask.delayed(add)(dask_df)
dask_compute.compute()
Also I have few queries
Does dask.delayed use the available clusters to do the computation.
Can I paralleize the for loop iteratition of this pandas DF using delayed, and use multiple computers present in the cluster to do computations.
does dask.distributed work on pandas dataframe.
can we use dask.delayed in dask.distributed.
If the above programming approach is wrong, can you guide me whether to choose delayed or dask DF for the above scenario.
For the record, some answers, although I wish to note my earlier general points about this question
Does dask.delayed use the available clusters to do the computation.
If you have created a client to a distributed cluster, dask will use it for computation unless you specify otherwise.
Can I paralleize the for loop iteratition of this pandas DF using delayed, and use multiple computers present in the cluster to do computations.
Yes, you can in general use delayed with pandas dataframes for parallelism if you wish. However, your dataframe only has one row, so it is not obvious in this case how - it depends on what you really want to achieve.
does dask.distributed work on pandas dataframe.
Yes, you can do anything that python can do with distributed, since it is just python processes executing code. Whether it brings you the performance you are after is a separate question
can we use dask.delayed in dask.distributed.
Yes, distributed can execute anything that dask in general can, including delayed functions/objects
If the above programming approach is wrong, can you guide me whether to choose delayed or dask DF for the above scenario.
Not easily, it is not clear to me that this is a dataframe operation at all. It seems more like an array - but, again, I note that your function does not actually return anything useful at all.
In the tutorial: passing pandas dataframes to delayed ; same with dataframe API.
The main problem with your code is sketched in this section of the best practices: don't pass Dask collections to delayed functions. This means, you should use either the delayed API or the dataframe API. While you can convert dataframes<->delayed, simply passing like this is not recommended.
Furthermore,
you only have one row in your dataframe, so you only get one partition and no parallelism whatever. You can only slow things down like this.
this appears to be an everything-to-everything (N^2) operation, so if you had many rows (the normal case for Dask), it would presumably take extremely long, no matter how many cores you used
passing lists in a pandas row is not a great idea, perhaps you wanted to use an array?
the function doesn't return anything useful, so it's not at all clear what you are trying to achieve. Under the description of MVCE, you will see references to "expected outcome" and "what went wrong". To get more help, please be more precise.

Example of tf.Estimator with model parallel execution

I am currently experimenting with distributed tensorflow.
I am using the tf.estimator.Estimator class (custom model function) together with tf.contrib.learn.Experiment and managed it to get a working data parallel execution.
However, I would now like to try model parallel execution. I was not able to find any example for that, except Implementation of model parallelism in tensorflow.
But I am not sure how to implement this using tf.estimators (e.g. how to deal with the input functions?).
Does anybody have any experience with it or can provide a working example?
First up, you should stop using tf.contrib.learn.Estimator in favor of tf.estimator.Estimator, because contrib is an experimental module, and classes that have graduated to the core API (such es Estimator) automatically get deprecated.
Now, back to your main question, you can create a distributed model and pass it via model_fn parameter of tf.estimator.Estimator.__init__.
def my_model(features, labels, mode):
net = features[X_FEATURE]
with tf.device('/device:GPU:1'):
for units in [10, 20, 10]:
net = tf.layers.dense(net, units=units, activation=tf.nn.relu)
net = tf.layers.dropout(net, rate=0.1)
with tf.device('/device:GPU:2'):
logits = tf.layers.dense(net, 3, activation=None)
onehot_labels = tf.one_hot(labels, 3, 1, 0)
loss = tf.losses.softmax_cross_entropy(onehot_labels=onehot_labels,
logits=logits)
optimizer = tf.train.AdagradOptimizer(learning_rate=0.1)
train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())
return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)
[...]
classifier = tf.estimator.Estimator(model_fn=my_model)
The model above defines 6 layers with /device:GPU:1 placement and 3 other layers with /device:GPU:2 placement. The return value of my_model function should be an EstimatorSpec instance. A complete working example can be found in tensorflow examples.

save binarizer together with sklearn model

I'm trying to build a service that has 2 components. In component 1, I train a machine learning model using sklearn by creating a Pipeline. This model gets serialized using joblib.dump (really numpy_pickle.dump). Component 2 runs in the cloud, loads the model trained by (1), and uses it to label text that it gets as input.
I'm running into an issue where, during training (component 1) I need to first binarize my data since it is text data, which means that the model is trained on binarized input and then makes predictions using the mapping created by the binarizer. I need to get this mapping back when (2) makes predictions based on the model so that I can output the actual text labels.
I tried adding the binarizer to the pipeline like this, thinking that the model would then have the mapping itself:
p = Pipeline([
('binarizer', MultiLabelBinarizer()),
('vect', CountVectorizer(min_df=min_df, ngram_range=ngram_range)),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(clf))
])
But I get the following error:
model = p.fit(training_features, training_tags)
*** TypeError: fit_transform() takes 2 positional arguments but 3 were given
My goal is to make sure the binarizer and model are tied together so that the consumer knows how to decode the model's output.
What are some existing paradigms for doing this? Should I be serializing the binarizer together with the model in some other object that I create? Is there some other way of passing the binarizer to Pipeline so that I don't have to do that, and would I be able to get the mappings back from the model if I did that?
Your intuition that you should add the MultiLabelBinarizer to the pipeline was the right way to solve this problem. It would have worked, except that MultiLabelBinarizer.fit_transform does not take the fit_transform(self, X, y=None) method signature which is now standard for sklearn estimators. Instead, it has a unique fit_transform(self, y) signature which I had never noticed before. As a result of this difference, when you call fit on the pipeline, it tries to pass training_tags as a third positional argument to a function with two positional arguments, which doesn't work.
The solution to this problem is tricky. The cleanest way I can think of to work around it is to create your own MultiLabelBinarizer that overrides fit_transform and ignores its third argument. Try something like the following.
class MyMLB(MultiLabelBinarizer):
def fit_transform(self, X, y=None):
return super(MultiLabelBinarizer, self).fit_transform(X)
Try adding this to your pipeline in place of the MultiLabelBinarizer and see what happens. If you're able to fit() the pipeline, the last problem that you'll have is that your new MyMLB class has to be importable on any system that will de-pickle your now trained, pickled pipeline object. The easiest way to do this is to put MyMLB into its own module and place a copy on the remote machine that will be de-pickling and executing the model. That should fix it.
I misunderstood how the MultiLabelBinarizer worked. It is a transformer of outputs, not of inputs. Not only does this explain the alternative fit_transform() method signature for that class, but it also makes it fundamentally incompatible with the idea of inclusion in a single classification pipeline which is limited to transforming inputs and making predictions of outputs. However, all is not lost!
Based on your question, you're already comfortable with serializing your model to disk as [some form of] a .pkl file. You should be able to also serialize a trained MultiLabelBinarizer, and then unpack it and use it to unpack the outputs from your pipeline. I know you're using joblib, but I'll write this up this sample code as if you're using pickle. I believe the idea will still apply.
X = <training_data>
y = <training_labels>
# Perform multi-label classification on class labels.
mlb = MultiLabelBinarizer()
multilabel_y = mlb.fit_transform(y)
p = Pipeline([
('vect', CountVectorizer(min_df=min_df, ngram_range=ngram_range)),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(clf))
])
# Use multilabel classes to fit the pipeline.
p.fit(X, multilabel_y)
# Serialize both the pipeline and binarizer to disk.
with open('my_sklearn_objects.pkl', 'wb') as f:
pickle.dump((mlb, p), f)
Then, after shipping the .pkl files to the remote box...
# Hydrate the serialized objects.
with open('my_sklearn_objects.pkl', 'rb') as f:
mlb, p = pickle.load(f)
X = <input data> # Get your input data from somewhere.
# Predict the classes using the pipeline
mlb_predictions = p.predict(X)
# Turn those classes into labels using the binarizer.
classes = mlb.inverse_transform(mlb_predictions)
# Do something with predicted classes.
<...>
Is this the paradigm for doing this? As far as I know, yes. Not only that, but if you desire to keep them together (which is a good idea, I think) you can serialize them as a tuple as I did in the example above so they stay in a single file. No need to serialize a custom object or anything like that.
Model serialization via pickle et al. is the sklearn approved way to save estimators between runs and move them between computers. I've used this process successfully many times before, including in productions systems with success.

Resources