Adding prediction thresholds to MultilayerPerceptronClassifier class in PySpark - machine-learning

I am trying to optimize the prediction threshold for the MultilayerPerceptronClassifier in (Py)Spark using cross validation. I tried to make a subclass of MultilayerPerceptronClassifier which actually allows thresholds to be provided. It seems to work in a regular Pipeline, however whenever I plug it into a CrossValidator it gives error messages.
The class I made:
class MLP(MultilayerPerceptronClassifier, HasThresholds):
def __init__(self, thresholds=None, **kwargs):
super(MLP, self).__init__(**kwargs)
self.setParams(thresholds=thresholds, **kwargs)
def setParams(self, thresholds=None, **kwargs):
return self._set(thresholds=thresholds, **kwargs)
Sample data (labeled):
+---------+-----+------+------------------------------+
|family_id|label|weight| embedded|
+---------+-----+------+------------------------------+
| 60009405| 1.0| 1.0|[0.10171283965701926,0.0415...|
| 55022499| 1.0| 1.0|[0.15376672673361091,-0.001...|
| 63938820| 1.0| 1.0|[0.16867649792968614,0.0126...|
| 37452877| 1.0| 1.0|[0.18771651450592225,0.0191...|
| 64559476| 1.0| 1.0|[0.1504634794488278,-0.0032...|
| 59544896| 0.0| 1.25|[0.12911133907668226,0.0116...|
| 46383793| 0.0| 1.25|[0.13390121417649795,-0.013...|
| 59473587| 0.0| 1.25|[0.1262944439844325,0.01176...|
| 63938820| 0.0| 1.25|[0.16867649792968614,0.0126...|
+---------+-----+------+------------------------------+
This seems to work correctly:
mlp = MLP(featuresCol='embedded', layers=[200, 10, 2], thresholds=[1e-20, 1-1e-20])
pipe = Pipeline(stages=[mlp])
model = pipe.fit(labeled)
model.transform(labeled).show(10)
+---------+-----+------+--------------------+--------------------+--------------------+----------+
|family_id|label|weight| embedded| rawPrediction| probability|prediction|
+---------+-----+------+--------------------+--------------------+--------------------+----------+
| 60009405| 1.0| 1.0|[0.10171283965701...|[-11.937067045534...|[6.74683311024104...| 0.0|
| 55022499| 1.0| 1.0|[0.15376672673361...|[-11.914377530833...|[7.32793349054270...| 0.0|
| 63938820| 1.0| 1.0|[0.16867649792968...|[-0.5160228904601...|[0.50001341804946...| 0.0|
| 37452877| 1.0| 1.0|[0.18771651450592...|[-10.034360656260...|[4.62078113096099...| 0.0|
| 64559476| 1.0| 1.0|[0.15046347944882...|[-11.971196504198...|[6.19667960173464...| 0.0|
| 59544896| 0.0| 1.25|[0.12911133907668...|[10.5489426088559...|[0.99999999980450...| 0.0|
| 46383793| 0.0| 1.25|[0.13390121417649...|[10.6067487531592...|[0.99999999982723...| 0.0|
| 59473587| 0.0| 1.25|[0.12629444398443...|[10.5199541406352...|[0.99999999979221...| 0.0|
| 63938820| 0.0| 1.25|[0.16867649792968...|[-0.5160228904601...|[0.50001341804946...| 0.0|
+---------+-----+------+--------------------+--------------------+--------------------+----------+
Note that I set the thresholds extremely to show that the model always predicts 0, using these thresholds.
Now, the following does not work:
mlp = MLP(featuresCol='embedded', layers=[200, 10, 2])
pipe = Pipeline(stages=[mlp])
grid = ParamGridBuilder().\
addGrid(mlp.thresholds, [[0.3, 0.7], [0.7, 0.3]]).\
build()
cv = CrossValidator(estimator=pipe,valuator=MulticlassClassificationEvaluator(metricName='f1'),numFolds=2,estimatorParamMaps=grid,parallelism=len(grid))
model = cv.fit(labeled)
error message:
Traceback (most recent call last):
File "/Users/thijsvandepoll/PycharmProjects/focusbv/venv/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3326, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-190-27dfb1e1d326>", line 1, in <module>
cv.fit(labeled)
File "/Users/thijsvandepoll/PycharmProjects/focusbv/venv/lib/python3.7/site-packages/pyspark/ml/base.py", line 132, in fit
return self._fit(dataset)
File "/Users/thijsvandepoll/PycharmProjects/focusbv/venv/lib/python3.7/site-packages/pyspark/ml/tuning.py", line 303, in _fit
tasks = _parallelFitTasks(est, train, eva, validation, epm, collectSubModelsParam)
File "/Users/thijsvandepoll/PycharmProjects/focusbv/venv/lib/python3.7/site-packages/pyspark/ml/tuning.py", line 49, in _parallelFitTasks
modelIter = est.fitMultiple(train, epm)
File "/Users/thijsvandepoll/PycharmProjects/focusbv/venv/lib/python3.7/site-packages/pyspark/ml/base.py", line 103, in fitMultiple
estimator = self.copy()
File "/Users/thijsvandepoll/PycharmProjects/focusbv/venv/lib/python3.7/site-packages/pyspark/ml/pipeline.py", line 128, in copy
stages = [stage.copy(extra) for stage in that.getStages()]
File "/Users/thijsvandepoll/PycharmProjects/focusbv/venv/lib/python3.7/site-packages/pyspark/ml/pipeline.py", line 128, in <listcomp>
stages = [stage.copy(extra) for stage in that.getStages()]
File "/Users/thijsvandepoll/PycharmProjects/focusbv/venv/lib/python3.7/site-packages/pyspark/ml/wrapper.py", line 262, in copy
that._transfer_params_to_java()
File "/Users/thijsvandepoll/PycharmProjects/focusbv/venv/lib/python3.7/site-packages/pyspark/ml/wrapper.py", line 124, in _transfer_params_to_java
pair = self._make_java_param_pair(param, self._paramMap[param])
File "/Users/thijsvandepoll/PycharmProjects/focusbv/venv/lib/python3.7/site-packages/pyspark/ml/wrapper.py", line 115, in _make_java_param_pair
return java_param.w(java_value)
File "/Users/thijsvandepoll/PycharmProjects/focusbv/venv/lib/python3.7/site-packages/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/Users/thijsvandepoll/PycharmProjects/focusbv/venv/lib/python3.7/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/Users/thijsvandepoll/PycharmProjects/focusbv/venv/lib/python3.7/site-packages/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o8256.w.
: java.lang.NullPointerException
at scala.collection.mutable.ArrayOps$ofDouble$.length$extension(ArrayOps.scala:276)
at scala.collection.mutable.ArrayOps$ofDouble.length(ArrayOps.scala:276)
at scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
at scala.collection.IndexedSeqOptimized$class.forall(IndexedSeqOptimized.scala:43)
at scala.collection.mutable.ArrayOps$ofDouble.forall(ArrayOps.scala:270)
at org.apache.spark.ml.param.shared.HasThresholds$$anonfun$2.apply(sharedParams.scala:201)
at org.apache.spark.ml.param.shared.HasThresholds$$anonfun$2.apply(sharedParams.scala:201)
at org.apache.spark.ml.param.Param.validate(params.scala:72)
at org.apache.spark.ml.param.ParamPair.<init>(params.scala:656)
at org.apache.spark.ml.param.Param.$minus$greater(params.scala:87)
at org.apache.spark.ml.param.Param.w(params.scala:83)
at sun.reflect.GeneratedMethodAccessor66.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
It seems like it cannot find the parameter thresholds. I am not sure how to solve this issue.
Can someone help me?

Depends of the spark version u do use, this feature is implemented from spark 3.0.0 , in spark version < 3.0.0 this feature is not present and this is why u do get the error (from spark 3.1.1 a lot of unbalanced dataset problems will be easy to address).
This is what u can access in spark < 3.0.0:
model_fit.bestModel.thresholds
Param(parent='MultilayerPerceptronClassifier_xxxxxx', name='thresholds', doc="Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold")
in spark >=3.0.0:
setThresholds(value)
Sets the value of thresholds.
New in version 3.1.0 more of parameters can be tuned and has been
added, beside
MultilayerPerceptronClassificationModel also
MultilayerPerceptronClassificationSummary .
More details can be found on releases:
spark-release-3-0-0
MLlib
Highlight
Multiple columns support was added to Binarizer (SPARK-23578), StringIndexer (SPARK-11215), StopWordsRemover (SPARK-29808) and PySpark QuantileDiscretizer (SPARK-22796)
Support Tree-Based Feature Transformation(SPARK-13677)
Two new evaluators MultilabelClassificationEvaluator (SPARK-16692) and RankingEvaluator (SPARK-28045) were added
Sample weights support was added in DecisionTreeClassifier/Regressor (SPARK-19591), RandomForestClassifier/Regressor (SPARK-9478), GBTClassifier/Regressor (SPARK-9612), RegressionEvaluator (SPARK-24102), BinaryClassificationEvaluator (SPARK-24103), BisectingKMeans (SPARK-30351), KMeans (SPARK-29967) and GaussianMixture (SPARK-30102)
R API for PowerIterationClustering was added (SPARK-19827)
Added Spark ML listener for tracking ML pipeline status (SPARK-23674)
Fit with validation set was added to Gradient Boosted Trees in Python (SPARK-24333)
RobustScaler transformer was added (SPARK-28399)
Factorization Machines classifier and regressor were added (SPARK-29224)
Gaussian Naive Bayes (SPARK-16872) and Complement Naive Bayes (SPARK-29942) were added
ML function parity between Scala and Python (SPARK-28958)
predictRaw is made public in all the Classification models. predictProbability is made public in all the Classification models except LinearSVCModel (SPARK-30358)
Changes of behavior
Please read the migration guide for details.
A few other behavior changes that are missed in the migration guide:
In Spark 3.0, a multiclass logistic regression in Pyspark will now (correctly) return LogisticRegressionSummary, not the subclass BinaryLogisticRegressionSummary. The additional methods exposed by BinaryLogisticRegressionSummary would not work in this case anyway. (SPARK-31681)
In Spark 3.0, pyspark.ml.param.shared.Has* mixins do not provide any set(self, value) setter methods anymore, use the respective self.set(self., value) instead. See SPARK-29093 for details. (SPARK-29093)
spark-release-3-1-1
PySpark
Project Zen
Project Zen: Improving Python usability (SPARK-32082)
PySpark type hints support (SPARK-32681)
Redesign PySpark documentation (SPARK-31851)
Migrate to NumPy documentation style (SPARK-32085)
Installation option for PyPI Users (SPARK-32017)
Un-deprecate inferring DataFrame schema from list of dict (SPARK-32686)
Simplify the exception message from Python UDFs (SPARK-33407)
Other Notable Changes
Stage Level Scheduling APIs (SPARK-29641)
Deduplicate deterministic PythonUDF calls (SPARK-33303)
Support higher order functions in PySpark functions(SPARK-30681)
Support data source v2x write APIs (SPARK-29157)
Support percentile_approx in PySpark functions(SPARK-30569)
Support inputFiles in PySpark DataFrame (SPARK-31763)
Support withField in PySpark Column (SPARK-32835)
Support dropFields in PySpark Column (SPARK-32511)
Support nth_value in PySpark functions (SPARK-33020)
Support acosh, asinh and atanh (SPARK-33563)
Support getCheckpointDir method in PySpark SparkContext (SPARK-33017)
Support to fill nulls for missing columns in unionByName (SPARK-32798)
Update cloudpickle to v1.5.0 (SPARK-32094)
Add MapType support for PySpark with Arrow (SPARK-24554)
DataStreamReader.table and DataStreamWriter.toTable (SPARK-33836)
Changes of behavior
Please read the migration guides for PySpark.
Programming guides: PySpark Getting Started and PySpark User Guide.
Structured Streaming
Performance Enhancements
Cache fetched list of files beyond maxFilesPerTrigger as unread file (SPARK-30866)
Streamline the logic on file stream source and sink metadata log (SPARK-30462)
Avoid reading compact metadata log twice if the query restarts from compact batch (SPARK-30900)
Feature Enhancements
Add DataStreamReader.table API (SPARK-32885)
Add DataStreamWriter.toTable API (SPARK-32896)
Left semi stream-stream join (SPARK-32862)
Full outer stream-stream join (SPARK-32863)
Provide a new option to have retention on output files (SPARK-27188)
Add Spark Structured Streaming History Server Support (SPARK-31953)
Introduce State schema validation among query restart (SPARK-27237)
Other Notable Changes
Introduce schema validation for streaming state store (SPARK-31894)
Support to use a different compression codec in state store (SPARK-33263)
Kafka connector infinite wait because metadata never updated (SPARK-28367)
Upgrade Kafka to 2.6.0 (SPARK-32568)
Pagination support for Structured Streaming UI pages (SPARK-31642, SPARK-30119)
State information in Structured Streaming UI (SPARK-33223)
Watermark gap information in Structured Streaming UI (SPARK-33224)
Expose state custom metrics information on SS UI (SPARK-33287)
Add a new metric regarding number of rows later than watermark (SPARK-24634)
Changes of behavior
Please read the migration guides for Structured Streaming.
Programming guides: Structured Streaming Programming Guide.
MLlib
Highlight
LinearSVC blockify input vectors (SPARK-30642)
LogisticRegression blockify input vectors (SPARK-30659)
LinearRegression blockify input vectors (SPARK-30660)
AFT blockify input vectors (SPARK-31656)
Add support for association rules in ML (SPARK-19939)
Add training summary for LinearSVCModel (SPARK-20249)
Add summary to RandomForestClassificationModel (SPARK-23631)
Add training summary to FMClassificationModel (SPARK-32140)
Add summary to MultilayerPerceptronClassificationModel (SPARK-32449)
Add FMClassifier to SparkR (SPARK-30820)
Add SparkR LinearRegression wrapper (SPARK-30818)
Add FMRegressor wrapper to SparkR (SPARK-30819)
Add SparkR wrapper for vector_to_array (SPARK-33040)
adaptively blockify instances - LinearSVC (SPARK-32907)
make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator (SPARK-33520)
Improve performance of ML ALS recommendForAll by GEMV (SPARK-33518)
Add UnivariateFeatureSelector (SPARK-34080)
Other Notable Changes
GMM compute summary and update distributions in one job (SPARK-31032)
Remove ChiSqSelector dependency on mllib.ChiSqSelectorModel (SPARK-31077)
Flatten the result dataframe of tests in testChiSquare (SPARK-31301)
MinHash keyDistance optimization (SPARK-31436)
KMeans optimization based on triangle-inequality (SPARK-31007)
Add weight support in ClusteringEvaluator (SPARK-31734)
Add getMetrics in Evaluators (SPARK-31768)
Add instance weight support in LinearRegressionSummary (SPARK-31944)
Add user-specified fold column to CrossValidator (SPARK-31777)
ML params default value parity in feature and tuning (SPARK-32310)
Fix double caching in KMeans/BiKMeans (SPARK-32676)
aft transform optimization (SPARK-33111)
FeatureHasher transform optimization (SPARK-32974)
Add array_to_vector function for dataframe column (SPARK-33556)
ML params default value parity in classification, regression, clustering and fpm (SPARK-32310)
Summary.totalIterations greater than maxIters (SPARK-31925)
tree models prediction optimization (SPARK-32298)
Changes of behavior
Please read the migration guides for MLlib.
Programming guide: Machine Learning Library (MLlib) Guide.
Note: this issue I did noticed too on spark 2.4.5 when doing modeling
and were looking to adjust threshold in order to improve performance
of the MLPC models for very unbalanced target.

Related

How to get vocabulary size of word2vec?

I have a pretrained word2vec model in pyspark and I would like to know how big is its vocabulary (and perhaps get a list of words in the vocabulary).
Is this possible? I would guess it has to be stored somewhere since it can predict for new data, but I couldn't find a clear answer in the documentation.
I tried w2v_model.getVectors().count() but the result (970) seem too small for my use case. In case it may be relevant, I'm using short-text data and my dataset has tens of millions of messages each having from 10 to 30/40 words. I am using min_count=50.
Not quite sure why you doubt the result of .getVectors().count(), which gives the desired result indeed, as shown in the documentation link you have provided yourself.
Here is the example posted there, with a vocabulary of just three (3) tokens - a, b, and c:
from pyspark.ml.feature import Word2Vec
sent = ("a b " * 100 + "a c " * 10).split(" ") # 3-token vocabulary
doc = spark.createDataFrame([(sent,), (sent,)], ["sentence"])
word2Vec = Word2Vec(vectorSize=5, seed=42, inputCol="sentence", outputCol="model")
model = word2Vec.fit(doc)
So, unsurprisingly, it is
model.getVectors().count()
# 3
and asking for the vectors themselves
model.getVectors().show()
gives
+----+--------------------+
|word| vector|
+----+--------------------+
| a|[0.09511678665876...|
| b|[-1.2028766870498...|
| c|[0.30153277516365...|
+----+--------------------+
In your case, with min_count=50, every word that appears less than 50 times in your corpus will not be represented; reducing this number will result in more vectors.

Estimate a numerical value through Spark MLlib Regression

I'm training a Spark MLlib linear regressor but I believe I didn't understand part of the libraries hands-on usage.
I have 1 feature (NameItem) and one output (Accumulator).
The first one is categorical (Speed, Temp, etc), the second is numerical in double type.
Training set is made of several milions of entries and they are not linearly correlated (I checked with heatmap and correlation indexes).
Issue: I'd like to estimate the Accumulator value given the NameItem value through linear regression, but I think it is not what I'm actually doing.
Question: How can I do It?
I first divided the dataset in training set and data set:
(trainDF, testDF) = df.randomSplit((0.80, 0.20), seed=42)
After that I tried a pipeline approach, as most tutorials show:
1) I indexed NameItem
indexer = StringIndexer(inputCol="NameItem", outputCol="CategorizedItem", handleInvalid = "keep")
2) Then I encoded it
encoderInput = [indexer.getOutputCol()]
encoderOutput = ["EncodedItem"]
encoder = OneHotEncoderEstimator(inputCols=encoderInput, outputCols=encoderOutput)
3) And also assembled it
assemblerInput = encoderOutput
assembler = VectorAssembler(inputCols=assemblerInput, outputCol="features")
After that I continued with the effective training:
lr = LinearRegression(labelCol="Accumulator")
pipeline = Pipeline(stages=[indexer, encoder, assembler, lr])
lrModel = pipeline.fit(trainDF)
That's what I obtain when I apply the prediction on the test set:
predictions = lrModel.transform(testDF).show(5, False)
+--------------+-----------------+---------------+-----------------+-------------------------------+------------------+
|NameItem |Accumulator |CategorizedItem|EncodedItem |features |prediction |
+--------------+-----------------+---------------+-----------------+-------------------------------+------------------+
|Speed |44000.00000000 |265.0 |(688,[265],[1.0])|(689,[265,688],[1.0,44000.0]) |44000.100892495786|
|Speed |245000.00000000 |265.0 |(688,[265],[1.0])|(689,[265,688],[1.0,245000.0]) |245000.09963708033|
|Temp |4473860.00000000 |66.0 |(688,[66],[1.0]) |(689,[66,688],[1.0,4473860.0]) |4473859.874261986 |
|Temp |6065.00000000 |66.0 |(688,[66],[1.0]) |(689,[66,688],[1.0,6065.0]) |6065.097757082314 |
|Temp |10140.00000000 |66.0 |(688,[66],[1.0]) |(689,[66,688],[1.0,10140.0]) |10140.097731630483|
+--------------+-----------------+---------------+-----------------+-------------------------------+------------------+
only showing top 5 rows
How can it be possible that for the same categorical feature (for example Temp) I get 3 different predictions?
Even though they are very close to the expected value, I feel there's something wrong.
How can it be possible that for the same categorical feature (for example Temp) I get 3 different predictions?
It's because somehow your output Accumulator has found its way into features (which of course should not be the case), so the model just "predicts" (essentially copies) this part of the input; that's why the predictions are so "accurate"...
Seems like the VectorAssembler messes things up. Thing is, you don't really need a VectorAssembler here, since in fact you only have a "single" feature (the one-hot encoded sparse vector in EncodedItem). This might be the reason why VectorAssembler behaves like that here (it is asked to "assemble" a single feature), but in any case this would be a bug.
So what I suggest is to get rid of the VectorAssembler, and rename the EncodedItem directly as features, i.e.:
indexer = StringIndexer(inputCol="NameItem", outputCol="CategorizedItem", handleInvalid = "keep")
encoderInput = [indexer.getOutputCol()]
encoderOutput = ["features"] # 1st change
encoder = OneHotEncoderEstimator(inputCols=encoderInput, outputCols=encoderOutput)
lr = LinearRegression(labelCol="Accumulator")
pipeline = Pipeline(stages=[indexer, encoder, lr]) # 2nd change
lrModel = pipeline.fit(trainDF)
UPDATE (after feedback in the comments)
My Spark version Is 1.4.4
Unfortunately I cannot reproduce the issue, simply because I have not access to Spark 1.4.4, which you are using. But I have confirmed that it works OK in the most recent version of Spark 2.4.4, making me even more inclined to believe that there was indeed some bug back in v1.4, which however has subsequently been resolved.
Here is a reproduction in Spark 2.4.4, using some dummy data resembling yours:
spark.version
# '2.4.4'
from pyspark.ml.feature import VectorAssembler, OneHotEncoderEstimator, StringIndexer
from pyspark.ml.regression import LinearRegression
from pyspark.ml import Pipeline
# dummy data resembling yours:
df = spark.createDataFrame([['Speed', 44000],
['Temp', 23000],
['Temp', 5000],
['Speed', 75000],
['Weight', 5300],
['Height', 34500],
['Weight', 6500]],
['NameItem', 'Accumulator'])
df.show()
# result:
+--------+-----------+
|NameItem|Accumulator|
+--------+-----------+
| Speed| 44000|
| Temp| 23000|
| Temp| 5000|
| Speed| 75000|
| Weight| 5300|
| Height| 34500|
| Weight| 6500|
+--------+-----------+
indexer = StringIndexer(inputCol="NameItem", outputCol="CategorizedItem", handleInvalid = "keep")
encoderInput = [indexer.getOutputCol()]
encoderOutput = ["EncodedItem"]
encoder = OneHotEncoderEstimator(inputCols=encoderInput, outputCols=encoderOutput)
assemblerInput = encoderOutput
assembler = VectorAssembler(inputCols=assemblerInput, outputCol="features")
lr = LinearRegression(labelCol="Accumulator")
pipeline = Pipeline(stages=[indexer, encoder, assembler, lr])
lrModel = pipeline.fit(df)
lrModel.transform(df).show() # predicting on the same df, for simplicity
The result of the last transform is
+--------+-----------+---------------+-------------+-------------+------------------+
|NameItem|Accumulator|CategorizedItem| EncodedItem| features| prediction|
+--------+-----------+---------------+-------------+-------------+------------------+
| Speed| 44000| 2.0|(4,[2],[1.0])|(4,[2],[1.0])| 59500.0|
| Temp| 23000| 1.0|(4,[1],[1.0])|(4,[1],[1.0])|14000.000000000004|
| Temp| 5000| 1.0|(4,[1],[1.0])|(4,[1],[1.0])|14000.000000000004|
| Speed| 75000| 2.0|(4,[2],[1.0])|(4,[2],[1.0])| 59500.0|
| Weight| 5300| 0.0|(4,[0],[1.0])|(4,[0],[1.0])| 5900.000000000004|
| Height| 34500| 3.0|(4,[3],[1.0])|(4,[3],[1.0])| 34500.0|
| Weight| 6500| 0.0|(4,[0],[1.0])|(4,[0],[1.0])| 5900.000000000004|
+--------+-----------+---------------+-------------+-------------+------------------+
from where you can see that:
The features now do not include the values of the output variable Accumulator, as it should be indeed; in fact, as I had argued above, features is now identical with EncodedItem, making the VectorAssembler redundant, exactly as we should expect since we only have one single feature.
The prediction values are now identical for the same values of NameItem, again as we would expect them to be, plus that they are less accurate and thus more realistic.
So, most certainly, your issue has to do with the vastly outdated Spark version 1.4.4 you are using. Spark has made leaps since v1.4, and you should seriously consider updating...

Spark ML Error: Incorrect no. of classes detected while using Linear SVC

I am working on a binary classification problem and using SparkML, I trained and evaluated my data using Random Forest and Logistic Regression models and now I wanted to check how well SVM classifies my data.
Snippet of my training data:-
+----------+------+
| spam | count|
+----------+------+
| No|197378|
| Yes| 7652|
+----------+------+
Note:- My dependent variable: 'spam': string (nullable = true)
+-----+------+
|label| count|
+-----+------+
| 0.0|197488|
| 1.0| 7650|
+-----+------+
Note:- label: double (nullable = false)
Updates to my question:-
trainingData.select('label').distinct().show()
+-----+
|label|
+-----+
| 0.0|
| 1.0|
+-----+
So, I used below code to fit my training data using Linear SVC:-
pyspark.ml.classification import LinearSVC
lsvc = LinearSVC()
# Fit the model
lsvcModel = lsvc.fit(trainingData)
In my data frame, label and dependent variable have only 2 classes, but I get an error saying more classes are detected. Not really sure what's causing this exception.
Any help is much appreciated!
Error:-
IllegalArgumentException: u'requirement failed: LinearSVC only supports
binary classification. 3 classes detected in
LinearSVC_4240bb949b9fad486ec0__labelCol'
you can try to convert your label value into a categorical data using OnehotEncoder with handleInvalid parameter to be "keep"
I have same problem.
scala> TEST_DF_37849c70_7cd3_4fd6_a9a0_df4de727df25.select("si_37849c70_7cd3_4fd6_a9a0_df4de727df25_logicProp1_lable_left").distinct.show
+-------------------------------------------------------------+
|si_37849c70_7cd3_4fd6_a9a0_df4de727df25_logicProp1_lable_left|
+-------------------------------------------------------------+
| 0.0|
| 1.0|
+-------------------------------------------------------------+
error: requirement failed: LinearSVC only supports binary classification. 3 classes detected in linearsvc_d18a38204551__labelCol
but in my case, Using StringIndexer with setHandleInvalid("skip") option, It works.
Maybe LeanerSVM have some bug in case of StringIndexer "keep" option.

No result after calculating the similarity of two words based on word vectors via Spacy's parser?

I have an example in spacy code:
from numpy import dot
from numpy.linalg import norm
from spacy.lang.en import English
parser = English()
# you can access known words from the parser's vocabulary
nasa = parser.vocab[u'NASA']
# cosine similarity
cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))
# gather all known words, take only the lowercased versions
allWords = list({w for w in parser.vocab if w.has_vector and
w.orth_.islower() and w.lower_ != unicode("nasa")})
# sort by similarity to NASA
allWords.sort(key=lambda w: cosine(w.vector, nasa.vector))
allWords.reverse()
print("Top 10 most similar words to NASA:")
for word in allWords[:10]:
print(word.orth_)
The result is like this:
Top 10 most similar words to NASA:
Process finished with exit code 0
So there is no similar words come out.
I have tried to install the parser and glove via cmd:
python -m spacy.en.download parser
python -m spacy.en.download glove
But failed, it turned out to be:
C:\Python\python.exe: No module named en
By the way, I use:
Python 2.7.9
Spacy 2.0.9
What's wrong with it? Thank you
The parser you are instantiating contains no word vectors. Check https://spacy.io/models/ for an overview of models.

KMeans clustering in PySpark

I have a spark dataframe 'mydataframe' with many columns. I am trying to run kmeans on only two columns: lat and long (latitude & longitude) using them as simple values). I want to extract 7 clusters based on just those 2 columns and then I want to attach the cluster asignment to my original dataframe. I've tried:
from numpy import array
from math import sqrt
from pyspark.mllib.clustering import KMeans, KMeansModel
# Prepare a data frame with just 2 columns:
data = mydataframe.select('lat', 'long')
data_rdd = data.rdd # needs to be an RDD
data_rdd.cache()
# Build the model (cluster the data)
clusters = KMeans.train(data_rdd, 7, maxIterations=15, initializationMode="random")
But I am getting an error after a while:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 5191.0 failed 4 times, most recent failure: Lost task 1.3 in stage 5191.0 (TID 260738, 10.19.211.69, executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last)
I've tried to detach and re-attach the cluster. Same result. What am I doing wrong?
Since, based on another recent question of yours, I guess you are in your very first steps with Spark clustering (you are even importing sqrt & array, without ever using them, probably because it is like that in the docs example), let me offer advice in a more general level rather than in the specific question you are asking here (hopefully also saving you from subsequently opening 3-4 more questions, trying to get your cluster assignments back into your dataframe)...
Since
you have your data already in a dataframe
you want to attach the cluster membership back into your initial
dataframe
you have no reason to revert to an RDD and use the (soon to be deprecated) MLlib package; you will do your job much more easily, elegantly, and efficiently using the (now recommended) ML package, which works directly with dataframes.
Step 0 - make some toy data resembling yours:
spark.version
# u'2.2.0'
df = spark.createDataFrame([[0, 33.3, -17.5],
[1, 40.4, -20.5],
[2, 28., -23.9],
[3, 29.5, -19.0],
[4, 32.8, -18.84]
],
["other","lat", "long"])
df.show()
# +-----+----+------+
# |other| lat| long|
# +-----+----+------+
# | 0|33.3| -17.5|
# | 1|40.4| -20.5|
# | 2|28.0| -23.9|
# | 3|29.5| -19.0|
# | 4|32.8|-18.84|
# +-----+----+------+
Step 1 - assemble your features
In contrast to most ML packages out there, Spark ML requires your input features to be gathered in a single column of your dataframe, usually named features; and it provides a specific method for doing this, VectorAssembler:
from pyspark.ml.feature import VectorAssembler
vecAssembler = VectorAssembler(inputCols=["lat", "long"], outputCol="features")
new_df = vecAssembler.transform(df)
new_df.show()
# +-----+----+------+-------------+
# |other| lat| long| features|
# +-----+----+------+-------------+
# | 0|33.3| -17.5| [33.3,-17.5]|
# | 1|40.4| -20.5| [40.4,-20.5]|
# | 2|28.0| -23.9| [28.0,-23.9]|
# | 3|29.5| -19.0| [29.5,-19.0]|
# | 4|32.8|-18.84|[32.8,-18.84]|
# +-----+----+------+-------------+
As perhaps already guessed, the argument inputCols serves to tell VectoeAssembler which particular columns in our dataframe are to be used as features.
Step 2 - fit your KMeans model
from pyspark.ml.clustering import KMeans
kmeans = KMeans(k=2, seed=1) # 2 clusters here
model = kmeans.fit(new_df.select('features'))
select('features') here serves to tell the algorithm which column of the dataframe to use for clustering - remember that, after Step 1 above, your original lat & long features are no more directly used.
Step 3 - transform your initial dataframe to include cluster assignments
transformed = model.transform(new_df)
transformed.show()
# +-----+----+------+-------------+----------+
# |other| lat| long| features|prediction|
# +-----+----+------+-------------+----------+
# | 0|33.3| -17.5| [33.3,-17.5]| 0|
# | 1|40.4| -20.5| [40.4,-20.5]| 1|
# | 2|28.0| -23.9| [28.0,-23.9]| 0|
# | 3|29.5| -19.0| [29.5,-19.0]| 0|
# | 4|32.8|-18.84|[32.8,-18.84]| 0|
# +-----+----+------+-------------+----------+
The last column of the transformed dataframe, prediction, shows the cluster assignment - in my toy case, I have ended up with 4 records in cluster #0 and 1 record in cluster #1.
You can further manipulate the transformed dataframe with select statements, or even drop the features column (which has now fulfilled its function and may be no longer necessary)...
Hopefully you are much closer now to what you actually wanted to achieve in the first place. For extracting cluster statistics etc., another recent answer of mine might be helpful...
Despite my other general answer, and in case you, for whatever reason, must stick with MLlib & RDDs, here is what causes your error using the same toy df.
When you select columns from a dataframe to convert to RDD, as you do, the result is an RDD of Rows:
df.select('lat', 'long').rdd.collect()
# [Row(lat=33.3, long=-17.5), Row(lat=40.4, long=-20.5), Row(lat=28.0, long=-23.9), Row(lat=29.5, long=-19.0), Row(lat=32.8, long=-18.84)]
which is not suitable as an input to MLlib KMeans. You'll need a map operation for this to work:
df.select('lat', 'long').rdd.map(lambda x: (x[0], x[1])).collect()
# [(33.3, -17.5), (40.4, -20.5), (28.0, -23.9), (29.5, -19.0), (32.8, -18.84)]
So, your code should be like this:
from pyspark.mllib.clustering import KMeans, KMeansModel
rdd = df.select('lat', 'long').rdd.map(lambda x: (x[0], x[1]))
clusters = KMeans.train(rdd, 2, maxIterations=10, initializationMode="random") # works OK
clusters.centers
# [array([ 40.4, -20.5]), array([ 30.9 , -19.81])]

Resources