KMeans clustering in PySpark - machine-learning

I have a spark dataframe 'mydataframe' with many columns. I am trying to run kmeans on only two columns: lat and long (latitude & longitude) using them as simple values). I want to extract 7 clusters based on just those 2 columns and then I want to attach the cluster asignment to my original dataframe. I've tried:
from numpy import array
from math import sqrt
from pyspark.mllib.clustering import KMeans, KMeansModel
# Prepare a data frame with just 2 columns:
data = mydataframe.select('lat', 'long')
data_rdd = data.rdd # needs to be an RDD
data_rdd.cache()
# Build the model (cluster the data)
clusters = KMeans.train(data_rdd, 7, maxIterations=15, initializationMode="random")
But I am getting an error after a while:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 5191.0 failed 4 times, most recent failure: Lost task 1.3 in stage 5191.0 (TID 260738, 10.19.211.69, executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last)
I've tried to detach and re-attach the cluster. Same result. What am I doing wrong?

Since, based on another recent question of yours, I guess you are in your very first steps with Spark clustering (you are even importing sqrt & array, without ever using them, probably because it is like that in the docs example), let me offer advice in a more general level rather than in the specific question you are asking here (hopefully also saving you from subsequently opening 3-4 more questions, trying to get your cluster assignments back into your dataframe)...
Since
you have your data already in a dataframe
you want to attach the cluster membership back into your initial
dataframe
you have no reason to revert to an RDD and use the (soon to be deprecated) MLlib package; you will do your job much more easily, elegantly, and efficiently using the (now recommended) ML package, which works directly with dataframes.
Step 0 - make some toy data resembling yours:
spark.version
# u'2.2.0'
df = spark.createDataFrame([[0, 33.3, -17.5],
[1, 40.4, -20.5],
[2, 28., -23.9],
[3, 29.5, -19.0],
[4, 32.8, -18.84]
],
["other","lat", "long"])
df.show()
# +-----+----+------+
# |other| lat| long|
# +-----+----+------+
# | 0|33.3| -17.5|
# | 1|40.4| -20.5|
# | 2|28.0| -23.9|
# | 3|29.5| -19.0|
# | 4|32.8|-18.84|
# +-----+----+------+
Step 1 - assemble your features
In contrast to most ML packages out there, Spark ML requires your input features to be gathered in a single column of your dataframe, usually named features; and it provides a specific method for doing this, VectorAssembler:
from pyspark.ml.feature import VectorAssembler
vecAssembler = VectorAssembler(inputCols=["lat", "long"], outputCol="features")
new_df = vecAssembler.transform(df)
new_df.show()
# +-----+----+------+-------------+
# |other| lat| long| features|
# +-----+----+------+-------------+
# | 0|33.3| -17.5| [33.3,-17.5]|
# | 1|40.4| -20.5| [40.4,-20.5]|
# | 2|28.0| -23.9| [28.0,-23.9]|
# | 3|29.5| -19.0| [29.5,-19.0]|
# | 4|32.8|-18.84|[32.8,-18.84]|
# +-----+----+------+-------------+
As perhaps already guessed, the argument inputCols serves to tell VectoeAssembler which particular columns in our dataframe are to be used as features.
Step 2 - fit your KMeans model
from pyspark.ml.clustering import KMeans
kmeans = KMeans(k=2, seed=1) # 2 clusters here
model = kmeans.fit(new_df.select('features'))
select('features') here serves to tell the algorithm which column of the dataframe to use for clustering - remember that, after Step 1 above, your original lat & long features are no more directly used.
Step 3 - transform your initial dataframe to include cluster assignments
transformed = model.transform(new_df)
transformed.show()
# +-----+----+------+-------------+----------+
# |other| lat| long| features|prediction|
# +-----+----+------+-------------+----------+
# | 0|33.3| -17.5| [33.3,-17.5]| 0|
# | 1|40.4| -20.5| [40.4,-20.5]| 1|
# | 2|28.0| -23.9| [28.0,-23.9]| 0|
# | 3|29.5| -19.0| [29.5,-19.0]| 0|
# | 4|32.8|-18.84|[32.8,-18.84]| 0|
# +-----+----+------+-------------+----------+
The last column of the transformed dataframe, prediction, shows the cluster assignment - in my toy case, I have ended up with 4 records in cluster #0 and 1 record in cluster #1.
You can further manipulate the transformed dataframe with select statements, or even drop the features column (which has now fulfilled its function and may be no longer necessary)...
Hopefully you are much closer now to what you actually wanted to achieve in the first place. For extracting cluster statistics etc., another recent answer of mine might be helpful...

Despite my other general answer, and in case you, for whatever reason, must stick with MLlib & RDDs, here is what causes your error using the same toy df.
When you select columns from a dataframe to convert to RDD, as you do, the result is an RDD of Rows:
df.select('lat', 'long').rdd.collect()
# [Row(lat=33.3, long=-17.5), Row(lat=40.4, long=-20.5), Row(lat=28.0, long=-23.9), Row(lat=29.5, long=-19.0), Row(lat=32.8, long=-18.84)]
which is not suitable as an input to MLlib KMeans. You'll need a map operation for this to work:
df.select('lat', 'long').rdd.map(lambda x: (x[0], x[1])).collect()
# [(33.3, -17.5), (40.4, -20.5), (28.0, -23.9), (29.5, -19.0), (32.8, -18.84)]
So, your code should be like this:
from pyspark.mllib.clustering import KMeans, KMeansModel
rdd = df.select('lat', 'long').rdd.map(lambda x: (x[0], x[1]))
clusters = KMeans.train(rdd, 2, maxIterations=10, initializationMode="random") # works OK
clusters.centers
# [array([ 40.4, -20.5]), array([ 30.9 , -19.81])]

Related

Adding prediction thresholds to MultilayerPerceptronClassifier class in PySpark

I am trying to optimize the prediction threshold for the MultilayerPerceptronClassifier in (Py)Spark using cross validation. I tried to make a subclass of MultilayerPerceptronClassifier which actually allows thresholds to be provided. It seems to work in a regular Pipeline, however whenever I plug it into a CrossValidator it gives error messages.
The class I made:
class MLP(MultilayerPerceptronClassifier, HasThresholds):
def __init__(self, thresholds=None, **kwargs):
super(MLP, self).__init__(**kwargs)
self.setParams(thresholds=thresholds, **kwargs)
def setParams(self, thresholds=None, **kwargs):
return self._set(thresholds=thresholds, **kwargs)
Sample data (labeled):
+---------+-----+------+------------------------------+
|family_id|label|weight| embedded|
+---------+-----+------+------------------------------+
| 60009405| 1.0| 1.0|[0.10171283965701926,0.0415...|
| 55022499| 1.0| 1.0|[0.15376672673361091,-0.001...|
| 63938820| 1.0| 1.0|[0.16867649792968614,0.0126...|
| 37452877| 1.0| 1.0|[0.18771651450592225,0.0191...|
| 64559476| 1.0| 1.0|[0.1504634794488278,-0.0032...|
| 59544896| 0.0| 1.25|[0.12911133907668226,0.0116...|
| 46383793| 0.0| 1.25|[0.13390121417649795,-0.013...|
| 59473587| 0.0| 1.25|[0.1262944439844325,0.01176...|
| 63938820| 0.0| 1.25|[0.16867649792968614,0.0126...|
+---------+-----+------+------------------------------+
This seems to work correctly:
mlp = MLP(featuresCol='embedded', layers=[200, 10, 2], thresholds=[1e-20, 1-1e-20])
pipe = Pipeline(stages=[mlp])
model = pipe.fit(labeled)
model.transform(labeled).show(10)
+---------+-----+------+--------------------+--------------------+--------------------+----------+
|family_id|label|weight| embedded| rawPrediction| probability|prediction|
+---------+-----+------+--------------------+--------------------+--------------------+----------+
| 60009405| 1.0| 1.0|[0.10171283965701...|[-11.937067045534...|[6.74683311024104...| 0.0|
| 55022499| 1.0| 1.0|[0.15376672673361...|[-11.914377530833...|[7.32793349054270...| 0.0|
| 63938820| 1.0| 1.0|[0.16867649792968...|[-0.5160228904601...|[0.50001341804946...| 0.0|
| 37452877| 1.0| 1.0|[0.18771651450592...|[-10.034360656260...|[4.62078113096099...| 0.0|
| 64559476| 1.0| 1.0|[0.15046347944882...|[-11.971196504198...|[6.19667960173464...| 0.0|
| 59544896| 0.0| 1.25|[0.12911133907668...|[10.5489426088559...|[0.99999999980450...| 0.0|
| 46383793| 0.0| 1.25|[0.13390121417649...|[10.6067487531592...|[0.99999999982723...| 0.0|
| 59473587| 0.0| 1.25|[0.12629444398443...|[10.5199541406352...|[0.99999999979221...| 0.0|
| 63938820| 0.0| 1.25|[0.16867649792968...|[-0.5160228904601...|[0.50001341804946...| 0.0|
+---------+-----+------+--------------------+--------------------+--------------------+----------+
Note that I set the thresholds extremely to show that the model always predicts 0, using these thresholds.
Now, the following does not work:
mlp = MLP(featuresCol='embedded', layers=[200, 10, 2])
pipe = Pipeline(stages=[mlp])
grid = ParamGridBuilder().\
addGrid(mlp.thresholds, [[0.3, 0.7], [0.7, 0.3]]).\
build()
cv = CrossValidator(estimator=pipe,valuator=MulticlassClassificationEvaluator(metricName='f1'),numFolds=2,estimatorParamMaps=grid,parallelism=len(grid))
model = cv.fit(labeled)
error message:
Traceback (most recent call last):
File "/Users/thijsvandepoll/PycharmProjects/focusbv/venv/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3326, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-190-27dfb1e1d326>", line 1, in <module>
cv.fit(labeled)
File "/Users/thijsvandepoll/PycharmProjects/focusbv/venv/lib/python3.7/site-packages/pyspark/ml/base.py", line 132, in fit
return self._fit(dataset)
File "/Users/thijsvandepoll/PycharmProjects/focusbv/venv/lib/python3.7/site-packages/pyspark/ml/tuning.py", line 303, in _fit
tasks = _parallelFitTasks(est, train, eva, validation, epm, collectSubModelsParam)
File "/Users/thijsvandepoll/PycharmProjects/focusbv/venv/lib/python3.7/site-packages/pyspark/ml/tuning.py", line 49, in _parallelFitTasks
modelIter = est.fitMultiple(train, epm)
File "/Users/thijsvandepoll/PycharmProjects/focusbv/venv/lib/python3.7/site-packages/pyspark/ml/base.py", line 103, in fitMultiple
estimator = self.copy()
File "/Users/thijsvandepoll/PycharmProjects/focusbv/venv/lib/python3.7/site-packages/pyspark/ml/pipeline.py", line 128, in copy
stages = [stage.copy(extra) for stage in that.getStages()]
File "/Users/thijsvandepoll/PycharmProjects/focusbv/venv/lib/python3.7/site-packages/pyspark/ml/pipeline.py", line 128, in <listcomp>
stages = [stage.copy(extra) for stage in that.getStages()]
File "/Users/thijsvandepoll/PycharmProjects/focusbv/venv/lib/python3.7/site-packages/pyspark/ml/wrapper.py", line 262, in copy
that._transfer_params_to_java()
File "/Users/thijsvandepoll/PycharmProjects/focusbv/venv/lib/python3.7/site-packages/pyspark/ml/wrapper.py", line 124, in _transfer_params_to_java
pair = self._make_java_param_pair(param, self._paramMap[param])
File "/Users/thijsvandepoll/PycharmProjects/focusbv/venv/lib/python3.7/site-packages/pyspark/ml/wrapper.py", line 115, in _make_java_param_pair
return java_param.w(java_value)
File "/Users/thijsvandepoll/PycharmProjects/focusbv/venv/lib/python3.7/site-packages/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/Users/thijsvandepoll/PycharmProjects/focusbv/venv/lib/python3.7/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/Users/thijsvandepoll/PycharmProjects/focusbv/venv/lib/python3.7/site-packages/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o8256.w.
: java.lang.NullPointerException
at scala.collection.mutable.ArrayOps$ofDouble$.length$extension(ArrayOps.scala:276)
at scala.collection.mutable.ArrayOps$ofDouble.length(ArrayOps.scala:276)
at scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
at scala.collection.IndexedSeqOptimized$class.forall(IndexedSeqOptimized.scala:43)
at scala.collection.mutable.ArrayOps$ofDouble.forall(ArrayOps.scala:270)
at org.apache.spark.ml.param.shared.HasThresholds$$anonfun$2.apply(sharedParams.scala:201)
at org.apache.spark.ml.param.shared.HasThresholds$$anonfun$2.apply(sharedParams.scala:201)
at org.apache.spark.ml.param.Param.validate(params.scala:72)
at org.apache.spark.ml.param.ParamPair.<init>(params.scala:656)
at org.apache.spark.ml.param.Param.$minus$greater(params.scala:87)
at org.apache.spark.ml.param.Param.w(params.scala:83)
at sun.reflect.GeneratedMethodAccessor66.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
It seems like it cannot find the parameter thresholds. I am not sure how to solve this issue.
Can someone help me?
Depends of the spark version u do use, this feature is implemented from spark 3.0.0 , in spark version < 3.0.0 this feature is not present and this is why u do get the error (from spark 3.1.1 a lot of unbalanced dataset problems will be easy to address).
This is what u can access in spark < 3.0.0:
model_fit.bestModel.thresholds
Param(parent='MultilayerPerceptronClassifier_xxxxxx', name='thresholds', doc="Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold")
in spark >=3.0.0:
setThresholds(value)
Sets the value of thresholds.
New in version 3.1.0 more of parameters can be tuned and has been
added, beside
MultilayerPerceptronClassificationModel also
MultilayerPerceptronClassificationSummary .
More details can be found on releases:
spark-release-3-0-0
MLlib
Highlight
Multiple columns support was added to Binarizer (SPARK-23578), StringIndexer (SPARK-11215), StopWordsRemover (SPARK-29808) and PySpark QuantileDiscretizer (SPARK-22796)
Support Tree-Based Feature Transformation(SPARK-13677)
Two new evaluators MultilabelClassificationEvaluator (SPARK-16692) and RankingEvaluator (SPARK-28045) were added
Sample weights support was added in DecisionTreeClassifier/Regressor (SPARK-19591), RandomForestClassifier/Regressor (SPARK-9478), GBTClassifier/Regressor (SPARK-9612), RegressionEvaluator (SPARK-24102), BinaryClassificationEvaluator (SPARK-24103), BisectingKMeans (SPARK-30351), KMeans (SPARK-29967) and GaussianMixture (SPARK-30102)
R API for PowerIterationClustering was added (SPARK-19827)
Added Spark ML listener for tracking ML pipeline status (SPARK-23674)
Fit with validation set was added to Gradient Boosted Trees in Python (SPARK-24333)
RobustScaler transformer was added (SPARK-28399)
Factorization Machines classifier and regressor were added (SPARK-29224)
Gaussian Naive Bayes (SPARK-16872) and Complement Naive Bayes (SPARK-29942) were added
ML function parity between Scala and Python (SPARK-28958)
predictRaw is made public in all the Classification models. predictProbability is made public in all the Classification models except LinearSVCModel (SPARK-30358)
Changes of behavior
Please read the migration guide for details.
A few other behavior changes that are missed in the migration guide:
In Spark 3.0, a multiclass logistic regression in Pyspark will now (correctly) return LogisticRegressionSummary, not the subclass BinaryLogisticRegressionSummary. The additional methods exposed by BinaryLogisticRegressionSummary would not work in this case anyway. (SPARK-31681)
In Spark 3.0, pyspark.ml.param.shared.Has* mixins do not provide any set(self, value) setter methods anymore, use the respective self.set(self., value) instead. See SPARK-29093 for details. (SPARK-29093)
spark-release-3-1-1
PySpark
Project Zen
Project Zen: Improving Python usability (SPARK-32082)
PySpark type hints support (SPARK-32681)
Redesign PySpark documentation (SPARK-31851)
Migrate to NumPy documentation style (SPARK-32085)
Installation option for PyPI Users (SPARK-32017)
Un-deprecate inferring DataFrame schema from list of dict (SPARK-32686)
Simplify the exception message from Python UDFs (SPARK-33407)
Other Notable Changes
Stage Level Scheduling APIs (SPARK-29641)
Deduplicate deterministic PythonUDF calls (SPARK-33303)
Support higher order functions in PySpark functions(SPARK-30681)
Support data source v2x write APIs (SPARK-29157)
Support percentile_approx in PySpark functions(SPARK-30569)
Support inputFiles in PySpark DataFrame (SPARK-31763)
Support withField in PySpark Column (SPARK-32835)
Support dropFields in PySpark Column (SPARK-32511)
Support nth_value in PySpark functions (SPARK-33020)
Support acosh, asinh and atanh (SPARK-33563)
Support getCheckpointDir method in PySpark SparkContext (SPARK-33017)
Support to fill nulls for missing columns in unionByName (SPARK-32798)
Update cloudpickle to v1.5.0 (SPARK-32094)
Add MapType support for PySpark with Arrow (SPARK-24554)
DataStreamReader.table and DataStreamWriter.toTable (SPARK-33836)
Changes of behavior
Please read the migration guides for PySpark.
Programming guides: PySpark Getting Started and PySpark User Guide.
Structured Streaming
Performance Enhancements
Cache fetched list of files beyond maxFilesPerTrigger as unread file (SPARK-30866)
Streamline the logic on file stream source and sink metadata log (SPARK-30462)
Avoid reading compact metadata log twice if the query restarts from compact batch (SPARK-30900)
Feature Enhancements
Add DataStreamReader.table API (SPARK-32885)
Add DataStreamWriter.toTable API (SPARK-32896)
Left semi stream-stream join (SPARK-32862)
Full outer stream-stream join (SPARK-32863)
Provide a new option to have retention on output files (SPARK-27188)
Add Spark Structured Streaming History Server Support (SPARK-31953)
Introduce State schema validation among query restart (SPARK-27237)
Other Notable Changes
Introduce schema validation for streaming state store (SPARK-31894)
Support to use a different compression codec in state store (SPARK-33263)
Kafka connector infinite wait because metadata never updated (SPARK-28367)
Upgrade Kafka to 2.6.0 (SPARK-32568)
Pagination support for Structured Streaming UI pages (SPARK-31642, SPARK-30119)
State information in Structured Streaming UI (SPARK-33223)
Watermark gap information in Structured Streaming UI (SPARK-33224)
Expose state custom metrics information on SS UI (SPARK-33287)
Add a new metric regarding number of rows later than watermark (SPARK-24634)
Changes of behavior
Please read the migration guides for Structured Streaming.
Programming guides: Structured Streaming Programming Guide.
MLlib
Highlight
LinearSVC blockify input vectors (SPARK-30642)
LogisticRegression blockify input vectors (SPARK-30659)
LinearRegression blockify input vectors (SPARK-30660)
AFT blockify input vectors (SPARK-31656)
Add support for association rules in ML (SPARK-19939)
Add training summary for LinearSVCModel (SPARK-20249)
Add summary to RandomForestClassificationModel (SPARK-23631)
Add training summary to FMClassificationModel (SPARK-32140)
Add summary to MultilayerPerceptronClassificationModel (SPARK-32449)
Add FMClassifier to SparkR (SPARK-30820)
Add SparkR LinearRegression wrapper (SPARK-30818)
Add FMRegressor wrapper to SparkR (SPARK-30819)
Add SparkR wrapper for vector_to_array (SPARK-33040)
adaptively blockify instances - LinearSVC (SPARK-32907)
make CrossValidator/TrainValidateSplit/OneVsRest Reader/Writer support Python backend estimator/evaluator (SPARK-33520)
Improve performance of ML ALS recommendForAll by GEMV (SPARK-33518)
Add UnivariateFeatureSelector (SPARK-34080)
Other Notable Changes
GMM compute summary and update distributions in one job (SPARK-31032)
Remove ChiSqSelector dependency on mllib.ChiSqSelectorModel (SPARK-31077)
Flatten the result dataframe of tests in testChiSquare (SPARK-31301)
MinHash keyDistance optimization (SPARK-31436)
KMeans optimization based on triangle-inequality (SPARK-31007)
Add weight support in ClusteringEvaluator (SPARK-31734)
Add getMetrics in Evaluators (SPARK-31768)
Add instance weight support in LinearRegressionSummary (SPARK-31944)
Add user-specified fold column to CrossValidator (SPARK-31777)
ML params default value parity in feature and tuning (SPARK-32310)
Fix double caching in KMeans/BiKMeans (SPARK-32676)
aft transform optimization (SPARK-33111)
FeatureHasher transform optimization (SPARK-32974)
Add array_to_vector function for dataframe column (SPARK-33556)
ML params default value parity in classification, regression, clustering and fpm (SPARK-32310)
Summary.totalIterations greater than maxIters (SPARK-31925)
tree models prediction optimization (SPARK-32298)
Changes of behavior
Please read the migration guides for MLlib.
Programming guide: Machine Learning Library (MLlib) Guide.
Note: this issue I did noticed too on spark 2.4.5 when doing modeling
and were looking to adjust threshold in order to improve performance
of the MLPC models for very unbalanced target.

Estimate a numerical value through Spark MLlib Regression

I'm training a Spark MLlib linear regressor but I believe I didn't understand part of the libraries hands-on usage.
I have 1 feature (NameItem) and one output (Accumulator).
The first one is categorical (Speed, Temp, etc), the second is numerical in double type.
Training set is made of several milions of entries and they are not linearly correlated (I checked with heatmap and correlation indexes).
Issue: I'd like to estimate the Accumulator value given the NameItem value through linear regression, but I think it is not what I'm actually doing.
Question: How can I do It?
I first divided the dataset in training set and data set:
(trainDF, testDF) = df.randomSplit((0.80, 0.20), seed=42)
After that I tried a pipeline approach, as most tutorials show:
1) I indexed NameItem
indexer = StringIndexer(inputCol="NameItem", outputCol="CategorizedItem", handleInvalid = "keep")
2) Then I encoded it
encoderInput = [indexer.getOutputCol()]
encoderOutput = ["EncodedItem"]
encoder = OneHotEncoderEstimator(inputCols=encoderInput, outputCols=encoderOutput)
3) And also assembled it
assemblerInput = encoderOutput
assembler = VectorAssembler(inputCols=assemblerInput, outputCol="features")
After that I continued with the effective training:
lr = LinearRegression(labelCol="Accumulator")
pipeline = Pipeline(stages=[indexer, encoder, assembler, lr])
lrModel = pipeline.fit(trainDF)
That's what I obtain when I apply the prediction on the test set:
predictions = lrModel.transform(testDF).show(5, False)
+--------------+-----------------+---------------+-----------------+-------------------------------+------------------+
|NameItem |Accumulator |CategorizedItem|EncodedItem |features |prediction |
+--------------+-----------------+---------------+-----------------+-------------------------------+------------------+
|Speed |44000.00000000 |265.0 |(688,[265],[1.0])|(689,[265,688],[1.0,44000.0]) |44000.100892495786|
|Speed |245000.00000000 |265.0 |(688,[265],[1.0])|(689,[265,688],[1.0,245000.0]) |245000.09963708033|
|Temp |4473860.00000000 |66.0 |(688,[66],[1.0]) |(689,[66,688],[1.0,4473860.0]) |4473859.874261986 |
|Temp |6065.00000000 |66.0 |(688,[66],[1.0]) |(689,[66,688],[1.0,6065.0]) |6065.097757082314 |
|Temp |10140.00000000 |66.0 |(688,[66],[1.0]) |(689,[66,688],[1.0,10140.0]) |10140.097731630483|
+--------------+-----------------+---------------+-----------------+-------------------------------+------------------+
only showing top 5 rows
How can it be possible that for the same categorical feature (for example Temp) I get 3 different predictions?
Even though they are very close to the expected value, I feel there's something wrong.
How can it be possible that for the same categorical feature (for example Temp) I get 3 different predictions?
It's because somehow your output Accumulator has found its way into features (which of course should not be the case), so the model just "predicts" (essentially copies) this part of the input; that's why the predictions are so "accurate"...
Seems like the VectorAssembler messes things up. Thing is, you don't really need a VectorAssembler here, since in fact you only have a "single" feature (the one-hot encoded sparse vector in EncodedItem). This might be the reason why VectorAssembler behaves like that here (it is asked to "assemble" a single feature), but in any case this would be a bug.
So what I suggest is to get rid of the VectorAssembler, and rename the EncodedItem directly as features, i.e.:
indexer = StringIndexer(inputCol="NameItem", outputCol="CategorizedItem", handleInvalid = "keep")
encoderInput = [indexer.getOutputCol()]
encoderOutput = ["features"] # 1st change
encoder = OneHotEncoderEstimator(inputCols=encoderInput, outputCols=encoderOutput)
lr = LinearRegression(labelCol="Accumulator")
pipeline = Pipeline(stages=[indexer, encoder, lr]) # 2nd change
lrModel = pipeline.fit(trainDF)
UPDATE (after feedback in the comments)
My Spark version Is 1.4.4
Unfortunately I cannot reproduce the issue, simply because I have not access to Spark 1.4.4, which you are using. But I have confirmed that it works OK in the most recent version of Spark 2.4.4, making me even more inclined to believe that there was indeed some bug back in v1.4, which however has subsequently been resolved.
Here is a reproduction in Spark 2.4.4, using some dummy data resembling yours:
spark.version
# '2.4.4'
from pyspark.ml.feature import VectorAssembler, OneHotEncoderEstimator, StringIndexer
from pyspark.ml.regression import LinearRegression
from pyspark.ml import Pipeline
# dummy data resembling yours:
df = spark.createDataFrame([['Speed', 44000],
['Temp', 23000],
['Temp', 5000],
['Speed', 75000],
['Weight', 5300],
['Height', 34500],
['Weight', 6500]],
['NameItem', 'Accumulator'])
df.show()
# result:
+--------+-----------+
|NameItem|Accumulator|
+--------+-----------+
| Speed| 44000|
| Temp| 23000|
| Temp| 5000|
| Speed| 75000|
| Weight| 5300|
| Height| 34500|
| Weight| 6500|
+--------+-----------+
indexer = StringIndexer(inputCol="NameItem", outputCol="CategorizedItem", handleInvalid = "keep")
encoderInput = [indexer.getOutputCol()]
encoderOutput = ["EncodedItem"]
encoder = OneHotEncoderEstimator(inputCols=encoderInput, outputCols=encoderOutput)
assemblerInput = encoderOutput
assembler = VectorAssembler(inputCols=assemblerInput, outputCol="features")
lr = LinearRegression(labelCol="Accumulator")
pipeline = Pipeline(stages=[indexer, encoder, assembler, lr])
lrModel = pipeline.fit(df)
lrModel.transform(df).show() # predicting on the same df, for simplicity
The result of the last transform is
+--------+-----------+---------------+-------------+-------------+------------------+
|NameItem|Accumulator|CategorizedItem| EncodedItem| features| prediction|
+--------+-----------+---------------+-------------+-------------+------------------+
| Speed| 44000| 2.0|(4,[2],[1.0])|(4,[2],[1.0])| 59500.0|
| Temp| 23000| 1.0|(4,[1],[1.0])|(4,[1],[1.0])|14000.000000000004|
| Temp| 5000| 1.0|(4,[1],[1.0])|(4,[1],[1.0])|14000.000000000004|
| Speed| 75000| 2.0|(4,[2],[1.0])|(4,[2],[1.0])| 59500.0|
| Weight| 5300| 0.0|(4,[0],[1.0])|(4,[0],[1.0])| 5900.000000000004|
| Height| 34500| 3.0|(4,[3],[1.0])|(4,[3],[1.0])| 34500.0|
| Weight| 6500| 0.0|(4,[0],[1.0])|(4,[0],[1.0])| 5900.000000000004|
+--------+-----------+---------------+-------------+-------------+------------------+
from where you can see that:
The features now do not include the values of the output variable Accumulator, as it should be indeed; in fact, as I had argued above, features is now identical with EncodedItem, making the VectorAssembler redundant, exactly as we should expect since we only have one single feature.
The prediction values are now identical for the same values of NameItem, again as we would expect them to be, plus that they are less accurate and thus more realistic.
So, most certainly, your issue has to do with the vastly outdated Spark version 1.4.4 you are using. Spark has made leaps since v1.4, and you should seriously consider updating...

How can one add batch mechanism to the input function in Tensorflow tutorial overcoming tf.Sparsetensor objects?

How can one add batch mechanism to the input_fn in the TensorFlow Wide & Deep Learning Tutorial overcoming that some features are represented as tf.Sparsetensor objects?
I have made many attempts around adding tf.train.slice_input_producer and tf.train.batchto the original code (see below), but have failed miserably so far.
I would like to keep the global working of that input_fn as it is handy while while training and evaluating the model.
Can someone help, please?
def input_fn(df):
# Creates a dictionary mapping from each continuous feature column name (k) to
# the values of that column stored in a constant Tensor.
continuous_cols = {k: tf.constant(df[k].values)
for k in CONTINUOUS_COLUMNS}
# Creates a dictionary mapping from each categorical feature column name (k)
# to the values of that column stored in a tf.SparseTensor.
categorical_cols = {k: tf.SparseTensor(indices=[[i, 0] for i in range(df[k].size)],
values=df[k].values,
shape=[df[k].size, 1]) for k in CATEGORICAL_COLUMNS}
# Merges the two dictionaries into one.
feature_cols = dict(continuous_cols.items() + categorical_cols.items())
# Converts the label column into a constant Tensor.
labels = tf.constant(df[LABEL_COLUMN].values)
'''
Changes from here:
'''
features_slices, features_slices = tf.train.slice_input_producer([features_cols, labels], ...)
features_batches, labels_batches = tf.train.batch([features_slices, features_slices], ...)
# Returns the feature and label batches.
return features_batches, labels_batches

TensorBoard - Plot training and validation losses on the same graph?

Is there a way to plot both the training losses and validation losses on the same graph?
It's easy to have two separate scalar summaries for each of them individually, but this puts them on separate graphs. If both are displayed in the same graph it's much easier to see the gap between them and whether or not they have begin to diverge due to overfitting.
Is there a built in way to do this? If not, a work around way? Thank you much!
The work-around I have been doing is to use two SummaryWriter with different log dir for training set and cross-validation set respectively. And you will see something like this:
Rather than displaying the two lines separately, you can instead plot the difference between validation and training losses as its own scalar summary to track the divergence.
This doesn't give as much information on a single plot (compared with adding two summaries), but it helps with being able to compare multiple runs (and not adding multiple summaries per run).
Just for anyone coming accross this via a search: The current best practice to achieve this goal is to just use the SummaryWriter.add_scalars method from torch.utils.tensorboard. From the docs:
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter()
r = 5
for i in range(100):
writer.add_scalars('run_14h', {'xsinx':i*np.sin(i/r),
'xcosx':i*np.cos(i/r),
'tanx': np.tan(i/r)}, i)
writer.close()
# This call adds three values to the same scalar plot with the tag
# 'run_14h' in TensorBoard's scalar section.
Expected result:
Many thanks to niko for the tip on Custom Scalars.
I was confused by the official custom_scalar_demo.py because there's so much going on, and I had to study it for quite a while before I figured out how it worked.
To show exactly what needs to be done to create a custom scalar graph for an existing model, I put together the following complete example:
# + <
# We need these to make a custom protocol buffer to display custom scalars.
# See https://developers.google.com/protocol-buffers/
from tensorboard.plugins.custom_scalar import layout_pb2
from tensorboard.summary.v1 import custom_scalar_pb
# >
import tensorflow as tf
from time import time
import re
# Initial values
(x0, y0) = (-1, 1)
# This is useful only when re-running code (e.g. Jupyter).
tf.reset_default_graph()
# Set up variables.
x = tf.Variable(x0, name="X", dtype=tf.float64)
y = tf.Variable(y0, name="Y", dtype=tf.float64)
# Define loss function and give it a name.
loss = tf.square(x - 3*y) + tf.square(x+y)
loss = tf.identity(loss, name='my_loss')
# Define the op for performing gradient descent.
minimize_step_op = tf.train.GradientDescentOptimizer(0.092).minimize(loss)
# List quantities to summarize in a dictionary
# with (key, value) = (name, Tensor).
to_summarize = dict(
X = x,
Y_plus_2 = y + 2,
)
# Build scalar summaries corresponding to to_summarize.
# This should be done in a separate name scope to avoid name collisions
# between summaries and their respective tensors. The name scope also
# gives a title to a group of scalars in TensorBoard.
with tf.name_scope('scalar_summaries'):
my_var_summary_op = tf.summary.merge(
[tf.summary.scalar(name, var)
for name, var in to_summarize.items()
]
)
# + <
# This constructs the layout for the custom scalar, and specifies
# which scalars to plot.
layout_summary = custom_scalar_pb(
layout_pb2.Layout(category=[
layout_pb2.Category(
title='Custom scalar summary group',
chart=[
layout_pb2.Chart(
title='Custom scalar summary chart',
multiline=layout_pb2.MultilineChartContent(
# regex to select only summaries which
# are in "scalar_summaries" name scope:
tag=[r'^scalar_summaries\/']
)
)
])
])
)
# >
# Create session.
with tf.Session() as sess:
# Initialize session.
sess.run(tf.global_variables_initializer())
# Create writer.
with tf.summary.FileWriter(f'./logs/session_{int(time())}') as writer:
# Write the session graph.
writer.add_graph(sess.graph) # (not necessary for scalars)
# + <
# Define the layout for creating custom scalars in terms
# of the scalars.
writer.add_summary(layout_summary)
# >
# Main iteration loop.
for i in range(50):
current_summary = sess.run(my_var_summary_op)
writer.add_summary(current_summary, global_step=i)
writer.flush()
sess.run(minimize_step_op)
The above consists of an "original model" augmented by three blocks of code indicated by
# + <
[code to add custom scalars goes here]
# >
My "original model" has these scalars:
and this graph:
My modified model has the same scalars and graph, together with the following custom scalar:
This custom scalar chart is simply a layout which combines the original two scalar charts.
Unfortunately the resulting graph is hard to read because both values have the same color. (They are distinguished only by marker.) This is however consistent with TensorBoard's convention of having one color per log.
Explanation
The idea is as follows. You have some group of variables which you want to plot inside a single chart. As a prerequisite, TensorBoard should be plotting each variable individually under the "SCALARS" heading. (This is accomplished by creating a scalar summary for each variable, and then writing those summaries to the log. Nothing new here.)
To plot multiple variables in the same chart, we tell TensorBoard which of these summaries to group together. The specified summaries are then combined into a single chart under the "CUSTOM SCALARS" heading. We accomplish this by writing a "Layout" once at the beginning of the log. Once TensorBoard receives the layout, it automatically produces a combined chart under "CUSTOM SCALARS" as the ordinary "SCALARS" are updated.
Assuming that your "original model" is already sending your variables (as scalar summaries) to TensorBoard, the only modification necessary is to inject the layout before your main iteration loop starts. Each custom scalar chart selects which summaries to plot by means of a regular expression. Thus for each group of variables to be plotted together, it can be useful to place the variables' respective summaries into a separate name scope. (That way your regex can simply select all summaries under that name scope.)
Important Note: The op which generates the summary of a variable is distinct from the variable itself. For example, if I have a variable ns1/my_var, I can create a summary ns2/summary_op_for_myvar. The custom scalars chart layout cares only about the summary op, not the name or scope of the original variable.
Here is an example, creating two tf.summary.FileWriters which share the same root directory. Creating a tf.summary.scalar shared by the two tf.summary.FileWriters. At every time step, get the summary and update each tf.summary.FileWriter.
import os
import tqdm
import tensorflow as tf
def tb_test():
sess = tf.Session()
x = tf.placeholder(dtype=tf.float32)
summary = tf.summary.scalar('Values', x)
merged = tf.summary.merge_all()
sess.run(tf.global_variables_initializer())
writer_1 = tf.summary.FileWriter(os.path.join('tb_summary', 'train'))
writer_2 = tf.summary.FileWriter(os.path.join('tb_summary', 'eval'))
for i in tqdm.tqdm(range(200)):
# train
summary_1 = sess.run(merged, feed_dict={x: i-10})
writer_1.add_summary(summary_1, i)
# eval
summary_2 = sess.run(merged, feed_dict={x: i+10})
writer_2.add_summary(summary_2, i)
writer_1.close()
writer_2.close()
if __name__ == '__main__':
tb_test()
Here is the result:
The orange line shows the result of the evaluation stage, and correspondingly, the blue line illustrates the data of the training stage.
Also, there is a very useful post by TF team to which you can refer.
For completeness, since tensorboard 1.5.0 this is now possible.
You can use the custom scalars plugin. For this, you need to first make tensorboard layout configuration and write it to the event file. From the tensorboard example:
import tensorflow as tf
from tensorboard import summary
from tensorboard.plugins.custom_scalar import layout_pb2
# The layout has to be specified and written only once, not at every step
layout_summary = summary.custom_scalar_pb(layout_pb2.Layout(
category=[
layout_pb2.Category(
title='losses',
chart=[
layout_pb2.Chart(
title='losses',
multiline=layout_pb2.MultilineChartContent(
tag=[r'loss.*'],
)),
layout_pb2.Chart(
title='baz',
margin=layout_pb2.MarginChartContent(
series=[
layout_pb2.MarginChartContent.Series(
value='loss/baz/scalar_summary',
lower='baz_lower/baz/scalar_summary',
upper='baz_upper/baz/scalar_summary'),
],
)),
]),
layout_pb2.Category(
title='trig functions',
chart=[
layout_pb2.Chart(
title='wave trig functions',
multiline=layout_pb2.MultilineChartContent(
tag=[r'trigFunctions/cosine', r'trigFunctions/sine'],
)),
# The range of tangent is different. Let's give it its own chart.
layout_pb2.Chart(
title='tan',
multiline=layout_pb2.MultilineChartContent(
tag=[r'trigFunctions/tangent'],
)),
],
# This category we care less about. Let's make it initially closed.
closed=True),
]))
writer = tf.summary.FileWriter(".")
writer.add_summary(layout_summary)
# ...
# Add any summary data you want to the file
# ...
writer.close()
A Category is group of Charts. Each Chart corresponds to a single plot which displays several scalars together. The Chart can plot simple scalars (MultilineChartContent) or filled areas (MarginChartContent, e.g. when you want to plot the deviation of some value). The tag member of MultilineChartContent must be a list of regex-es which match the tags of the scalars that you want to group in the Chart. For more details check the proto definitions of the objects in https://github.com/tensorflow/tensorboard/blob/master/tensorboard/plugins/custom_scalar/layout.proto. Note that if you have several FileWriters writing to the same directory, you need to write the layout in only one of the files. Writing it to a separate file also works.
To view the data in TensorBoard, you need to open the Custom Scalars tab. Here is an example image of what to expect https://user-images.githubusercontent.com/4221553/32865784-840edf52-ca19-11e7-88bc-1806b1243e0d.png
The solution in PyTorch 1.5 with the approach of two writers:
import os
from torch.utils.tensorboard import SummaryWriter
LOG_DIR = "experiment_dir"
train_writer = SummaryWriter(os.path.join(LOG_DIR, "train"))
val_writer = SummaryWriter(os.path.join(LOG_DIR, "val"))
# while in the training loop
for k, v in train_losses.items()
train_writer.add_scalar(k, v, global_step)
# in the validation loop
for k, v in val_losses.items()
val_writer.add_scalar(k, v, global_step)
# at the end
train_writer.close()
val_writer.close()
Keys in the train_losses dict have to match those in the val_losses to be grouped on the same graph.
Tensorboard is really nice tool but by its declarative nature can make it difficult to get it to do exactly what you want.
I recommend you checkout Losswise (https://losswise.com) for plotting and keeping track of loss functions as an alternative to Tensorboard. With Losswise you specify exactly what should be graphed together:
import losswise
losswise.set_api_key("project api key")
session = losswise.Session(tag='my_special_lstm', max_iter=10)
loss_graph = session.graph('loss', kind='min')
# train an iteration of your model...
loss_graph.append(x, {'train_loss': train_loss, 'validation_loss': validation_loss})
# keep training model...
session.done()
And then you get something that looks like:
Notice how the data is fed to a particular graph explicitly via the loss_graph.append call, the data for which then appears in your project's dashboard.
In addition, for the above example Losswise would automatically generate a table with columns for min(training_loss) and min(validation_loss) so you can easily compare summary statistics across your experiments. Very useful for comparing results across a large number of experiments.
Please let me contribute with some code sample in the answer given by #Lifu Huang. First download the loger.py from here and then:
from logger import Logger
def train_model(parameters...):
N_EPOCHS = 15
# Set the logger
train_logger = Logger('./summaries/train_logs')
test_logger = Logger('./summaries/test_logs')
for epoch in range(N_EPOCHS):
# Code to get train_loss and test_loss
# ============ TensorBoard logging ============#
# Log the scalar values
train_info = {
'loss': train_loss,
}
test_info = {
'loss': test_loss,
}
for tag, value in train_info.items():
train_logger.scalar_summary(tag, value, step=epoch)
for tag, value in test_info.items():
test_logger.scalar_summary(tag, value, step=epoch)
Finally you run tensorboard --logdir=summaries/ --port=6006and you get:

How to find clusters a matrix

I have no clue about data mining or data analysis or statistical analysis but I think what I need is finding "clusters in a matrix". I have a data set of ~20k records and each has ~40 characteristics all of which are either turned on or off.
+--------+------+------+------+------+------+------+
| record | hasA | hasB | hasC | hasD | hasE | hasF |
+--------+------+------+------+------+------+------+
| foo | 1 | 0 | 1 | 0 | 0 | 0 |
| bar | 1 | 1 | 0 | 0 | 1 | 1 |
| baz | 1 | 1 | 1 | 0 | 0 | 0 |
+--------+------+------+------+------+------+------+
I'm quite convinced most of those 20k records have characteristics that fall into one of several categories. There must be means to determine how similar record 'foo' is to record 'bar'.
So, what is it that I'm actually looking at? What algorithm am I looking for?
Transform each record r into a binary vector v(r) so that i-th component of v(r) is set to 1 if r has i-th characteristic, and 0 otherwise.
Now run hierarchical clustering algorithm on this set of vectors under the Hamming distance or Jaccard distance, whichever you think is more appropriate; also make sure there's a notion of distance between clusters defined in terms of the underlying distance (see linkage criteria).
Then decide where to cut the resulting dendrogram based on common sense. Where you cut the dendrogram will affect the number of clusters.
One downside of hierarchical clustering is that it's rather slow. It takes O(n^3) time in general, so it would take quite a while on a large data set. For single- and complete-linkages you can bring the time down to O(n^2).
Hierarchical clustering is very easy to implement in languages such as Python. You can also use the implementation from the scipy library.
Example: Hierarchical Clustering in Python
Here's a code snippet to get you started. I assume S is the set of records transformed into binary vectors (i.e. each list in S corresponds to a record from your data set).
import numpy as np
import scipy
import scipy.cluster.hierarchy as sch
import matplotlib.pylab as plt
# This is the set of binary vectors, each of which would
# correspond to a record in your case.
S = [
[0, 0, 0, 1, 1], # 0
[0, 0, 0, 0, 1], # 1
[0, 0, 0, 1, 0], # 2
[1, 1, 1, 0, 0], # 3
[1, 0, 1, 0, 0], # 4
[0, 1, 1, 0, 0]] # 5
# Use Hamming distance with complete linkage.
Z = sch.linkage(sch.distance.pdist(S, metric='hamming'), 'complete')
# Compute the dendrogram
P = sch.dendrogram(Z)
plt.show()
The result is as you'd expect: cut at 0.5 to get two clusters, one of the first three vectors (which have ones at beginning, zeros at the end) and the other of the last three vectors (which have ones at the end, zeros at the beginning). Here's the image:
Hierarchical clustering starts with each vector being its own cluster. In each successive steps it merges the closest clusters. It repeats this until there is a single cluster left.
The dendrogram essentially encodes the whole clustering process. At the beginning each vector is its own cluster. Then {3} and {5} merge into {3,5} and {0} and {2} merge into {0,2}. Next, {4} and {3,5} merge into {3,4,5}, and {1} and {0,2} merge into {0,1,2}. Finally, {0,1,2} and {3,4,5} merge into {0,1,2,3,4,5}.
From the dendrogram you can usually see at which point it makes the most sense to cut---this will define your clusters.
I encourage you to experiment with various distances (e.g. Hamming distance, Jaccard distance) and linkages (e.g. single linkage, complete linkage), and various representations (e.g. binary vectors).
Are you sure you want cluster analysis?
To find similar records you don't need cluster analysis. Simply find similar records with any distance measure such as Jaccard similarity or Hamming distance (both of which are for binary data). Or cosine distance, so that you can use e.g. Lucene to find similar records fast.
To find common patterns, the use of frequent itemset mining may yield much more meaningful results, because these can work on a subset of attributes only. For example, in a supermarket, the columns Noodles, Tomato, Basil, Cheese may constitute a frequent pattern.
Most clustering algorithms attempt to divide the data into k groups. While this at first appears a good idea (get k target groups) it rarely matches what real data contains. For example customers: why would every customer belong to exactly one audience? What if the audiences are e.g. car lovers, gun lovers, football lovers, soccer moms - are you sure you don't want to allow overlap of these groups?
Furhermore, a problem with cluster analysis is that it's incredibly easy to use badly. It does not "fail hard" - you always get a result, and you might not realize that it's a bad result...

Resources