What are the performance metrics for Clustering Algorithms? [closed] - machine-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I'm working on Kmeans clustering but unlike supervised learning I cannot figure the performance metrics for clustering algorithms. How to perform the accuracy after training the data?

For kmeans you can find the inertia_ of it. Which can give you an idea how well kmeans algorithm has worked.
kmeans = KMeans(...)
# Assuming you already have fitted data on it.
kmeans.inertia_ # lesser is better
Or, alternatively if you call score() function, which will give you the same but the sign will be negative. As we assume bigger score means better but for kmeans lesser inertia_ is better. So, to make them consistent an extra negation is applied on it.
# Call score with data X
kmeans.score(X) # greater is better
This is the very basic form of analyzing performance of kmeans. In reality if you take the number of clusters too high the score() will increase accordingly (in other words inertia_ will decrease), because inertia_ is nothing but the summation of the squared distances from each point to its corresponding cluster's centroid to which cluster it is assigned to. So if you increase the number of the clusters too much, the overall distances' squared summation will decrease as each point will get a centroid very near to it. Although, the quality of clustering is horrible in this case. So, for better analysis you should find out silhouette score or even better use silhouette diagram in this case.
You will find all of the implementations in this notebook: 09_unsupervised_learning.ipynb
The book corresponding to this repository is: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition. It is a great book to learn all of these details.

Related

Batch normalization [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
Why the batch normalization is working on the different samples of the same characteristics instead of different characteristics of the same sample? Shouldn't it be the normalization of different features? In the diagram, why do we use the first row and not the first column?
Could someone help me?
Because different features of the same object mean different things, and it's not logical to calculate some statistics over these values. They can have different range, mean, std, etc. E.g. one of your features could mean the age of a person and other one is the height of the person. If you calculate mean of these values you will not get any meaningful number.
In classic machine learning (especially in linear models and KNN) you should normalize your features (i.e. calculate mean and std of the specific feature over the entire dataset and transform your features to (X-mean(X)) / std(X) ). Batch normalization is analogue of this applied to stochastic optimization methods, like SGD (it's not meaningful to use global statistics on mini batch, furthermore you want to use batch norm more often than just before the first layer). More fundamenal ideas could be found in the original paper

difference between classification and regression in k-nearest neighbor? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
what is the difference between using K-nearest neighbor in classification and using it in regression?
and when using KNN in recommendation system. Does it concerned as classification or as regression?
In classification tasks, the user seeks to predict a category, which is usually represented as an integer label, but represents a category of "things". For instance, you could try to classify pictures between "cat" and "dog" and use label 0 for "cat" and 1 for "dog".
The KNN algorithm for classification will look at the k nearest neighbours of the input you are trying to make a prediction on. It will then output the most frequent label among those k examples.
In regression tasks, the user wants to output a numerical value (usually continuous). It may be for instance estimate the price of a house, or give an evaluation of how good a movie is.
In this case, the KNN algorithm would collect the values associated with the k closest examples from the one you want to make a prediction on and aggregate them to output a single value. usually, you would choose the average of the k values of the neighbours, but you could choose the median or a weighted average (or actually anything that makes sense to you for the task at hand).
For your specific problem, you could use both but regression makes more sense to me in order to predict some kind of a "matching percentage ""between the user and the thing you want to recommand to him.

How do you calculate accuracy metric for a regression problem with multiple outputs? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
My CNN (Conv1D) on pytorch has 20 inputs and 6 outputs. The predicted output is said to be "accurate" only of all 6 them match,right? So, unless all my predicted results are accurate to the 8th decimal point ,will I ever be able to get decent accuracy?
The standard accuracy metric is used for classification tasks. In order to sue accuracy you have to say if an output if one of the following: True positive (TP), True Negative (TN), False positive (FP), False negative (FN).
These classification metrics and be used to a certain extentin regression tasks, when you can apply these labels (TP, TN, FP, FN) to the outputs, maybe via simple threshold. This heavily depends on the kind of problem you are dealing with and may or may not be possible or useful.
As Andrey said in general you wan't to use metrics like the Mean absolute error (MAE) or the Mean squared error (MSE). But these metrics can be hard to interpret. I would suggest to look into papers who have a similar problem as you do and see which metrics they use to evaluate their results and compare themselves to other work.
Accuracy isn't a suitable metric for regression tasks. For regression tasks you should use such metrics as MAE, RMSE and so on.

Stream normalization for online clustering in evolving environments [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
TL;DR: how to normalize stream data, given that the whole data set is not available and you are dealing with clustering for evolving environments
Hi! I'm currently studying dynamic clustering for non-stationary data streams. I need to normalize the data because all features should have the same impact in the final clustering, but I don't know how to do it .....
I need to apply a standard normalization. My initial approach was to:
Fill a buffer with initial data points
Use those data points to get mean and standard deviation
Use those measures to normalize the current data points
Send those points normalized to the algorithm one by one
Use the previous measures to keep normalizing incoming data points for a while
Every some time calculate again mean and standard deviation
Represent the current micro clusters centroids with the new measures (having the older ones it shouldn't be a problem to go back and normalize again)
Use the new measures to keep normalizing incoming data points for a while
And so on ....
The thing is that normalizing the data should not get involved with what the clustering algorithm does ... I mean, you are not able to tell the clustering algorithm 'ok, the micro clusters you have till now need to be normalized with this new mean and stdev' ... I mean, I developed an algorithm and I could do this, but I am also using existing algorithms (clustream and denstream) and it does not feel right to me to modify them to be able to do this ....
Any ideas?
TIA
As more data streams in, the estimated standardization parameters (e.g, mean and std) are updated and converge further to the true values [1, 2, 3]. In evolving environments, it is even more pronounced as the data distributions are now time-varying too [4]. Therefore, the more recent streamed samples that have been standardized using the more recent estimated standardization parameters are more accurate and representative.
A solution is to merge the present with a partial reflection of the past by embedding a new decay parameter in the update rule of your clustering algorithm. It boosts the contribution of the more recent samples that have been standardized using the more recent distribution estimates. You can see an implementation of this idea in Apache Sparks MLib [5, 6, 7]:
where the α is the new decay parameter; lower α makes the algorithm favor the more recent samples more.
Data normalization affects clustering for algorithms that depend on the L2 distance. Therefore you can't really have a global solution to your question.
If your clustering algorithm supports it, one option would be to use clustering with a warm-start in the following way:
at each step, find the "evolved" clusters from scratch, using the samples re-normalized according to the new mean and std dev
do not initialize clusters randomly, but instead, use the clusters found in the previous step as represented in the new space.

Ridge regression vs Lasso Regression [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
Is Lasso regression or Elastic-net regression always better than the ridge regression?
I've conducted these regressions on a few data sets and I've always got the same result that the mean squared error is the least in lasso regression. Is this a mere coincidence or is this true in any case?
On the topic, James, Witten, Hastie and Tibshirani write in their book "An Introduktion to Statistical Learning":
These two examples illustrate that neither ridge regression nor the
lasso will universally dominate the other. In general, one might expect
the lasso to perform better in a setting where a relatively small
number of predictorshave substantial coefficients, and the remaining
predictors have coefficients that are very small or that equal zero.
Ridge regression will perform better when the response is a function of
many predictors, all with coefficients of roughly equal size. However,
the number of predictors that is related to the response is never
known apriori for real data sets. A technique such as cross-validation
can be used in order to determine which approach is betteron a
particular data set. (chapter 6.2)
It's different for each problem. In lasso regression, algorithm is trying to remove the extra features that doesn't have any use which sounds better because we can train with less data very nicely as well but the processing is a little bit harder, but in ridge regression the algorithm is trying to make those extra features less effective but not removing them completely which is easier to process.

Resources