When using KNeighborsClassifier what is the motivation of using weights="distance" ?
According to sklearn docs:
‘distance’ : weight points by the inverse of their distance. In this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.
What is the motivation of using this?
The idea of a nearest-neighbors classifier is to consider those points of the training set which are close to the point you want to classify, and guess this point's class based on their known class labels.
If all these close training points have the same label, the result is clear. But what if they don't all have the same label? You could take their most common label, but this may not always be the best guess.
For example, imagine one training point with label A being very close to the point you want to classify, while two training points with label B are somewhat further away, but still close. Should the new point be labelled A or B? Weighting the points by how close they are (i.e. by the inverse of their distance) provides an objective way to answer this question.
Related
I have an accident location dataset. I have applied several clustering algorithms on this dataset using the column latitude and longitude. Now I would like to measure the accuracy of different clustering algorithms separately to compare between them.
I want to apply the confusion matrix described in this article.
But I am not able to understand what I should consider as a label? I have made my clusters using only two columns latitude and longitude. Can anyone guide me, please? I have the code but it's not clear to me. I mean what is the label or class label in my case?
In a confusion matrix you provide two sets of labels for each entry. One of these labels is the cluster assignment generated by the clustering you did. The second label can be the ground truth, which allows you to determine accuracy/precision.
Your case sounds like there is no ground truth, so you can't compare for accuracy. You CAN use the result of one of the different algorithms you used as the second set of labels, to compare the result between these two clusterings.
I have a explanatory variable x and a response variable y. I am trying to find which power of the feature i should train with. You can ignore the colors for my question. the scatter data is from the sensor and the line plot is the theoretical curve from the lab, which you can also ignore for my question.
For this answer I understand you want to obtain some polynomial curve going through the croissant shaped zone where points are dense.
Also I assume that the independent variable is on the horizontal axis, while the dependent is on the vertical one. Otherwise as you can see from the blue line, there is no functional that could give you this.
Now to select the degree of polynomial you can use stepwise regression.
This is about running the regression with more or less features one at a time (i.e decrease or increase the degree of polynomial in this case), and calculating a score such as AIC, BIC, or even adjusted R2 to assess if it's worth it or not to add or remove this feature.
Why is it wrong to think that it only needs the data since it: "outputs a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation)."
However, I also need to input the labels (which the function itself computes); so, why are the labels necessary to input?
how similar an object is to its own cluster
In order to compute the silhouette, you need to know to which cluster your samples belong.
Also:
The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of.
You need the labels to know what "intra-cluster" and "nearest-cluster" mean.
Silhouette_score is a metric for clustering quality, not a clustering algorithm. It considers both the inter-class and intra-class distance.
For that calculation to happen, you need to supply both the data and target labels (estimated by unsupervised methods like K-means).
I have a bunch of gray-scale images decomposed into superpixels. Each superpixel in these images have a label in the rage of [0-1]. You can see one sample of images below.
Here is the challenge: I want the spatially (locally) neighboring superpixels to have consistent labels (close in value).
I'm kind of interested in smoothing local labels but do not want to apply Gaussian smoothing functions or whatever, as some colleagues suggested. I have also heard about Conditional Random Field (CRF). Is it helpful?
Any suggestion would be welcome.
I'm kind of interested in smoothing local labels but do not want to apply Gaussian smoothing functions or whatever, as some colleagues suggested.
And why is that? Why do you not consider helpful advice of your colleagues, which are actually right. Applying smoothing function is the most reasonable way to go.
I have also heard about Conditional Random Field (CRF). Is it helpful?
This also suggests, that you should rather go with collegues advice, as CRF has nothing to do with your problem. CRF is a classifier, sequence classifier to be exact, requiring labeled examples to learn from and has nothing to do with the setting presented.
What are typical approaches?
The exact thing proposed by your collegues, you should define a smoothing function and apply it to your function values (I will not use a term "labels" as it is missleading, you do have values in [0,1], continuous values, "label" denotes categorical variable in machine learning) and its neighbourhood.
Another approach would be to define some optimization problem, where your current assignment of values is one goal, and the second one is "closeness", for example:
Let us assume that you have points with values {(x_i, y_i)}_{i=1}^N and that n(x) returns indices of neighbouring points of x.
Consequently you are trying to find {a_i}_{i=1}^N such that they minimize
SUM_{i=1}^N (y_i - a_i)^2 + C * SUM_{i=1}^N SUM_{j \in n(x_i)} (a_i - a_j)^2
------------------------- - --------------------------------------------
closeness to current constant to closeness to neighbouring values
values weight each part
You can solve the above optimization problem using many techniques, for example through scipy.optimize.minimize module.
I am not sure that your request makes any sense.
Having close label values for nearby superpixels is trivial: take some smooth function of (X, Y), such as constant or affine, taking values in the range [0,1], and assign the function value to the superpixel centered at (X, Y).
You could also take the distance function from any point in the plane.
But this is of no use as it is unrelated to the image content.
I have a concern about support vector machines, namely their classification scores:
Do these classification scores have an upper bound?
I think no, since an SVM is just a hyperplane, and the score basically a point's distance to that hyperplane. Without restrictions, a point could lie anywhere in the space and thus the distance does not have any bound, does it?
I am asking, because I have read the following line:
"When decision scores are bounded — and SVM scores are bounded by the margin — ..."
Could you explain what is meant by that? I don't see how the margin is a bound on the detection score...
Thanks for your help, I appreciate it!
Your intuition is correct. Whatever you read is misleading at best or plain wrong (some context is required in any case). SVM decision values do not have an upper bound. It depends entirely on the test instances.
SVM decision values are a linear combination of inner products in feature space of the test instance and the support vectors. If the test instance has infinite norm, these inner products will be infinite as well.