k-medoids How are new centroids picked? - machine-learning

My understanding of K-medoids is that centroids are picked randomly from existing points. Clusters are calculated by dividing remaining points to the nearest centroid. Error is calculated (absolute distance).
a) How are new centroids picked? From examples seams that they are picked randomly? And error is calculated again to see if those new centroids are better or worse.
b) How do you know that you need to stop picking new centroids?

It's worth to read the wikipedia page of the k-medoid algorithm. You are right about that the k medoid from the n data points selected randomly at the first step.
The new medoids are picked by swapping every medoid m and every non-medoid o in a loop and calculating the distance again. If the cost increased you undo the swap.
The algorithm stops if there is no swap for a full iteration.

The process for choosing the initial medoids is fairly complicated.. many people seem to just use random initial centers instead.
After this k medoids always considers every possible change of replacing one of the medoids with one non-medoid. The best such change is then applied, if it improves the result. If no further improvements are possible, the algorithm stops.
Don't rely on vague descriptions. Read the original publications.

Before answering a brief about k-medoids would be needed which i have stated in the first two steps and the last two would answer your questions.
1) The first step of k-medoids is that k-centroids/medoids are randomly picked from your dataset. Suppose your dataset contains 'n' points so these k- medoids would be chosen from these 'n' points. Now you can choose them randomly or you could use approaches like smart initialization that is used in k-means++.
2) The second step is the assignment step wherein you take each point in your dataset and find its distance from these k-medoids and find the one that is minimum and add this datapoint to set S_j corresponding to C_j centroid (as we have k-centroids C_1,C_2,....,C_k).
3) The third step of the algorithm is updation step.This will answer your question regarding how new centroids are picked after they have been initialized. I will explain updation step with an example to make it more clear.
Suppose you have ten points in your dataset which are
(x_1,x_2,x_3,x_4,x_5,x_6,x_7,x_8,x_9,x_10). Now suppose our problem is 2-cluster one so we firstly choose 2-centroids/medoids randomly from these ten points and lets say those 2-medoids are (x_2,x_5). The assignment step will remain same. Now in updation, you will choose those points which are not medoids (points apart from x_2,x_5) and again repeat the assigment and update step to find the loss which is the square of the distance between x_i's from medoids. Now you will compare the loss found using medoid x_2 and the loss found by non-medoid point. If the loss is reduced then you will swap x_2 point with any non-medoid point that has reduced the loss.If the loss is not reduced then you will keep x_2 as your medoid and won't swap.
So, there can be lot of swaps in the updation step which also makes this algorithm computationally high.
4) The last step will answer your second question i.e. when should one stop picking new centroids. When you compare the loss of medoid/centroid point with a loss computed by non-medoid, If the difference is very negligible, the you can stop and keep the medoid point as a centroid only.But if the loss is quite significant then you will have to perform the swapping until the loss reduces.
I Hope that would answer your questions.

Related

Regarding to backward of convolution layer in Deep learning

I understood the way to compute the forward part in Deep learning. Now, I want to understand the backward part. Let's take X(2,2) as an example. The backward at the position X(2,2) can compute as the figure bellow
My question is that where is dE/dY (such as dE/dY(1,1),dE/dY(1,2)...) in the formula? How to compute it at the first iteration?
SHORT ANSWER
Those terms are in the final expansion at the bottom of the slide; they contribute to the summation for dE/dX(2,2). In your first back-propagation, you start at the end and work backwards (hence the name) -- and the Y values are the ground-truth labels. So much for computing them. :-)
LONG ANSWER
I'll keep this in more abstract, natural-language terms. I'm hopeful that the alternate explanation will help you see the large picture as well as sorting out the math.
You start the training with assigned weights that may or may not be at all related to the ground truth (labels). You move blindly forward, making predictions at each layer based on naive faith in those weights. The Y(i,j) values are the resulting meta-pixels from that faith.
Then you hit the labels at the end. You work backward, adjusting each weight. Note that, at the last layer, the Y values are the ground-truth labels.
At each layer, you mathematically deal with two factors:
How far off was this prediction?
How heavily did this parameter contribute to that prediction?
You adjust the X-to-Y weight by "off * weight * learning_rate".
When you complete that for layer N, you back up to layer N-1 and repeat.
PROGRESSION
Whether you initialize your weights with fixed or random values (I generally recommend the latter), you'll notice that there's really not much progress in the early iterations. Since this is slow adjustment from guess-work weights, it takes several iterations to get a glimmer of useful learning into the last layers. The first layers are still cluelessly thrashing at this point. The loss function will bounce around close to its initial values for a while. For instance, with GoogLeNet's image recognition, this flailing lasts for about 30 epochs.
Then, finally, you get some valid learning in the latter layers, the patterns stabilize enough that some consistency percolates back to the early layers. At this point, you'll see the loss function drop to a "directed experimentation" level. From there, the progression depends a lot on the paradigm and texture of the problem: some have a sharp drop, then a gradual convergence; others have a more gradual drop, almost an exponential decay to convergence; more complex topologies have additional sharp drops as middle or early phases "get their footing".

what is the meaning of iterations of neural network, gradient descent steps, epoch, batch size?

Could you explain the words below, it really confused me.
1.iterations
2.gradient descent steps
3.epoch
4.batch size.
In the neural network terminology:
one epoch = one forward pass and one backward pass of all the training examples
batch size = the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you'll need.
number of iterations = number of passes, each pass using [batch size] number of examples. To be clear, one pass = one forward pass + one backward pass (we do not count the forward pass and backward pass as two different passes).
Example: if you have 1000 training examples, and your batch size is 500, then it will take 2 iterations to complete 1 epoch.
Gradient Descent:
Please watch this lecture:
https://www.coursera.org/learn/machine-learning/lecture/8SpIM/gradient-descent (Source: Andrew ng, Coursera)
So let's see what gradient descent does. Imagine this is like the landscape of some grassy park, with two hills like so, and I want us to imagine that you are physically standing at that point on the hill, on this little red hill in your park.
Turns out, that if you're standing at that point on the hill, you look all around and you find that the best direction is to take a little step downhill is roughly that direction.
Okay, and now you're at this new point on your hill. You're gonna, again, look all around and say what direction should I step in order to take a little baby step downhill? And if you do that and take another step, you take a step in that direction.
And then you keep going. From this new point you look around, decide what direction would take you downhill most quickly. Take another step, another step, and so on until you converge to this local minimum down here.
In gradient descent, what we're going to do is we're going to spin 360 degrees around, just look all around us, and ask, if I were to take a little baby step in some direction, and I want to go downhill as quickly as possible, what direction do I take that little baby step in? If I wanna go down, so I wanna physically walk down this hill as rapidly as possible.
I hope now you understand significance of gradient descent steps. Hope this is helpful!
in addition to Sayali's great answer, here are definitions from Keras python package:
Sample: one element of a dataset. Example: one image is a sample in
a convolutional network. Example: one audio file is a sample for a
speech recognition model
Batch: a set of N samples. The samples in a batch are processed
independently, in parallel. If training, a batch results in only one
update to the model. A batch generally approximates the distribution
of the input data better than a single input. The larger the batch,
the better the approximation; however, it is also true that the
batch will take longer to process and will still result in only one
update. For inference (evaluate/predict), it is recommended to pick
a batch size that is as large as you can afford without going out of
memory (since larger batches will usually result in faster
evaluating/prediction).
Epoch: an arbitrary cutoff, generally defined as "one pass over the
entire dataset", used to separate training into distinct phases,
which is useful for logging and periodic evaluation.

Algorithm suitable for a region-growing image segmentation based on the minimization of a metric

Is there a pixel-based region growing algorithm that can be employed for the extraction of features (segmentation) on an image, by adding pixels to the seed based on the minimization of a certain metric. Potentially, a pixel can be removed if the metric is not optimized when this pixel is added (i.e. possibility to backtrack and go back to the seed obtained in the previous iterations).
I'll try to explain further my objectives:
This algorithm starts from a central pixel selected as an initial seed on the image.
Afterwards, each of the 4 neighbors is explored (right, left, bottom and top neighbors) separately, to see if the metric is optimized by growing the seed in the selected direction.
A neighboring pixel might not optimize the metric immediately, even if the seed created by adding this pixel will be optimal in future iterations.
There is a possibility that a neighboring pixel is added to the seed but is removed later, if the obtained seed is not optimal.
Can anyone suggest to me an Artificial Intelligence technique (or a greedy approach) that is adequate to solve this kind of problems? Also, what would be a good criteria to judge that the addition of a pixel will optimize the metric even though this will probably happen in future iterations.
P.S: I started implementing what's explained above in Python but was stuck in the issue of determining if a path (neighboring pixel) is worth exploring or not. Right now, I try to add a neighboring pixel only if the seed produced improve (i.e. minimize) the error relatively to the metric. However, even though by adding the right or left neighbors the metric isn't optimized, one of these two paths might lead to the optimal solution in the future (as explained in the third objective).
You've basically outlined the most successful algorithm you could get with this approach. It's success will depend heavily on the metric you use to add/remove pixels, but there are a few things you can do to emulate the behavior you want.
Definitions
We'll call the metric we're optimizing M where M(R) is the metric's value for a region R and a region R is some collections of pixels. I will assume that optimizing the metric will result in the largest possible value of M, but this approach can work if the goal is to minimize M as well.
Methodology
This approach is going to be slightly backwards to your original outline, but it should satisfy both requirements of adding pixels that lie in non-optimal paths from the seed and removing pixels that do not contribute significantly to the optimization.
We will begin at a seed s, but instead of evaluating paths as we go we will add all pixels in the image (or maximum feature size) iteratively to our region. At each step we will determine a value of the pixel based on how much it improves the metric for the current region, M(p). This is not the same as the value of the region containing the pixel (M(R) where p is in R). Rather it would be the difference of the value of the region containing the pixel and the value of the region before the pixel was added (M(p) = M(R) - M(R') where R = R' + p). If you have the capacity to evaluate a single pixel you could simply use that instead.
The next change is to include an regularization parameter in M(R) that penalizes the score based on the number of pixels included: N(R) = M(R) - a * |R| where a is some arbitrary positive constant and |R| represents the cardinality (number of pixels) in our region. Note: if the goal is to minimize M then a should be negative. This will have the effect of penalizing the score of the region if it includes too many pixels.
Finally, after all pixels have been added to the region and N(p) has been evaluated for each pixel we iterate over the region again. This time we begin at the last pixel added and iterate backwards over our set of pixels, ending at the seed s. At each iteration determine the score of the region N(R). If the score N(R) has decreased since the last iteration then we remove the pixel p with the lowest score N(p). This should have the effect of the smallest number of pixels in the region that contribute the most to the score.
Additional Considerations
If the remaining pixels lie on non-contiguous paths after pruning you could run a secondary algorithm to add in adjoining pixels. You'll need to do testing to determine an optimal value of a such that enough pixels are kept to reconstruct the building, but it doesn't include every pixel from the image.
My Opinion (that you didn't ask for)
In general I think you would have more luck with more robust algorithms such as Convolutional Neural Networks for feature classification. They'll likely be faster and definitely more accurate than the algorithm described above.

Why do we maximize variance during Principal Component Analysis?

I'm trying to read through PCA and saw that the objective was to maximize the variance. I don't quite understand why. Any explanation of other related topics would be helpful
Variance is a measure of the "variability" of the data you have. Potentially the number of components is infinite (actually, after numerization it is at most equal to the rank of the matrix, as #jazibjamil pointed out), so you want to "squeeze" the most information in each component of the finite set you build.
If, to exaggerate, you were to select a single principal component, you would want it to account for the most variability possible: hence the search for maximum variance, so that the one component collects the most "uniqueness" from the data set.
Note that PCA does not actually increase the variance of your data. Rather, it rotates the data set in such a way as to align the directions in which it is spread out the most with the principal axes. This enables you to remove those dimensions along which the data is almost flat. This decreases the dimensionality of the data while keeping the variance (or spread) among the points as close to the original as possible.
Maximizing the component vector variances is the same as maximizing the 'uniqueness' of those vectors. Thus you're vectors are as distant from each other as possible. That way if you only use the first N component vectors you're going to capture more space with highly varying vectors than with like vectors. Think about what Principal Component actually means.
Take for example a situation where you have 2 lines that are orthogonal in a 3D space. You can capture the environment much more completely with those orthogonal lines than 2 lines that are parallel (or nearly parallel). When applied to very high dimensional states using very few vectors, this becomes a much more important relationship among the vectors to maintain. In a linear algebra sense you want independent rows to be produced by PCA, otherwise some of those rows will be redundant.
See this PDF from Princeton's CS Department for a basic explanation.
max variance is basically setting these axis that occupy the maximum spread of the datapoints, why? because the direction of this axis is what really matters as it kinda explains correlations and later on we will compress/project the points along those axis to get rid of some dimensions

OpenCV RANSAC returns the same transformation everytime

I am confused as to how to use the OpenCV findHomography method to compute the optimal transformation.
The way I use it is as follows:
cv::Mat h = cv::findHomography(src, dst, CV_RANSAC, 5.f);
No matter how many times I run it, I get the same transformation matrix. I thought RANSAC is supposed to randomly select a subset of points to do the fitting, so why does it return the same transformation matrix every time? Is it related to some random number initialization? How can I make this behaviour actually random?
Secondly, how can I tune the number of RANSAC iterations in this setup? Usually the number of iterations is based on inlier ratios and things like that.
I thought RANSAC is supposed to randomly select a subset of points to do the fitting, so why does it return the same transformation matrix every time?
RANSAC repeatedly selects a subset of points, then fits a model based upon them, then checks how many data points in the data set are inliers given that fitted model. Once it's done that lots of times, it picks the fitted model that had the most inliers, and refits the model to those inliers.
For any given data set, set of variable model parameters, and rule for what constitutes an inlier, there will exist one or more (but often exactly one) largest possible set of "inliers". For example, given this data set (image from Wikipedia):
... then with some sort of reasonable definition of an outlier, the maximal possible set of inliers any linear model can have is the one in blue below:
Let's call the set of blue points above - the maximal possible set of inliers - I.
If you randomly select a small number of points (e.g. two or three) and draw a line of best fit through them, it's hopefully intuitively obvious that it'll only take you a handful of tries until you hit an iteration where:
all the randomly-selected points you pick are from I, and so
the line of best fit through those points is roughly equal to the line of best fit in the graph above, and so
the set of inliers found on that iteration is exactly I
From that iteration onwards, all further iterations are a waste that cannot possibly improve the model further (although RANSAC has no way of knowing this, since it doesn't magically know when it's found the maximal set of inliers).
If you have a large enough number of iterations relative to the size of your data set, and a large enough proportion of the data set are inliers, then you will eventually find the maximal set of inliers with a close to 100% chance every time you run RANSAC. As a consequence, RANSAC will (almost) always output exactly the same model.
And that's a good thing! Often, you want RANSAC to find the absolute maximal set of inliers and don't want to settle for anything less. If you're getting different results each time you run RANSAC in such a scenario, that's a sign that you want to increase your number of iterations.
(Of course, in the case above we're talking about trying to fit a line through points in a 2D plane, which isn't what findHomography does, but the principle is the same; there will typically still be a single maximal set of inliers and eventually RANSAC will find it.)
How can I make this behaviour actually random?
Decrease the number of iterations (maxIters) so that RANSAC sometimes fails to find the maximal set of inliers.
But there's generally no reason to do this besides pure intellectual curiosity; you'll basically be deliberately telling RANSAC to output an inferior model.
findHomography will already give you the optimal transformation. The real question is about the meaning of optimal.
For example, with RANSAC you'll have the model with maximum number of inliers, while with LMEDS you'll have the model with minimum median error.
You can modify default behavior by:
changing the number of iteration of RANSAC by setting maxIters (max number allowed is 2000)
decreasing (increasing) the ransacReprojThreshold used to validate a inliers and outliers (usually between 1 and 10).
Regarding you questions.
No matter how many times I run it, I get the same transformation matrix.
Probably your points are good enough that you find always the optimal model.
I thought RANSAC is supposed to randomly select a subset of points to do the fitting
RANSAC (RANdom SAmple Consensus) first selects a random subset, the checks if the model built with these points is good enough. If not, it selects another random subset.
How can I make this behaviour actually random?
I can't imagine a scenario where this would be useful, but you can randomly select 4 couples of points from src and dst, and use getPerspectiveTransform. Unless your points are perfect, you'll get a different matrix for each subset.

Resources