I am attempting to estimate the similarity between three different entities (here expressed as curves).
One of the curves represent a "teacher" (green curve) and the other two are "students".
While researching how to solve this problem, I have come across multiple techniques:
Procrustes Analysis Procrustes Analysis with NumPy?
Peak Finding Peak finding algorithm
Minkowski Distance (to penalize the outliers heavier)
All three methods have their own advantages and disadvantages, however neither of them seem to help me with the problem demonstrated in the image:
I "know" that "student 3" (orange curve) is closer to the "teacher", however distance wise "student 5" is measured as closest one
Peak estimations works well for sharp edges, and it does not perform well here.
I do not have a background in signal processing (which is what the problem appears to be requiring), and I would appreciate general suggestions/techniques on how to address these types of problems.
This problem isn't necessarily related to signal processing but to curve fitting or optimization in general. When you say that student 3 is "closer", you have to define "closeness".When using pre-defined distance function like you did, you have arbitrarily chosen a distance measure which isn't necessarily suitable to your needs.Estimating from the drawing, I think that by using Euclidean distance you'll get what you want (that student 3 is closer).
Related
I'm going through CS231n to understand the basics of neural networks.
Attached is the slide in which Justin (the tutor) gives the reasoning for why data preprocessing is required and I don't completely understand. The explanation given is similar to the one given on the slide and I don't get it. The slide is below.
The second question I have is: is it actually normalisation or standardisation? This link implies that it is standardisation, whereas the course material says it is normalisation.
Any help will be appreciated.
A) The meaning of "less sensitive to small changes in weights" can easily be visualized. Imagine to operate a little change in the weights of the drawn hyperplane, i.e. rotate it a bit. If the samples are located around the origin, you'll notice that they can still be correctly classified. If they're far away from the origin, the same little change in weights will lead to bigger misclassifications.
B) Sometimes standardization and normalization are used interchangeably.
Standardization: I quote from Machine Learning and Pattern Recognition by Bishop : "For the purposes of this example, we have made a linear re-scaling of the data, known as standardizing, such that each of the variables has zero mean and unit standard deviation."
Normalization could be e.g. min-max normalization when you scale all feature values to the [0,1] range, or feature vector normalization when you divide the feature vector by its modulus.
I have go question while programming K-means algorithm in Matlab. Why K-means algorithm not suitable for classifying elongated data set?
In sort, draw some thick lines on a paper. Can you really represent each one with a single point? How would single points give information about orientation?
K-means assigns each datapoint to each nearest centroid. That is to say that for each centroid c, all points that their distance from c is smaller (in comparison to all other centroids) will be assigned to c. And, since the surface of a (hyper)sphere is in fact, all points with distance less or equal to some value from a center, I think it is easy to see how resulted clusters tend to be spherical. (To be exact, kmeans practically creates a Voronoi diagram in the vector space)
Elongated clusters however, don't necessarily satisfy the requirement that all their points are closer to their "center of mass" than to some other cluster's center.
It is difficult for you to choose a init cluster center point in elongated data set, but it has a powerful effect on the result.You may get different results when choose different points.
You will get only one result in this case when you choose 3 init points:
But it is different in elongated data set.
How exactly is an U-matrix constructed in order to visualise a self-organizing-map? More specifically, suppose that I have an output grid of 3x3 nodes (that have already been trained), how do I construct a U-matrix from this? You can e.g. assume that the neurons (and inputs) have dimension 4.
I have found several resources on the web, but they are not clear or they are contradictory. For example, the original paper is full of typos.
A U-matrix is a visual representation of the distances between neurons in the input data dimension space. Namely you calculate the distance between adjacent neurons, using their trained vector. If your input dimension was 4, then each neuron in the trained map also corresponds to a 4-dimensional vector. Let's say you have a 3x3 hexagonal map.
The U-matrix will be a 5x5 matrix with interpolated elements for each connection between two neurons like this
The {x,y} elements are the distance between neuron x and y, and the values in {x} elements are the mean of the surrounding values. For example, {4,5} = distance(4,5) and {4} = mean({1,4}, {2,4}, {4,5}, {4,7}). For the calculation of the distance you use the trained 4-dimensional vector of each neuron and the distance formula that you used for the training of the map (usually Euclidian distance). So, the values of the U-matrix are only numbers (not vectors). Then you can assign a light gray colour to the largest of these values and a dark gray to the smallest and the other values to corresponding shades of gray. You can use these colours to paint the cells of the U-matrix and have a visualized representation of the distances between neurons.
Have also a look at this web article.
The original paper cited in the question states:
A naive application of Kohonen's algorithm, although preserving the topology of the input data is not able to show clusters inherent in the input data.
Firstly, that's true, secondly, it is a deep mis-understanding of the SOM, thirdly it is also a mis-understanding of the purpose of calculating the SOM.
Just take the RGB color space as an example: are there 3 colors (RGB), or 6 (RGBCMY), or 8 (+BW), or more? How would you define that independent of the purpose, ie inherent in the data itself?
My recommendation would be not to use maximum likelihood estimators of cluster boundaries at all - not even such primitive ones as the U-Matrix -, because the underlying argument is already flawed. No matter which method you then use to determine the cluster, you would inherit that flaw. More precisely, the determination of cluster boundaries is not interesting at all, and it is loosing information regarding the true intention of building a SOM. So, why do we build SOM's from data?
Let us start with some basics:
Any SOM is a representative model of a data space, for it reduces the dimensionality of the latter. For it is a model it can be used as a diagnostic as well as a predictive tool. Yet, both cases are not justified by some universal objectivity. Instead, models are deeply dependent on the purpose and the accepted associated risk for errors.
Let us assume for a moment the U-Matrix (or similar) would be reasonable. So we determine some clusters on the map. It is not only an issue how to justify the criterion for it (outside of the purpose itself), it is also problematic because any further calculation destroys some information (it is a model about a model).
The only interesting thing on a SOM is the accuracy itself viz the classification error, not some estimation of it. Thus, the estimation of the model in terms of validation and robustness is the only thing that is interesting.
Any prediction has a purpose and the acceptance of the prediction is a function of the accuracy, which in turn can be expressed by the classification error. Note that the classification error can be determined for 2-class models as well as for multi-class models. If you don't have a purpose, you should not do anything with your data.
Inversely, the concept of "number of clusters" is completely dependent on the criterion "allowed divergence within clusters", so it is masking the most important thing of the structure of the data. It is also dependent on the risk and the risk structure (in terms of type I/II errors) you are willing to take.
So, how could we determine the number classes on a SOM? If there is no exterior apriori reasoning available, the only feasible way would be an a-posteriori check of the goodness-of-fit. On a given SOM, impose different numbers of classes and measure the deviations in terms of mis-classification cost, then choose (subjectively) the most pleasing one (using some fancy heuristics, like Occam's razor)
Taken together, the U-matrix is pretending objectivity where no objectivity can be. It is a serious misunderstanding of modeling altogether.
IMHO it is one of the greatest advantages of the SOM that all the parameters implied by it are accessible and open for being parameterized. Approaches like the U-matrix destroy just that, by disregarding this transparency and closing it again with opaque statistical reasoning.
Trying to write some code that deals with this task:
As an starting point, I have around 20 "profiles" (imagine a landscape profile), i.e. one-dimensional arrays of around 1000 real values.
Each profile has a real-valued desired outcome, the "effective height".
The effective height is some sort of average but height, width and position of peaks play a particular role.
My aim is to generalize from the input data so as to calculate the effective height for further profiles.
Is there a machine learning algorithm or principle that could help?
Principle 1: Extract the most import features, instead of feeding it everything
As you said, "The effective height is some sort of average but height, width and position of peaks play a particular role." So that you have a strong priori assumption that these measures are the most important for learning. If I were you, I would calculate these measures at first, and use them as the input for learning, instead of the raw data.
Principle 2: While choosing a learning algorithm, the first thing to care about would be the the linear separability
Suppose the height is a function of those measures, then you have to think about that to what extent the function is linear. For example if the function is almost linear, then a very simple Perceptron would be perfect. Otherwise if it's far from linear, you might want to pick up a multiple-layer neural network. If it's far far far from linear....please turn to principle 1, and check out if you are extracting the right features.
Principle 3: More data help
As you said, you have around 20 "profiles" for training. In general speaking, that's not enough. Almost all of the machine learning algorithms were designed for somehow big data. Even they claimed that their algorithm is good at learning small sample, but usually not as small as 20. Get more data!
Maybe multivariate linear regression suffices?
I would probably use a combination of what you said about which features play the most important role, and then train a regression on that. Basically, you need at least one coefficient corresponding to each feature, and you need substantially more data points than coefficients. So, I would pick something like the heights and width of the two biggest peaks. You've now reduced every profile to just 4 numbers. Now do this trick: divide the data into 5 groups of 4. Pick the first 4 groups. Reduce all those profiles to 4 numbers, and then use the desired outcomes to come up with a regression. Once you have trained the regression, try your technique on the last 4 points and see how well it works. Repeat this procedure 5 times, each time leaving out a different set of data. This is called cross-validation, and it's very handy.
Obviously getting more data would help.
I have implemented k-means clustering for determining the clusters in 300 objects. Each of my object
has about 30 dimensions. The distance is calculated using the Euclidean metric.
I need to know
How would I determine if my algorithms works correctly? I can't have a graph which will
give some idea about the correctness of my algorithm.
Is Euclidean distance the correct method for calculating distances? What if I have 100 dimensions
instead of 30 ?
The two questions in the OP are separate topics (i.e., no overlap in the answers), so I'll try to answer them one at a time staring with item 1 on the list.
How would I determine if my [clustering] algorithms works correctly?
k-means, like other unsupervised ML techniques, lacks a good selection of diagnostic tests to answer questions like "are the cluster assignments returned by k-means more meaningful for k=3 or k=5?"
Still, there is one widely accepted test that yields intuitive results and that is straightforward to apply. This diagnostic metric is just this ratio:
inter-centroidal separation / intra-cluster variance
As the value of this ratio increase, the quality of your clustering result increases.
This is intuitive. The first of these metrics is just how far apart is each cluster from the others (measured according to the cluster centers)?
But inter-centroidal separation alone doesn't tell the whole story, because two clustering algorithms could return results having the same inter-centroidal separation though one is clearly better, because the clusters are "tighter" (i.e., smaller radii); in other words, the cluster edges have more separation. The second metric--intra-cluster variance--accounts for this. This is just the mean variance, calculated per cluster.
In sum, the ratio of inter-centroidal separation to intra-cluster variance is a quick, consistent, and reliable technique for comparing results from different clustering algorithms, or to compare the results from the same algorithm run under different variable parameters--e.g., number of iterations, choice of distance metric, number of centroids (value of k).
The desired result is tight (small) clusters, each one far away from the others.
The calculation is simple:
For inter-centroidal separation:
calculate the pair-wise distance between cluster centers; then
calculate the median of those distances.
For intra-cluster variance:
for each cluster, calculate the distance of every data point in a given cluster from
its cluster center; next
(for each cluster) calculate the variance of the sequence of distances from the step above; then
average these variance values.
That's my answer to the first question. Here's the second question:
Is Euclidean distance the correct method for calculating distances? What if I have 100 dimensions instead of 30 ?
First, the easy question--is Euclidean distance a valid metric as dimensions/features increase?
Euclidean distance is perfectly scalable--works for two dimensions or two thousand. For any pair of data points:
subtract their feature vectors element-wise,
square each item in that result vector,
sum that result,
take the square root of that scalar.
Nowhere in this sequence of calculations is scale implicated.
But whether Euclidean distance is the appropriate similarity metric for your problem, depends on your data. For instance, is it purely numeric (continuous)? Or does it have discrete (categorical) variables as well (e.g., gender? M/F) If one of your dimensions is "current location" and of the 200 users, 100 have the value "San Francisco" and the other 100 have "Boston", you can't really say that, on average, your users are from somewhere in Kansas, but that's sort of what Euclidean distance would do.
In any event, since we don't know anything about it, i'll just give you a simple flow diagram so that you can apply it to your data and identify an appropriate similarity metric.
To identify an appropriate similarity metric given your data:
Euclidean distance is good when dimensions are comparable and on the same scale. If one dimension represents length and another - weight of item - euclidean should be replaced with weighted.
Make it in 2d and show the picture - this is good option to see visually if it works.
Or you may use some sanity check - like to find cluster centers and see that all items in the cluster aren't too away of it.
Can't you just try sum |xi - yi| instead if (xi - yi)^2
in your code, and see if it makes much difference ?
I can't have a graph which will give some idea about the correctness of my algorithm.
A couple of possibilities:
look at some points midway between 2 clusters in detail
vary k a bit, see what happens (what is your k ?)
use
PCA
to map 30d down to 2d; see the plots under
calculating-the-percentage-of-variance-measure-for-k-means,
also SO questions/tagged/pca
By the way, scipy.spatial.cKDTree
can easily give you say 3 nearest neighbors of each point,
in p=2 (Euclidean) or p=1 (Manhattan, L1), to look at.
It's fast up to ~ 20d, and with early cutoff works even in 128d.
Added: I like Cosine distance in high dimensions; see euclidean-distance-is-usually-not-good-for-sparse-data for why.
Euclidean distance is the intuitive and "normal" distance between continuous variable. It can be inappropriate if too noisy or if data has a non-gaussian distribution.
You might want to try the Manhattan distance (or cityblock) which is robust to that (bear in mind that robustness always comes at a cost : a bit of the information is lost, in this case).
There are many further distance metrics for specific problems (for example Bray-Curtis distance for count data). You might want to try some of the distances implemented in pdist from python module scipy.spatial.distance.