How do I make a U-matrix? - machine-learning

How exactly is an U-matrix constructed in order to visualise a self-organizing-map? More specifically, suppose that I have an output grid of 3x3 nodes (that have already been trained), how do I construct a U-matrix from this? You can e.g. assume that the neurons (and inputs) have dimension 4.
I have found several resources on the web, but they are not clear or they are contradictory. For example, the original paper is full of typos.

A U-matrix is a visual representation of the distances between neurons in the input data dimension space. Namely you calculate the distance between adjacent neurons, using their trained vector. If your input dimension was 4, then each neuron in the trained map also corresponds to a 4-dimensional vector. Let's say you have a 3x3 hexagonal map.
The U-matrix will be a 5x5 matrix with interpolated elements for each connection between two neurons like this
The {x,y} elements are the distance between neuron x and y, and the values in {x} elements are the mean of the surrounding values. For example, {4,5} = distance(4,5) and {4} = mean({1,4}, {2,4}, {4,5}, {4,7}). For the calculation of the distance you use the trained 4-dimensional vector of each neuron and the distance formula that you used for the training of the map (usually Euclidian distance). So, the values of the U-matrix are only numbers (not vectors). Then you can assign a light gray colour to the largest of these values and a dark gray to the smallest and the other values to corresponding shades of gray. You can use these colours to paint the cells of the U-matrix and have a visualized representation of the distances between neurons.
Have also a look at this web article.

The original paper cited in the question states:
A naive application of Kohonen's algorithm, although preserving the topology of the input data is not able to show clusters inherent in the input data.
Firstly, that's true, secondly, it is a deep mis-understanding of the SOM, thirdly it is also a mis-understanding of the purpose of calculating the SOM.
Just take the RGB color space as an example: are there 3 colors (RGB), or 6 (RGBCMY), or 8 (+BW), or more? How would you define that independent of the purpose, ie inherent in the data itself?
My recommendation would be not to use maximum likelihood estimators of cluster boundaries at all - not even such primitive ones as the U-Matrix -, because the underlying argument is already flawed. No matter which method you then use to determine the cluster, you would inherit that flaw. More precisely, the determination of cluster boundaries is not interesting at all, and it is loosing information regarding the true intention of building a SOM. So, why do we build SOM's from data?
Let us start with some basics:
Any SOM is a representative model of a data space, for it reduces the dimensionality of the latter. For it is a model it can be used as a diagnostic as well as a predictive tool. Yet, both cases are not justified by some universal objectivity. Instead, models are deeply dependent on the purpose and the accepted associated risk for errors.
Let us assume for a moment the U-Matrix (or similar) would be reasonable. So we determine some clusters on the map. It is not only an issue how to justify the criterion for it (outside of the purpose itself), it is also problematic because any further calculation destroys some information (it is a model about a model).
The only interesting thing on a SOM is the accuracy itself viz the classification error, not some estimation of it. Thus, the estimation of the model in terms of validation and robustness is the only thing that is interesting.
Any prediction has a purpose and the acceptance of the prediction is a function of the accuracy, which in turn can be expressed by the classification error. Note that the classification error can be determined for 2-class models as well as for multi-class models. If you don't have a purpose, you should not do anything with your data.
Inversely, the concept of "number of clusters" is completely dependent on the criterion "allowed divergence within clusters", so it is masking the most important thing of the structure of the data. It is also dependent on the risk and the risk structure (in terms of type I/II errors) you are willing to take.
So, how could we determine the number classes on a SOM? If there is no exterior apriori reasoning available, the only feasible way would be an a-posteriori check of the goodness-of-fit. On a given SOM, impose different numbers of classes and measure the deviations in terms of mis-classification cost, then choose (subjectively) the most pleasing one (using some fancy heuristics, like Occam's razor)
Taken together, the U-matrix is pretending objectivity where no objectivity can be. It is a serious misunderstanding of modeling altogether.
IMHO it is one of the greatest advantages of the SOM that all the parameters implied by it are accessible and open for being parameterized. Approaches like the U-matrix destroy just that, by disregarding this transparency and closing it again with opaque statistical reasoning.

Related

Why doesn't the hidden state of a neuron network provide better dimension reduction result than original input?

I just read a great post here. I am curious about content of "An example with images" in that post. If the hidden states mean a lot of features of the original picture and getting closer to final result, using dimension reduction on hidden states should provide better result than the original raw pixels, I think.
Hence, I tried it on mnist digits with 2 hidden layers of 256 unit NN, using T-SNE for dimension reduction; the result is far from ideal. From left to right, top to bot, they are raw pixels, second hidden layer and final prediction. Can anyone explain that?
BTW, the accuracy of this model is around 94.x%.
You have ten classes, and as you mentioned, your model is performing well on this dataset - so in this 256 dimensional space - the classes are separated well using linear subspaces.
So why T-SNE projections don't have this property?
One trivial answer which comes to my mind is that projecting a highly dimensional set to two dimensions may lose the linear separation property. Consider following example : a hill where one class is at its peak and second - on lower height levels around. In three dimensions these classes are easily separated by a plane but one can easily find a two dimensional projection which doesn't have that property (e.g. projection in which you are loosing the height dimension).
Of course T-SNE is not such linear projections but it's main purpose is to preserve a local structure of data, so that general property like linear separation property might be easly losed when using this approach.

Linear Regression :: Normalization (Vs) Standardization

I am using Linear regression to predict data. But, I am getting totally contrasting results when I Normalize (Vs) Standardize variables.
Normalization = x -xmin/ xmax – xmin
 
Zero Score Standardization = x - xmean/ xstd
 
a) Also, when to Normalize (Vs) Standardize ?
b) How Normalization affects Linear Regression?
c) Is it okay if I don't normalize all the attributes/lables in the linear regression?
Thanks,
Santosh
Note that the results might not necessarily be so different. You might simply need different hyperparameters for the two options to give similar results.
The ideal thing is to test what works best for your problem. If you can't afford this for some reason, most algorithms will probably benefit from standardization more so than from normalization.
See here for some examples of when one should be preferred over the other:
For example, in clustering analyses, standardization may be especially crucial in order to compare similarities between features based on certain distance measures. Another prominent example is the Principal Component Analysis, where we usually prefer standardization over Min-Max scaling, since we are interested in the components that maximize the variance (depending on the question and if the PCA computes the components via the correlation matrix instead of the covariance matrix; but more about PCA in my previous article).
However, this doesn’t mean that Min-Max scaling is not useful at all! A popular application is image processing, where pixel intensities have to be normalized to fit within a certain range (i.e., 0 to 255 for the RGB color range). Also, typical neural network algorithm require data that on a 0-1 scale.
One disadvantage of normalization over standardization is that it loses some information in the data, especially about outliers.
Also on the linked page, there is this picture:
As you can see, scaling clusters all the data very close together, which may not be what you want. It might cause algorithms such as gradient descent to take longer to converge to the same solution they would on a standardized data set, or it might even make it impossible.
"Normalizing variables" doesn't really make sense. The correct terminology is "normalizing / scaling the features". If you're going to normalize or scale one feature, you should do the same for the rest.
That makes sense because normalization and standardization do different things.
Normalization transforms your data into a range between 0 and 1
Standardization transforms your data such that the resulting distribution has a mean of 0 and a standard deviation of 1
Normalization/standardization are designed to achieve a similar goal, which is to create features that have similar ranges to each other. We want that so we can be sure we are capturing the true information in a feature, and that we dont over weigh a particular feature just because its values are much larger than other features.
If all of your features are within a similar range of each other then theres no real need to standardize/normalize. If, however, some features naturally take on values that are much larger/smaller than others then normalization/standardization is called for
If you're going to be normalizing at least one variable/feature, I would do the same thing to all of the others as well
First question is why we need Normalisation/Standardisation?
=> We take a example of dataset where we have salary variable and age variable.
Age can take range from 0 to 90 where salary can be from 25thousand to 2.5lakh.
We compare difference for 2 person then age difference will be in range of below 100 where salary difference will in range of thousands.
So if we don't want one variable to dominate other then we use either Normalisation or Standardization. Now both age and salary will be in same scale
but when we use standardiztion or normalisation, we lose original values and it is transformed to some values. So loss of interpretation but extremely important when we want to draw inference from our data.
Normalization rescales the values into a range of [0,1]. also called min-max scaled.
Standardization rescales data to have a mean (μ) of 0 and standard deviation (σ) of 1.So it gives a normal graph.
Example below:
Another example:
In above image, you can see that our actual data(in green) is spread b/w 1 to 6, standardised data(in red) is spread around -1 to 3 whereas normalised data(in blue) is spread around 0 to 1.
Normally many algorithm required you to first standardise/normalise data before passing as parameter. Like in PCA, where we do dimension reduction by plotting our 3D data into 1D(say).Here we required standardisation.
But in Image processing, it is required to normalise pixels before processing.
But during normalisation, we lose outliers(extreme datapoints-either too low or too high) which is slight disadvantage.
So it depends on our preference what we chose but standardisation is most recommended as it gives a normal curve.
None of the mentioned transformations shall matter for linear regression as these are all affine transformations.
Found coefficients would change but explained variance will ultimately remain the same. So, from linear regression perspective, Outliers remain as outliers (leverage points).
And these transformations also will not change the distribution. Shape of the distribution remains the same.
lot of people use Normalisation and Standardisation interchangeably. The purpose remains the same is to bring features into the same scale. The approach is to subtract each value from min value or mean and divide by max value minus min value or SD respectively. The difference you can observe that when using min value u will get all value + ve and mean value u will get bot + ve and -ve values. This is also one of the factors to decide which approach to use.

Dimension Reduction of Feature in Machine Learning

Is there any way to reduce the dimension of the following features from 2D coordinate (x,y) to one dimension?
Yes. In fact, there are infinitely many ways to reduce the dimension of the features. It's by no means clear, however, how they perform in practice.
A feature reduction usually is done via a principal component analysis (PCA) which involves a singular value decomposition. It finds the directions with highest variance -- that is, those direction in which "something is going on".
In your case, a PCA might find the black line as one of the two principal components:
The projection of your data onto this one-dimensional subspace than yields the reduced form of your data.
Already with the eye one can see that on this line the three feature sets can be separated -- I coloured the three ranges accordingly. For your example, it is even possible to completely separate the data sets. A new data point then would be classified according to the range in which its projection onto the black line lies (or, more generally, the projection onto the principal component subspace) lies.
Formally, one could obtain a division with further methods that use the PCA-reduced data as input, such as for example clustering methods or a K-nearest neighbour model.
So, yes, in case of your example it could be possible to make such a strong reduction from 2D to 1D, and, at the same time, even obtain a reasonable model.

Interpreting a Self Organizing Map

I have been doing reading about Self Organizing Maps, and I understand the Algorithm(I think), however something still eludes me.
How do you interpret the trained network?
How would you then actually use it for say, a classification task(once you have done the clustering with your training data)?
All of the material I seem to find(printed and digital) focuses on the training of the Algorithm. I believe I may be missing something crucial.
Regards
SOMs are mainly a dimensionality reduction algorithm, not a classification tool. They are used for the dimensionality reduction just like PCA and similar methods (as once trained, you can check which neuron is activated by your input and use this neuron's position as the value), the only actual difference is their ability to preserve a given topology of output representation.
So what is SOM actually producing is a mapping from your input space X to the reduced space Y (the most common is a 2d lattice, making Y a 2 dimensional space). To perform actual classification you should transform your data through this mapping, and run some other, classificational model (SVM, Neural Network, Decision Tree, etc.).
In other words - SOMs are used for finding other representation of the data. Representation, which is easy for further analyzis by humans (as it is mostly 2dimensional and can be plotted), and very easy for any further classification models. This is a great method of visualizing highly dimensional data, analyzing "what is going on", how are some classes grouped geometricaly, etc.. But they should not be confused with other neural models like artificial neural networks or even growing neural gas (which is a very similar concept, yet giving a direct data clustering) as they serve a different purpose.
Of course one can use SOMs directly for the classification, but this is a modification of the original idea, which requires other data representation, and in general, it does not work that well as using some other classifier on top of it.
EDIT
There are at least few ways of visualizing the trained SOM:
one can render the SOM's neurons as points in the input space, with edges connecting the topologicaly close ones (this is possible only if the input space has small number of dimensions, like 2-3)
display data classes on the SOM's topology - if your data is labeled with some numbers {1,..k}, we can bind some k colors to them, for binary case let us consider blue and red. Next, for each data point we calculate its corresponding neuron in the SOM and add this label's color to the neuron. Once all data have been processed, we plot the SOM's neurons, each with its original position in the topology, with the color being some agregate (eg. mean) of colors assigned to it. This approach, if we use some simple topology like 2d grid, gives us a nice low-dimensional representation of data. In the following image, subimages from the third one to the end are the results of such visualization, where red color means label 1("yes" answer) andbluemeans label2` ("no" answer)
onc can also visualize the inter-neuron distances by calculating how far away are each connected neurons and plotting it on the SOM's map (second subimage in the above visualization)
one can cluster the neuron's positions with some clustering algorithm (like K-means) and visualize the clusters ids as colors (first subimage)

Why do we maximize variance during Principal Component Analysis?

I'm trying to read through PCA and saw that the objective was to maximize the variance. I don't quite understand why. Any explanation of other related topics would be helpful
Variance is a measure of the "variability" of the data you have. Potentially the number of components is infinite (actually, after numerization it is at most equal to the rank of the matrix, as #jazibjamil pointed out), so you want to "squeeze" the most information in each component of the finite set you build.
If, to exaggerate, you were to select a single principal component, you would want it to account for the most variability possible: hence the search for maximum variance, so that the one component collects the most "uniqueness" from the data set.
Note that PCA does not actually increase the variance of your data. Rather, it rotates the data set in such a way as to align the directions in which it is spread out the most with the principal axes. This enables you to remove those dimensions along which the data is almost flat. This decreases the dimensionality of the data while keeping the variance (or spread) among the points as close to the original as possible.
Maximizing the component vector variances is the same as maximizing the 'uniqueness' of those vectors. Thus you're vectors are as distant from each other as possible. That way if you only use the first N component vectors you're going to capture more space with highly varying vectors than with like vectors. Think about what Principal Component actually means.
Take for example a situation where you have 2 lines that are orthogonal in a 3D space. You can capture the environment much more completely with those orthogonal lines than 2 lines that are parallel (or nearly parallel). When applied to very high dimensional states using very few vectors, this becomes a much more important relationship among the vectors to maintain. In a linear algebra sense you want independent rows to be produced by PCA, otherwise some of those rows will be redundant.
See this PDF from Princeton's CS Department for a basic explanation.
max variance is basically setting these axis that occupy the maximum spread of the datapoints, why? because the direction of this axis is what really matters as it kinda explains correlations and later on we will compress/project the points along those axis to get rid of some dimensions

Resources