Can K-means do dimensionality reduction? - machine-learning

My question is if we have 10 columns continuous variable,
can we do k-means to shrink 10 columns to 1 with corresponding cluster labels
and then do decision tree or logistic regression?
if a new data comes in, use k-mean result to determine its label and go to the machine learning model.

K-means is absolutely not a dimensionality reduction technique. Dimensionality reduction algorithms map the input space to a lower dimensional input space, while what you are proposing is mapping the input space directly to the output space which consists of the set of all integer labels.

Related

2-dimensional clustering for segmentation and variance minimization

I have a dataset with two cardinal attributes of comparable scale. I wish to divide the data points into 4 clusters, so as to have complete segmentation by attribute 1, while minimizing the variance within attribute 2.
E.g.: If plotting attribute 1 on the x-axis and attribute 2 on the y-axis, the resulting clusters should represent vertical cuts through the data set, which are sized horizontally so as to minimize the variance in attribute 2.
The only approach I have come up with so far is to employ k-means clustering and scale up attribute 1 so as to be the dominant factor in the distance function.
Any other suggestions for suitable unsupervised learning / clustering algorithms?

How to cluster multi dimensional dataset in python?

I have been doing clustering using sklearn's kmeans and DBSCAN algorithm on datasets with 2 features. I have to cluster data with very high dimensions say 800-900 I want to know how it can be achieved accurately as possible.
P.S: after some search I have realised that one can apply PCA for dimension reduction but I want to know is there any other way in any other library if not sklearn .
You can run KMeans and DBSCAN on high dimensional data.
Also, it is the intrinsic dimensionality that matters. A 900 dimensional data set where 898 dimensions are constant 0 will behave exactly like a 2 dimensional data set (well, it probably takes 450x longer, but that is to be expected).

One-class Support Vector Machine Sensitivity Drops when the number of training sample increase

I am using One-Class SVM for outlier detections. It appears that as the number of training samples increases, the sensitivity TP/(TP+FN) of One-Class SVM detection result drops, and classification rate and specificity both increase.
What's the best way of explaining this relationship in terms of hyperplane and support vectors?
Thanks
The more training examples you have, the less your classifier is able to detect true positive correctly.
It means that the new data does not fit correctly with the model you are training.
Here is a simple example.
Below you have two classes, and we can easily separate them using a linear kernel.
The sensitivity of the blue class is 1.
As I add more yellow training data near the decision boundary, the generated hyperplane can't fit the data as well as before.
As a consequence we now see that there is two misclassified blue data point.
The sensitivity of the blue class is now 0.92
As the number of training data increase, the support vector generate a somewhat less optimal hyperplane. Maybe because of the extra data a linearly separable data set becomes non linearly separable. In such case trying different kernel, such as RBF kernel can help.
EDIT: Add more informations about the RBF Kernel:
In this video you can see what happen with a RBF kernel.
The same logic applies, if the training data is not easily separable in n-dimension you will have worse results.
You should try to select a better C using cross-validation.
In this paper, the figure 3 illustrate that the results can be worse if the C is not properly selected :
More training data could hurt if we did not pick a proper C. We need to
cross-validate on the correct C to produce good results

Gaussian basis function selection - Linear Regression

I'm looking to set up a linear regression using 2D Gaussian basis functions. My input training variables cover a two dimensional space. Before applying the machine learning (Bayesian linear regression), I need to select parameters for the Gaussians - mean and variance and also decide how many basis functions to use.
I am currently spacing the means (of a preallocated number of basis Gaussians) evenly over a grid, and just assuming constant variance. This is obviously not the best approach.
Any ideas on how to calculate these variables?

Interpreting a Self Organizing Map

I have been doing reading about Self Organizing Maps, and I understand the Algorithm(I think), however something still eludes me.
How do you interpret the trained network?
How would you then actually use it for say, a classification task(once you have done the clustering with your training data)?
All of the material I seem to find(printed and digital) focuses on the training of the Algorithm. I believe I may be missing something crucial.
Regards
SOMs are mainly a dimensionality reduction algorithm, not a classification tool. They are used for the dimensionality reduction just like PCA and similar methods (as once trained, you can check which neuron is activated by your input and use this neuron's position as the value), the only actual difference is their ability to preserve a given topology of output representation.
So what is SOM actually producing is a mapping from your input space X to the reduced space Y (the most common is a 2d lattice, making Y a 2 dimensional space). To perform actual classification you should transform your data through this mapping, and run some other, classificational model (SVM, Neural Network, Decision Tree, etc.).
In other words - SOMs are used for finding other representation of the data. Representation, which is easy for further analyzis by humans (as it is mostly 2dimensional and can be plotted), and very easy for any further classification models. This is a great method of visualizing highly dimensional data, analyzing "what is going on", how are some classes grouped geometricaly, etc.. But they should not be confused with other neural models like artificial neural networks or even growing neural gas (which is a very similar concept, yet giving a direct data clustering) as they serve a different purpose.
Of course one can use SOMs directly for the classification, but this is a modification of the original idea, which requires other data representation, and in general, it does not work that well as using some other classifier on top of it.
EDIT
There are at least few ways of visualizing the trained SOM:
one can render the SOM's neurons as points in the input space, with edges connecting the topologicaly close ones (this is possible only if the input space has small number of dimensions, like 2-3)
display data classes on the SOM's topology - if your data is labeled with some numbers {1,..k}, we can bind some k colors to them, for binary case let us consider blue and red. Next, for each data point we calculate its corresponding neuron in the SOM and add this label's color to the neuron. Once all data have been processed, we plot the SOM's neurons, each with its original position in the topology, with the color being some agregate (eg. mean) of colors assigned to it. This approach, if we use some simple topology like 2d grid, gives us a nice low-dimensional representation of data. In the following image, subimages from the third one to the end are the results of such visualization, where red color means label 1("yes" answer) andbluemeans label2` ("no" answer)
onc can also visualize the inter-neuron distances by calculating how far away are each connected neurons and plotting it on the SOM's map (second subimage in the above visualization)
one can cluster the neuron's positions with some clustering algorithm (like K-means) and visualize the clusters ids as colors (first subimage)

Resources