How to cluster multi dimensional dataset in python? - machine-learning

I have been doing clustering using sklearn's kmeans and DBSCAN algorithm on datasets with 2 features. I have to cluster data with very high dimensions say 800-900 I want to know how it can be achieved accurately as possible.
P.S: after some search I have realised that one can apply PCA for dimension reduction but I want to know is there any other way in any other library if not sklearn .

You can run KMeans and DBSCAN on high dimensional data.
Also, it is the intrinsic dimensionality that matters. A 900 dimensional data set where 898 dimensions are constant 0 will behave exactly like a 2 dimensional data set (well, it probably takes 450x longer, but that is to be expected).

Related

Can K-means do dimensionality reduction?

My question is if we have 10 columns continuous variable,
can we do k-means to shrink 10 columns to 1 with corresponding cluster labels
and then do decision tree or logistic regression?
if a new data comes in, use k-mean result to determine its label and go to the machine learning model.
K-means is absolutely not a dimensionality reduction technique. Dimensionality reduction algorithms map the input space to a lower dimensional input space, while what you are proposing is mapping the input space directly to the output space which consists of the set of all integer labels.

Why does having too many principal components for handwritten digits classification result in less accuracy

I'm currently using PCA to do handwritten digits recognition for MNIST database (each digit has about 1000 observations and 784 features). One thing I have found confusing is that the accuracy is the highest when it has 40 PCs. If the number of PCs grows from this point, the accuracy starts to drop continuously.
From my understanding of PCA, I thought the more components I have, the better I can describe a dataset. Why does the accuracy becomes less if I have too many PCs?
In order to identify the optimum number of components, you need to plot the elbow curve
https://en.wikipedia.org/wiki/Elbow_method_(clustering)
The idea behind PCA is to reduce the dimensionality of the data by finding the principal components.
Lastly, I do not think that PCA can overfit the data as it is not a learning/ fitting algorithm.
You are just trying to project the data based on eigen-vectors to capture most of the variance along an axis.
This video should help: https://www.youtube.com/watch?v=_UVHneBUBW0

SMOTE oversampling for anomaly detection using a classifier

I have sensor data and I want to do live anomaly detection using LOF on the training set to detect anomalies and then apply the labeled data to a classifier to do classification for new data points. I thought about using SMOTE because I want more anamolies points in the training data to overcome the imbalanced classification problem but the issue is that SMOTE created many points which are inside the normal range.
how can I do oversampling without creating samples in the normal data range?
the graph for the data before applying SMOTE.
data after SMOTE
SMOTE is going to linearly interpolate synthetic points between a minority class sample's k-nearest neighbors. This means that you're going to end up with points between a sample and its neighbors. When samples are all over the place like this, it makes sense that you're going to create synthetic points in the middle.
SMOTE should really be used to identify more specific regions in the feature space as the decision region for the minority class. This doesn't seem to be your use case. You want to know which points "don't belong," per se.
This seems like a fairly nice use case for DBSCAN, a density-based clustering algorithm that will identify points beyond some distance, eps, as not belonging to the same neighborhood.

Weights in eigenface approach

1) In eigenface approach the eigenfaces is a combination of elements from different faces. What are these elements?
2) The output face is an image composed of different eigenfaces with different weights. What does the weights of eigenfaces exactly mean? I know that the weight is percentage of eigenfacein the image, but what does it mean exactly, is mean the number of selected pixels?
Please study about PCA to understand what is the physical meaning of eigenfaces, when PCA is applied to an image. The answer lies in the understanding of eigenvectors and eigenvalues associated with PCA.
EigenFaces is based on Principal Component Analysis
Principal Component Analysis does dimensionality reduction and finds unique features in the training images and removes the similar features from the face images
By getting unique features our recognition task gets simpler
By using PCA you calculate the eigenvectors for your face image data
From these eigenvectors you calculate EigenFace of every training subject or you can say calculating EigenFace for every class in your data
So if you have 9 classes then the number of EigenFaces will be 9
The weight usually means how important something is
In EigenFaces weight of a particular EigenFace is a vector which just tells you how important that particular EigenFace is in contributing the MeanFace
Now if you have 9 EigenFaces then for every EigenFace you will get exactly one Weight vector which will be of N dimension where N is number of eigenvectors
So every element out N elements in one weight vector will tell you how important that particular eigenvector is for that corresponding EigenFace
The facial Recognition in EigenFaces is done by comparing the weights of training images and testing images with some kind of distance function
You can refer this github link: https://github.com/jayshah19949596/Computer-Vision-Course-Assignments/blob/master/EigenFaces/EigenFaces.ipynb
The code on the above link is a good documented code so If you know the basics you will understand the code

One-class Support Vector Machine Sensitivity Drops when the number of training sample increase

I am using One-Class SVM for outlier detections. It appears that as the number of training samples increases, the sensitivity TP/(TP+FN) of One-Class SVM detection result drops, and classification rate and specificity both increase.
What's the best way of explaining this relationship in terms of hyperplane and support vectors?
Thanks
The more training examples you have, the less your classifier is able to detect true positive correctly.
It means that the new data does not fit correctly with the model you are training.
Here is a simple example.
Below you have two classes, and we can easily separate them using a linear kernel.
The sensitivity of the blue class is 1.
As I add more yellow training data near the decision boundary, the generated hyperplane can't fit the data as well as before.
As a consequence we now see that there is two misclassified blue data point.
The sensitivity of the blue class is now 0.92
As the number of training data increase, the support vector generate a somewhat less optimal hyperplane. Maybe because of the extra data a linearly separable data set becomes non linearly separable. In such case trying different kernel, such as RBF kernel can help.
EDIT: Add more informations about the RBF Kernel:
In this video you can see what happen with a RBF kernel.
The same logic applies, if the training data is not easily separable in n-dimension you will have worse results.
You should try to select a better C using cross-validation.
In this paper, the figure 3 illustrate that the results can be worse if the C is not properly selected :
More training data could hurt if we did not pick a proper C. We need to
cross-validate on the correct C to produce good results

Resources