SMOTE oversampling for anomaly detection using a classifier - machine-learning

I have sensor data and I want to do live anomaly detection using LOF on the training set to detect anomalies and then apply the labeled data to a classifier to do classification for new data points. I thought about using SMOTE because I want more anamolies points in the training data to overcome the imbalanced classification problem but the issue is that SMOTE created many points which are inside the normal range.
how can I do oversampling without creating samples in the normal data range?
the graph for the data before applying SMOTE.
data after SMOTE

SMOTE is going to linearly interpolate synthetic points between a minority class sample's k-nearest neighbors. This means that you're going to end up with points between a sample and its neighbors. When samples are all over the place like this, it makes sense that you're going to create synthetic points in the middle.
SMOTE should really be used to identify more specific regions in the feature space as the decision region for the minority class. This doesn't seem to be your use case. You want to know which points "don't belong," per se.
This seems like a fairly nice use case for DBSCAN, a density-based clustering algorithm that will identify points beyond some distance, eps, as not belonging to the same neighborhood.

Related

Reusing image-to-image GANs for spatial denoising of trajectories

I work on particle tracking experiments that generate trajectories (x and y coordinates over time) from videos. Some experimental setups result in trajectories with a lot of spatial noise.
I'm looking into using machine-learning models to denoise those trajectories, as our available algorithmic methods are limited. My goal is to train the model with two inputs : simulated trajectories as ground truth, and the same trajectories with induced noise.
So far, most of the solutions I found regarding multiple inputs models that aren't classification or regression point to CNNs. However, I came across image-to-image denoising models (such as https://arxiv.org/abs/1611.07004) which seem to work based on the same relation between inputs, although with a different shape.
Could it be feasible to readapt such a model for this purpose ?

Why does having too many principal components for handwritten digits classification result in less accuracy

I'm currently using PCA to do handwritten digits recognition for MNIST database (each digit has about 1000 observations and 784 features). One thing I have found confusing is that the accuracy is the highest when it has 40 PCs. If the number of PCs grows from this point, the accuracy starts to drop continuously.
From my understanding of PCA, I thought the more components I have, the better I can describe a dataset. Why does the accuracy becomes less if I have too many PCs?
In order to identify the optimum number of components, you need to plot the elbow curve
https://en.wikipedia.org/wiki/Elbow_method_(clustering)
The idea behind PCA is to reduce the dimensionality of the data by finding the principal components.
Lastly, I do not think that PCA can overfit the data as it is not a learning/ fitting algorithm.
You are just trying to project the data based on eigen-vectors to capture most of the variance along an axis.
This video should help: https://www.youtube.com/watch?v=_UVHneBUBW0

Logistic Regression is sensitive to outliers? Using on synthetic 2D dataset

I am currently using sklearn's Logistic Regression function to work on a synthetic 2d problem. The dataset is shown as below:
I'm basic plugging the data into sklearn's model, and this is what I'm getting (the light green; disregard the dark green):
The code for this is only two lines; model = LogisticRegression(); model.fit(tr_data,tr_labels). I've checked the plotting function; that's fine as well. I'm using no regularizer (should that affect it?)
It seems really strange to me that the boundaries behave in this way. Intuitively I feel they should be more diagonal, as the data is (mostly) located top-right and bottom-left, and from testing some things out it seems a few stray datapoints are what's causing the boundaries to behave in this manner.
For example here's another dataset and its boundaries
Would anyone know what might be causing this? From my understanding Logistic Regression shouldn't be this sensitive to outliers.
Your model is overfitting the data (The decision regions it found perform indeed better on the training set than the diagonal line you would expect).
The loss is optimal when all the data is classified correctly with probability 1. The distances to the decision boundary enter in the probability computation. The unregularized algorithm can use large weights to make the decision region very sharp, so in your example it finds an optimal solution, where (some of) the outliers are classified correctly.
By a stronger regularization you prevent that and the distances play a bigger role. Try different values for the inverse regularization strength C, e.g.
model = LogisticRegression(C=0.1)
model.fit(tr_data,tr_labels)
Note: the default value C=1.0 corresponds already to a regularized version of logistic regression.
Let us further qualify why logistic regression overfits here: After all, there's just a few outliers, but hundreds of other data points. To see why it helps to note that
logistic loss is kind of a smoothed version of hinge loss (used in SVM).
SVM does not 'care' about samples on the correct side of the margin at all - as long as they do not cross the margin they inflict zero cost. Since logistic regression is a smoothed version of SVM, the far-away samples do inflict a cost but it is negligible compared to the cost inflicted by samples near the decision boundary.
So, unlike e.g. Linear Discriminant Analysis, samples close to the decision boundary have disproportionately more impact on the solution than far-away samples.

One-class Support Vector Machine Sensitivity Drops when the number of training sample increase

I am using One-Class SVM for outlier detections. It appears that as the number of training samples increases, the sensitivity TP/(TP+FN) of One-Class SVM detection result drops, and classification rate and specificity both increase.
What's the best way of explaining this relationship in terms of hyperplane and support vectors?
Thanks
The more training examples you have, the less your classifier is able to detect true positive correctly.
It means that the new data does not fit correctly with the model you are training.
Here is a simple example.
Below you have two classes, and we can easily separate them using a linear kernel.
The sensitivity of the blue class is 1.
As I add more yellow training data near the decision boundary, the generated hyperplane can't fit the data as well as before.
As a consequence we now see that there is two misclassified blue data point.
The sensitivity of the blue class is now 0.92
As the number of training data increase, the support vector generate a somewhat less optimal hyperplane. Maybe because of the extra data a linearly separable data set becomes non linearly separable. In such case trying different kernel, such as RBF kernel can help.
EDIT: Add more informations about the RBF Kernel:
In this video you can see what happen with a RBF kernel.
The same logic applies, if the training data is not easily separable in n-dimension you will have worse results.
You should try to select a better C using cross-validation.
In this paper, the figure 3 illustrate that the results can be worse if the C is not properly selected :
More training data could hurt if we did not pick a proper C. We need to
cross-validate on the correct C to produce good results

Interpreting a Self Organizing Map

I have been doing reading about Self Organizing Maps, and I understand the Algorithm(I think), however something still eludes me.
How do you interpret the trained network?
How would you then actually use it for say, a classification task(once you have done the clustering with your training data)?
All of the material I seem to find(printed and digital) focuses on the training of the Algorithm. I believe I may be missing something crucial.
Regards
SOMs are mainly a dimensionality reduction algorithm, not a classification tool. They are used for the dimensionality reduction just like PCA and similar methods (as once trained, you can check which neuron is activated by your input and use this neuron's position as the value), the only actual difference is their ability to preserve a given topology of output representation.
So what is SOM actually producing is a mapping from your input space X to the reduced space Y (the most common is a 2d lattice, making Y a 2 dimensional space). To perform actual classification you should transform your data through this mapping, and run some other, classificational model (SVM, Neural Network, Decision Tree, etc.).
In other words - SOMs are used for finding other representation of the data. Representation, which is easy for further analyzis by humans (as it is mostly 2dimensional and can be plotted), and very easy for any further classification models. This is a great method of visualizing highly dimensional data, analyzing "what is going on", how are some classes grouped geometricaly, etc.. But they should not be confused with other neural models like artificial neural networks or even growing neural gas (which is a very similar concept, yet giving a direct data clustering) as they serve a different purpose.
Of course one can use SOMs directly for the classification, but this is a modification of the original idea, which requires other data representation, and in general, it does not work that well as using some other classifier on top of it.
EDIT
There are at least few ways of visualizing the trained SOM:
one can render the SOM's neurons as points in the input space, with edges connecting the topologicaly close ones (this is possible only if the input space has small number of dimensions, like 2-3)
display data classes on the SOM's topology - if your data is labeled with some numbers {1,..k}, we can bind some k colors to them, for binary case let us consider blue and red. Next, for each data point we calculate its corresponding neuron in the SOM and add this label's color to the neuron. Once all data have been processed, we plot the SOM's neurons, each with its original position in the topology, with the color being some agregate (eg. mean) of colors assigned to it. This approach, if we use some simple topology like 2d grid, gives us a nice low-dimensional representation of data. In the following image, subimages from the third one to the end are the results of such visualization, where red color means label 1("yes" answer) andbluemeans label2` ("no" answer)
onc can also visualize the inter-neuron distances by calculating how far away are each connected neurons and plotting it on the SOM's map (second subimage in the above visualization)
one can cluster the neuron's positions with some clustering algorithm (like K-means) and visualize the clusters ids as colors (first subimage)

Resources