Conclusion from PCA of dataset

Conclusion from PCA of dataset - machine-learning

I have a set of data for sequence labeling.
I did PCA with (with 2 principal components on the x and y axis) on the dataset and it turns out as below:
Using an LSTM network to classify the dataset above, I then decided to extract the activations from the hidden layer of the LSTM. What I obtain is like the figure below:
My question is, what conclusion can I draw by comparing both the results?
Is it fair to say that the features of the original dataset are now self-organized after running it through an LSTM classifier?

Related

Data normalization Convolutional Autoencoders

Iam a little bit confused about how to normalize/standarize image pixel values before training a convolutional autoencoder. The goal is to use the autoencoder for denoising, meaning that my traning images consists of noisy images and the original non-noisy images used as ground truth.
To my knowledge there are to options to pre-process the images:
- normalization
- standarization (z-score)
When normalizing using the MinMax approach (scaling between 0-1) the network works fine, but my question here is:
- When using the min max values of the training set for scaling, should I use the min/max values of the noisy images or of the ground truth images?
The second thing I observed when training my autoencoder:
- Using z-score standarization, the loss decreases for the two first epochs, after that it stops at about 0.030 and stays there (it gets stuck). Why is that? With normalization the loss decreases much more.
Thanks in advance,
cheers,
Mike

[Note: This answer is a compilation of the comments above, for the record]
MinMax is really sensitive to outliers and to some types of noise, so it shouldn't be used it in a denoising application. You can use quantiles 5% and 95% instead, or use z-score (for which ready-made implementations are more common).
For more realistic training, normalization should be performed on the noisy images.
Because the last layer uses sigmoid activation (info from your comments), the network's outputs will be forced between 0 and 1. Hence it is not suited for an autoencoder on z-score-transformed images (because target intensities can take arbitrary positive or negative values). The identity activation (called linear in Keras) is the right choice in this case.
Note however that this remark on activation only concerns the output layer, any activation function can be used in the hidden layers. Rationale: negative values in the output can be obtained through negative weights multiplying the ReLU output of hidden layers.

SMOTE oversampling for anomaly detection using a classifier

I have sensor data and I want to do live anomaly detection using LOF on the training set to detect anomalies and then apply the labeled data to a classifier to do classification for new data points. I thought about using SMOTE because I want more anamolies points in the training data to overcome the imbalanced classification problem but the issue is that SMOTE created many points which are inside the normal range.
how can I do oversampling without creating samples in the normal data range?
the graph for the data before applying SMOTE.
data after SMOTE

SMOTE is going to linearly interpolate synthetic points between a minority class sample's k-nearest neighbors. This means that you're going to end up with points between a sample and its neighbors. When samples are all over the place like this, it makes sense that you're going to create synthetic points in the middle.
SMOTE should really be used to identify more specific regions in the feature space as the decision region for the minority class. This doesn't seem to be your use case. You want to know which points "don't belong," per se.
This seems like a fairly nice use case for DBSCAN, a density-based clustering algorithm that will identify points beyond some distance, eps, as not belonging to the same neighborhood.

How should I optimize neural network for image classification using pretrained models

Thank you for viewing my question. I'm trying to do image classification based on some pre-trained models, the images should be classified to 40 classes. I want to use VGG and Xception pre-trained model to convert each image to two 1000-dimensions vectors and stack them to a 1*2000 dimensions vector as the input of my network and the network has an 40 dimensions output. The network has 2 hidden layers, one with 1024 neurons and the other one has 512 neurons.
Structure:
image-> vgg(1*1000 dimensions), xception(1*1000 dimensions)->(1*2000 dimensions) as input -> 1024 neurons -> 512 neurons -> 40 dimension output -> softmax
However, using this structure I can only achieve about 30% accuracy. So my question is that how could I optimize the structure of my networks to achieve higher accuracy? I'm new to deep learning so I'm not quiet sure my current design is 'correct'. I'm really looking forward to your advice

I'm not entirely sure I understand your network architecture, but some pieces don't look right to me.
There are two major transfer learning scenarios:
ConvNet as fixed feature extractor. Take a pretrained network (any of VGG and Xception will do, do not need both), remove the last fully-connected layer (this layer’s outputs are the 1000 class scores for a different task like ImageNet), then treat the rest of the ConvNet as a fixed feature extractor for the new dataset. For example, in an AlexNet, this would compute a 4096-D vector for every image that contains the activations of the hidden layer immediately before the classifier. Once you extract the 4096-D codes for all images, train a linear classifier (e.g. Linear SVM or Softmax classifier) for the new dataset.
Tip #1: take only one pretrained network.
Tip #2: no need for multiple hidden layers for your own classifier.
Fine-tuning the ConvNet. The second strategy is to not only replace and retrain the classifier on top of the ConvNet on the new dataset, but to also fine-tune the weights of the pretrained network by continuing the backpropagation. It is possible to fine-tune all the layers of the ConvNet, or it’s possible to keep some of the earlier layers fixed (due to overfitting concerns) and only fine-tune some higher-level portion of the network. This is motivated by the observation that the earlier features of a ConvNet contain more generic features (e.g. edge detectors or color blob detectors) that should be useful to many tasks, but later layers of the ConvNet becomes progressively more specific to the details of the classes contained in the original dataset.
Tip #3: keep the early pretrained layers fixed.
Tip #4: use a small learning rate for fine-tuning because you don't want to distort other pretrained layers too quickly and too much.
This architecture much more resembled the ones I saw that solve the same problem and has higher chances to hit high accuracy.

There are couple of steps you may try when the model is not fitting well:
Increase training time and decrease learning rate. It may be stopping at very bad local optima.
Add additional layers that can extract specific features for the large number of classes.
Create multiple two-class deep networks for each class ('yes' or 'no' output class). This will let each network be more specialized for each class, rather than training one single network to learn all 40 classes.
Increase training samples.

Understanding Faster rcnn

I'm trying to understand fast(er) RCNN and following are the questions I'm searching for:
To train, a FastRcnn model do we have to give bounding box
information in training phase.
If you have to give bonding box information then what's the role of
ROI layer.
Can we use a pre-trained model, which is only trained for classification, not
object detection and use it for Fast(er) RCNN's

Your answers:
1.- Yes.
2.- The ROI layer is used to produce a fixed-size vector from variable-sized images. This is performed by using max-pooling, but instead of using the typical n by n cells, the image is divided into n by n non-overlapping regions (which vary in size) and the maximum value in each region is output. The ROI layer also does the job of proyecting the bounding box in input space to the feature space.
3.- Faster R-CNN MUST be used with a pretrained network (typically on ImageNet), it cannot be trained end-to-end. This might be a bit hidden in the paper but the authors do mention that they use features from a pretrained network (VGG, ResNet, Inception, etc).

Feeding HOG into SVM: the HOG has 9 bins, but the SVM takes in a 1D matrix

In OpenCV, there is a CvSVM class which takes in a matrix of samples to train the SVM. The matrix is 2D, with the samples in the rows.
I created my own method to generate a histogram of oriented gradients (HOG) off of a video feed. To do this, I created a 9 channeled matrix to store the HOG, where each channel corresponds to an orientation bin. So in the end I have a 40x30 matrix of type CV_32FC(9).
Also made a visualisation for the HOG and it's working.
I don't see how I'm supposed to feed this matrix into the OpenCV SVM, because if I flatten it, I don't see how the SVM is supposed to learn a 9D hyperplane from 1D input data.

The SVM always takes in a single row of data per feature vector. The dimensionality of the feature vector is thus the length of the row. If you're dealing with 2D data, then there are 2 items per feature vector. Example of 2D data is on this webpage:
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
code of an equivalent demo in OpenCV http://sites.google.com/site/btabibian/labbook/svmusingopencv
The point is that even though you're thinking of the histogram as 2D with 9-bin cells, the feature vector is in fact the flattened version of this. So it's correct to flatten it out into a long feature vector. The result for me was a feature vector of length 2304 (16x16x9) and I get 100% prediction accuracy on a small test set (i.e. it's probably slightly less than 100% but it's working exceptionally well).
The reason this works is that the SVM is working on a system of weights per item of the feature vector. So it doesn't have anything to do with the problem's dimension, the hyperplane is always in the same dimension as the feature vector. Another way of looking at it is to forget about the hyperplane and just view it as a bunch of weights for each item in the feature vector. In this case, it needs one weighting for every item, then it multiplies each item by its weighting and outputs the result.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart