Pytorch Feature Scaling within the Model or within the Dataloader - machine-learning

I am trying to conduct a simple feature scaling in PyTorch. For example, I have an image, and I want to scale certain pixel values down by 10. Now I have 2 options:
Directly divide those features by 10.0 in __getitem__ function in dataloader;
Pass the original features into the model forward function, but before pass them through trainable layers, scale down the corresponding features.
I have conducted several experiments, but observed after the first epoch, the validation losses between the two would start to diverge slightly. While after a couple hundreds of epochs, the two trained models would vary largely. Any suggestion on this?

Related

Data normalization Convolutional Autoencoders

Iam a little bit confused about how to normalize/standarize image pixel values before training a convolutional autoencoder. The goal is to use the autoencoder for denoising, meaning that my traning images consists of noisy images and the original non-noisy images used as ground truth.
To my knowledge there are to options to pre-process the images:
- normalization
- standarization (z-score)
When normalizing using the MinMax approach (scaling between 0-1) the network works fine, but my question here is:
- When using the min max values of the training set for scaling, should I use the min/max values of the noisy images or of the ground truth images?
The second thing I observed when training my autoencoder:
- Using z-score standarization, the loss decreases for the two first epochs, after that it stops at about 0.030 and stays there (it gets stuck). Why is that? With normalization the loss decreases much more.
Thanks in advance,
cheers,
Mike
[Note: This answer is a compilation of the comments above, for the record]
MinMax is really sensitive to outliers and to some types of noise, so it shouldn't be used it in a denoising application. You can use quantiles 5% and 95% instead, or use z-score (for which ready-made implementations are more common).
For more realistic training, normalization should be performed on the noisy images.
Because the last layer uses sigmoid activation (info from your comments), the network's outputs will be forced between 0 and 1. Hence it is not suited for an autoencoder on z-score-transformed images (because target intensities can take arbitrary positive or negative values). The identity activation (called linear in Keras) is the right choice in this case.
Note however that this remark on activation only concerns the output layer, any activation function can be used in the hidden layers. Rationale: negative values in the output can be obtained through negative weights multiplying the ReLU output of hidden layers.

Future-proofing feature scaling in machine learning?

I have a question about how feature scaling works after training a model.
Let's say a neural network model predicts the height of a tree by training on outside temperature.
The lowest outside temperature in my training data is 60F and the max is 100F. I scale the temperature between 0 and 1 and train the model. I save the model for future predictions. Two months later, I want to predict on some new data. But this time the min and max temperatures in my test data are -20F and 50F, respectively.
How does the trained model deal with this? The range I imposed the scaling on in the training set to generate my trained model does not match the test data range.
What would prevent me from hard-coding a range to scale to that I know the data will always be within, say from -50F to 130F? The problem I see here is if I have a model with many features. If I impose a different hard scale to each feature, using feature scaling is essentially pointless, is it not?
Different scales won't work. Your model trains for one scale, it learns one scale, if you change the scale, your model will still think it's the same scale and make very shifted predictions.
Training again will overwrite what was learned before.
So, yes, hardcode your scaling (preferentially directly on your data, not inside the model).
And for a quality result, train with all the data you can gather.

How should I optimize neural network for image classification using pretrained models

Thank you for viewing my question. I'm trying to do image classification based on some pre-trained models, the images should be classified to 40 classes. I want to use VGG and Xception pre-trained model to convert each image to two 1000-dimensions vectors and stack them to a 1*2000 dimensions vector as the input of my network and the network has an 40 dimensions output. The network has 2 hidden layers, one with 1024 neurons and the other one has 512 neurons.
Structure:
image-> vgg(1*1000 dimensions), xception(1*1000 dimensions)->(1*2000 dimensions) as input -> 1024 neurons -> 512 neurons -> 40 dimension output -> softmax
However, using this structure I can only achieve about 30% accuracy. So my question is that how could I optimize the structure of my networks to achieve higher accuracy? I'm new to deep learning so I'm not quiet sure my current design is 'correct'. I'm really looking forward to your advice
I'm not entirely sure I understand your network architecture, but some pieces don't look right to me.
There are two major transfer learning scenarios:
ConvNet as fixed feature extractor. Take a pretrained network (any of VGG and Xception will do, do not need both), remove the last fully-connected layer (this layer’s outputs are the 1000 class scores for a different task like ImageNet), then treat the rest of the ConvNet as a fixed feature extractor for the new dataset. For example, in an AlexNet, this would compute a 4096-D vector for every image that contains the activations of the hidden layer immediately before the classifier. Once you extract the 4096-D codes for all images, train a linear classifier (e.g. Linear SVM or Softmax classifier) for the new dataset.
Tip #1: take only one pretrained network.
Tip #2: no need for multiple hidden layers for your own classifier.
Fine-tuning the ConvNet. The second strategy is to not only replace and retrain the classifier on top of the ConvNet on the new dataset, but to also fine-tune the weights of the pretrained network by continuing the backpropagation. It is possible to fine-tune all the layers of the ConvNet, or it’s possible to keep some of the earlier layers fixed (due to overfitting concerns) and only fine-tune some higher-level portion of the network. This is motivated by the observation that the earlier features of a ConvNet contain more generic features (e.g. edge detectors or color blob detectors) that should be useful to many tasks, but later layers of the ConvNet becomes progressively more specific to the details of the classes contained in the original dataset.
Tip #3: keep the early pretrained layers fixed.
Tip #4: use a small learning rate for fine-tuning because you don't want to distort other pretrained layers too quickly and too much.
This architecture much more resembled the ones I saw that solve the same problem and has higher chances to hit high accuracy.
There are couple of steps you may try when the model is not fitting well:
Increase training time and decrease learning rate. It may be stopping at very bad local optima.
Add additional layers that can extract specific features for the large number of classes.
Create multiple two-class deep networks for each class ('yes' or 'no' output class). This will let each network be more specialized for each class, rather than training one single network to learn all 40 classes.
Increase training samples.

How to fit a classifier with high accuracy on the training set with low features?

I have input (r,c) in range (0, 1] as the coordinate of a pixel of an image and its color 1 or 2 only.
I have about 6,400 pixels.
My attempt of fitting X=(r,c) and y=color was a failure the accuracy won't go higher than 70%.
Here's the image:
The first is the actual image, the 2nd is the image I use to train on, it has only 2 colors. The last is the image that the neural network generated with about 500 weights training with 50 iterations. Input Layer is 2, one hidden layer of size 100, and the output layer is 2. (for binary classification like this, I may need only one output layer but I am just preparing for multi-class classification)
The classifier failed to fit the training set, why is that? I tried generating high polynomial terms of those 2 features but it doesn't help. I tried using Gaussian kernel and random 20-100 landmarks on the picture to add more features, also got similar output. I tried using logistic regressions, doesn't help.
Please help me increase the accuracy.
Here's the input:input.txt (you can load it into Octave the variable is coordinate (r,c features) and idx (color)
You can try plotting it first to make sure that you understand the input then try training on it and tell me if you get better result.
Your problem is hard to model. You are trying to fit function from R^2 to R, which has lots of complexity - lots of "spikes", lots of discontinuous regions (pixels that are completely separated from the rest). This is not an easy problem, and not usefull one.. In order to overfit your network to such setting you will need plenty of hidden units. Thus, what are the options to do so?
General things that are missing in the question, and are important
Your output variable should be {0, 1} if you are fitting your network through cross entropy cost (log likelihood), which you should use for classification.
50 iteraions (if you are talking about some mini-batch iteraions) is orders of magnitude to small, unless you mean 50 epochs (iterations over whole training set).
Actual things, that will probably need to be done (at least one of the below):
I assume that you are using ReLU activations (or Tanh, hard to say looking at the output) - you can instead use RBF activations, and increase number of hidden neurons to ~5000,
If you do not want to go with RBFs, then you will need 1-2 additional hidden layers to fit function of this complexity. Try architecture of type 100-100-100 instaed.
If the above fails - increase number of hidden units, that's all you need - enough capacity.
In general: neural networks are not designed for working with low dimensional datasets. This is nice example from the web, that you can learn pix-pos to color mapping, but it is completely artificial and seems to actually harm people intuitions.

Using an ANN to calculate a position vector's length and the angle between it and the x-axis

I'm new to neural networks and trying to get the hang of it by solving the following task:
Given a semi circle which defines an area above the x-axis, I would like to teach an ANN to output the length of a vector pointing to any position within that area. In addition, I would also like to know the angle between it and the x-axis.
I thought of this as a classical example of supervised learning and used Backpropagation to train a feed-forward network. The network is built by two Input-, two Output-, and variable amount of Hidden-neurons organised in a variable amount of hidden layers.
My training data is a random and unsorted sample of points within that area and the respective desired values. The coordinates of the points serve as the input of the net while I use the calculated values to minimise the error.
However, even after thousands of training iterations and empirical changes of the networks topology, I am unable to produce results with an error below ~0.2 (Radius: 20.0, Topology: 2/4/2).
Are there any obvious pitfalls I'm failing to see or does the chosen approach just not fit the task? Which other network types and/or learning techniques could be used to complete the task?
I wouldn't use variable amounts of hidden layers, I would use just one.
Then, I wouldn't use two output neurons, I would use two separate ANNs, one for each of the values you're after. This should do better, since your outputs aren't clearly related in my opinion.
Then, I would experiment with number of hidden neurons between 2 and 10 and different activation functions (logistic and tanh, maybe ReLUs).
After that, do you scale your data? It might be worth scaling both your inputs and outputs. Sigmoid units return small numbers, so it is good if you can adapt your outputs to be small as well (in [-1 , 1] or [0, 1]). For example, if want your angles in degrees, divide all of your targets by 360 before training the ANN on them. Then when the ANN returns a result, multiply it by 360 and see if that helps.
Finally, there are a number of ways to train your neural network. Gradient descent is the classic, but probably not the best. Better methods are conjugate gradient, BFGS etc. See here for optimizers if you're using python - even if not, they might give you an idea of what to search for in your language.

Resources