Why do we need to preseve the "expected output" during dropout? - machine-learning

I am very confused as to why do we need to preserve the value of the expected output when performing dropout regularisation. Why does it matter if the mean of the outputs of layer l is different in the training and testing phase?
Weights that are non-zero after dropout are just a slightly scaled value of its self, how does it affect the decision making power of the neural network?
According to a comment under this question, it says the output layer sigmoid might interpret a value as 0 instead of 1 if not scaled. But weights that are dropped anyways don't contribute.
Please throw some light, I am not able to see the bigger picture of the concept.

Found the answer to this, courtesy to Andrew Ng's lecture videos.
We basically preserve the value of the expected output of activations where dropout is applied so that it doesn't affect the result of the cost and so that it remains the same expected value as without dropout. Hence, we scale the value and spread out weights.

Related

Data normalization Convolutional Autoencoders

Iam a little bit confused about how to normalize/standarize image pixel values before training a convolutional autoencoder. The goal is to use the autoencoder for denoising, meaning that my traning images consists of noisy images and the original non-noisy images used as ground truth.
To my knowledge there are to options to pre-process the images:
- normalization
- standarization (z-score)
When normalizing using the MinMax approach (scaling between 0-1) the network works fine, but my question here is:
- When using the min max values of the training set for scaling, should I use the min/max values of the noisy images or of the ground truth images?
The second thing I observed when training my autoencoder:
- Using z-score standarization, the loss decreases for the two first epochs, after that it stops at about 0.030 and stays there (it gets stuck). Why is that? With normalization the loss decreases much more.
Thanks in advance,
cheers,
Mike
[Note: This answer is a compilation of the comments above, for the record]
MinMax is really sensitive to outliers and to some types of noise, so it shouldn't be used it in a denoising application. You can use quantiles 5% and 95% instead, or use z-score (for which ready-made implementations are more common).
For more realistic training, normalization should be performed on the noisy images.
Because the last layer uses sigmoid activation (info from your comments), the network's outputs will be forced between 0 and 1. Hence it is not suited for an autoencoder on z-score-transformed images (because target intensities can take arbitrary positive or negative values). The identity activation (called linear in Keras) is the right choice in this case.
Note however that this remark on activation only concerns the output layer, any activation function can be used in the hidden layers. Rationale: negative values in the output can be obtained through negative weights multiplying the ReLU output of hidden layers.

CNN Regression on Grid - Limitation of Convolutional Neural Networks?

I'm working on a (high energy physics related) problem using CNNs.
For understanding the problem, let's consider these examples here.
The left-hand side is the input to the CNN, the right-hand side the desired output. So the network is supposed to cluster the input. The actual algorithm behind this clustering (i.e. how we got the desired output for training) is really complex and we want the CNN to learn this.
I've tried different CNN architectures, for example one similar to the U-net architecture (https://arxiv.org/abs/1505.04597) but also various concatenations of convolutional layers, etc.
The outputs are always really similar (for all architectures).
Here you can see some CNN predictions.
In principle the network is performing quite well, but as you can see, in most cases the CNN output consists of several filled pixels that are directly next to each other, which will never (!) happen in the true cases.
I've been using mean squared error as the loss function in all of the networks.
Do you have any suggestions how one could avoid this problem and improve the networks performance?
Or is this a general limitation to CNNs and in practice it is not possible to solve such a problem using CNNs?
Thank you very much!
My suggestion would be to split up the work. First use a U-Shaped NN to find the activations in a binary segmentation task (like in your paper) and then regress on the found activations to find their final values. In my experience this works way better than doing regression on large images, because the MSE will result in blurry outputs, as you have observed.
The CNN does not know that you wanted a sharp result. As mentioned by #Thomas, MSE tends to give you blurry result as it is the nature of that loss function. Giving a blurry result does not introduce large loss in MSE.
An easy modification would be to use L1 Loss (absolute difference instead of squared error). It has a constant gradient unlike MSE whose gradient decreases with error.
If you really wanted a sharp result, it would be easier to add a manual step -- non maximum suppression (NMS). In practice, a 3x3 box-max filter might do.

How to fit a classifier with high accuracy on the training set with low features?

I have input (r,c) in range (0, 1] as the coordinate of a pixel of an image and its color 1 or 2 only.
I have about 6,400 pixels.
My attempt of fitting X=(r,c) and y=color was a failure the accuracy won't go higher than 70%.
Here's the image:
The first is the actual image, the 2nd is the image I use to train on, it has only 2 colors. The last is the image that the neural network generated with about 500 weights training with 50 iterations. Input Layer is 2, one hidden layer of size 100, and the output layer is 2. (for binary classification like this, I may need only one output layer but I am just preparing for multi-class classification)
The classifier failed to fit the training set, why is that? I tried generating high polynomial terms of those 2 features but it doesn't help. I tried using Gaussian kernel and random 20-100 landmarks on the picture to add more features, also got similar output. I tried using logistic regressions, doesn't help.
Please help me increase the accuracy.
Here's the input:input.txt (you can load it into Octave the variable is coordinate (r,c features) and idx (color)
You can try plotting it first to make sure that you understand the input then try training on it and tell me if you get better result.
Your problem is hard to model. You are trying to fit function from R^2 to R, which has lots of complexity - lots of "spikes", lots of discontinuous regions (pixels that are completely separated from the rest). This is not an easy problem, and not usefull one.. In order to overfit your network to such setting you will need plenty of hidden units. Thus, what are the options to do so?
General things that are missing in the question, and are important
Your output variable should be {0, 1} if you are fitting your network through cross entropy cost (log likelihood), which you should use for classification.
50 iteraions (if you are talking about some mini-batch iteraions) is orders of magnitude to small, unless you mean 50 epochs (iterations over whole training set).
Actual things, that will probably need to be done (at least one of the below):
I assume that you are using ReLU activations (or Tanh, hard to say looking at the output) - you can instead use RBF activations, and increase number of hidden neurons to ~5000,
If you do not want to go with RBFs, then you will need 1-2 additional hidden layers to fit function of this complexity. Try architecture of type 100-100-100 instaed.
If the above fails - increase number of hidden units, that's all you need - enough capacity.
In general: neural networks are not designed for working with low dimensional datasets. This is nice example from the web, that you can learn pix-pos to color mapping, but it is completely artificial and seems to actually harm people intuitions.

Pre-processing data: Normalizing data labels in regression?

Recently I was told that the labels of regression data should also be normalized for better result but I am pretty doubtful of that. I have never tried normalizing labels in both regression and classification that's why I don't know if that state is true or not. Can you please give me a clear explanation (mathematically or in experience) about this problem?
Thank you so much.
Any help would be appreciated.
When you say "normalize" labels, it is not clear what you mean (i.e. whether you mean this in a statistical sense or something else). Can you please provide an example?
On Making labels uniform in data analysis
If you are trying to neaten labels for use with the text() function, you could try the abbreviate() function to shorten them, or the format() function to align them better.
The pretty() function works well for rounding labels on plot axes. For instance, the base function hist() for drawing histograms calls on Sturges or other algorithms and then uses pretty() to choose nice bin sizes.
The scale() function will standardize values by subtracting their mean and dividing by the standard deviation, which in some circles is referred to as normalization.
On the reasons for scaling in regression (in response to comment by questor). Suppose you regress Y on covariates X1, X2, ... The reasons for scaling covariates Xk depend on the context. It can enable comparison of the coefficients (effect sizes) of each covariate. It can help ensure numerical accuracy (these days not usually an issue unless covariates on hugely different scales and/or data is big). For a readable intro see Psychosomatic medicine editors' guide. For a mathematically intense discussion see Sylvain Sardy's guide.
In particular, in Bayesian regression, rescaling is advisable to ensure convergence of MCMC estimation; e.g. see this discussion.
You mean features not labels.
It is not necessary to normalize your features for regression or classification, even though in some cases, it is a trick that can help converging faster. You might want to check this post.
To my experience, when using a simple model like a linear regression with only a few variables, keeping the features as they are (without normalization) is preferable since the model is more interpretable.
It may be that what you mean is that you should scale your labels. The reason is so convergence is faster, and you don't get numeric instability.
For example, if your labels are in the range (1000, 1000000) and the weights are initialized close to zero, a mse loss would be so large, you'd likely get NaN errors.
See https://datascience.stackexchange.com/q/22776/38707 for a similar discussion.
for a regression problem with algorithms including decision tree or logistic regression and linear regression I tested in two modes: 1- with label scaling using MinMaxScaler 2- without label scaling the result that i got was : r2 score is the same in 2 mode mse and mae scales
for diabetes dataset using linear regression the result before and after is
without scaling:
Mean Squared Error: 3424.3166
Mean Absolute Error: 46.1742
R2_score : 0.33
after scaling labels:
Mean Squared Error: 0.0332
Mean Absolute Error: 0.1438
R2_score : 0.33
also below link can be useful which says scaling can be helpful in fast convergence enter scale or not scale labels in deep leaning?

Multilayer perceptron for ocr works with only some data sets

NEW DEVELOPMENT
I recently used OpenCV's MLP implementation to test whether it could solve the same tasks. OpenCV was able to classify the same data sets that my implementation was able to, but unable to solve the one's that mine could not. Maybe this is due to termination parameters (determining when to end training). I stopped before 100,000 iterations, and the MLP did not generalize. This time the network architecture was 400 input neurons, 10 hidden neurons, and 2 output neurons.
I have implemented the multilayer perceptron algorithm, and verified that it works with the XOR logic gate. For OCR I taught the network to correctly classify letters of "A"s and "B"s that have been drawn with a thick drawing untensil (a marker). However when I try to teach the network to classify a thin drawing untensil (a pencil) the network seems to become stuck in a valley and unable to classify the letters in a reasonable amount of time. The same goes for letters I drew with GIMP.
I know people say we have to use momentum to get out of the valley, but the sources I read were vague. I tried increasing a momentum value when the change in error was insignificant and decreasing when above, but it did not seem to help.
My network architecture is 400 input neurons (one for each pixel), 2 hidden layers with 25 neurons each, and 2 neurons in the output layer. The images are gray scale images and the inputs are -0.5 for a black pixel and 0.5 for a white pixel.
EDIT:
Currently the network is trainning until the calculated error for each trainning example falls below an accepted error constant. I have also tried stopping trainning at 10,000 epochs, but this yields bad predictions. The activation function used is the sigmoid logistic function. The error function I am using is the sum of the squared error.
I suppose I may have reached a local minimum rather than a valley, but this should not happen repeatedly.
Momentum is not always good, it can help the model to jump out of the a bad valley but may also make the model to jump out of a good valley. Especially when the previous weights update directions is not good.
There are several reasons that make your model not work well.
The parameter are not well set, it is always a non-trivial task to set the parameters of the MLP.
An easy way is to first set the learning rate, momentum weight and regularization weight to a big number, but to set the iteration (or epoch) to a very large weight. Once the model diverge, half the learning rate, momentum weight and regularization weight.
This approach can make the model to slowly converge to a local optimal, and also give the chance for it to jump out a bad valley.
Moreover, in my opinion, one output neuron is enough for two class problem. There is no need to increase the complexity of the model if it is not necessary. Similarly, if possible, use a three-layer MLP instead of a four-layer MLP.

Resources