Is it when I define a binaryproto mean file in a prototxt file, it will subtract the mean in the background (Caffe internally does that) from the input automatically? Or should I write an implementation (that implements the mean subtraction from input) to do that?
When defining mean_file in "Data" layer parameters caffe will subtract the mean values from the inputs. It is already implemented in caffe for training and validation.
Note that when you deploy your network you will need to manually subtract the mean.
Related
I have a transformer (for rec.) and (sequence)classification task with a lot of classes, usually for classification tasks probabilities or logits for all classes are calculated and the max value is selected. However I think there will be too man parameters in this case, so I was wondering if following approach is valid:
I want to predict 10 classes based on an input sequence of x No of classes
-convert all classes to INTs
-build transformer with mean pooling and linear output layer = 10 (then round all values) predicting classes directly
I am not sure because the ordering of class INTs has no meaning, so will the model be able to learn correct weights?
Maybe a basic question but I couldnt find anything on that.
Module's parameters get changed during training, that is, they are what is learnt during training of a neural network, but what is a buffer?
and is it learnt during neural network training?
Pytorch doc for register_buffer() method reads
This is typically used to register a buffer that should not to be considered a model parameter. For example, BatchNorm’s running_mean is not a parameter, but is part of the persistent state.
As you already observed, model parameters are learned and updated using SGD during the training process.
However, sometimes there are other quantities that are part of a model's "state" and should be
- saved as part of state_dict.
- moved to cuda() or cpu() with the rest of the model's parameters.
- cast to float/half/double with the rest of the model's parameters.
Registering these "arguments" as the model's buffer allows pytorch to track them and save them like regular parameters, but prevents pytorch from updating them using SGD mechanism.
An example for a buffer can be found in _BatchNorm module where the running_mean , running_var and num_batches_tracked are registered as buffers and updated by accumulating statistics of data forwarded through the layer. This is in contrast to weight and bias parameters that learns an affine transformation of the data using regular SGD optimization.
Both parameters and buffers you create for a module (nn.Module).
Say you have a linear layer nn.Linear. You already have weight and bias parameters. But if you need a new parameter you use register_parameter() to register a new named parameter that is a tensor.
When you register a new parameter it will appear inside the module.parameters() iterator, but when you register a buffer it will not.
The difference:
Buffers are named tensors that do not update gradients at every step, like parameters.
For buffers, you create your custom logic (fully up to you).
The good thing is when you save the model, all params and buffers are saved, and when you move the model to or off the CUDA params and buffers will go as well.
Can I compute the function f(x) = sqr(x) using opencv ANN ?
I need to train my ann by using set of integers and their square values.
I need to get squared value of a integer as output from ann model.
If we can do that using opencv ann, what will be the number input neurons, output neurons and how to specify the classes etc.. ??
You mention class specification, but I don't think that this is a class categorization problem. I think it would be better to treat the input as X, and the output as sqr(X). Then this becomes a general function approximation problem.
There is an issue with this particular problem however. Neural networks aren't well suited for functions with unbounded input/output. The output of a neural network is usually limited to the range of its activation function, and the input value is usually scaled to some reasonable range. Assuming you are using the default activation (symmetrical sigmoid), your output is limited to (-1, 1). If you have a limited range of integers you want to use, you can still do this, but you'll have to scale the inputs and outputs accordingly.
If you use this method, there will be one input node, and one output node, corresponding to the scaled versions of X and sqr(X) respectively. OpenCV will try to take care of scaling for you automatically. It's probably best for you to trust this, UNLESS you are planning on providing multiple different sets of training data. The different sets may have different distributions, hence a different scale.
I have this 5-5-2 backpropagation neural network I'm training, and after reading this awesome article by LeCun I started to put in practice some of the ideas he suggests.
Currently I'm evaluating it with a 10-fold cross-validation algorithm I made myself, which goes basically like this:
for each epoch
for each possible split (training, validation)
train and validate
end
compute mean MSE between all k splits
end
My inputs and outputs are standardized (0-mean, variance 1) and I'm using a tanh activation function. All network algorithms seem to work properly: I used the same implementation to approximate the sin function and it does it pretty good.
Now, the question is as the title implies: should I standardize each train/validation set separately or do I simply need to standardize the whole dataset once?
Note that if I do the latter, the network doesn't produce meaningful predictions, but I prefer having a more "theoretical" answer than just looking at the outputs.
By the way, I implemented it in C, but I'm also comfortable with C++.
You will most likely be better off standardizing each training set individually. The purpose of cross-validation is to get a sense for how well your algorithm generalizes. When you apply your network to new inputs, the inputs will not be ones that were used to compute your standardization parameters. If you standardize the entire data set at once, you are ignoring the possibility that a new input will fall outside the range of values over which you standardized.
So unless you plan to re-standardize every time you process a new input (which I'm guessing is unlikely), you should only compute the standardization parameters for the training set of the partition being evaluated. Furthermore, you should compute those parameters only on the training set of the partition, not the validation set (i.e., each of the 10-fold partitions will use 90% of the data to calculate standardization parameters).
So you assume the inputs are normally distribution and are subtracting the mean, dividing by standard deviation, to get N(0,1) distributed inputs?
Yes I agree with #bogatron that you standardize each training set separately, but I would more strongly say it's a "must" to not use the validation set data too. The problem is not values outside the range in the training set; this is fine, the transformation to a standard normal is still defined for any value. You can't compute mean / standard deviation overa ll the data because you can't in any way use the validation data in the training set, even if just via this statistic.
It should further be emphasized that you use the mean from the training set with the validation set, not the mean from the validation set. It has to be the same transformation of features that was used during training. It would not be valid to transform the validation set differently.
I am trying to pre-process biological data to train a neural network and despite an extensive search and repetitive presentation of the various normalization methods I am none the wiser as to which method should be used when. In particular I have a number of input variables which are positively skewed and have been trying to establish whether there is a normalisation method that is most appropriate.
I was also worried about whether the nature of these inputs would affect performance of the network and as such have experimented with data transformations (log transformation in particular). However some inputs have many zeros but may also be small decimal values and seem to be highly affected by a log(x + 1) (or any number from 1 to 0.0000001 for that matter) with the resulting distribution failing to approach normal (either remains skewed or becomes bimodal with a sharp peak at the min value).
Is any of this relevant to neural networks? ie. should I be using specific feature transformation / normalization methods to account for the skewed data or should I just ignore it and pick a normalization method and push ahead?
Any advice on the matter would be greatly appreciated!
Thanks!
As features in your input vector are of different nature, you should use different normalization algorithms for every feature. Network should be feeded by uniformed data on every input for better performance.
As you wrote that some data is skewed, I suppose you can run some algoritm to "normalize" it. If applying logarithm does not work, perhaps other functions and methods such as rank transforms can be tried out.
If the small decimal values do entirely occur in a specific feature, then just normalize it in specific way, so that they get transformed into your work range: either [0, 1] or [-1, +1] I suppose.
If some inputs have many zeros, consider removing them from main neural network, and create additional neural network which will operate on vectors with non-zeroed features. Alternatively, you may try to run Principal Component Analysis (for example, via Autoassociative memory network with structure N-M-N, M < N) to reduce input space dimension and so eliminate zeroed components (they will be actually taken into account in the new combined inputs somehow). BTW, new M inputs will be automatically normalized. Then you can pass new vectors to your actual worker neural network.
This is an interesting question. Normalization is meant to keep features' values in one scale to facilitate the optimization process.
I would suggest the following:
1- Check if you need to normalize your data. If, for example, the means of the variables or features are within same scale of values, you may progress with no normalization. MSVMpack uses some normalization check condition for their SVM implementation. If, however, you need to do so, you are still advised to run the models on the data without Normalization.
2- If you know the actual maximum or minimum values of a feature, use them to normalize the feature. I think this kind of normalization would preserve the skewedness in values.
3- Try decimal value normalization with other features if applicable.
Finally, you are still advised to apply different normalization techniques and compare the MSE for evey technique including z-score which may harm the skewedness of your data.
I hope that I have answered your question and gave some support.