I am trying to fine-tune a conv-net. It has the following structure (adapted from OverFeat):
net:add(SpatialConvolution(3, 96, 7, 7, 2, 2))
net:add(nn.ReLU(true))
net:add(SpatialMaxPooling(3, 3, 3, 3))
net:add(SpatialConvolutionMM(96, 256, 7, 7, 1, 1))
net:add(nn.ReLU(true))
net:add(SpatialMaxPooling(2, 2, 2, 2))
net:add(SpatialConvolutionMM(256, 512, 3, 3, 1, 1, 1, 1))
net:add(nn.ReLU(true))
net:add(SpatialConvolutionMM(512, 512, 3, 3, 1, 1, 1, 1))
net:add(nn.ReLU(true))
net:add(SpatialConvolutionMM(512, 1024, 3, 3, 1, 1, 1, 1))
net:add(nn.ReLU(true))
net:add(SpatialConvolutionMM(1024, 1024, 3, 3, 1, 1, 1, 1))
net:add(nn.ReLU(true))
net:add(SpatialMaxPooling(3, 3, 3, 3))
net:add(SpatialConvolutionMM(1024, 4096, 5, 5, 1, 1))
net:add(nn.ReLU(true))
net:add(SpatialConvolutionMM(4096, 4096, 1, 1, 1, 1))
net:add(nn.ReLU(true))
net:add(SpatialConvolutionMM(4096, total_classes, 1, 1, 1, 1))
net:add(nn.View(total_classes))
net:add(nn.LogSoftMax())
And I'm using SGD as the optimization method with the following parameters:
optimState = {
learningRate = 1e-3,
weightDecay = 0,
momentum = 0,
learningRateDecay = 1e-7
}
optimMethod = optim.sgd
I am training it as follows:
optimMethod(feval, parameters, optimState)
where:
-- 'feval' is the function with the forward and backward passes on the current batch
parameters,gradParameters = net:getParameters()
From my references, I have learned that while fine-tuning a pre-trained network, it is recommended that the lower (convolutional) layers should have lower learning rates and the higher layers should have relatively higher learning rates.
I referred to torch7's documentation of optim/sgd to set different learning rates for each layer. From there, I get that setting config.learningRates i.e. a vector of individual learning rates, I can achieve what I want. I am new to Torch, so, please pardon me if this seems as a silly question, but it would be really helpful if someone can please explain me how and where to create/use this vector to serve my purpose?
Thanks in advance.
I don't know if you still need an answer, as you posted this question one year ago.
Anyway, just in case someone sees this, I've written a post here about how to set different learning rates for different layers in torch.
The solution is to use net:parameters() instead of net:getParameters(). Instead of returning two tensors, it returns two tables of tensors, containing the parameters (and the gradParameters) for each layer in separate tensors.
In this way, you can run an sgd() step (with a different learning rate) for each layer. You can find the full code by clicking the above link.
Related
I am a newbie in machine learning and learning the basic concepts in regression. The confusion I have can be well explained by placing an example of input samples with the target values. So, For example (please notice that the example I am putting is the general case, I observed the performance and predicted values on a large custom dataset of images. Also, notice that the target values are not in floats.), I have:
xtrain = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
ytrain = [10, 10, 10, 20, 20, 20, 30, 30, 30, 40, 40, 40]
and
xtest = [13, 14, 15, 16]
ytest = [25, 25, 35, 35]
As you can notice that the ever three (two in the test set) samples have similar target values. Suppose I have a multi-layer perceptron network with one Flatten() and two Dense() layers. The network, after training, predicts the target values all same for test samples:
yPredicted = [40, 40, 40, 40]
Because the predicted values are all same, the correlations between ytest and yPredicted return null and give an error.
But when I have:
xtrain = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
ytrain = [332, 433, 456, 675, 234, 879, 242, 634, 789, 432, 897, 982]
And:
xtest = [13, 14, 15, 16]
ytest = [985, 341, 354, 326]
The predicted values are:
yPredicted = [987, 345, 435, 232]
which gives very good correlations.
My question is, what it the thing or process in a machine learning algorithm that makes the learning better when having different target values for each input? Why the network does not work when having repeated values for a large number of inputs?
Why the network does not work when having repeated values for a large number of inputs?
Most certainly, this is not the reason why your network does not perform well in the first dataset shown.
(You have not provided any code, so inevitably this will be a qualitative answer)
Looking closely at your first dataset:
xtrain = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
ytrain = [10, 10, 10, 20, 20, 20, 30, 30, 30, 40, 40, 40]
it's not difficult to conclude that we have a monotonic (increasing) function y(x) (it is not strictly monotonic, but it is monotonic nevertheless over the whole x range provided).
Given that, your model has absolutely no way of "knowing" that, for x > 12, the qualitative nature of the function changes significantly (and rather abruptly), as apparent from your test set:
xtest = [13, 14, 15, 16]
ytest = [25, 25, 35, 35]
and you should not expect it to know or "guess" it in any way (despite what many people may seem to believe, NN are not magic).
Looking closely to your second dataset, you will realize that this is not the case with it, hence the network is unsurprisingly able to perform better here; when doing such experiments, it is very important to be sure that we are comparing apples to apples, and not apples to oranges.
Another general issue with your attempts here and your question is the following: neural nets are not good at extrapolation, i.e. predicting such numerical functions outside the numeric domain on which they have been trained. For details, please see own answer at Is deep learning bad at fitting simple non linear functions outside training scope?
A last unusual thing here is your use of correlation; not sure why you choose to do this, but you may be interested to know that, in practice, we never assess model performance using a correlation measure between predicted outcomes and ground truth - we use measures such as the mean squared error (MSE) instead (for regression problems, such as yours here).
Given a very huge table of the following format (e.g. snippet):
Subject, Condition, VPH, Task, Round, Item, Decision, Self, Other, RT
1, 1, 1, SVO, 0, 0, 4, 2.5, 2.0, 8.598
1, 1, 1, SVO, 1, 5, 3, 4.1, 3.4, 7.785
1, 1, 1, SVO, 2, 4, 3, 3.2, 3.4, 15.713
2, 2, 1, SVO, 0, 0, 4, 2.5, 2.0, 15.439
2, 2, 1, SVO, 1, 2, 7, 4.9, 2.3, 30.777
2, 2, 1, SVO, 2, 3, 8, 4.3, 4.3, 13.549
3, 3, 1, SVO, 0, 0, 5, 2.8, 1.5, 9.066
... (And so on)
Needed: Compute the mean over all rounds for self and others for each subject.
What i have so far:
I sorted the about 100mb .txt file using bash sort so the subject and the related rounds appear after each other (like the example shows). After that i imported the .txt file into SPSS24. Right now i have no idea to write a function that computes for each subject the mean of variable self and others over the three rounds. E.g.: (some pseudo-code)
for n = 1 to last_subject do:
get row self where lines have line_subject as n
compute mean over these content
write result as new variable self_mean as new variable after variabel RT at line n
increase n by one
As i am totally new to SPSS i really appreciate detailed help. I am also satisfied with references that specifically attend to computation over rows (i found lots of stuff over columns).
Thank you very much!
Edit: example output
After computing the table should look like this:
Subject, Mean_Self, Mean_Others
1, 3.27, 2.9
2, ..., ...
3,
... (And so on)
So now we computed the Mean_Self from the top example like so:
mean(2.5 + 4.1 + 3.2)
where:
2.5 was used from line 1 of Variable Self
4.1 was used from line 2 of Variable Self
3.2 was used from line 3 of Variable Self
2.5 was not used from line 4 of Variable Self because Variable Subject changed, there for we want to repeat the process with the new Subject (here 2) until it changes again. The results should create a table like the one above. Same procedure for Variable Other.
If I understand right what you need is the aggregate command. aggregate can create a new dataset/file with your aggregated data, or add the aggregated data to your active dataset, like you described above:
AGGREGATE
/OUTFILE=* MODE=ADDVARIABLES
/BREAK=Subject
/Self_mean=MEAN(Self)
/Other_mean=MEAN(Other).
In order to get the new variables in a new, separate tabe, look up other AGGREGATE options, e.g. /OUTFILE=* (removing MODE=ADDVARIABLES) will result in the new aggregated data replacing the original file in the window, while /OUTFILE="path/filename" will save the aggregated data to a file.
When I study Deep MNIST for Experts tutorial, I have many difficulties.
I'd to know why they used Convolution and Pooling in a Multilayer Convolutional Network.
And I don't understand the following two functions.
def conv2d(x, W):
return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')
def max_pool_2x2(x):
return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
strides=[1, 2, 2, 1], padding='SAME')
I'd to know the meaning of strides=[1,1,1,1] in conv2d function.
Should we always use ksize=[1, 2, 2, 1] and strides=[1, 2, 2, 1] in max_pool_2x2 function.
What is the difference between padding='SAME' and padding='VALID'
I would say check the following answer. It has a wonderful explanation for the whole convolution operation. This should cover your query for conv2d .
for max pooling,
ksize: is basically the kernal size. Its the size of the window for each dimension of the input tensor. you can change it according to your need. Like in the paper AlexNet they have used ksize=[1, 3, 3, 1] and
stride: The filter is applied to image patches of the same size as the filter and strided according to the strides argument. strides = [1, 2, 2, 1] applies the filter to every other image patch in each dimension, etc.
The difference of padding is explained well in this post.
I have a CNN trained upon the images (cropped faces) of Mark Ruffalo. For my positive class I have around 200 images and for the negative datapoints I have sampled 200 random faces.
The model has a high recall but a very low precision. How could I increase the precision ?Also I am constrained by the number of positive images that I have. I am ready to compromise the recall in this tradeoff.
I have tried increasing the number of negative samples but that introduces a form of bias and the model starts classifying everything as negative to attain a local optima.
I have based my CNN upon overfeat:
local features = nn.Sequential()
features:add(nn.SpatialConvolutionMM(3, 96, 11, 11))
features:add(nn.ReLU())
features:add(nn.SpatialMaxPooling(2, 2, 2, 2))
features:add(nn.SpatialConvolutionMM(96, 256, 5, 5))
features:add(nn.ReLU())
features:add(nn.SpatialMaxPooling(2, 2, 2, 2))
features:add(nn.SpatialConvolutionMM(256, 512, 3, 3))
features:add(nn.ReLU())
features:add(nn.SpatialMaxPooling(2, 2, 2, 2))
-- 24x24x512
features:add(nn.SpatialConvolutionMM(512, 1024, 3, 3))
features:add(nn.ReLU())
features:add(nn.SpatialMaxPooling(2, 2, 2, 2))
--11x11x1024
features:add(nn.SpatialConvolutionMM(1024, 1024, 3, 3))
features:add(nn.ReLU())
features:add(nn.SpatialMaxPooling(2, 2, 2, 2))
-- 1.3. Create Classifier (fully connected layers)
local classifier = nn.Sequential()
classifier:add(nn.View(1024*4*4))
classifier:add(nn.Dropout(0.5))
classifier:add(nn.Linear(1024*4*4, 3072))
classifier:add(nn.Threshold(0, 1e-6))
classifier:add(nn.Dropout(0.5))
classifier:add(nn.Linear(3072, 4096))
classifier:add(nn.Threshold(0, 1e-6))
classifier:add(nn.Linear(4096, noutputs))
model = nn.Sequential():add(features):add(classifier)
Kindly Help
Try playing with the raw output of the CNN instead of taking the sign() of the output node (since it is a positive and negative class I assume there is only one output in the range [-1,1]).
For instance, for one sample, the output could be [0.9] indicating that the positive class should be picked. But if you play with this values, you can find a specific threshold value, hopefully, that gives you the precision you need. In other words, if you find that anything greater than [-0.35] should actually be chosen as the positive class because it gived you better precision, then -0.35 should be your threshold value.
This is where ROC analysis comes in handy.
Let me know if this helps.
I know that a Gaussian Process model is best suited for regression rather than classification. However, I would still like to apply a Gaussian Process to a classification task but I am not sure what is the best way to bin the predictions generated by the model. I have reviewed the Gaussian Process classification example that is available on the scikit-learn website at:
http://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gp_probabilistic_classification_after_regression.html
But I found this example confusing (I have listed the things I found confusing about this example at the end of the question). To try and get a better understanding I have created a very basic python code example using scikit-learn that generates classifications by applying a decision boundary to the predictions made by a gaussian process:
#A minimum example illustrating how to use a
#Gaussian Processes for binary classification
import numpy as np
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.gaussian_process import GaussianProcess
if __name__ == "__main__":
#defines some basic training and test data
#If the descriptive features have large values
#(i.e., 8s and 9s) the target is 1
#If the descriptive features have small values
#(i.e., 2s and 3s) the target is 0
TRAININPUTS = np.array([[8, 9, 9, 9, 9],
[9, 8, 9, 9, 9],
[9, 9, 8, 9, 9],
[9, 9, 9, 8, 9],
[9, 9, 9, 9, 8],
[2, 3, 3, 3, 3],
[3, 2, 3, 3, 3],
[3, 3, 2, 3, 3],
[3, 3, 3, 2, 3],
[3, 3, 3, 3, 2]])
TRAINTARGETS = np.array([1, 1, 1, 1, 1, 0, 0, 0, 0, 0])
TESTINPUTS = np.array([[8, 8, 9, 9, 9],
[9, 9, 8, 8, 9],
[3, 3, 3, 3, 3],
[3, 2, 3, 2, 3],
[3, 2, 2, 3, 2],
[2, 2, 2, 2, 2]])
TESTTARGETS = np.array([1, 1, 0, 0, 0, 0])
DECISIONBOUNDARY = 0.5
#Fit a gaussian process model to the data
gp = GaussianProcess(theta0=10e-1, random_start=100)
gp.fit(TRAININPUTS, TRAINTARGETS)
#Generate a set of predictions for the test data
y_pred = gp.predict(TESTINPUTS)
print "Predicted Values:"
print y_pred
print "----------------"
#Convert the continuous predictions into the classes
#by splitting on a decision boundary of 0.5
predictions = []
for y in y_pred:
if y > DECISIONBOUNDARY:
predictions.append(1)
else:
predictions.append(0)
print "Binned Predictions (decision boundary = 0.5):"
print predictions
print "----------------"
#print out the confusion matrix specifiy 1 as the positive class
cm = confusion_matrix(TESTTARGETS, predictions, [1, 0])
print "Confusion Matrix (1 as positive class):"
print cm
print "----------------"
print "Classification Report:"
print metrics.classification_report(TESTTARGETS, predictions)
When I run this code I get the following output:
Predicted Values:
[ 0.96914832 0.96914832 -0.03172673 0.03085167 0.06066993 0.11677634]
----------------
Binned Predictions (decision boundary = 0.5):
[1, 1, 0, 0, 0, 0]
----------------
Confusion Matrix (1 as positive class):
[[2 0]
[0 4]]
----------------
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 4
1 1.00 1.00 1.00 2
avg / total 1.00 1.00 1.00 6
The approach used in this basic example seems to work fine with this simple dataset. But this approach is very different from the classification example given on the scikit-lean website that I mentioned above (url repeated here):
http://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gp_probabilistic_classification_after_regression.html
So I'm wondering if I am missing something here. So, I would appreciate if anyone could:
With respect to the classification example given on the scikit-learn website:
1.1 explain what the probabilities being generated in this example are probabilities of? Are they the probability of the query instance belonging to the class >0?
1.2 why the example uses a cumulative density function instead of a probability density function?
1.3 why the example divides the predictions made by the model by the square root of the mean square error before they are input into the cumulative density function?
With respect to the basic code example I have listed here, clarify whether or not applying a simple decision boundary to the predictions generated by a gaussian process model is an appropriate way to do binary classification?
Sorry for such a long question and thanks for any help.
In the GP classifier, a standard GP distribution over functions is "squashed," usually using the standard normal CDF (also called the probit function), to map it to a distribution over binary categories.
Another interpretation of this process is through a hierarchical model (this paper has the derivation), with a hidden variable drawn from a Gaussian Process.
In sklearn's gp library, it looks like the output from y_pred, MSE=gp.predict(xx, eval_MSE=True) are the (approximate) posterior means (y_pred) and posterior variances (MSE) evaluated at points in xx before any squashing occurs.
To obtain the probability that a point from the test set belongs to the positive class, you can convert the normal distribution over y_pred to a binary distribution by applying the Normal CDF (see [this paper again] for details).
The hierarchical model of the probit squashing function is defined by a 0 decision boundary (the standard normal distribution is symmetric around 0, meaning PHI(0)=.5). So you should set DECISIONBOUNDARY=0.