When I study Deep MNIST for Experts tutorial, I have many difficulties.
I'd to know why they used Convolution and Pooling in a Multilayer Convolutional Network.
And I don't understand the following two functions.
def conv2d(x, W):
return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')
def max_pool_2x2(x):
return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
strides=[1, 2, 2, 1], padding='SAME')
I'd to know the meaning of strides=[1,1,1,1] in conv2d function.
Should we always use ksize=[1, 2, 2, 1] and strides=[1, 2, 2, 1] in max_pool_2x2 function.
What is the difference between padding='SAME' and padding='VALID'
I would say check the following answer. It has a wonderful explanation for the whole convolution operation. This should cover your query for conv2d .
for max pooling,
ksize: is basically the kernal size. Its the size of the window for each dimension of the input tensor. you can change it according to your need. Like in the paper AlexNet they have used ksize=[1, 3, 3, 1] and
stride: The filter is applied to image patches of the same size as the filter and strided according to the strides argument. strides = [1, 2, 2, 1] applies the filter to every other image patch in each dimension, etc.
The difference of padding is explained well in this post.
Related
I have some keras code that I need to convert to Pytorch. I've done some research but so far I am not able to reproduce the results I got from keras. I have spent many hours on this any tips or help is very appreciated.
Here is the keras code I am dealing with. The input shape is (None, 105, 768) where None is the batch size and I want to apply Conv1D to the input. The desire output in keras is (None, 105)
x = tf.keras.layers.Dropout(0.2)(input)
x = tf.keras.layers.Conv1D(1,1)(x)
x = tf.keras.layers.Flatten()(x)
x = tf.keras.layers.Activation('softmax')(x)
What I've tried, but worse in term of results:
self.conv1d = nn.Conv1d(768, 1, 1)
self.dropout = nn.Dropout(0.2)
self.softmax = nn.Softmax()
def forward(self, input):
x = self.dropout(input)
x = x.view(x.shape[0],x.shape[2],x.shape[1])
x = self.conv1d(x)
x = torch.squeeze(x, 1)
x = self.softmax(x)
The culprit is your attempt to swap the dimensions of the input around, since Keras and PyTorch have different conventions for the dimension order.
x = x.view(x.shape[0],x.shape[2],x.shape[1])
.view() does not swap the dimensions, but changes which part of the data is part of a given dimension. You can consider it as a 1D array, then you decide how many steps you take to cover the dimension. An example makes it much simpler to understand.
# Let's start with a 1D tensor
# That's how the underlying data looks in memory.
x = torch.arange(6)
# => tensor([0, 1, 2, 3, 4, 5])
# How the tensor looks when using Keras' convention (expected input)
keras_version = x.view(2, 3)
# => tensor([[0, 1, 2],
# [3, 4, 5]])
# Vertical isn't swapped with horizontal, but the data is arranged differently
# The numbers are still incrementing from left to right
incorrect_pytorch_version = keras_version.view(3, 2)
# => tensor([[0, 1],
# [2, 3],
# [4, 5]])
To swap the dimensions you need to use torch.transpose.
correct_pytorch_version = keras_version.transpose(0, 1)
# => tensor([[0, 3],
# [1, 4],
# [2, 5]])
For example (I can do this with Theano without a problem):
# log_var has shape --> (num, )
# Mean has shape --> (?, num)
std_var = T.repeat(T.exp(log_var)[None, :], Mean.shape[0], axis=0)
With TensorFlow I can do this:
std_var = tf.tile(tf.reshape(tf.exp(log_var), [1, -1]), (tf.shape(Mean)[0], 1))
But I don't know how to do the same for Keras, may be like this:
std_var = K.repeat(K.reshape(K.exp(log_var), [1, -1]), Mean.get_shape()[0])
or
std_var = K.repeat_elements(K.exp(log_var), Mean.get_shape()[0], axis=0)
... because Mean has unknown dimension at axis 0.
I need this for a custom layer output:
return K.concatenate([Mean, Std], axis=1)
Keras has an abstraction layer keras.backend which you seem to have found already (you refer to it as K). This layer provides all the functions for both Theano and TensorFlow you will need.
Say your TensorFlow code works, which is
std_var = tf.tile(tf.reshape(tf.exp(log_var), [1, -1]), (tf.shape(Mean)[0], 1))
then you can translate it to the abstract version by writing it like this:
std_var = K.tile(K.reshape(K.exp(log_var), (1, -1)), K.shape(Mean)[0])
Both Theano and TensorFlow support the unknown axis syntax (-1 for unknown axes) so this is not a problem.
On a side note I am not sure whether your TF code is correct though. You reshape to (1, -1), meaning that the dimension of axis 0 will be 1. I think what you rather want to do is to do this:
std_var = K.tile(K.reshape(K.exp(log_var), (-1, num)), K.shape(Mean)[0])
I am trying to fine-tune a conv-net. It has the following structure (adapted from OverFeat):
net:add(SpatialConvolution(3, 96, 7, 7, 2, 2))
net:add(nn.ReLU(true))
net:add(SpatialMaxPooling(3, 3, 3, 3))
net:add(SpatialConvolutionMM(96, 256, 7, 7, 1, 1))
net:add(nn.ReLU(true))
net:add(SpatialMaxPooling(2, 2, 2, 2))
net:add(SpatialConvolutionMM(256, 512, 3, 3, 1, 1, 1, 1))
net:add(nn.ReLU(true))
net:add(SpatialConvolutionMM(512, 512, 3, 3, 1, 1, 1, 1))
net:add(nn.ReLU(true))
net:add(SpatialConvolutionMM(512, 1024, 3, 3, 1, 1, 1, 1))
net:add(nn.ReLU(true))
net:add(SpatialConvolutionMM(1024, 1024, 3, 3, 1, 1, 1, 1))
net:add(nn.ReLU(true))
net:add(SpatialMaxPooling(3, 3, 3, 3))
net:add(SpatialConvolutionMM(1024, 4096, 5, 5, 1, 1))
net:add(nn.ReLU(true))
net:add(SpatialConvolutionMM(4096, 4096, 1, 1, 1, 1))
net:add(nn.ReLU(true))
net:add(SpatialConvolutionMM(4096, total_classes, 1, 1, 1, 1))
net:add(nn.View(total_classes))
net:add(nn.LogSoftMax())
And I'm using SGD as the optimization method with the following parameters:
optimState = {
learningRate = 1e-3,
weightDecay = 0,
momentum = 0,
learningRateDecay = 1e-7
}
optimMethod = optim.sgd
I am training it as follows:
optimMethod(feval, parameters, optimState)
where:
-- 'feval' is the function with the forward and backward passes on the current batch
parameters,gradParameters = net:getParameters()
From my references, I have learned that while fine-tuning a pre-trained network, it is recommended that the lower (convolutional) layers should have lower learning rates and the higher layers should have relatively higher learning rates.
I referred to torch7's documentation of optim/sgd to set different learning rates for each layer. From there, I get that setting config.learningRates i.e. a vector of individual learning rates, I can achieve what I want. I am new to Torch, so, please pardon me if this seems as a silly question, but it would be really helpful if someone can please explain me how and where to create/use this vector to serve my purpose?
Thanks in advance.
I don't know if you still need an answer, as you posted this question one year ago.
Anyway, just in case someone sees this, I've written a post here about how to set different learning rates for different layers in torch.
The solution is to use net:parameters() instead of net:getParameters(). Instead of returning two tensors, it returns two tables of tensors, containing the parameters (and the gradParameters) for each layer in separate tensors.
In this way, you can run an sgd() step (with a different learning rate) for each layer. You can find the full code by clicking the above link.
I know that a Gaussian Process model is best suited for regression rather than classification. However, I would still like to apply a Gaussian Process to a classification task but I am not sure what is the best way to bin the predictions generated by the model. I have reviewed the Gaussian Process classification example that is available on the scikit-learn website at:
http://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gp_probabilistic_classification_after_regression.html
But I found this example confusing (I have listed the things I found confusing about this example at the end of the question). To try and get a better understanding I have created a very basic python code example using scikit-learn that generates classifications by applying a decision boundary to the predictions made by a gaussian process:
#A minimum example illustrating how to use a
#Gaussian Processes for binary classification
import numpy as np
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.gaussian_process import GaussianProcess
if __name__ == "__main__":
#defines some basic training and test data
#If the descriptive features have large values
#(i.e., 8s and 9s) the target is 1
#If the descriptive features have small values
#(i.e., 2s and 3s) the target is 0
TRAININPUTS = np.array([[8, 9, 9, 9, 9],
[9, 8, 9, 9, 9],
[9, 9, 8, 9, 9],
[9, 9, 9, 8, 9],
[9, 9, 9, 9, 8],
[2, 3, 3, 3, 3],
[3, 2, 3, 3, 3],
[3, 3, 2, 3, 3],
[3, 3, 3, 2, 3],
[3, 3, 3, 3, 2]])
TRAINTARGETS = np.array([1, 1, 1, 1, 1, 0, 0, 0, 0, 0])
TESTINPUTS = np.array([[8, 8, 9, 9, 9],
[9, 9, 8, 8, 9],
[3, 3, 3, 3, 3],
[3, 2, 3, 2, 3],
[3, 2, 2, 3, 2],
[2, 2, 2, 2, 2]])
TESTTARGETS = np.array([1, 1, 0, 0, 0, 0])
DECISIONBOUNDARY = 0.5
#Fit a gaussian process model to the data
gp = GaussianProcess(theta0=10e-1, random_start=100)
gp.fit(TRAININPUTS, TRAINTARGETS)
#Generate a set of predictions for the test data
y_pred = gp.predict(TESTINPUTS)
print "Predicted Values:"
print y_pred
print "----------------"
#Convert the continuous predictions into the classes
#by splitting on a decision boundary of 0.5
predictions = []
for y in y_pred:
if y > DECISIONBOUNDARY:
predictions.append(1)
else:
predictions.append(0)
print "Binned Predictions (decision boundary = 0.5):"
print predictions
print "----------------"
#print out the confusion matrix specifiy 1 as the positive class
cm = confusion_matrix(TESTTARGETS, predictions, [1, 0])
print "Confusion Matrix (1 as positive class):"
print cm
print "----------------"
print "Classification Report:"
print metrics.classification_report(TESTTARGETS, predictions)
When I run this code I get the following output:
Predicted Values:
[ 0.96914832 0.96914832 -0.03172673 0.03085167 0.06066993 0.11677634]
----------------
Binned Predictions (decision boundary = 0.5):
[1, 1, 0, 0, 0, 0]
----------------
Confusion Matrix (1 as positive class):
[[2 0]
[0 4]]
----------------
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 4
1 1.00 1.00 1.00 2
avg / total 1.00 1.00 1.00 6
The approach used in this basic example seems to work fine with this simple dataset. But this approach is very different from the classification example given on the scikit-lean website that I mentioned above (url repeated here):
http://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gp_probabilistic_classification_after_regression.html
So I'm wondering if I am missing something here. So, I would appreciate if anyone could:
With respect to the classification example given on the scikit-learn website:
1.1 explain what the probabilities being generated in this example are probabilities of? Are they the probability of the query instance belonging to the class >0?
1.2 why the example uses a cumulative density function instead of a probability density function?
1.3 why the example divides the predictions made by the model by the square root of the mean square error before they are input into the cumulative density function?
With respect to the basic code example I have listed here, clarify whether or not applying a simple decision boundary to the predictions generated by a gaussian process model is an appropriate way to do binary classification?
Sorry for such a long question and thanks for any help.
In the GP classifier, a standard GP distribution over functions is "squashed," usually using the standard normal CDF (also called the probit function), to map it to a distribution over binary categories.
Another interpretation of this process is through a hierarchical model (this paper has the derivation), with a hidden variable drawn from a Gaussian Process.
In sklearn's gp library, it looks like the output from y_pred, MSE=gp.predict(xx, eval_MSE=True) are the (approximate) posterior means (y_pred) and posterior variances (MSE) evaluated at points in xx before any squashing occurs.
To obtain the probability that a point from the test set belongs to the positive class, you can convert the normal distribution over y_pred to a binary distribution by applying the Normal CDF (see [this paper again] for details).
The hierarchical model of the probit squashing function is defined by a 0 decision boundary (the standard normal distribution is symmetric around 0, meaning PHI(0)=.5). So you should set DECISIONBOUNDARY=0.
I'm doing a tweet classification, where each tweet can belong to one of few classes.
The training set output is given as the probability for belonging that sample to each class.
Eg: tweet#1 : C1-0.6, C2-0.4, C3-0.0 (C1,C2,C3 being classes)
I'm planning to use a Naive Bayes classifier using Scikit-learn. I couldn't find a fit method in naive_bayes.py which takes probability for each class for training.
I need a classifier which accepts output probability for each class for the training set.
(ie: y.shape = [n_samples, n_classes])
How can I process my data set to apply a NaiveBayes classifier?
This is not so easy, as the "classes probability" can have many interpretations.
In case of NB classifier and sklearn the easiest procedure I see is:
Split (duplicate) your training samples according to the following rule:
given (x, [p1, p2, ..., pk ]) sample (where pi is probability for ith class) create artificial training samples:
(x, 1, p1), (x, 2, p2), ..., (x, k, pk). So you get k new observations, each "attached" to one class, and pi is treated as a sample weight, which NB (in sklearn) accepts.
Train your NB with fit(X,Y,sample_weights) (where X is a matrix of your x observations, Y is a matrix of classes from previous step, and sample_weights is a matrix of pi from the previous step.
For example if your training set consists of two points:
( [0 1], [0.6 0.4] )
( [1 3], [0.1 0.9] )
You transform them to:
( [0 1], 1, 0.6)
( [0 1], 2, 0.4)
( [1 3], 1, 0.1)
( [1 3], 2, 0.9)
and train NB with
X = [ [0 1], [0 1], [1 3], [1 3] ]
Y = [ 1, 2, 1, 2 ]
sample_weights = [ 0.6 0.4 0.1 0.9 ]