What is the default batch size of pytorch SGD? - machine-learning

What does pytorch SGD do if I feed the whole data and do not specify the batch size? I don't see any "stochastic" or "randomness" in the case.
For example, in the following simple code, I feed the whole data (x,y) into a model.
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
for epoch in range(5):
y_pred = model(x_data)
loss = criterion(y_pred, y_data)
optimizer.zero_grad()
loss.backward()
optimizer.step()
Suppose there are 100 data pairs (x,y), i.e. x_data and y_data each has 100 elements.
Question: It seems to me that all the 100 gradients are calculated before one update of parameters. Size of a "mini_batch" is 100, not 1. So there is no randomness, am I right? At first, I think SGD means randomly choose 1 data point and calculate its gradient, which will be used as an approximation of the true gradient from all data.

The SGD optimizer in PyTorch is just gradient descent. The stocastic part comes from how you usually pass a random subset of your data through the network at a time (i.e. a mini-batch or batch). The code you posted passes the entire dataset through on each epoch before doing backprop and stepping the optimizer so you're really just doing regular gradient descent.

Related

Use SMOTE to oversample image data

I'm doing a binary classification with CNNs and the data is imbalanced where the positive medical image : negative medical image = 0.4 : 0.6. So I want to use SMOTE to oversample the positive medical image data before training.
However, the dimension of the data is 4D (761,64,64,3) which cause the error
Found array with dim 4. Estimator expected <= 2
So, I reshape my train_data:
X_res, y_res = smote.fit_sample(X_train.reshape(X_train.shape[0], -1), y_train.ravel())
And it works fine. Before feed it to CNNs, I reshape it back by:
X_res = X_res.reshape(X_res.shape[0], 64, 64, 3)
Now, I'm not sure is it a correct way to oversample and will the reshape operator change the images' structer?
I had a similar issue. I had used the reshape function to reshape the image (basically flattened the image)
X_train.shape
(8000, 250, 250, 3)
ReX_train = X_train.reshape(8000, 250 * 250 * 3)
ReX_train.shape
(8000, 187500)
smt = SMOTE()
Xs_train, ys_train = smt.fit_sample(ReX_train, y_train)
Although, this approach is pathetically slow, but helped to improve the performance.
As soon as you flatten an image you are loosing localized information, this is one of the reasons why convolutions are used in image-based machine learning.
8000x250x250x3 has an inherent meaning - 8000 samples of images, each image of width 250, height 250 and all of them have 3 channels when you do 8000x250*250*3 reshape is just a bunch of numbers unless you use some kind of sequence network to teach its bad.
oversampling is bad for image data, you can do image augmentations (20crop, introducing noise like a gaussian blur, rotations, translations, etc..)
First Flatten the image
Apply SMOTE on this flattened image data and its labels
Reshape the flattened image to RGB image
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
train_rows=len(X_train)
X_train = X_train.reshape(train_rows,-1)
(80,30000)
X_train, y_train = sm.fit_resample(X_train, y_train)
X_train = X_train.reshape(-1,100,100,3)
(>80,100,100,3)

what's the difference between tf.nn.conv2d with strides = 2 and tf.nn.max_pool with 2x2 pooling?

As mentioned above, both
tf.nn.conv2d with strides = 2
and
tf.nn.max_pool with 2x2 pooling
can reduce the size of input to half, and I know the output may be different, but what I don't know is that affect the final training result or not, any clue about this, thanks.
In both your examples assume we have a [height, width] kernel applied with strides [2,2]. That means we apply the kernel to a 2-D window of size [height, width] on the 2-D inputs to get an output value, and then slide the window over by 2 either up or down to get the next output value.
In both cases you end up with 4x fewer outputs than inputs (2x fewer in each dimension) assuming padding='SAME'
The difference is how the output values are computed for each window:
conv2d
the output is a linear combination of the input values times a weight for each cell in the [height, width] kernel
these weights become trainable parameters in your model
max_pool
the output is just selecting the maximum input value within the [height, width] window of input values
there is no weight and no trainable parameters introduced by this operation
The results of the final training could actually be different as the convolution multiplies the tensor by a filter, which you might not want to do as it takes up extra computational time and also can overfit your model as it will have more weights.

How to set local biases in caffe and torch?

When convoluting a multi-channel image into one channel image, usually you can have only one bias variable(as output is one channel). If I want to set local biases, that is, set biases for each pixel of the output image, how shall I do this in caffe and torch?
In Tensorflow, this is very simple. your just set a bias matrix, for example:
data is 25(height)X25(width)X48(channels)
weights is 3X3(kernel size)X48(input channels)X1(output channels)
biases is 25X25,
then,
hidden = tf.nn.conv2d(data, weights, [1, 1, 1, 1], padding='SAME')
output = tf.relu(hidden+biases)
Is there a similar solution in caffe ortorch?
For caffe, here is a scale layer post: Scale layer in Caffe. Scale layer can only provide one variable bias.
The answer is Bias layer. bias layer can have a weight matrix, treat it as biases.
For torch, torch has a nn.Add() layer, almost like the tensorflow's tf.add() function, so nn.Add() layer is the solution.
All these have been proved by actual models.
But still thank you very much #Shai

Reshape y_train for binary text classification in Tensorflow

i have a classical y_train which is composed of 0 (negative) and 1 (positive) in a one dimensionnal shape. I wanted to train a tensorflow model but i have to initialize the y placeholder with the number of classes i want. So in this text classification case, i want the model to check negative or positive value so 2 classes ? But how did i convert my y_train to fit the output that i'm looking for. Thanks for your time!
"ValueError: Cannot feed value of shape (25000, 1) for Tensor u'Placeholder_5:0', which has shape (Dimension(None), Dimension(2))"
It appears your y_train contains the label values themselves, whereas the y_train needed by the model requires label probabilities. In your case, since there are only two labels, you can convert that to label probabilities as follows :
y_train = tf.concat(1, [1 - y_train, y_train])
If you have more labels, have a look at sparse_to_dense to convert them to probabilities.

Finding the function approximated by a neural network

If I have a feed-forward multilayer perceptron with sigmoid activation function, which is trained and has known weights, how can I find the equation of the curve that is approximated by the network (the curve that separates between 2 types of data)?
In general, there is no closed form solution for the input points where your NN output is 0.5 (or 0, in case of -1/1 instead of 0/1).
What is usually done for visualization in low-dimensional input space is gridding up the input space and computing the contours of the NN output. (The contours are smooth estimate of what the NN response surface looks like.)
In MATLAB, one would do
[X,Y] = meshgrid(linspace(-1,1), linspace(-1,1));
contour(f(X,Y))
where f is your trained NN, and assuming [-1,1] x [-1,1] space.

Resources