Adapting MNIST designed network for a larger dataset - machine-learning

I am attempting to implement this deep clustering algorithm, which was designed to cluster the MNIST dataset. Single Channel 28x28 images.
The images I am trying to use are 416x416 and 3-Channel RGB. The script is initialised with the following functions.
class CachedMNIST(Dataset):
def __init__(self, train, cuda, testing_mode=False):
img_transform = transforms.Compose([transforms.Lambda(self._transformation)])
# img_transform = transforms.Compose([transforms.Resize((28*28)), transforms.ToTensor(), transforms.Grayscale()])
self.ds = torchvision.datasets.ImageFolder(root=train, transform=img_transform)
self.cuda = cuda
self.testing_mode = testing_mode
self._cache = dict()
#staticmethod
def _transformation(img):
return (torch.ByteTensor(torch.ByteStorage.from_buffer(img.tobytes())).float()
* 0.02
)
If the images are left un-altered the resulting tensor shape output from the _transformation function is of size torch.Size{{256,519168]] far too large for the AutoEncoder network to calculate.
Error 1
RuntimeError: mat1 and mat2 shapes cannot be multiplied (128x519168 and 784x500)
When I attempted to resize the images a the result is a 4D Tensor, torch.Size([256,1,784,748]) which even when reducing the Batch Size to minuscule amounts the CUDA will crash as there is not enough memory.
Error 2
RuntimeError: CUDA out of memory.
I'm hoping someone can point me in the right direction to tackle this problem as there must be a more efficient way to adapt the network.
AutoEnocder Model
StackedDenoisingAutoEncoder(
(encoder): Sequential(
(0): Sequential(
(linear): Linear(in_features=784, out_features=500, bias=True)
(activation): ReLU()
)
(1): Sequential(
(linear): Linear(in_features=500, out_features=500, bias=True)
(activation): ReLU()
)
(2): Sequential(
(linear): Linear(in_features=500, out_features=2000, bias=True)
(activation): ReLU()
)
(3): Sequential(
(linear): Linear(in_features=2000, out_features=10, bias=True)
)
)
(decoder): Sequential(
(0): Sequential(
(linear): Linear(in_features=10, out_features=2000, bias=True)
(activation): ReLU()
)
(1): Sequential(
(linear): Linear(in_features=2000, out_features=500, bias=True)
(activation): ReLU()
)
(2): Sequential(
(linear): Linear(in_features=500, out_features=500, bias=True)
(activation): ReLU()
)
(3): Sequential(
(linear): Linear(in_features=500, out_features=784, bias=True)
)
)
)

Error 1 is happening because the first linear layer has in_features=784. That number comes from the 28x28 pixels in the 1-channel MNIST data. Your input data is 416x416x3 = 519168 (different if you resize your inputs). In order to resolve this error, you need to make the in_features in that first linear layer match the number of pixels (times the number of channels) of your input. You can do this by changing that number or resizing your input (or, likely both). Also, note that you will likely have to flatten your input so that it is a vector. Also, note that whatever the in_features becomes (to the encoder) you'll want to make the out_features of the decoder match (otherwise you'll be trying to compare two vectors of different sizes when training).
Error 2 CUDA OOM could happen for lots of reasons (small GPU, too large of network, too large batch size, etc). The network you have doesn't appear to look particularly large. But you could reduce its size by shrinking some of the internal layers (numbers of in_features and out_features). Just be sure that if you adjust these, that you maintain the property that the number of out_features from one layer matches the number of in_features on the next layer. And, in this example, the decoder is a nice mirror of in the encoder (so if you adjust the encoder, make the corresponding mirror-adjustment in the decoder).

Related

Expected output from the autoencoder layer?

I wrote a little linear autoencoder on ORL 32x32 dataset (1024 - 961 - 900 - 961 - 1024).
self.encoder1 = torch.nn.Sequential(
torch.nn.Linear(inout_size, 961 ),
#torch.nn.BatchNorm1d(512),
torch.nn.ReLU(),
)
self.encoder2 = torch.nn.Sequential(
torch.nn.Linear(961, 900),
#torch.nn.BatchNorm1d(256),
torch.nn.ReLU(),
)
self.decoder = torch.nn.Sequential(
torch.nn.Linear(900, 961),
torch.nn.ReLU(),
torch.nn.Linear(961, inout_size),
Autoencoder learns some contours and outlines of the image very well after only about 50 epochs.
But I was interested in how the outputs from the first and second layers of the autoencoder look like, in order to see on the basis of which he actually tries to reconstruct the image. And I received quite strange data for which I would ask for clarification as to whether they are expected or not, and if not, what would be the expected outputs.
The original is the original image, and la1 shows the image after the first layer, and la2 the image after exiting the second layer.
enter image description here

How to use a Swin Transformer to generate an embedding, not classification?

Using timm's implementation of Swin Transformer, how does one generate an embedding vector?
I would like to use timm's SwinTransformer class to generate an embedding vector for use with metric learning (sub-center ArcFace).
What I've tried:
To create the SwinTransformer I have something like:
from timm import create_model
backbone_name = 'swin_large_patch4_window7_224'
EMBEDDING_SIZE = 128
NUM_CLASSES = EMBEDDING_SIZE # ???
backbone = create_model(backbone_name, pretrained=True, num_classes=NUM_CLASSES)
This results in backbone having the structure:
SwinTransformer(
(patch_embed): PatchEmbed(
(proj): Conv2d(3, 192, kernel_size=(4, 4), stride=(4, 4))
(norm): LayerNorm((192,), eps=1e-05, elementwise_affine=True)
)
(pos_drop): Dropout(p=0.0, inplace=False)
(layers): Sequential(
... *SNIP SNIP SNIP* ...
)
(norm): LayerNorm((1536,), eps=1e-05, elementwise_affine=True)
(avgpool): AdaptiveAvgPool1d(output_size=1)
(head): Linear(in_features=1536, out_features=128, bias=True)
)
So, I did what I normally do, which is freeze the backbone and replace the timm's head with a 3-layer multilayer perceptron.
The "smoke test" of trying to train this was really poor, but I haven't started exploring training parameters, MLP layer sizes, etc.
Looking at the SwinTransformer architecture above, my intuition is that the weights going into head are already separating classes, not generating the activations for an embedding.
Is there a layer in the SwinTransformer where I can grab the activations and put them into a simple MLP embedding projector?
Should I just use the SwinTransformer as-is, no layers frozen, and train towards the output of head being an embedding? Won't that be very inefficient?

why need to use conv2D(64) twice, instead it can be used as conv2D(128). Is both are sme or different

model = Sequential()
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu', input_shape=(X_train.shape[1:])))
model.add(Conv2D(64,kernel_size= (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2), strides=(2, 2)))
model.add(Dropout(0.5))
In the above code whether I can use con2D(128) instead of conv24(64) twice.
No, you cannot, because both configurations do not represent the same functions, this pattern was introduced in the VGG network paper, and it is used to increase the representation power of the network. Two layers with 3x3 filters are kind of equivalent to one layer with a 5x5 filter (through composition), it is not equivalent to adding the number of filters
In particular, if you had a convolutional layer with 128 filters, this is not the same to having two convolutional layers with 64 filters each, specially considering that there is a ReLU activation in between them, which makes behavior more non-linear.

Why does fully convolutional network plateau first and then learns?

Why does fully convolutional network plateau first and then learns?
Im training a fully convolutional network to classify handwriting Chinese characters. The dev dataset I am using has 250 classes with 200 - 300 samples in each class.
And I found out no matter how I tweak the model, somehow the ones I've tried so far all has a similar behaviour, which they all plateau at first and then the accuracies start to shoot up while the losses decrease, as shown in the screenshot below:
I would love to know more about the reasons behind this behaviour.
Thanks a lot!
Edit:
Sorry for didn't provide more details before.
My best performing network so far is as below, using an Adadelta optimizer with LR at 0.1. My weights were initialised using xavier initialisation.
Input(shape=(30, 30, 1))
Lambda(
lambda image: tf.image.resize_images(
image, size=(resize_size, resize_size),
method=tf.image.ResizeMethod.BILINEAR
)
)
Conv2D(filters=96, kernel_size=(5, 5), strides=(1, 1), padding="same", "relu")
Conv2D(filters=96, kernel_size=(1, 1), strides=(1, 1), padding="same", "relu")
MaxPooling2D(pool_size=(3, 3), strides=2)
Conv2D(filters=192, kernel_size=(5, 5), strides=(2, 2), padding="same", "relu")
Conv2D(filters=192, kernel_size=(1, 1), strides=(1, 1), padding="same", "relu")
MaxPooling2D(pool_size=(3, 3), strides=2)
Conv2D(filters=192, kernel_size=(3, 3), "same", "relu")
Conv2D(filters=192, kernel_size=(1, 1), "same", "relu")
Conv2D(filters=10, kernel_size=(1, 1), "same", "relu")
AveragePooling2D(pool_size=(3, 3))
Flatten()
Dense(250, activation="softmax")
model = Model(inp, x)
model.compile(
loss=categorical_crossentropy,
optimizer=Adadelta(lr=0.1),
metrics=["accuracy"],
)
As to the input data, they are all handwriting chinese characters that had been transformed into a MNIST format by me, with size of 30x30x1 (the existence of the Lambda layer after the Input layer was because I was following the original FCN paper paper and they used 32x32 input size), which is as below:
And this is how the loss and accuracy charts above came about.
Hope this provides better intuition. Thanks.
We can't answer specifically, because you've neglected to identify your network and inputs sufficiently, let alone the training methods. To fully trace the high-level training characteristics, we'd need some detailed visualization of the kernels through the iterations in question.
In general, this is simply because a highly complex model usually needs a few iterations before it gets better than random results. We begin with random weights and kernels. In the first few iterations, the model has to work through the chaos, establish a few useful patterns in the early-level kernels, and find weights that correlate with enough output categories that the accuracy moves above 0.4% with statistical significance.
Part of the problem is that, in those first few iterations, the model also stumbles across patterns that are useful in chaos, but actually harm long-term learning. For instance, it may build a pattern for black dots, and guess right that this correlates to mammal eyes and vehicle wheels. All too soon, that generalization, that an airplane and an Airedale are structurally related, turns out to be a wrong assumption. It has to break down the second-level correlations between those categories and find something else.
This is the sort of learning that keeps the accuracy low for longer than you might think. The model spends the first few iterations jumping to hundreds of conclusions about classifications, anything that correlates with one or two correct guesses. Then it has to learn enough to separate valid ones from invalid ones. That is where the model starts making advances it can retain.

Intuition behind Stacking Multiple Conv2D Layers before Dropout in CNN

Background:
Tagging TensorFlow since Keras runs on top of it and this is more a general deep learning question.
I have been working on the Kaggle Digit Recognizer problem and used Keras to train CNN models for the task. This model below has the original CNN structure I used for this competition and it performed okay.
def build_model1():
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), padding="Same" activation="relu", input_shape=[28, 28, 1]))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Dropout(0.25))
model.add(layers.Conv2D(64, (3, 3), padding="Same", activation="relu"))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Dropout(0.25))
model.add(layers.Conv2D(64, (3, 3), padding="Same", activation="relu"))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Dropout(0.25))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation="relu"))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(10, activation="softmax"))
return model
Then I read some other notebooks on Kaggle and borrowed another CNN structure (copied below), which works much better than the one above in that it achieved better accuracy, lower error rate, and took many more epochs before overfitting the training data.
def build_model2():
model = models.Sequential()
model.add(layers.Conv2D(32, (5, 5),padding ='Same', activation='relu', input_shape = (28, 28, 1)))
model.add(layers.Conv2D(32, (5, 5),padding = 'Same', activation ='relu'))
model.add(layers.MaxPool2D((2, 2)))
model.add(layers.Dropout(0.25))
model.add(layers.Conv2D(64,(3, 3),padding = 'Same', activation ='relu'))
model.add(layers.Conv2D(64, (3, 3),padding = 'Same', activation ='relu'))
model.add(layers.MaxPool2D(pool_size=(2, 2), strides=(2, 2)))
model.add(layers.Dropout(0.25))
model.add(layers.Flatten())
model.add(layers.Dense(256, activation = "relu"))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(10, activation = "softmax"))
return model
Question:
Is there any intuition or explanation behind the better performance of the second CNN structure? What is it that makes stacking 2 Conv2D layers better than just using 1 Conv2D layer before max pooling and dropout? Or is there something else that contributes to the result of the second model?
Thank y'all for your time and help.
The main difference between these two approaches is that the later (2 conv) has more flexibility in expressing non-linear transformations without loosing information. Maxpool removes information from the signal, dropout forces distributed representation, thus both effectively make it harder to propagate information. If, for given problem, highly non-linear transformation has to be applied on raw data, stacking multiple convs (with relu) will make it easier to learn, that's it. Also note that you are comparing a model with 3 max poolings with model with only 2, consequently the second one will potentially loose less information. Another thing is it has way bigger fully connected bit at the end, while the first one is tiny (64 neurons + 0.5 dropout means that you effectively have at most 32 neurons active, that is a tiny layer!). To sum up:
These architectures differe in many aspects, not just stacking conv nets.
Stacking convnets usually leads to less information being lost in processing; see for example "all convolutional" architectures.

Resources