Convolutional Autoencoder in Pytorch for Dummies - machine-learning

I am here to ask some more general questions about Pytorch and Convolutional Autoencoders.
If I only use Convolutional Layers (FCN), do I even have to care about the input shape? And then how do I choose the number of featuremaps best?
Does a ConvTranspose2d Layer automatically unpool?
Can you spot any errors or unconventional code in my example?
By the way, I want to make a symmetrical Convolutional Autoencoder to colorize black and white images with different image sizes.
self.encoder = nn.Sequential (
# conv 1
nn.Conv2d(in_channels=3, out_channels=512, kernel_size=3, stride=1, padding=1),
nn.ReLU,
nn.MaxPool2d(kernel_size=2, stride=2), # 1/2
nn.BatchNorm2d(512),
# conv 2
nn.Conv2d(in_channels=512, out_channels=256, kernel_size=3, stride=1, padding=1),
nn.ReLU,
nn.MaxPool2d(kernel_size=2, stride=2), # 1/4
nn.BatchNorm2d(256),
# conv 3
nn.Conv2d(in_channels=256, out_channels=128, kernel_size=3, stride=1, padding=1),
nn.ReLU,
nn.MaxPool2d(kernel_size=2, stride=2), # 1/8
nn.BatchNorm2d(128),
# conv 4
nn.Conv2d(in_channels=128, out_channels=64, kernel_size=3, stride=1, padding=1),
nn.ReLU,
nn.MaxPool2d(kernel_size=2, stride=2), #1/16
nn.BatchNorm2d(64)
)
self.encoder = nn.Sequential (
# conv 5
nn.ConvTranspose2d(in_channels=64, out_channels=128, kernel_size=3, stride=1, padding=1),
nn.ReLU,
nn.BatchNorm2d(128),
# conv 6
nn.ConvTranspose2d(in_channels=128, out_channels=256, kernel_size=3, stride=1, padding=1),
nn.ReLU,
nn.BatchNorm2d(256),
# conv 7
nn.ConvTranspose2d(in_channels=256, out_channels=512, kernel_size=3, stride=1, padding=1),
nn.ReLU,
nn.BatchNorm2d(512),
# conv 8
nn.ConvTranspose2d(in_channels=512, out_channels=512, kernel_size=3, stride=1, padding=1),
nn.Softmax()
)
def forward(self, x):
h = x
h = self.encoder(h)
h = self.decoder(h)
return h

No, you don't need to care about input width and height with a fully convolutional model. But should probably ensure that each downsampling operation in the encoder is matched by a corresponding upsampling operation in the decoder.
I'm not sure what you mean by unpooling. If you mean upsampling (increasing spatial dimensions), then this is what the stride parameter is for. In PyTorch, a transpose convolution with stride=2 will upsample twice. Note, however, that instead of a transpose convolution, many practitioners prefer to use bilinear upsampling followed by a regular convolution. This is one reason why.
If, on the other hand, you mean actual unpooling, then you should look at the documentation of torch.MaxUnpool2d. You need to collect maximal value indices from the MaxPool2d operation and feed them into MaxUnpool2d.
The general consensus seems to be that you should increase the number of feature maps as you downsample. Your code appears to do the reverse. Consecutive powers of 2 seem like a good place to start. It's hard to suggest a better rule of thumb. You probably need to experiment a little.
In other notes, I'm not sure why you apply softmax to the encoder output.

Related

1D Convolutions CNN Keras

I'm very new to Keras and I'm tying to implement a CNN using 1D convolutions for binary classification on the raw time series data. Each training example has 160 time steps and I have 120 training examples. The training data is of shape (120,160). Here is the code:
X_input = Input((160,1))
X = Conv1D(6, 5, strides=1, name='conv1', kernel_initializer=glorot_uniform(seed=0))(X_input)
X = Activation('relu')(X)
X = MaxPooling1D(2, strides=2)(X)
X = Conv1D(16, 5, strides=1, name='conv2', kernel_initializer=glorot_uniform(seed=0))(X)
X = Activation('relu')(X)
X = MaxPooling1D(2, strides=2)(X)
X = Flatten()(X)
X = Dense(120, activation='relu', name='fc1', kernel_initializer=glorot_uniform(seed=0))(X)
X = Dense(84, activation='relu', name='fc2', kernel_initializer=glorot_uniform(seed=0))(X)
X = Dense(2, activation='sigmoid', name='fc3', kernel_initializer=glorot_uniform(seed=0))(X)
model = Model(inputs=X_input, outputs=X, name='model')
X_train = X_train.reshape(-1,160,1) # shape (120,160,1)
t_train = y_train.reshape(-1,1,1) # shape (120,1,1)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train)
The error that I get is expected fc3 to have 2 dimensions, but got array with shape (120, 1, 1).
I tried removing each layer and just leaving the 'conv1' component but I get the error expected conv1 to have shape (156, 6) but got array with shape (1, 1). It seems like my input shape is wrong; however, looking at other examples it seems that this worked for other people.
I think the issue is not your inputs, but rather your targets.
The output of the model is 2 dimensions, but when it checks against the targets, it realizes that the targets are in an array with shape (120, 1, 1).
You can try changing the y_train reshape line as follows (fyi, it also seems that you accidentally typed t_train instead of y_train):
y_train = y_train.reshape(-1,1)
Also, it seems that you probably want to use 1 instead of 2 for the last Dense layer (see Difference between Dense(2) and Dense(1) as the final layer of a binary classification CNN?)

Negative weights and biases in siamese keras model

I try to train a siamese model in keras. I use a really simple encoder with only covnets to encode a 32x32 RGB picture into a feature vector. The encoder encodes two pictures A and B. Then a MLP compares the two vectors and computes a score between 0 and 1 which is should be high if A and B are of the same class and low if they are not.
I used relu as the activation function on all layers but the model only learned to encode everything into a 0-vector. I switched to 'tanh' and saw that a lot of weight and biases, and also the entries in the feature-vector are negative. So i now understand why with relu everything was zero. But how come i get negative values? The input is positive, output as well and y-values are 0 or 1. I think there is something wrong with my model.
It doesn't perform very well either. It gets to around 60% accuracy.
Here is my model:
def model():
initializer = keras.initializers.random_uniform(minval=0.0001, maxval=0.001)
enc = Sequential()
enc.add(Conv2D(32, (3, 3), padding='same', activation='tanh',kernel_initializer=initializer))
enc.add(Conv2D(32, (3, 3), padding='same', strides=(2, 2), activation='tanh',kernel_initializer=initializer))
enc.add(Conv2D(32, (3, 3), padding='same', activation='tanh',kernel_initializer=initializer))
enc.add(Conv2D(16, (3, 3), padding='same', strides=(2, 2), activation='tanh',kernel_initializer=initializer))
enc.add(Conv2D(32, (3, 3), padding='same', activation='tanh',kernel_initializer=initializer))
enc.add(Conv2D(4, (3, 3), padding='same', strides=(2, 2), activation='tanh',kernel_initializer=initializer))
enc.add(Flatten())
input1 = Input((32, 32, 3))
# enc.build((1,32,32,3))
# enc.summary()
input2 = Input((32, 32, 3))
enc1 = enc(input1)
enc2 = enc(input2)
twin = concatenate([enc1, enc2])
twin = Dense(64, activation='tanh',kernel_initializer=initializer)(twin)
twin = Dense(32, activation='tanh',kernel_initializer=initializer)(twin)
twin = Dense(1, activation='sigmoid',kernel_initializer=initializer)(twin)
twin = Model(inputs=[input1, input2], outputs=twin)
twin.summary()
twin.compile(optimizer=adam(0.0001), loss='binary_crossentropy', metrics=["acc"])
return twin
Edit: I found out it was all good. Just my data was bad. I had only 1/10's samples of one class compared to the others. Oversampling didn't help. I removed the class from the dataset for now. Its working. I might add the class back in with augmented copies as additional samples and see how it goes.

What is the best Neural Network architecture for mapping one input image to two outputs?

I have generated a data set using EMNIST that has one character per image or two characters per image.The image is sized at 28x56(hxw)
I basically want to predict the one or two characters in a given image. I am not sure on which architecture to follow to implement this. There are 62 character classes.
ex:-single character two characters
For single character y= [23]
For two characters y= [35,11]
I tried the following.
I tried implementing this thorough a CTC but I got stuck in a infinite loss that I couldn't fix.
Padded the single character ground truths with 62 to note a blank character and trained a CNN with following layers.
print()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
y_train = sequence.pad_sequences(y_train, padding='post', value = 62)
y_test = sequence.pad_sequences(y_test,padding='post', value = 62)
X_train = X_train/255.0
X_test = X_test/255.0
input_shape = (28, 56, 1)
model = Sequential()
model.add(Conv2D(filters=72, kernel_size=(11,11), padding = 'same', activation='relu',input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2,2),strides=2))
model.add(Conv2D(filters=144, kernel_size=(7,7) , padding = 'same', activation='relu'))
model.add(Conv2D(filters=144, kernel_size=(3,3) , padding = 'same', activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Flatten())
model.add(Dense(units=1024, activation='relu'))
model.add(Dropout(.5))
model.add(Dense(512, activation='relu'))
model.add(Dense(units=2, activation='relu'))
model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])
model.summary()
batch_size = 128
steps = math.ceil(X_train.shape[0]/batch_size)
datagen = ImageDataGenerator(
featurewise_center=False, # set input mean to 0 over the dataset
samplewise_center=False, # set each sample mean to 0
featurewise_std_normalization=False, # divide inputs by std of the dataset
samplewise_std_normalization=False, # divide each input by its std
zca_whitening=False, # apply ZCA whitening
rotation_range=0, # randomly rotate images in the range (degrees, 0 to 180)
zoom_range = 0.2, # Randomly zoom image
width_shift_range=0.2, # randomly shift images horizontally (fraction of total width)
height_shift_range=0.1, # randomly shift images vertically (fraction of total height)
horizontal_flip=False, # randomly flip images
vertical_flip=False)
history = model.fit_generator(datagen.flow(X_train,y_train, batch_size=batch_size),
epochs = 6, validation_data = (X_test, y_test),
verbose = 1,steps_per_epoch=steps)
I was able to reach an accuracy of around 90% for validation set. However when I feed a generated image to see it's prediction it's a few characters off from the correct classification. Is there something wrong in the way I have created the model or pre-processed the data?
I have recognized my error. I have tried to tackle the problem using regression method wheres the problem is a classification problem.

ValueError: strides should be of length 1, 1 or 3 but was 2

train input shape : (13974, 100, 6, 5)
train output shape : (13974, 1,1)
test input shape : (3494, 100, 6, 5)
test output shape : (3494, 1, 1)
I am developing the following model. of 2D CNN LSTM.
model = Sequential()
model.add(TimeDistributed(Conv2D(1, (1,1), activation='relu',
input_shape=(6,5,1))))
model.add(TimeDistributed(MaxPooling2D(pool_size=(6, 5))))
model.add(TimeDistributed(Flatten()))
model.add(LSTM(units=300, return_sequences= False, input_shape=(100,1)))
model.add(Dense(1))
when I try to fit as follow
model.fit(train_input,train_output,epochs=50,batch_size=60)
it gives me a error.
ValueError: strides should be of length 1, 1 or 3 but was 2
please correct my model. I am converting the 6,5 image to a single unit and predict the 101th time stamp from 100 time stamps.
Your question is quite unclear, but I believe you have sequence of 100 images of size 6 x 5. It is better to incorporate Conv3D in your usecase, and also there is no necessary to have TimeDistributed everywhere. This is just an illustration for your usecase, you may have to add more layers of Conv and MaxPool and experiment with other hyper-parameters to get good fit.
# Add the channel dimension in input
train_input = np.expand_dims(train_input, -1)
# Remove the extra dimension in output
train_output = np.reshape(train_output, (-1, 1))
model = Sequential()
model.add(Conv3D(1, (1,1,1), activation='relu', input_shape=(100, 6,5, 1)))
model.add(MaxPooling3D(pool_size=(6, 5, 1)))
model.add(Reshape((16, 5)))
model.add(LSTM(units=300, return_sequences= False))
model.add(Dense(1))

OneClass SVM model using pretrained ResNet50 network

I'm trying to build OneClass classifier for image recognition. I found this article, but because I have no full source code I don't exactly understand what am i doing.
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=42)
# X_train (2250, 200, 200, 3)
resnet_model = ResNet50(input_shape=(200, 200, 3), weights='imagenet', include_top=False)
features_array = resnet_model.predict(X_train)
# features_array (2250, 7, 7, 2048)
pca = PCA(svd_solver='randomized', n_components=450, whiten=True, random_state=42)
svc = SVC(kernel='rbf', class_weight='balanced')
model = make_pipeline(pca, svc)
param_grid = {'svc__C': [1, 5, 10, 50], 'svc__gamma': [0.0001, 0.0005, 0.001, 0.005]}
grid = GridSearchCV(model, param_grid)
grid.fit(X_train, y_train)
I have 2250 images (food and not food) 200x200px size, I send this data to predict method of ResNet50 model. Result is (2250, 7, 7, 2048) tensor, any one know what this dimensionality does it mean?
When I try to run grid.fit method i get an error:
ValueError: Found array with dim 4. Estimator expected <= 2.
These are the findings I could make.
You are getting the output tensor above the global average pooling layer. (See resnet_model.summary() to know about how input dimension changes to output dimension)
For a simple fix, add an Average pooling 2d Layer on top of resnet_model.
(So that output shape becomes (2250,1,1, 2048))
resnet_model = ResNet50(input_shape=(200, 200, 3), weights='imagenet', include_top=False)
resnet_op = AveragePooling2D((7, 7), name='avg_pool_app')(resnet_model.output)
resnet_model = Model(resnet_model.input, resnet_op, name="ResNet")
This generally is present in the source code of ResNet50 itself. Basically we are appending an AveragePooling2D layer to the resnet50 model. The last line combines the layer (2nd line) and the base line model into a model object.
Now the output dimension (feature_array) will be (2250, 1, 1, 2048) (because of added average pooling layer).
To avoid the ValueError you ought to reshape this feature_array to (2250, 2048)
feature_array = np.reshape(feature_array, (-1, 2048))
In the last line of the program in the question,
grid.fit(X_train, y_train)
you have fit with X_train (which are images in this case). The correct variable here is features_array (This is considered to be summary of the image). Entering this line will rectify the error,
grid.fit(features_array, y_train)
For more finetuning in this fashion by extracting feature vectors do look here (training with neural nets instead of using PCA and SVM).
Hope this helps!!

Resources