use of Tanh() in the output layer of generator network - machine-learning

I am studying Generative Adversarial Networks. Recently, while reading a paper of Radford et al. here, I found that the output layer of their generator network uses Tanh(). The range of Tanh() is (-1, 1), however, pixel values of an image in double-precision format lies in [0, 1]. Can someone please explain why Tanh() is used in the output layer and how the generator generates images with proper pixel values?

If you look at the code of the paper, you will see that the authors preprocess the images: https://github.com/soumith/dcgan.torch/blob/master/data/donkey_folder.lua#L68 so that the values of the images are taken in [-1, 1].
Then, in the generation part, they rescale the images to values in [0, 1]: https://github.com/soumith/dcgan.torch/blob/master/generate.lua#L89.

Related

What is the purpose of having the same input and output in PyTorch nn.Linear function?

I think this is a comprehension issue, but I would appreciate any help.
I'm trying to learn how to use PyTorch for autoencoding. In the nn.Linear function, there are two specified parameters,
nn.Linear(input_size, hidden_size)
When reshaping a tensor to its minimum meaningful representation, as one would in autoencoding, it makes sense that the hidden_size would be smaller. However, in the PyTorch tutorial there is a line specifying identical input_size and hidden_size:
class NeuralNetwork(nn.Module):
def __init__(self):
super(NeuralNetwork, self).__init__()
self.flatten = nn.Flatten()
self.linear_relu_stack = nn.Sequential(
nn.Linear(28*28, 512),
nn.ReLU(),
nn.Linear(512, 512),
nn.ReLU(),
nn.Linear(512, 10),
)
I guess my question is, what is the purpose of having the same input and hidden size? Wouldn't this just return an identical tensor?
I suspect that this just a requirement after calling the nn.ReLU() activation function.
As well stated by wikipedia:
An autoencoder is a type of artificial neural network used to learn
efficient codings of unlabeled data. The
encoding is validated and refined by attempting to regenerate the
input from the encoding.
In other words, the idea of the autoencoder is to learn an identity. This identity-function will be learned only for particular inputs (i.e. without anomalies). From this, the following points derive:
Input will have same dimensions as output
Autoencoders are (generally) built to learn the essential features of the input
Because of point (1), you have that autoencoder will have a series of layers (e.g. a series of nn.Linear() or nn.Conv()).
Because of point (2), you generally have an Encoder which compresses the information (as your code-snippet, you start from 28x28 to the ending 10) and a Decoder that decompress the information (10 -> 28x28). Generally the latent space dimensionality (10) is much smaller than the input (28x28) across several implementation of this theoretical architecture. Now that the end-goal of the Encoder part is clear, you may appreciate that the compression may produce additional data during the compression itself (nn.Linear(28*28, 512)), which will disappear when the series of layers will give the final output (10).
Note that because the model in your question includes a nonlinearity after the linear layer, the model will not learn an identity transform between the input and output. In the specific case of the relu nonlinearity, the model could learn an identity transform if all of the input values were positive, but in general this won't be the case.
I find it a little easier to imagine the issue if we had an even smaller model consisting of Linear --> Sigmoid --> Linear. In such a case, the input will be mapped through the first matrix transform and then "squashed" into the space [0, 1] as the "hidden" layer representation. The next ("output") layer would need to take this squashed view of the input and come up with some way of "unsquashing" it back into the original. But with an affine output layer, it's not possible to do this, so the model will have to learn some other, non-identity, transforms for the two matrices.
There are some neat visualizations of this concept on Chris Olah's blog that are well worth a look.

What does the 4D array returned by net.forward() in OpenCV DNN means? I have little knowledge about deep learning

I need to use face detection to finish my homework and then I searched on the Internet and I think that using a pre-trained deep learning face detector model with OpenCV's DNN module is easy and good, it works well. Where I learnt it is here: https://www.pyimagesearch.com/2018/02/26/face-detection-with-opencv-and-deep-learning/ , but I am really confused about the 4D array returned by net.forward():
net = cv2.dnn.readNetFromCaffe("deploy.prototxt", "res10_300x300_ssd_iter_140000_fp16.caffemodel")
def detect_img(net, image):
blob = cv2.dnn.blobFromImage(image, 1.0, (300, 300), (104.0, 177.0, 123.0), False, False)
net.setInput(blob)
detections = net.forward() # Here is the 4D array.
print(detections.shape)
return show_detections(image, detections)
I almost know nothing about deep learning. I think that I guessed out something by reading "deploy.prototxt" which may be a configuration file of the pre-trained model, I guess, but I still feel really confused about it. May I ask whether there is one way that I can understand the meaning of the 4D array quickly or not? Could I understand how the pre-trained model works roughly, with poor knowledge of deep learning, in a week?
3rd dimension helps you iterate over predictions and
in the 4th dimension, there are actual results
class_lable = int(inference_results[0, 0, i,1]) --> gives one hot encoded class label for ith box
conf = inference_results[0, 0, i, 2] --> gives confidence of ith box prediction
TopLeftX,TopLeftY, BottomRightX, BottomRightY = inference_results[0, 0, i, 3:7] -->gives
co-ordinates bounding boxes for resized small image
and 2nd dimension is used when the predictions are made in more than one stages, for example in YOLO the predictions are done at 3 different layers.
you can iterate over these predictions using 2nd dimension like [:,i,:,:]

How to interpret patterns occuring in convolutional layer after training?

I am pretty sure I understood' the principle of cnn and why they are prefered over just fully connected neural networks. What I try to comprehend is how to interpret the occuring patterns after training the model.
So let's assume I want to recognize the number "1" written on an 256x256 big image-plane (only 1 bit image, black/white) that is then forwared to the output that either says "is a one", or "is not a one".
If the model is untrained and the first handwritten "1" is forwared, the result could be "[0.28, 0.72] which is obiously wrong. I then calculate the error between [0.28, 0.72] and [1, 0] (for example based on the mean squared error), derive it and try to find the local minimas of the derivative (backpropagation). Then I calculate the delta values for each weight (by using chainrule and partial derivative) until I finally reach the convolutional layer for which delta values are also calculated.
But my question now is: What exactly do the patterns that will occur by adding up bunch of delta values to the convolutional layer "weights" mean? Why do they find certain features characteristical for the number "1"? Or is it more like, it does not find any specific features per se, but rather it "encodes" the relationship between handwritten "1"s and the desired output [1, 0] into the convolutional layers?
_

Best output activation function for binary mask classification

I have a CNN which inputs a satellite image and should output a mask where it finds cars. I have manually labelled images and created masks for each image where each pixel is 1 if there is part of a car in that pixel, 0 otherwise.
I am trying to work out the best output layer activation function and loss function, and I'm fishing for opinions. I know there is a wealth of information out there but I find myself getting confused about whether my problem is regression or classification.
Could someone please offer their opinion? I am currently using the following output and loss in keras:
conv10 = Conv2D(1, 1, activation='sigmoid')(conv9)
model = Model(inputs=[inputs], outputs=[conv10])
model.compile(optimizer=Adam(lr=1e-4), loss='binary_crossentropy', metrics=['accuracy'])
Is this a good idea? Thanks!
This seems like a good idea from my point of view, because you want to output a probability P(px is part of a car | image) for each pixel px in image. Therefore, that's a binary classification problem, for which using the binary_crossentropy loss function (plus a sigmoid activation in the output layer) is appropriate.

feature number in tensorflow tf.nn.conv2d

In the Tensorflow example "Deep MNIST for Experts" https://www.tensorflow.org/get_started/mnist/pros
I am not clear how to determine the feature number specified in weight of activation function.
For example:
We can now implement our first layer. It will consist of convolution,
followed by max pooling. The convolution will compute 32 features for
each 5x5 patch.
W_conv1 = weight_variable([5, 5, 1, 32])
Why 32 is picked here?
In order to build a deep network, we stack several layers of this
type. The second layer will have 64 features for each 5x5 patch.
W_conv2 = weight_variable([5, 5, 32, 64])
Again, why 64 is picked?
Now that the image size has been reduced to 7x7, we add a
fully-connected layer with 1024 neurons to allow processing on the
entire image.
W_fc1 = weight_variable([7 * 7 * 64, 1024])
Why 1024 here?
Thanks
Each of these filters will actually do something, like check for edges, check for colour change, or right-shift, left-shit the image, sharpen, blur etc.
Each of these filters are actually working on finding out the meaning of the image by sharpening, enhancing, smoothening, intensifying etc.
For e.g. check this link which explains the meaning of these filters
http://setosa.io/ev/image-kernels/
So all these filters are actually neurons where the output will be max-pooled and eventually fed into a FC layer after some activation.
If you are looking for just understanding the filters, that is another approach. However if you are looking to learn how conv. architectures work but since these are tried and tested filters over the dataset, you hsould just go with it for now.
The filters also learn through Backprop.
32 and 64 are number of filters in the respective layers.
1024 is the number of output neurons in the fully connected layer.
Your question basically is about the reason behind the choice of these hyperparameters.
There is no mathematical or programming reason behind these specific choices. These have been picked up after experiments as they delivered a good accuracy over MNIST dataset.
You can change these numbers and that is one way by which you can modify a model.
Unfortunately you cannot yet explore the reason for the choice behind these parameters within TensorFlow or any other literature source.

Resources