I am using the tf.layers api for defining convolution and other layers in my network. One thing that I stumbled upon is the kernel_initializer option for the convolution layers. Does this parameter refer to the weights for the convolution layer? If yes, then does this mean that I can directly pass the weight matrix to that layer?
The short answer is yes, you can implement the initializer by yourself.
some discussions about this topic
https://github.com/tensorflow/tensorflow/issues/9744
https://stackoverflow.com/a/43284391/456105
Related
I know that a dense layer means a classic fully connected layer, which means each input is going to each neuron for multiplication. But recently some questions raised in my mind which when searched on youtube, blogs, StackOverflow, articles nobody gave me a satisfying answer to it.
1-Why do we need fully connected(dense) layers in neural networks, its usage? Can't we use sparse layers(means some input are going to only some neurons so all neurons are not getting all input)
2-What will happen if we use sparse layers? I know computations will be less but what will be the effect on output. Will neuron be able to perform just like dense layers or not.
3-Which will be better sparse or dense layers to use in a neural network.(Pros and Cons)
4- If we can use sparse layer and it performs well then why have I not heard this term more than FCN(Fully connected layer)
Sparse layer is not the same as a drop layer in neural network. In drop layer you prune/drop some neurons but other neurons get all the output from previous layer. So not same.
Thank you in advance for help.
Using sparse layers would simply introduce more features you would have to tweak. What would your sparse layer look like, what connects to what? Using Dense Layers atleast guarantees that there might be a chance that a connection is used.
You also answer you own question, evidently sparse layers are not better, else you would have heard from them. Dropout on the other hand is useful and is used widely.
I am implementing a CNN model for Detection of Forged Images. The paper I am referring to asks to initialize the kernel weights of the first layer with 30 basic high pass filters (the ones used in calculation of residual maps in SRM). What is this high pass filter and how to do this?
Also, is there any function that instead of a single image at a time, these filters can be applied to a batch of images...similar to ImageDataGenerator?
Research Paper Reference: https://ieeexplore.ieee.org/document/7823911
In the U-net, have activation functions in all the layers but there seems to be no activation function in the upsampling layer (that is done using transpose convolution). Why does this offer more efficiency than having an activation function?
From my understanding, activation functions offer non-linearity. So, this question really is, what benefit is there to maintaining linearity in transpose convolutions but maintaining non-linearity on regular convolutions. Wouldn't it just always be best to have an activation function in these layers?
My only other intuition is that perhaps, they're trying to keep the upsampling as closely related to regular morphological interpolation methods.
I think your interpretation is right: they were just trying to keep the process similar to the upsampling operated with classic interpolations because of a better interpretability of the architecture (while still allowing flexibility to the network, that can still learn the best weights for the upsampling). In general, if you want to add more non-linearity, you can enter any desired activation function (such as a ReLU) after that level, but personally, from my experience, I would say that performance will not change much.
In the original paper (Section 2), there seems to be ReLU activation function in the upsampling path?
"Every step in the expansive path consists of an upsampling of the
feature map followed by a 2x2 convolution (“up-convolution”) that halves the
number of feature channels, a concatenation with the correspondingly cropped
feature map from the contracting path, and two 3x3 convolutions, each followed by a ReLU."
https://arxiv.org/pdf/1505.04597.pdf
The assumption made by the OP is incorrect. The upsampling layers in the UNet do involve an activation function. Below is a screenshot from the video on the page linked by the OP that shows the exact upconvolution operation used in UNet. You can see that they use ReLU in this operation.
I am confused between how to correctly use dropout with RNN in keras, specifically with GRU units. The keras documentation refers to this paper (https://arxiv.org/abs/1512.05287) and I understand that same dropout mask should be used for all time-steps. This is achieved by dropout argument while specifying the GRU layer itself. What I don't understand is:
Why there are several examples over the internet including keras own example (https://github.com/keras-team/keras/blob/master/examples/imdb_bidirectional_lstm.py) and "Trigger word detection" assignment in Andrew Ng's Coursera Seq. Models course, where they add a dropout layer explicitly "model.add(Dropout(0.5))" which, in my understanding, will add a different mask to every time-step.
The paper mentioned above suggests that doing this is inappropriate and we might lose the signal as well as long-term memory due to the accumulation of this dropout noise over all the time-steps.
But then, how are these models (using different dropout masks at every time-step) are able to learn and perform well.
I myself have trained a model which uses different dropout masks at every time-step, and although I haven't gotten results as I wanted, the model is able to overfit the training data. This, in my understanding, invalidates the "accumulation of noise" and "signal getting lost" over all the time-steps (I have 1000 time-step series being input to the GRU layers).
Any insights, explanations or experience with the situation will be helpful. Thanks.
UPDATE:
To make it more clear I'll mention an extract from keras documentation of Dropout Layer ("noise_shape: 1D integer tensor representing the shape of the binary dropout mask that will be multiplied with the input. For instance, if your inputs have shape (batch_size, timesteps, features) and you want the dropout mask to be the same for all timesteps, you can use noise_shape=(batch_size, 1, features").
So, I believe, it can be seen that when using Dropout layer explicitly and needing the same mask at every time-step (as mentioned in the paper), we need to edit this noise_shape argument which is not done in the examples I linked earlier.
As Asterisk explained in his comment, there is a fundamental difference between dropout within a recurrent unit and dropout after the unit's output. This is the architecture from the keras tutorial you linked in your question:
model = Sequential()
model.add(Embedding(max_features, 128, input_length=maxlen))
model.add(Bidirectional(LSTM(64)))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
You're adding a dropout layer after the LSTM finished its computation, meaning that there won't be any more recurrent passes in that unit. Imagine this dropout layer as teaching the network not to rely on the output for a specific feature of a specific time step, but to generalize over information in different features and time steps. Dropout here is no different to feed-forward architectures.
What Gal & Ghahramani propose in their paper (which you linked in the question) is dropout within the recurrent unit. There, you're dropping input information between the time steps of a sequence. I found this blogpost to be very helpful to understand the paper and how it relates to the keras implementation.
I know there are some ways to use specific convolution fliter like dilated convolution by using the op tf.nn.atrous_conv2d.But if I want to realize the structure like Convolution in Convolution for Network in Network.,how can I change the fliter of the op liketf.nn.conv2d.
Is writing a brand new op the only way to achieve that?
Or if I would like to use data transformation to realize it, will the automatic differentiation work? Is there any document to use data transformation?