How to use tf.nn.crelu in tensorflow? - machine-learning

I am trying different activation functions in my simple neural network.
It does not matter using tf.nn.relu, tf.nn.sigmoid,... the network does what it should do.
But if I am using tf.nn.crelu, I have a dimension error.
It returns something like [max, min] and the width dimension is twice bigger.
What do I have to do? Fitting the following weights and biases to the output of crelu?

You're right, if you're building the network manually, you need to adjust the dimensions of the following layer to match tf.nn.crelu output. In this sense, tf.nn.crelu is not interchangeable with tf.nn.relu, tf.nn.elu, etc.
The situation is simpler if you use a high-level API, e.g. tensorflow slim. In this case, the layer functions are taking care of matching dimensions, so you can replace tf.nn.relu easily with tf.nn.crelu in code. However, keep in mind that the network is silently becoming twice as big.
Here's an example:
with slim.arg_scope([slim.conv2d, slim.fully_connected],
activation_fn=tf.nn.crelu,
normalizer_fn=slim.batch_norm,
normalizer_params={'is_training': is_training, 'decay': 0.95}):
conv1 = slim.conv2d(x_image, 16, [5, 5], scope='conv1')
pool1 = slim.max_pool2d(conv1, [2, 2], scope='pool1')
conv2 = slim.conv2d(pool1, 32, [5, 5], scope='conv2')
pool2 = slim.max_pool2d(conv2, [2, 2], scope='pool2')
flatten = slim.flatten(pool2)
fc = slim.fully_connected(flatten, 1024, scope='fc1')
drop = slim.dropout(fc, keep_prob=keep_prob)
logits = slim.fully_connected(drop, 10, activation_fn=None, scope='logits')
slim.arg_scope simply applies all provided arguments to the underlying layers, in particular activation_fn. Also note activation_fn=None in the last layer to fix the output dimension. Complete code can be found here.

Related

Pytorch VNet final softmax activation layer for segmentation. Different channel dimensions to labels. How do I get prediction output?

I am trying to build a V-Net. When I pass the images to segment during training, the output has 2 channels after the softmax activation (as specified in the architecture in the attached image) but the label and input has 1. How do I convert this such that output is the segmented image? Do I just take one of the channels as the final output when training (e.g output = output[:, 0, :, :, :]) and the other channel would be background?
outputs = network(inputs)
batch_size = 32
outputs.shape: [32, 2, 64, 128, 128]
inputs.shape: [32, 1, 64, 128, 128]
labels.shape: [32, 1, 64, 128, 128]
Here is my Vnet forward pass:
def forward(self, x):
# Initial input transition
out = self.in_tr(x)
# Downward transitions
out, residual_0 = self.down_depth0(out)
out, residual_1 = self.down_depth1(out)
out, residual_2 = self.down_depth2(out)
out, residual_3 = self.down_depth3(out)
# Bottom layer
out = self.up_depth4(out)
# Upward transitions
out = self.up_depth3(out, residual_3)
out = self.up_depth2(out, residual_2)
out = self.up_depth1(out, residual_1)
out = self.up_depth0(out, residual_0)
# Pass to convert to 2 channels
out = self.final_conv(out)
# return softmax
out = F.softmax(out)
return out [batch_size, 2, 64, 128, 128]
V Net architecture as described in (https://arxiv.org/pdf/1606.04797.pdf)
That paper has two outputs as they predict two classes:
The network predictions, which consist of two volumes having the same resolution as the original input data, are processed through a soft-max layer which
outputs the probability of each voxel to belong to foreground and to background.
Therefore this is not an autoencoder, where your inputs are passed back through the model as outputs. They use a set of labels which distinguish between their pixels of interest (foreground) and other (background). You will need to change your data if you wish to use the V-net in this manner.
It won't be as simple as designating a channel as output because this will be a classification task rather than a regression task. You will need annotated labels to work with this model architecture.

Converting keras code to pytorch code with Conv1D layer

I have some keras code that I need to convert to Pytorch. I've done some research but so far I am not able to reproduce the results I got from keras. I have spent many hours on this any tips or help is very appreciated.
Here is the keras code I am dealing with. The input shape is (None, 105, 768) where None is the batch size and I want to apply Conv1D to the input. The desire output in keras is (None, 105)
x = tf.keras.layers.Dropout(0.2)(input)
x = tf.keras.layers.Conv1D(1,1)(x)
x = tf.keras.layers.Flatten()(x)
x = tf.keras.layers.Activation('softmax')(x)
What I've tried, but worse in term of results:
self.conv1d = nn.Conv1d(768, 1, 1)
self.dropout = nn.Dropout(0.2)
self.softmax = nn.Softmax()
def forward(self, input):
x = self.dropout(input)
x = x.view(x.shape[0],x.shape[2],x.shape[1])
x = self.conv1d(x)
x = torch.squeeze(x, 1)
x = self.softmax(x)
The culprit is your attempt to swap the dimensions of the input around, since Keras and PyTorch have different conventions for the dimension order.
x = x.view(x.shape[0],x.shape[2],x.shape[1])
.view() does not swap the dimensions, but changes which part of the data is part of a given dimension. You can consider it as a 1D array, then you decide how many steps you take to cover the dimension. An example makes it much simpler to understand.
# Let's start with a 1D tensor
# That's how the underlying data looks in memory.
x = torch.arange(6)
# => tensor([0, 1, 2, 3, 4, 5])
# How the tensor looks when using Keras' convention (expected input)
keras_version = x.view(2, 3)
# => tensor([[0, 1, 2],
# [3, 4, 5]])
# Vertical isn't swapped with horizontal, but the data is arranged differently
# The numbers are still incrementing from left to right
incorrect_pytorch_version = keras_version.view(3, 2)
# => tensor([[0, 1],
# [2, 3],
# [4, 5]])
To swap the dimensions you need to use torch.transpose.
correct_pytorch_version = keras_version.transpose(0, 1)
# => tensor([[0, 3],
# [1, 4],
# [2, 5]])

Error: "DimensionMismatch("matrix A has dimensions (1024,10), vector B has length 9")" using Flux in Julia

i'm still new in Julia and in machine learning in general, but I'm quite eager to learn. In the current project i'm working on I have a problem about dimensions mismatch, and can't figure what to do.
I have two arrays as follow:
x_array:
9-element Array{Array{Int64,N} where N,1}:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 72, 73]
[11, 12, 13, 14, 15, 16, 17, 72, 73]
[18, 12, 19, 20, 21, 22, 72, 74]
[23, 24, 12, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 72, 74]
[36, 37, 38, 39, 40, 38, 41, 42, 72, 73]
[43, 44, 45, 46, 47, 48, 72, 74]
[49, 50, 51, 52, 14, 53, 72, 74]
[54, 55, 41, 56, 57, 58, 59, 60, 61, 62, 63, 62, 64, 72, 74]
[65, 66, 67, 68, 32, 69, 70, 71, 72, 74]
y_array:
9-element Array{Int64,1}
75
76
77
78
79
80
81
82
83
and the next model using Flux:
model = Chain(
LSTM(10, 256),
LSTM(256, 128),
LSTM(128, 128),
Dense(128, 9),
softmax
)
I zip both arrays, and then feed them into the model using Flux.train!
data = zip(x_array, y_array)
Flux.train!(loss, Flux.params(model), data, opt)
and immediately throws the next error:
ERROR: DimensionMismatch("matrix A has dimensions (1024,10), vector B has length 9")
Now, I know that the first dimension of matrix A is the sum of the hidden layers (256 + 256 + 128 + 128 + 128 + 128) and the second dimension is the input layer, which is 10. The first thing I did was change the 10 for a 9, but then it only throws the error:
ERROR: DimensionMismatch("dimensions must match")
Can someone explain to me what dimensions are the ones that mismatch, and how to make them match?
Introduction
First off, you should know that from an architectural standpoint, you are asking something very difficult from your network; softmax re-normalizes outputs to be between 0 and 1 (weighted like a probability distribution), which means that asking your network to output values like 77 to match y will be impossible. That's not what is causing the dimension mismatch, but it's something to be aware of. I'm going to drop the softmax() at the end to give the network a fighting chance, especially since it's not what's causing the problem.
Debugging shape mismatches
Let's walk through what actually happens inside of Flux.train!(). The definition is actually surprisingly simple. Ignoring everything that doesn't matter to us, we are left with:
for d in data
gs = gradient(ps) do
loss(d...)
end
end
Therefore, let's start by pulling the first element out of your data, and splatting it into your loss function. You didn't specify your loss function or optimizer in the question. Although softmax usually means you should use crossentropy loss, your y values are very much not probabilities, and so if we drop the softmax we can just use the dead-simple mse() loss. For optimizer, we'll default to good old ADAM:
model = Chain(
LSTM(10, 256),
LSTM(256, 128),
LSTM(128, 128),
Dense(128, 9),
#softmax, # commented out for now
)
loss(x, y) = Flux.mse(model(x), y)
opt = ADAM(0.001)
data = zip(x_array, y_array)
Now, to simulate the first run of Flux.train!(), we take first(data) and splat that into loss():
loss(first(data)...)
This gives us the error message you've seen before; ERROR: DimensionMismatch("matrix A has dimensions (1024,10), vector B has length 12"). Looking at our data, we see that yes, indeed, the first element of our dataset has a length of 12. And so we will change our model to instead expect 12 values instead of 10:
model = Chain(
LSTM(12, 256),
LSTM(256, 128),
LSTM(128, 128),
Dense(128, 9),
)
And now we re-run:
julia> loss(first(data)...)
50595.52542674723 (tracked)
Huzzah! It worked! We can run this again:
julia> loss(first(data)...)
50578.01417593167 (tracked)
The value changes because the RNN holds memory within itself which gets updated each time we run the network, otherwise we would expect the network to give the same answer for the same inputs!
The problem comes, however, when we try to run the second training instance through our network:
julia> loss([d for d in data][2]...)
ERROR: DimensionMismatch("matrix A has dimensions (1024,12), vector B has length 9")
Understanding LSTMs
This is where we run into Machine Learning problems more than programming problems; the issue here is that we have promised to feed that first LSTM network a vector of length 10 (well, 12 now) and we are breaking that promise. This is a general rule of deep learning; you always have to obey the contracts you sign about the shape of the tensors that are flowing through your model.
Now, the reasons you're using LSTMs at all is probably because you want to feed in ragged data, chew it up, then do something with the result. Maybe you're processing sentences, which are all of variable length, and you want to do sentiment analysis, or somesuch. The beauty of recurrent architectures like LSTMs is that they are able to carry information from one execution to another, and they are therefore able to build up an internal representation of a sequence when applied upon one time point after another.
When building an LSTM layer in Flux, you are therefore declaring not the length of the sequence you will feed in, but rather the dimensionality of each time point; imagine if you had an accelerometer reading that was 1000 points long and gave you X, Y, Z values at each time point; to read that in, you would create an LSTM that takes in a dimensionality of 3, then feed it 1000 times.
Writing our own training loop
I find it very instructive to write our own training loop and model execution function so that we have full control over everything. When dealing with time series, it's often easy to get confused about how to call LSTMs and Dense layers and whatnot, so I offer these simple rules of thumb:
When mapping from one time series to another (E.g. constantly predict future motion from previous motion), you can use a single Chain and call it in a loop; for every input time point, you output another.
When mapping from a time series to a single "output" (E.g. reduce sentence to "happy sentiment" or "sad sentiment") you must first chomp all the data up and reduce it to a fixed size; you feed many things in, but at the end, only one comes out.
We're going to re-architect our model into two pieces; first the recurrent "pacman" section, where we chomp up a variable-length time sequence into an internal state vector of pre-determined length, then a feed-forward section that takes that internal state vector and reduces it down to a single output:
pacman = Chain(
LSTM(1, 128), # map from timepoint size 1 to 128
LSTM(128, 256), # blow it up even larger to 256
LSTM(256, 128), # bottleneck back down to 128
)
reducer = Chain(
Dense(128, 9),
#softmax, # keep this commented out for now
)
The reason we split it up into two pieces like this is because the problem statement wants us to reduce a variable-length input series to a single number; we're in the second bullet point above. So our code naturally must take this into account; we will write our loss(x, y) function to, instead of calling model(x), it will instead do the pacman dance, then call the reducer on the output. Note that we also must reset!() the RNN state so that the internal state is cleared for each independent training example:
function loss(x, y)
# Reset internal RNN state so that it doesn't "carry over" from
# the previous invocation of `loss()`.
Flux.reset!(pacman)
# Iterate over every timepoint in `x`
for x_t in x
y_hat = pacman(x_t)
end
# Take the very last output from the recurrent section, reduce it
y_hat = reducer(y_hat)
# Calculate reduced output difference against `y`
return Flux.mse(y_hat, y)
end
Feeding this into Flux.train!() actually trains, albeit not very well. ;)
Final observations
Although your data is all Int64's, it's pretty typical to use floating point numbers with everything except embeddings (an embedding is a way to take non-numeric data such as characters or words and assign numbers to them, kind of like ASCII); if you're dealing with text, you're almost certainly going to be working with some kind of embedding, and that embedding will dictate what the dimensionality of your first LSTM is, whereupon your inputs will all be "one-hot" encoded.
softmax is used when you want to predict probabilities; it's going to ensure that for each input, the outputs are all between [0...1] and moreover that they sum to 1.0, like a good little probability distribution should. This is most useful when doing classification, when you want to wrangle your wild network output values of [-2, 5, 0.101] into something where you can say "we have 99.1% certainty that the second class is correct, and 0.7% certainty it's the third class."
When training these networks, you're often going to want to batch multiple time series at once through your network for hardware efficiency reasons; this is both simple and complex, because on one hand it just means that instead of passing a single Sx1 vector through (where S is the size of your embedding) you're instead going to be passing through an SxN matrix, but it also means that the number of timesteps of everything within your batch must match (because the SxN must remain the same across all timesteps, so if one time series ends before any of the others in your batch you can't just drop it and thereby reduce N halfway through a batch). So what most people do is pad their timeseries all to the same length.
Good luck in your ML journey!

How does correlation work for an even-sized filter in this example?

(I know a question like this exists, but I wanted help with a specific example)
If the linear filter has even dimensions, how is the "center" defined? i.e. in the following scenario:
filter = np.array([[a, b],
[c, d]])
and the image was:
image = np.array([[0, 1, 0],
[1, 1, 1],
[0, 1, 0]])
what would be the result of correlation of the image with the linear filter?
Which of the elements of the even-sized filter is considered the origin is an arbitrary choice. Each implementation will make a different choice. Though a and d are the two most likely choices for reasons of similarity of the two image dimensions.
For example, MATLAB's imfilter (which implements correlation, not convolution) does the following:
f = [1,2;4,8];
img = [0,1,0;1,1,1;0,1,0];
imfilter(img,f,'same')
ans =
14 13 4
11 7 1
2 1 0
meaning that a is the origin of the kernel in this case. Other implementations might make a different choice.

The dark mystery of tensorflow and tensorboard using cross-validation in training. Weird graphs showing up

This is the first time I'm using tensorboard, as I am getting a weird bug for my graph.
This is what I get if I open up the 'STEP' window.
However, this is what I get if I open up the 'RELATIVE'. (Similary when opening the 'WALL' window).
In addition to that, to test the performance of the model, I apply cross-validation every few steps. The accuracy of this cross-validation drops from ~10% (random guessing), to 0% after some time. I am not sure where I have made a mistake, as I am not a pro with tensorflow, but I suspect my problem to be in the graph building. The code looks as follows:
def initialize_parameters():
global_step = tf.get_variable("global_step", shape=[], trainable=False,
initializer=tf.constant_initializer(1), dtype=tf.int64)
Weights = {
"W_Conv1": tf.get_variable("W_Conv1", shape=[3, 3, 1, 64],
initializer=tf.random_normal_initializer(mean=0.00, stddev=0.01),
),
...
"W_Affine3": tf.get_variable("W_Affine3", shape=[128, 10],
initializer=tf.random_normal_initializer(mean=0.00, stddev=0.01),
)
}
Bias = {
"b_Conv1": tf.get_variable("b_Conv1", shape=[1, 16, 8, 64],
initializer=tf.random_normal_initializer(mean=0.00, stddev=0.01),
),
...
"b_Affine3": tf.get_variable("b_Affine3", shape=[1, 10],
initializer=tf.random_normal_initializer(mean=0.00, stddev=0.01),
)
}
return Weights, Bias, global_step
def build_model(W, b, global_step):
keep_prob = tf.placeholder(tf.float32)
learning_rate = tf.placeholder(tf.float32)
is_training = tf.placeholder(tf.bool)
## 0.Layer: Input
X_input = tf.placeholder(shape=[None, 16, 8], dtype=tf.float32, name="X_input")
y_input = tf.placeholder(shape=[None, 10], dtype=tf.int8, name="y_input")
inputs = tf.reshape(X_input, (-1, 16, 8, 1)) #must be a 4D input into the CNN layer
inputs = tf.contrib.layers.batch_norm(
inputs,
center=False,
scale=False,
is_training=is_training
)
## 1. Layer: Conv1 (64, stride=1, 3x3)
inputs = layer_conv(inputs, W['W_Conv1'], b['b_Conv1'], is_training)
...
## 7. Layer: Affine 3 (128 units)
logits = layer_affine(inputs, W['W_Affine3'], b['b_Affine3'], is_training)
## 8. Layer: Softmax, or loss otherwise
predict = tf.nn.softmax(logits) #should be an argmax, or should this even go through
## Output: Loss functions and model trainers
loss = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(
labels=y_input,
logits=logits
)
)
trainer = tf.train.GradientDescentOptimizer(
learning_rate=learning_rate
)
updateModel = trainer.minimize(loss, global_step=global_step)
## Test Accuracy
correct_pred = tf.equal(tf.argmax(y_input, 1), tf.argmax(predict, 1))
acc_op = tf.reduce_mean(tf.cast(correct_pred, "float"))
return X_input, y_input, loss, predict, updateModel, keep_prob, learning_rate, is_training
Now I suspect my error to be in the definition of the loss-function of the graph, but I am not sure. Any idea what the problem could be? Or does the model converge correctly and all those errors are expected?
Yes I think you are runing same model more than once with your cross-validation implementation.
Just try at the end of every loop
session.close()
I suspect you are getting such strange output (and I have seen similar myself) because you are running the same model more than once and it is saving the Tensorboard output in exactly the same place. I can't see in your code how you name the file where you are putting the output? Try to make the file path in this part of code unique:
`summary_writer = tf.summary.FileWriter(unique_path_to_log, sess.graph)`
You can also try to locate the directory where your existing output has bene put in and try to remove the files that have the older (or newer?) timestamps and this way Tensorboard will not be confused as to which one to use.

Resources