Why is ReLU used in regression with Neural Networks? - machine-learning

I am following the official TensorFlow with Keras tutorial and I got stuck here: Predict house prices: regression - Create the model
Why is an activation function used for a task where a continuous value is predicted?
The code is:
def build_model():
model = keras.Sequential([
keras.layers.Dense(64, activation=tf.nn.relu,
input_shape=(train_data.shape[1],)),
keras.layers.Dense(64, activation=tf.nn.relu),
keras.layers.Dense(1)
])
optimizer = tf.train.RMSPropOptimizer(0.001)
model.compile(loss='mse', optimizer=optimizer, metrics=['mae'])
return model

The general reason for using non-linear activation functions in hidden layers is that, without them, no matter how many layers or how many units per layer, the network would behave just like a simple linear unit. This is nicely explained in this short video by Andrew Ng: Why do you need non-linear activation functions?
In your case, looking more closely, you'll see that the activation function of your final layer is not the relu as in your hidden layers, but the linear one (which is the default activation when you don't specify anything, like here):
keras.layers.Dense(1)
From the Keras docs:
Dense
[...]
Arguments
[...]
activation: Activation function to use (see activations). If you don't specify anything, no activation is applied (ie. "linear" activation: a(x) = x).
which is indeed what is expected for a regression network with a single continuous output.

Related

getting values between layers of a neural network using functional API

I am using below simple neural network for classification
model2 = keras.Sequential([
keras.layers.Flatten(input_shape=(X.shape[1], X.shape[2])),
keras.layers.Dense(64, activation ='relu'),
keras.layers.Dense(32, activation ='relu'),
keras.layers.Dense(16, activation ='relu'),
keras.layers.Dense(5, activation ='softmax')
])
I would like to check how the values for different classes change after going through each layer using the Euclidean distance to see which layer is the most useful and which is the least.
Does it make sense? If so, how can I do it using functional API? I encountered the first problems at the very beginning - they are related to the shape of the data - originally it is (1991, 13, 1292).

Can the number of units in NN input layer be different than the number of features in the data?

Based on the tensorflow keras API tutorial;
model = keras.Sequential([
keras.layers.Dense(10, activation='softmax', input_shape=(32,)),
keras.layers.Dense(10, activation='softmax')
])
I couldn't understand that why the number of units in the input layer is 10 while the input shape is 32. Also, there are many examples like this one in the tensorflow tutorials.
This is a rather common confusion by new practitioners, and not without a reason: the answer, as it has already been hinted at in the comments, is that in the Keras Sequential API there is an implicit input layer, determined by the input_shape argument of the first explicit layer.
This is directly visible in the Keras Functional API (check the example in the docs), where Input is an explicit layer itself, and in which your model would be written as:
inputs = Input(shape=(32,)) # input layer
x = Dense(10, activation='softmax')(inputs) # hidden layer
outputs = Dense(10, activation='softmax')(x) # output layer
model = Model(inputs, outputs)
i.e. your model is actually an example of a "good old" neural net with three layers (input, hidden, and output), despite that it looks like a two-layer net in the Keras Sequential API.
(BTW, and irrelevant to the question, it does not make much sense to have softmax as activation for your hidden layer.)

Tensorflow single sigmoid output with log loss vs two linear outputs with sparse softmax cross entropy loss for binary classification

I am experimenting with a binary classifier implementation in TensorFlow. If I have two plain outputs (i.e. no activation) in the final layer and use tf.losses.sparse_softmax_cross_entropy, my network trains as expected. However, if I change the output layer to produce a single output with a tf.sigmoid activation and use tf.losses.log_loss as the loss function, my network does not train (i.e. loss/accuracy does not improve).
Here is what my output layer/loss function looks like in the first (i.e. working) case:
out = tf.layers.dense(prev, 2)
loss = tf.losses.sparse_softmax_cross_entropy(labels=y, logits=out)
In the second case, I have the following:
out = tf.layers.dense(prev, 1, activation=tf.sigmoid)
loss = tf.losses.log_loss(labels=y, predictions=out)
Tensor y is a vector of 0/1 values; it is not one-hot encoded. The network learns as expected in the first case, but not in the second case. Apart from these two lines, everything else is kept the same.
I do not understand why the second set-up does not work. Interestingly, if I express the same network in Keras and use the second set-up, it works. Am I using the wrong TensorFlow functions to express my intent in the second case? I'd like to produce a single sigmoid output and use binary cross-entropy loss to train a simple binary classifier.
I'm using Python 3.6 and TensorFlow 1.4.
Here is a small, runnable Python script to demonstrate the issue. Note that you need to have downloaded the StatOil/C-CORE dataset from Kaggle to be able to run the script as is.
Thanks!
Using a sigmoid activation on two outputs doesn't give you a probability distribution:
import tensorflow as tf
import tensorflow.contrib.eager as tfe
tfe.enable_eager_execution()
start = tf.constant([[4., 5.]])
out_dense = tf.layers.dense(start, units=2)
print("Logits (un-transformed)", out_dense)
out_sigmoid = tf.layers.dense(start, units=2, activation=tf.sigmoid)
print("Elementwise sigmoid", out_sigmoid)
out_softmax = tf.nn.softmax(tf.layers.dense(start, units=2))
print("Softmax (probability distribution)", out_softmax)
Prints:
Logits (un-transformed) tf.Tensor([[-3.64021587 6.90115976]], shape=(1, 2), dtype=float32)
Elementwise sigmoid tf.Tensor([[ 0.94315267 0.99705648]], shape=(1, 2), dtype=float32)
Softmax (probability distribution) tf.Tensor([[ 0.05623185 0.9437682 ]], shape=(1, 2), dtype=float32)
Instead of tf.nn.softmax, you could also use tf.sigmoid on a single logit, then set the other output to one minus that.

Visual proof that neural network can approximate any function

I want to create a neural network that will exactly fit to my sample, however at some point the learning process stops. Different optimizers like adam or sgd also don't work.
Cybenko proved that the statement from the title is true for any sigmoidal function as an activation function so I want to use either sigmoid or tanh. I want also to keep only one hidden layer.
What are tweaks that I can use? Here is my setup (python3) and results:
features = np.linspace(-0.5,.5,2000).reshape(2000,1)
target = np.sin(12*features)
hn = 100
nn = models.Sequential()
#Only one hidden layer
nn.add(layers.Dense(units=hn, activation='sigmoid',
input_shape=(features.shape[1],)))
nn.add(layers.Dense(units=1, activation='linear'))
# Compile neural network
nn.compile(loss='mse',
optimizer='RMSprop',
metrics=['mse'])
nn.fit(features,
target,
epochs=200,
verbose=2,
batch_size=5)

Cross Entropy Loss for Semantic Segmentation Keras

I'm pretty sure this is a silly question but I can't find it anywhere else so I'm going to ask it here.
I'm doing semantic image segmentation using a cnn (unet) in keras with 7 labels. So my label for each image is (7,n_rows,n_cols) using the theano backend. So across the 7 layers for each pixel, it's one-hot encoded. In this case, is the correct error function to use categorical cross-entropy? It seems that way to me but the network seems to learn better with binary cross-entropy loss. Can someone shed some light on why that would be and what the principled objective is?
Binary cross-entropy loss should be used with sigmod activation in the last layer and it severely penalizes opposite predictions. It does not take into account that the output is a one-hot coded and the sum of the predictions should be 1. But as mis-predictions are severely penalizing the model somewhat learns to classify properly.
Now to enforce the prior of one-hot code is to use softmax activation with categorical cross-entropy. This is what you should use.
Now the problem is using the softmax in your case as Keras don't support softmax on each pixel.
The easiest way to go about it is permute the dimensions to (n_rows,n_cols,7) using Permute layer and then reshape it to (n_rows*n_cols,7) using Reshape layer. Then you can added the softmax activation layer and use crossentopy loss. The data should also be reshaped accordingly.
The other way of doing so will be to implement depth-softmax :
def depth_softmax(matrix):
sigmoid = lambda x: 1 / (1 + K.exp(-x))
sigmoided_matrix = sigmoid(matrix)
softmax_matrix = sigmoided_matrix / K.sum(sigmoided_matrix, axis=0)
return softmax_matrix
and use it as a lambda layer:
model.add(Deconvolution2D(7, 1, 1, border_mode='same', output_shape=(7,n_rows,n_cols)))
model.add(Permute(2,3,1))
model.add(BatchNormalization())
model.add(Lambda(depth_softmax))
If tf image_dim_ordering is used then you can do way with the Permute layers.
For more reference check here.
I tested the solution of #indraforyou and think that the proposed method has some mistakes. As the commentsection does not allow for proper code segments, here is what I think would be the fixed version:
def depth_softmax(matrix):
from keras import backend as K
exp_matrix = K.exp(matrix)
softmax_matrix = exp_matrix / K.expand_dims(K.sum(exp_matrix, axis=-1), axis=-1)
return softmax_matrix
This method will expect the ordering of the matrix to be (height, width, channels).

Resources