How to define a custom loss function properly? - machine-learning

Question: How should I define a custom loss function that matches with the default value of the learning rate?
It is obvious that defining a proper loss function is vital for training a neural network.
On the other hand, learning rate is an important parameter that must be set properly.
As I know, loss function and learning rate are tightly related because they directly determine new values of the network weights:
w_new = w_old - learning_rate * (gradient_of_loss w.r.t. w)
For example one may choose the following loss function:
loss_1 = || y_pred - y_true||^2_2
But another chooses the following:
loss_2 = (1/2)|| y_pred - y_true||^2_2
In many neural network/deep learning framework, the value of learning rate is set as default. So using default learning rate, the effective learning rate in latter loss function is half of the former. Please correct me if I'm wrong.

Related

understanding test and validation set usage for early stop and model selection

I implemented an ANN (1 hidden layer of 64 units, learning rate = 0.001, epsilon = 0.001, iters = 500) with pythons OpenCV module. Train error ~ 3% and test error ~ 12%
In order to improve the accruacy/ generalisation of my NN I decided to proceed by- implementing model selection (of #hidden units and learning rate) to get an accurate value of hyperparameters and plotting learning curves to determine if more data is needed (currently have 2.5k).
Having read some sources regarding NN training and model selection, I'm very confused on the following matter -
1) In order to perform model selection, I know the following needs to be done-
create set possibleHiddenUnits {4, 8, 16, 32, 64}
randomly select Tr & Va sets from the total set of Tr + Va with some split e.g. 80/20
foreach ele in possibleHiddenUnits
(*) compute weights for the NN using backpropagation and an iterative optimisation algorithm like Gradient Descent (where we provide the termination criteria in the form of number of iterations / epsilon)
compute Validation set error using these trained weights
select the number of hidden units which min Va set error
Alternatively, I believe we can also use k-fold cross validation.
a. how do you decide what the number of iterations/ epsilon for GD should be?
b. does 1 iteration out of x iterations of GD (where the entire training set is used to compute the gradients of cost wrt weights through backprop) constitute an 'epoch'?
2) Sources (whats is the difference between train, validation and test set, in neural networks? and How to use k-fold cross validation in a neural network) mention that the training for a NN is done in the following way as it prevents over-fitting
for each epoch
for each training data instance
propagate error through the network
adjust the weights
calculate the accuracy over training data
for each validation data instance
calculate the accuracy over the validation data
if the threshold validation accuracy is met
exit training
else
continue training
a. I believe this method should be executed once the model selection has been done. But then how do we avoid overfitting of the model in step (*) of the model selection process above?
b. Am I right in assuming that one epoch constitues one iteration of training where weights are calculated using the entire Tr set through GD + backprop and GD involves x (>1) iters over the entire Tr set to calculate the weights ?
Also, out off 1b and 2b which is correct?
This is more of a comment but since I cant make comments yet I write it here. Have you tried other methods like l2 regularization or dropout? I dont know a lot about model selection but dropout has a very similiar effect like taking lots of models and averaging them. Normaly dropout should do the trick and you wont have problems with overfitting anymore.

Initial bias values for a neural network

I am currently building a CNN in tensorflow and I am initialising my weight matrix using a He normal weight initialisation. However, I am unsure how I should initialise my bias values. I am using ReLU as my activation function between each convolutional layer. Is there a standard method to initialising bias values?
# Define approximate xavier weight initialization (with RelU correction described by He)
def xavier_over_two(shape):
std = np.sqrt(shape[0] * shape[1] * shape[2])
return tf.random_normal(shape, stddev=std)
def bias_init(shape):
return #???
Initializing the biases. It is possible and common to initialize the
biases to be zero, since the asymmetry breaking is provided by the
small random numbers in the weights. For ReLU non-linearities, some
people like to use small constant value such as 0.01 for all biases
because this ensures that all ReLU units fire in the beginning and
therefore obtain and propagate some gradient. However, it is not clear
if this provides a consistent improvement (in fact some results seem
to indicate that this performs worse) and it is more common to simply
use 0 bias initialization.
source: http://cs231n.github.io/neural-networks-2/
Be aware of the specific case of the last layer's bias. As Andrej Karpathy explains in his Recipe for Training Neural Networks:
init well. Initialize the final layer weights correctly. E.g. if you are regressing some values that have a mean of 50 then initialize the final bias to 50. If you have an imbalanced dataset of a ratio 1:10 of positives:negatives, set the bias on your logits such that your network predicts probability of 0.1 at initialization. Setting these correctly will speed up convergence and eliminate “hockey stick” loss curves where in the first few iteration your network is basically just learning the bias.

Different costs for underestimation and overestimation

I have a regression problem, but the cost function is different: The cost for an underestimate is higher than an overestimate. For example, if predicted value < true value, the cost will be 3*(true-predicted)^2; if predicted value > true value, the cost will be 1*(true-predicted)^2.
I'm thinking of using classical regression models such as linear regression, random forest etc. What modifications should I make to adjust for this cost function?
As I know, the ML API such as scikit-learn does not provide the functionality to directly modify the cost function. If I have to use these APIs, what can I do?
Any recommended reading?
You can use Tensorflow (or theano) for custom cost functions. The common linear regression implementation is here.
To find out how you can implement your custom cost function looking at a huber loss function implemented in tensorflow might help you. Here comes your custom cost function which you should replace in the linked code so instead of
cost = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)
in the linked code you'll have:
error = y_known - y_pred
condition = tf.less(error, 0)
overestimation_loss = 1 * tf.square(error)
underestimation_loss = 3 * tf.square(error)
cost = tf.reduce_mean(tf.where(condition, overestimation_loss, underestimation_loss))
Here when condition is true, error is lower than zero which means you y_known is smaller than y_pred so you'll have overestimation and so the tf.where statement will choose overestimation_loss otherwise underestimation loss.
The secret is that you'll compute both losses and choose where to use them using tf.where and condition.
Update:
If you want to use other libraries, if huber loss is implemented you can take a look to get ideas because huber loss is a conditional loss function similar to yours.
You can use asymmetric cost function to make your model overestimate or underestimate. You can replace cost function in this implementation with:
def acost(a): return tf.pow(pred-Y, 2) * tf.pow(tf.sign(pred-Y) + a, 2)
for more detail see this link

Does it makes any sense that weights and threshold are growing proportionally when training my perceptron?

I am moving my first steps in neural networks and to do so I am experimenting with a very simple single layer, single output perceptron which uses a sigmoidal activation function. I am updating my weights on-line each time a training example is presented using:
weights += learningRate * (correct - result) * {input,1}
Here weights is a n-length vector which also contains the weight from the bias neuron (- threshold), result is the result as computed by the perceptron (and processed using the sigmoid) when given the input, correct is the correct result and {input,1} is the input augmented with 1 (the fixed input from the bias neuron). Now, when I try to train the perceptron to perform logic AND, the weights don't converge for a long time, instead they keep growing similarly and they maintain a ratio of circa -1.5 with the threshold, for instance the three weights are in sequence:
5.067160008240718 5.105631826680446 -7.945513136885797
...
8.40390853077094 8.43890306970281 -12.889540730182592
I would expect the perceptron to stop at 1, 1, -1.5.
Apart from this problem, which looks like connected to some missing stopping condition in the learning, if I try to use the identity function as activation function, I get weight values oscillating around:
0.43601272528257057 0.49092558197172703 -0.23106430854347537
and I obtain similar results with tanh. I can't give an explanation to this.
Thank you
Tunnuz
It is because the sigmoid activation function doesn't reach one (or zero) even with very highly positive (or negative) inputs. So (correct - result) will always be non-zero, and your weights will always get updated. Try it with the step function as the activation function (i.e. f(x) = 1 for x > 0, f(x) = 0 otherwise).
Your average weight values don't seem right for the identity activation function. It might be that your learning rate is a little high -- try reducing it and see if that reduces the size of the oscillations.
Also, when doing online learning (aka stochastic gradient descent), it is common practice to reduce the learning rate over time so that you converge to a solution. Otherwise your weights will continue to oscillate.
When trying to analyze the behavior of the perception, it helps to also look at correct and result.

Probability and Neural Networks

Is it a good practice to use sigmoid or tanh output layers in Neural networks directly to estimate probabilities?
i.e the probability of given input to occur is the output of sigmoid function in the NN
EDIT
I wanted to use neural network to learn and predict the probability of a given input to occur..
You may consider the input as State1-Action-State2 tuple.
Hence the output of NN is the probability that State2 happens when applying Action on State1..
I Hope that does clear things..
EDIT
When training NN, I do random Action on State1 and observe resultant State2; then teach NN that input State1-Action-State2 should result in output 1.0
First, just a couple of small points on the conventional MLP lexicon (might help for internet searches, etc.): 'sigmoid' and 'tanh' are not 'output layers' but functions, usually referred to as "activation functions". The return value of the activation function is indeed the output from each layer, but they are not the output layer themselves (nor do they calculate probabilities).
Additionally, your question recites a choice between two "alternatives" ("sigmoid and tanh"), but they are not actually alternatives, rather the term 'sigmoidal function' is a generic/informal term for a class of functions, which includes the hyperbolic tangent ('tanh') that you refer to.
The term 'sigmoidal' is probably due to the characteristic shape of the function--the return (y) values are constrained between two asymptotic values regardless of the x value. The function output is usually normalized so that these two values are -1 and 1 (or 0 and 1). (This output behavior, by the way, is obviously inspired by the biological neuron which either fires (+1) or it doesn't (-1)). A look at the key properties of sigmoidal functions and you can see why they are ideally suited as activation functions in feed-forward, backpropagating neural networks: (i) real-valued and differentiable, (ii) having exactly one inflection point, and (iii) having a pair of horizontal asymptotes.
In turn, the sigmoidal function is one category of functions used as the activation function (aka "squashing function") in FF neural networks solved using backprop. During training or prediction, the weighted sum of the inputs (for a given layer, one layer at a time) is passed in as an argument to the activation function which returns the output for that layer. Another group of functions apparently used as the activation function is piecewise linear function. The step function is the binary variant of a PLF:
def step_fn(x) :
if x <= 0 :
y = 0
if x > 0 :
y = 1
(On practical grounds, I doubt the step function is a plausible choice for the activation function, but perhaps it helps understand the purpose of the activation function in NN operation.)
I suppose there an unlimited number of possible activation functions, but in practice, you only see a handful; in fact just two account for the overwhelming majority of cases (both are sigmoidal). Here they are (in python) so you can experiment for yourself, given that the primary selection criterion is a practical one:
# logistic function
def sigmoid2(x) :
return 1 / (1 + e**(-x))
# hyperbolic tangent
def sigmoid1(x) :
return math.tanh(x)
what are the factors to consider in selecting an activation function?
First the function has to give the desired behavior (arising from or as evidenced by sigmoidal shape). Second, the function must be differentiable. This is a requirement for backpropagation, which is the optimization technique used during training to 'fill in' the values of the hidden layers.
For instance, the derivative of the hyperbolic tangent is (in terms of the output, which is how it is usually written) :
def dsigmoid(y) :
return 1.0 - y**2
Beyond those two requriements, what makes one function between than another is how efficiently it trains the network--i.e., which one causes convergence (reaching the local minimum error) in the fewest epochs?
#-------- Edit (see OP's comment below) ---------#
I am not quite sure i understood--sometimes it's difficult to communicate details of a NN, without the code, so i should probably just say that it's fine subject to this proviso: What you want the NN to predict must be the same as the dependent variable used during training. So for instance, if you train your NN using two states (e.g., 0, 1) as the single dependent variable (which is obviously missing from your testing/production data) then that's what your NN will return when run in "prediction mode" (post training, or with a competent weight matrix).
You should choose the right loss function to minimize.
The squared error does not lead to the maximum likelihood hypothesis here.
The squared error is derived from a model with Gaussian noise:
P(y|x,h) = k1 * e**-(k2 * (y - h(x))**2)
You estimate the probabilities directly. Your model is:
P(Y=1|x,h) = h(x)
P(Y=0|x,h) = 1 - h(x)
P(Y=1|x,h) is the probability that event Y=1 will happen after seeing x.
The maximum likelihood hypothesis for your model is:
h_max_likelihood = argmax_h product(
h(x)**y * (1-h(x))**(1-y) for x, y in examples)
This leads to the "cross entropy" loss function.
See chapter 6 in Mitchell's Machine Learning
for the loss function and its derivation.
There is one problem with this approach: if you have vectors from R^n and your network maps those vectors into the interval [0, 1], it will not be guaranteed that the network represents a valid probability density function, since the integral of the network is not guaranteed to equal 1.
E.g., a neural network could map any input form R^n to 1.0. But that is clearly not possible.
So the answer to your question is: no, you can't.
However, you can just say that your network never sees "unrealistic" code samples and thus ignore this fact. For a discussion of this (and also some more cool information on how to model PDFs with neural networks) see contrastive backprop.

Resources