When I was monitoring my DQN training, I noticed that there's a reg_loss part to my total loss. I don't know what this comes from, or what it means. It decreases during training, but does not seem to be used when calculating gradients.
Code snippet from dqn_agent.py
# line 500, reg_loss is not calculated here, but from q_network
# I didn't find definition of losses in the Network interface
agg_loss = common.aggregate_losses(
per_example_loss=td_loss,
sample_weight=weights,
regularization_loss=self._q_network.losses)
......
# line 535, notice here only td_loss is returned, which then used to get gradient
return tf_agent.LossInfo(total_loss, DqnLossInfo(td_loss=td_loss,
td_error=td_error))
Can anyone tell me what this loss is and what's the use of it? Thanks.
Related
I'm working on a regression model and to evaluate the model performance, my boss thinks that we should use this metric:
Total Absolute Error Mean = mean(y_predicted) / mean(y_true) - 1
Where mean(y_predicted) is the average of all the predictions and mean(y_true) is the average of all the true values.
I have never seen this metric being used in machine learning before and I convinced him to add Mean Absolute Percentage Error as an alternative, yet even though my model is performing better regarding MAPE, some areas underperform when we look at Total Absolute Error Mean.
My gut feeling is that this metric is wrong in displaying the real accuracy, but I can't seem to understand why.
Is Total Absolute Error Mean a valid performance metric? If not, then why? If it is, why would a regression model's accuracy increase in terms of MAPE, but not in terms of Total Absolute Error Mean?
Thank you in advance!
I would kindly suggest to inform your boss that, when one wishes to introduce a new metric, it is on him/her to demonstrate why it is useful on top of the existing ones, not the other way around (i.e. us demonstrating why it is not); BTW, this is exactly the standard procedure when someone really comes up with a new proposed metric in a research paper, like the recent proposal of the Maximal Information Coefficient (MIC).
That said, it is not difficult to demonstrate in practice that this proposed metric is a poor one with some dummy data:
import numpy as np
from sklearn.metrics import mean_squared_error
# your proposed metric:
def taem(y_true, y_pred):
return np.mean(y_true)/np.mean(y_pred)-1
# dummy true data:
y_true = np.array([0,1,2,3,4,5,6])
Now, suppose that we have a really awesome model, which predicts perfectly, i.e. y_pred1 = y_true; in this case both MSE and your proposed TAEM will indeed be 0:
y_pred1 = y_true # PERFECT predictions
mean_squared_error(y_true, y_pred1)
# 0.0
taem(y_true, y_pred1)
# 0.0
So far so good. But let's now consider the output of a really bad model, which predicts high values when it should have predicted low ones, and vice versa; in other words, consider a different set of predictions:
y_pred2 = np.array([6,5,4,3,2,1,0])
which is actually y_pred1 in reverse order. Now, it easy to see that here we will also have a perfect TAEM score:
taem(y_true, y_pred2)
# 0.0
while of course MSE would have warned us that we are very far indeed from perfect predictions:
mean_squared_error(y_true, y_pred2)
# 16.0
Bottom line: Any metric that ignores element-wise differences in favor of only averages suffers from similar limitations, namely taking identical values for any permutation of the predictions, a characteristic which is highly undesirable for a useful performance metric.
I'm trying to program a neural network with backpropagation in python.
Usually converges to 1. To the left of the image there are some delta values. They are very small, should they be larger? Do you know a reason why this converging could happen?
sometimes it goes up in the direction of the point and then goes down again
here is the complete code:
http://pastebin.com/9BiwhWrD the backpropagation code starts at line 146
(the root stuff in line 165 does nothing. was just trying out some ideas)
Any ideas of what could be wrong? Have you ever seen a behaviour like this?
Thanks you very much.
The reason why this happened is, because the input data was too large. The activation sigmoid function converged to f(x)=1 for x -> inf. I had to normalize the data
e.g.:
a = np.array([1,2,3,4,5])
a /= a.max()
or prevent generating unnormalized data at all.
Also, the interims value was updated BEFORE the sigmoid was applied. But the derivation of sigmoid looks like this: y'(x) = y(x)-(1-y(x)). In my case it was just: y'(x) = x-(1-x)
There were also errors in how i updated the weights after calculating the deltas. I rewrote the whole loop using a tutorial for neural networks with python and then it worked.
It still does not support bias but it can do classification. For regression it's not precise enough, but i guess this has to do with the missing bias.
Here is the code:
http://pastebin.com/hRCKe1dK
Someone suggested that i should put my training-data into a neural-network framework and see if it works. It didn't. So it was kindof clear that it had to to with it and so i had to the idea that it should be between -1 and 1.
Update: This question is outdated and was asked for a pre 1.0 version of tensorflow. Do not refer to answers or suggest new ones.
I'm using the tf.nn.sigmoid_cross_entropy_with_logits function for the loss and it's going to NaN.
I'm already using gradient clipping, one place where tensor division is performed, I've added an epsilon to prevent division by zero, and the arguments to all softmax functions have an epsilon added to them as well.
Yet, I'm getting NaN's mid-way through training.
Are there any known issues where TensorFlow does this that I have missed?
It's quite frustrating because the loss is randomly going to NaN during training and ruining everything.
Also, how could I go about detecting if the training step will result in NaN and maybe skip that example altogether? Any suggestions?
EDIT: The network is a Neural Turing Machine.
EDIT 2: Here's the code for gradient clipping:
optimizer = tf.train.AdamOptimizer(self.lr)
gvs = optimizer.compute_gradients(loss)
capped_gvs =\
[(tf.clip_by_value(grad, -1.0, 1.0), var) if grad != None else (grad, var) for grad, var in gvs]
train_step = optimizer.apply_gradients(capped_gvs)
I had to add the if grad != None condition because I was getting an error without it. Could the problem be here?
Potential Solution: I'm using tf.contrib.losses.sigmoid_cross_entropy for a while now, and so far the loss hasn't diverged. I will test some more and report back.
Use 1e-4 for the learning rate. That one always seems to work for me with the Adam optimizer. Even if you gradient clip it can still diverge. Also another sneaky one is taking a square root since although it will be stable for all positive inputs its gradient diverges as the value approaches zero. Finally I would check and make sure all inputs to the model are reasonable.
I know it has been a while since this was asked, but I'd like to add another solution that helped me, on top of clipping. I found that, if I increase the batch size, the loss tends to not go close to 0, and doesn't end up (as of yet) going to NaN. Hope this helps anyone that finds this!
In my case, the NaN values were a result of NaN in the training datasets , while I was working on multiclass classifier , the problem was a dataframe positional filter on the [ one hot encoding ] labels.
Resolving the the target dataset resolved my issue - hope this help someone else.
Best of luck.
for me i added epsilon to parameters inside a log function.
i no longer see the errors but i noticed a moderate increase in the model training accuracy.
I can't get TensorFlow RELU activations (neither tf.nn.relu nor tf.nn.relu6) working without NaN values for activations and weights killing my training runs.
I believe I'm following all the right general advice. For example I initialize my weights with
weights = tf.Variable(tf.truncated_normal(w_dims, stddev=0.1))
biases = tf.Variable(tf.constant(0.1 if neuron_fn in [tf.nn.relu, tf.nn.relu6] else 0.0, shape=b_dims))
and use a slow training rate, e.g.,
tf.train.MomentumOptimizer(0.02, momentum=0.5).minimize(cross_entropy_loss)
But any network of appreciable depth results in NaN for cost and and at least some weights (at least in the summary histograms for them). In fact, the cost is often NaN right from the start (before training).
I seem to have these issues even when I use L2 (about 0.001) regularization, and dropout (about 50%).
Is there some parameter or setting that I should adjust to avoid these issues? I'm at a loss as to where to even begin looking, so any suggestions would be appreciated!
Following He et. al (as suggested in lejlot's comment), initializing the weights of the l-th layer to a zero-mean Gaussian distribution with standard deviation
where nl is the flattened length of the the input vector or
stddev=np.sqrt(2 / np.prod(input_tensor.get_shape().as_list()[1:]))
results in weights that generally do not diverge.
If you use a softmax classifier at the top of your network, try to make the initial weights of the layer just below the softmax very small (e.g. std=1e-4). This makes the initial distribution of outputs of the network very soft (high temperature), and helps ensure that the first few steps of your optimization are not too large and numerically unstable.
Have you tried gradient clipping and/or a smaller learning rate?
Basically, you will need to process your gradients before applying them, as follows (from tf docs, mostly):
# Replace this with what follows
# opt = tf.train.MomentumOptimizer(0.02, momentum=0.5).minimize(cross_entropy_loss)
# Create an optimizer.
opt = tf.train.MomentumOptimizer(learning_rate=0.001, momentum=0.5)
# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(cross_entropy_loss, tf.trainable_variables())
# grads_and_vars is a list of tuples (gradient, variable). Do whatever you
# need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(tf.clip_by_value(gv[0], -5., 5.), gv[1]) for gv in grads_and_vars]
# Ask the optimizer to apply the capped gradients.
opt = opt.apply_gradients(capped_grads_and_vars)
Also, the discussion in this question might help.
I am using backpropogation algorithm for my model. It works perfectly fine a simple xor case and when I tested it for a smaller subset of my actual data.
There are 3 inputs in total and a single output(0,1,2)
I have split the data set into training set (80% amounting to approx 5.5k) and the rest 20% as validation data.
I use trainingRate and momentum for calculating the delta weights.
I have normalized the input as below
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(input_array)
I use 1 hidden layer with sigmoid and linear activation functions for input-hidden and hidden-output respectively.
I train with trainingRate = 0.0005, momentum = 0.6, Epochs = 100,000. Any higher trainingRate shoots up the error to Nan. momentum values between 0.5 and 0.9 works fine and any other value makes the error Nan.
I tried various number of nodes in the hidden layer such as 3,6,9,10 and the error converged to 4140.327574 in each case. I am not sure how to reduce this. Changing the activation functions doesn't help. I even tried adding another hidden layer with gaussian activation function but I cannot reduce the error whatsoever.
Is it because of the outliers? Do i need to clean those values from the training data?
Any suggestion would be of great help be it the activation function, hidden layers, etc. I had been trying to get this working for quite some time and I am sort of stuck now.
Well I'm having kind of a similar problem, still haven fixed it, but I can tell you a couple of things I have found. I think the net is overfitting, my error at some point goes down and then starts going up again, also the verification set... is this you case also?
Check if you are implementing well the "early stopping" algorithm, most of the times the problem is not the backpropagation, but the error analysis or the validation analysis.
Hope this helps!