I am running a simple convolutional neural network, doing regression and predicting the results. It predicts 30 outputs (floats)
The prediction results are almost the same irrespective of any input. (converging to mean on trained outputs)
The training after 1000 iterations converges to maximum loss of 0.0107 (which is good one) based on this dataset.
What is causing this?
I tried to set the bias to 1.0, it brings little variables but still the same below. When i set bias to 0, the results are far worse, all outputs are 100% same. i am already using regularisation max(0,x) no improvement with results.
The outputs are below. As you can see, the first, second, third arrays are almost same..
[[ 66.60850525 37.19641876 29.36295891 ..., 71.91300964 47.92261505
85.02180481]
[ 66.4874115 37.09647369 29.23101997 ..., 71.90777588 47.74259186
85.10979462]
[ 66.54870605 37.19485474 29.36085892 ..., 71.84892273 47.8970108
85.05699921]
...,
[ 65.7435379 36.78604889 28.57537079 ..., 71.98916626 47.03699493
85.88017273]
[ 65.7435379 36.78604889 28.57537079 ..., 71.98916626 47.03699493
85.88017273]
[ 65.7435379 36.78604889 28.57537079 ..., 71.98916626 47.03699493
85.88017273]]
The network model runs with this parameters
base_lr: 0.001
lr_policy: "fixed"
display: 100
max_iter: 1000
momentum: 0.9
Judging by the output and the fact that bias affect the result greatly I have a feeling that maybe you didn't normalize your input and output.
Try to normalize them between -1 & +1.
Related
I use a CatBoostClassifier and my classes are highly imbalanced. I applied a scale_pos_weight parameter to account for that. While training with an evaluation dataset (test) CatBoost shows a high precision on test. However, when I make predictions on test using a predict method, I only get a low precision score (calculated using the sklearn.metrics).
I think this might be related to class weights that I applied. However, I don't quite understand how a precision score is affected by this.
params = frozendict({
'task_type': 'CPU',
'loss_function': 'Logloss',
'eval_metric': 'F1',
'custom_metric': ['F1', 'Precision', 'Recall'],
'iterations': 100,
'random_seed': 20190128,
'scale_pos_weight': 56.88657244809081,
'learning_rate': 0.5412829495147387,
'depth': 7,
'l2_leaf_reg': 9.526905230698302
})
from catboost import CatBoostClassifier
model = cb.CatBoostClassifier(**params)
model.fit(
X_train, y_train,
cat_features=np.where(X_train.dtypes == np.object)[0],
eval_set=(X_test, y_test),
verbose=False,
plot=True
)
model.get_best_score()
{'learn': {'Recall': 0.9243007537531925,
'Logloss': 0.15892360013680026,
'F1': 0.9416723809244181,
'Precision': 0.9640191600545249},
'validation_0': {'Recall': 0.914252301192093,
'Logloss': 0.1714387314107052,
'F1': 0.9357892623978286,
'Precision': 0.9642642597943112}}
y_test_pred = model.predict(data=X_test)
from sklearn.metrics import balanced_accuracy_score, recall_score, precision_score, f1_score
print('Balanced accuracy: {:.2f}'.format(balanced_accuracy_score(y_test, y_test_pred)))
print('Precision: {:.2f}'.format(precision_score(y_test, y_test_pred)))
print('Recall: {:.2f}'.format(recall_score(y_test, y_test_pred)))
print('F1: {:.2f}'.format(f1_score(y_test, y_test_pred)))
Balanced accuracy: 0.94
Precision: 0.29
Recall: 0.91
F1: 0.44
I expected to get the same precision as CatBoost show while training, however, it's not so. What am I doing wrong?
Default use_weights is set to True , which means adding weights to the evaluation metrics, e.g. Precision:use_weights=True,
To let your own precision calculator the same as his, change to Precision: use_weights=False
Also, get_best_score gives the highest score over the iterations, you need to specify which iteration to be used in prediction. You can set use_best_model=True in model.fit to automatically choose the iteration.
The predict function uses a standard threshold of 0.5 to convert the probabilities of the prediction into a binary value. When you are dealing with a imbalanced problem, the threshold of 0.5 is not always the best value, that's why on the test set you are achieving a poor precision.
In order to find a better threshold, catboost has some methods that help you to do so, like get_roc_curve, get_fpr_curve, get_fnr_curve. These 3 methods can help you to visualize the true positive, false positive and false negative rates by changing the prediction threhsold.
Besides these visualization methods, catboost has a method called select_threshold which gives you the best threshold by that optimizes one of the curves.
You can check this on their documentation.
In addition to setting the use_bet_model=True, ensure that the class balance in both datasets is the same, or use balanced accuracy metrics to account for different class balance.
If you've done both of these, and you still see much worse accuracy metrics on a test set versus the train set, it is a sign of overfitting. I'd recommend you take advantage of the CatBoost's overfitting detector. The most common first method is to set early_stopping_rounds to an integer like 10, which will stop training once an improvement in the selected loss function isn't achieved after that number of training rounds (see early_stopping_rounds documentation).
I am new babie to the Deep Learning field, and I am use log-likelihood method to compare the MSE metrics.Could anyone be able to show how to calculate the following 2 predicted output examples with 3 outputs neurons each. Thanks
yt = [ [1,0,0],[0,0,1]]
yp = [ [0.9, 0.2,0.2], [0.2,0.8,0.3] ]
MSE or Mean Squared Error is simply the expected value of the squared difference between the predicted and the ground truth labels, represented as
\text{MSE}(\hat{\theta}) = E\left[(\hat{\theta} - \theta)^2\right]
where theta is the ground truth labels and theta^hat is the predicted labels
I am not sure what are you referring to exactly, like a theoretical question or a part of code
As a Python implementation
def mean_squared_error(A, B):
return np.square(np.subtract(A,B)).mean()
yt = [[1,0,0],[0,0,1]]
yp = [[0.9, 0.2,0.2], [0.2,0.8,0.3]]
mse = mean_squared_error(yt, yp)
print(mse)
This will give a value of 0.21
If you are using one of the DL frameworks like TensorFlow, then they are already providing the function which calculates the mse loss between tensors
tf.losses.mean_squared_error
where
tf.losses.mean_squared_error(
labels,
predictions,
weights=1.0,
scope=None,
loss_collection=tf.GraphKeys.LOSSES,
reduction=Reduction.SUM_BY_NONZERO_WEIGHTS
)
Args:
labels: The ground truth output tensor, same dimensions as 'predictions'.
predictions: The predicted outputs.
weights: Optional Tensor whose rank is either 0, or the same rank as labels, and must be broadcastable to labels (i.e., all dimensions
must be either 1, or the same as the corresponding losses dimension).
scope: The scope for the operations performed in computing the loss.
loss_collection: collection to which the loss will be added.
reduction: Type of reduction to apply to loss.
Returns:
Weighted loss float Tensor. If reduction is NONE, this has the same
shape as labels; otherwise, it is scalar.
When we train neural networks, we typically use gradient descent, which relies on a continuous, differentiable real-valued cost function. The final cost function might, for example, take the mean squared error. Or put another way, gradient descent implicitly assumes the end goal is regression - to minimize a real-valued error measure.
Sometimes what we want a neural network to do is perform classification - given an input, classify it into two or more discrete categories. In this case, the end goal the user cares about is classification accuracy - the percentage of cases classified correctly.
But when we are using a neural network for classification, though our goal is classification accuracy, that is not what the neural network is trying to optimize. The neural network is still trying to optimize the real-valued cost function. Sometimes these point in the same direction, but sometimes they don't. In particular, I've been running into cases where a neural network trained to correctly minimize the cost function, has a classification accuracy worse than a simple hand-coded threshold comparison.
I've boiled this down to a minimal test case using TensorFlow. It sets up a perceptron (neural network with no hidden layers), trains it on an absolutely minimal dataset (one input variable, one binary output variable) assesses the classification accuracy of the result, then compares it to the classification accuracy of a simple hand-coded threshold comparison; the results are 60% and 80% respectively. Intuitively, this is because a single outlier with a large input value, generates a correspondingly large output value, so the way to minimize the cost function is to try extra hard to accommodate that one case, in the process misclassifying two more ordinary cases. The perceptron is correctly doing what it was told to do; it's just that this does not match what we actually want of a classifier. But the classification accuracy is not a continuous differentiable function, so we can't use it as the target for gradient descent.
How can we train a neural network so that it ends up maximizing classification accuracy?
import numpy as np
import tensorflow as tf
sess = tf.InteractiveSession()
tf.set_random_seed(1)
# Parameters
epochs = 10000
learning_rate = 0.01
# Data
train_X = [
[0],
[0],
[2],
[2],
[9],
]
train_Y = [
0,
0,
1,
1,
0,
]
rows = np.shape(train_X)[0]
cols = np.shape(train_X)[1]
# Inputs and outputs
X = tf.placeholder(tf.float32)
Y = tf.placeholder(tf.float32)
# Weights
W = tf.Variable(tf.random_normal([cols]))
b = tf.Variable(tf.random_normal([]))
# Model
pred = tf.tensordot(X, W, 1) + b
cost = tf.reduce_sum((pred-Y)**2/rows)
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
tf.global_variables_initializer().run()
# Train
for epoch in range(epochs):
# Print update at successive doublings of time
if epoch&(epoch-1) == 0 or epoch == epochs-1:
print('{} {} {} {}'.format(
epoch,
cost.eval({X: train_X, Y: train_Y}),
W.eval(),
b.eval(),
))
optimizer.run({X: train_X, Y: train_Y})
# Classification accuracy of perceptron
classifications = [pred.eval({X: x}) > 0.5 for x in train_X]
correct = sum([p == y for (p, y) in zip(classifications, train_Y)])
print('{}/{} = perceptron accuracy'.format(correct, rows))
# Classification accuracy of hand-coded threshold comparison
classifications = [x[0] > 1.0 for x in train_X]
correct = sum([p == y for (p, y) in zip(classifications, train_Y)])
print('{}/{} = threshold accuracy'.format(correct, rows))
How can we train a neural network so that it ends up maximizing classification accuracy?
I'm asking for a way to get a continuous proxy function that's closer to the accuracy
To start with, the loss function used today for classification tasks in (deep) neural nets was not invented with them, but it goes back several decades, and it actually comes from the early days of logistic regression. Here is the equation for the simple case of binary classification:
The idea behind it was exactly to come up with a continuous & differentiable function, so that we would be able to exploit the (vast, and still expanding) arsenal of convex optimization for classification problems.
It is safe to say that the above loss function is the best we have so far, given the desired mathematical constraints mentioned above.
Should we consider this problem (i.e. better approximating the accuracy) solved and finished? At least in principle, no. I am old enough to remember an era when the only activation functions practically available were tanh and sigmoid; then came ReLU and gave a real boost to the field. Similarly, someone may eventually come up with a better loss function, but arguably this is going to happen in a research paper, and not as an answer to a SO question...
That said, the very fact that the current loss function comes from very elementary considerations of probability and information theory (fields that, in sharp contrast with the current field of deep learning, stand upon firm theoretical foundations) creates at least some doubt as to if a better proposal for the loss may be just around the corner.
There is another subtle point on the relation between loss and accuracy, which makes the latter something qualitatively different than the former, and is frequently lost in such discussions. Let me elaborate a little...
All the classifiers related to this discussion (i.e. neural nets, logistic regression etc) are probabilistic ones; that is, they do not return hard class memberships (0/1) but class probabilities (continuous real numbers in [0, 1]).
Limiting the discussion for simplicity to the binary case, when converting a class probability to a (hard) class membership, we are implicitly involving a threshold, usually equal to 0.5, such as if p[i] > 0.5, then class[i] = "1". Now, we can find many cases whet this naive default choice of threshold will not work (heavily imbalanced datasets are the first to come to mind), and we'll have to choose a different one. But the important point for our discussion here is that this threshold selection, while being of central importance to the accuracy, is completely external to the mathematical optimization problem of minimizing the loss, and serves as a further "insulation layer" between them, compromising the simplistic view that loss is just a proxy for accuracy (it is not). As nicely put in the answer of this Cross Validated thread:
the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.
Enlarging somewhat an already broad discussion: Can we possibly move completely away from the (very) limiting constraint of mathematical optimization of continuous & differentiable functions? In other words, can we do away with back-propagation and gradient descend?
Well, we are actually doing so already, at least in the sub-field of reinforcement learning: 2017 was the year when new research from OpenAI on something called Evolution Strategies made headlines. And as an extra bonus, here is an ultra-fresh (Dec 2017) paper by Uber on the subject, again generating much enthusiasm in the community.
I think you are forgetting to pass your output through a simgoid. Fixed below:
import numpy as np
import tensorflow as tf
sess = tf.InteractiveSession()
tf.set_random_seed(1)
# Parameters
epochs = 10000
learning_rate = 0.01
# Data
train_X = [
[0],
[0],
[2],
[2],
[9],
]
train_Y = [
0,
0,
1,
1,
0,
]
rows = np.shape(train_X)[0]
cols = np.shape(train_X)[1]
# Inputs and outputs
X = tf.placeholder(tf.float32)
Y = tf.placeholder(tf.float32)
# Weights
W = tf.Variable(tf.random_normal([cols]))
b = tf.Variable(tf.random_normal([]))
# Model
# CHANGE HERE: Remember, you need an activation function!
pred = tf.nn.sigmoid(tf.tensordot(X, W, 1) + b)
cost = tf.reduce_sum((pred-Y)**2/rows)
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
tf.global_variables_initializer().run()
# Train
for epoch in range(epochs):
# Print update at successive doublings of time
if epoch&(epoch-1) == 0 or epoch == epochs-1:
print('{} {} {} {}'.format(
epoch,
cost.eval({X: train_X, Y: train_Y}),
W.eval(),
b.eval(),
))
optimizer.run({X: train_X, Y: train_Y})
# Classification accuracy of perceptron
classifications = [pred.eval({X: x}) > 0.5 for x in train_X]
correct = sum([p == y for (p, y) in zip(classifications, train_Y)])
print('{}/{} = perceptron accuracy'.format(correct, rows))
# Classification accuracy of hand-coded threshold comparison
classifications = [x[0] > 1.0 for x in train_X]
correct = sum([p == y for (p, y) in zip(classifications, train_Y)])
print('{}/{} = threshold accuracy'.format(correct, rows))
The output:
0 0.28319069743156433 [ 0.75648874] -0.9745011329650879
1 0.28302448987960815 [ 0.75775659] -0.9742625951766968
2 0.28285878896713257 [ 0.75902224] -0.9740257859230042
4 0.28252947330474854 [ 0.76154679] -0.97355717420578
8 0.28187844157218933 [ 0.76656926] -0.9726400971412659
16 0.28060704469680786 [ 0.77650583] -0.970885694026947
32 0.27818527817726135 [ 0.79593837] -0.9676888585090637
64 0.2738055884838104 [ 0.83302218] -0.9624817967414856
128 0.26666420698165894 [ 0.90031379] -0.9562843441963196
256 0.25691407918930054 [ 1.01172411] -0.9567816257476807
512 0.2461051195859909 [ 1.17413962] -0.9872989654541016
1024 0.23519910871982574 [ 1.38549554] -1.088881492614746
2048 0.2241383194923401 [ 1.64616168] -1.298340916633606
4096 0.21433120965957642 [ 1.95981205] -1.6126530170440674
8192 0.2075471431016922 [ 2.31746769] -1.989408016204834
9999 0.20618653297424316 [ 2.42539024] -2.1028473377227783
4/5 = perceptron accuracy
4/5 = threshold accuracy
I'd like to predict the interest rate and I've got some relevant factors like stock index and money supply number, something like that. The number of factors may be up to 200.
For example,the training data like, X contains factors and y is the interest rate I want to train and predict.
factor1 factor2 factor3 factor176 factor177 factor178
X= [[ 2.1428 6.1557 5.4101 ..., 5.86 6.0735 6.191 ]
[ 2.168 6.1533 5.2315 ..., 5.8185 6.0591 6.189 ]
[ 2.125 4.7965 3.9443 ..., 5.7845 5.9873 6.1283]...]
y= [[ 3.5593]
[ 3.014 ]
[ 2.7125]...]
So I want to use tensorflow/tflearn to train this model but I don't really know what method exactly I should choose to do regression. I have tried LinearRegression from tflearn before, but the result is not so great.
For now, I just use the code I found online.
net = tflearn.input_data([None, 178])
net = tflearn.fully_connected(net, 64, activation='linear',
weight_decay=0.0005)
net = tflearn.fully_connected(net, 1, activation='linear')
net = tflearn.regression(net, optimizer=
tflearn.optimizers.AdaGrad(learning_rate=0.01, initial_accumulator_value=0.01),
loss='mean_square', learning_rate=0.05)
model = tflearn.DNN(net, tensorboard_verbose=0, checkpoint_path='tmp/')
model.fit(X, y, show_metric=True,
batch_size=1, n_epoch=100)
The result is roughly 50% accuracy when the error range is ±10%.
I have tried to make the window to 7 days but the result is still bad. So I want to know what additional layer I can use to make this network better.
First of all this network makes no sense. If you do not have any activations on your hidden units, you network is equivalent to linear regression.
So first of all change
net = tflearn.fully_connected(net, 64, activation='linear',
weight_decay=0.0005)
to
net = tflearn.fully_connected(net, 64, activation='relu',
weight_decay=0.0005)
Another general thing is to always normalise your data. Your X's are big, y's are big as well - make sure they aren't, by for example whitening them (making them 0 mean and 1 std).
Finding right architecture is hard problem and you will not find any "magical recipies" for that. Start with understanding what you are doing. Log your training, see if the training loss converges to small values, if it does not - you either do not train long enough, network is too small, or training hyperparameters are off (like too big learning right, too high regularisation etc.)
I have tried to train simple backpropagation neural network with the xor function. When I use tanh(x) as activation function, with the derivative 1-tanh(x)^2, I get the right result after about 1000 iterations. However, when I use g(x) = 1/(1+e^(-x)) as an activation function, with the derivative g(x)*(1-g(x)), I need about 50000 iterations to get the right result. What can be the reason?
Thank you.
Yes, what you observe is true. I have similar observations when training neural networks using back propagations. For XOR problem, I used to set up a 2x20x2 network, logistic function takes 3000+ episodes to get below result:
[0, 0] -> [0.049170633762142486]
[0, 1] -> [0.947292007836417]
[1, 0] -> [0.9451808598939389]
[1, 1] -> [0.060643862846171494]
While using tanh as activation function, here is the result after 800 episodes. tanh converges consistently faster than logistic.
[0, 0] -> [-0.0862215901296476]
[0, 1] -> [0.9777578145233919]
[1, 0] -> [0.9777632805205176]
[1, 1] -> [0.12637838259658932]
The two functions' shape look like below (credit: efficient backprop):
The left is the standard logistic function: 1/(1+e^(-x)).
The right is the tanh function, also known as hyperbolic tangent.
It's easy to see that tanh is antisymmetric about the origin.
According to efficient Backprop,
Symmetric sigmoids such as tanh often converge faster than standard logistic function.
Also from wiki Logistic regression:
Practitioners caution that sigmoidal functions which are antisymmetric about the origin (e.g. the hyperbolic tangent) lead to faster convergence when training networks with backpropagation.
See efficient Backprop for more details explaining the intuition here.
See elliott for an alternative of tanh with easier computations. It's shown below as the black curve (the blue one is the original tanh).
Two things should stand out from the above chart. First, TANH usually needed fewer iterations to train than Elliott. So the training accuracy is not as good with Elliott, for an Encoder. However, notice the training times. Elliott completed its entire task, even with the extra iterations it had to do, in half the time of TANH. This is a huge improvement and literally means that in this case, Elliott will cut your training time in half, and deliver the same final training error. While it does take more training iterations to get there, the speed per iteration is so much faster it still results in the training time being cut in half.