Neural network and XOR as a classification - machine-learning

I read somewhere that mean squared error loss is good for regression, and cross entropy loss for classification.
When I tried to train XOR as a classification problem with cross entropy loss, network failed to converge.
My setting:
Network is 2-2-2
First output is signaling 0 and second 1 (so two classes of inputs).
Cross entropy is used for calculating error in output layer of network instead of mean squared error.
as a activation function, Im using logsig.
Apparently, Im missing something, where is my mistake ?

Here's an implementation of this network in Mathematica:
net = NetChain[{2, Tanh, 2, Tanh, 1, LogisticSigmoid}, "Input" -> {2}];
eps = 0.01;
data = {{0, 0} -> {eps}, {1, 0} -> {1 - eps}, {0, 1} -> {1 - eps}, {1,
1} -> {eps}};
trained =
NetTrain[net, data, CrossEntropyLossLayer["Binary"],
MaxTrainingRounds -> Quantity[5, "Minutes"], TargetDevice -> "GPU"]
Which converges after a few thousand rounds. So, I don't think you're missing anything - there's probably a bug in your library

Related

Difference between Logistic Loss and Cross Entropy Loss

I'm confused about logistic loss and cross entropy loss in binary classification scenario.
According to Wikipedia (https://en.wikipedia.org/wiki/Loss_functions_for_classification), the logistic loss is defined as:
where v=y*y_hat
The cross entropy loss is defined as:
From the Wikipedia (https://en.wikipedia.org/wiki/Loss_functions_for_classification):
It's easy to check that the logistic loss and binary cross entropy loss (Log loss) are in fact the same (up to a multiplicative constant 1/log(2))
However, when I test it with some code, I found they are not the same. Here is the python code:
from numpy import exp
from math import log
def cross_entropy_loss(y, yp):
return -log(1-yp) if y==0 else -log(yp)
def logistic_loss(y, yp):
return log(1+exp(-y*yp))/log(2)
y, yp = 0, 0.3 # y= {0, 1} for cross_entropy_loss
l1 = cross_entropy_loss(y, yp)
y, yp = -1, 0.3 # y = {-1, 1} for logistic loss
l2 = logistic_loss(y, yp)
print(l1, l2, l1/l2)
y, yp = 1, 0.9
l1 = cross_entropy_loss(y, yp)
l2 = logistic_loss(y, yp)
print(l1, l2, l1/l2)
The output shows that neither the loss values are the same nor the ratio between them is constant:
0.35667494393873245 1.2325740743522222 0.2893740436056004
0.10536051565782628 0.49218100325603786 0.21406863523949665
Could somebody explain why they are "in fact the same"?
In wikipedia v is defined as
v = -yf(x).
What is not defined in wikipedia is \hat{y} (i.e. the predicted label). It should be defined as (logistic function):
\hat{y} = 1/(1+exp(-f(x))).
By substituting the above definition into the logistic loss formula from the wikipedia, you should be able to recover the cross entropy loss. Please note that cross entropy loss equation (you have presented above) is formulated for y={0,1}, while the eqautions from the wikipedia article are for y={-1,1}.

Caffe: Check failed: outer_num_ * inner_num_ == bottom[1]->count() (10 vs. 60) Number of labels must match number of predictions

I am trying to fine tune Alexnet for a multi-label regression task. For this I have replaced the last layer producing 1000-label output (for image classification task) to 6 label output which provides me with 6 floats. I replaced the last layers as mentioned here.
My training data is prepared in h5 format and is shaped as (11000, 3, 544, 1024) for data and (11000, 1, 6) for labels. While retraining the weights of Alexnet in Caffe library, I get the following error:
I1013 10:50:49.759560 3107 net.cpp:139] Memory required for data: 950676640
I1013 10:50:49.759562 3107 layer_factory.hpp:77] Creating layer accuracy_retrain
I1013 10:50:49.759567 3107 net.cpp:86] Creating Layer accuracy_retrain
I1013 10:50:49.759568 3107 net.cpp:408] accuracy_retrain <- fc8_fc8_retrain_0_split_0
I1013 10:50:49.759572 3107 net.cpp:408] accuracy_retrain <- label_data_1_split_0
I1013 10:50:49.759575 3107 net.cpp:382] accuracy_retrain -> accuracy
F1013 10:50:49.759587 3107 accuracy_layer.cpp:31] Check failed: outer_num_ * inner_num_ == bottom[1]->count() (10 vs. 60) Number of labels must match number of predictions; e.g., if label axis == 1 and prediction shape is (N, C, H, W), label count (number of labels) must be N*H*W, with integer values in {0, 1, ..., C-1}.
My Batchsize for both training and testing phases is 10. The error arises in the testing phase, possibly in the accuracy layer Complete Error Log here. I am not sure why this problem arises, might be my label is misshaped. Any help in this regard will be highly appreciated.
I solved this problem. Seems like the accuracy layer is only used for classification tasks along with SoftmaxWithLoss layer. As stated in this answer, the EuclideanLoss can be used to test the regression network.

Create a List and Use it in Loss Function Tensorflow

I am trying to create a list based on my neural network outputs and use it in Tensorflow as a loss function.
Assume that results is list of size [1, batch_size] that is output by a neural network. I check to see whether the first value of this list is in a specific range passed in as a placeholder called valid_range, and if it is add 1 to a list. If it is not, add -1. The goal is to make all predictions of the network in the range, so the correct predictions is a tensor of all 1, which I call correct_predictions.
values_list = []
for j in range(batch_size):
a = results[0, j] >= valid_range[0]
b = result[0, j] <= valid_range[1]
c = tf.logical_and(a, b)
if (c == 1):
values_list.append(1)
else:
values_list.append(-1.)
values_list_tensor = tf.convert_to_tensor(values_list)
correct_predictions = tf.ones([batch_size, ], tf.float32)
Now, I want to use this as a loss function in my network, so that I can force all the predictions to be in the specified range. I try to train like this:
loss = tf.reduce_mean(tf.squared_difference(values_list_tensor, correct_predictions))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
gradients, variables = zip(*optimizer.compute_gradients(loss))
gradients, _ = tf.clip_by_global_norm(gradients, gradient_clip_threshold)
optimize = optimizer.apply_gradients(zip(gradients, variables))
This, however, has a problem and throws an error on the last optimize line, saying:
ValueError: No gradients provided for any variable: ['<tensorflow.python.training.optimizer._RefVariableProcessor object at 0x7f0245d4afd0>',
'<tensorflow.python.training.optimizer._RefVariableProcessor object at 0x7f0245d66050>'
...
I tried to debug this in Tensorboard, and I notice that the list I am creating does not appear in the graph, so basically the x part of the loss function is not part of the network itself. Is there some way to accurately create a list based on the predictions of a neural network and use it in the loss function in Tensorflow to train the network?
Please help, I have been stuck on this for a few days now.
Edit:
Following what was suggested in the comments, I decided to use a l2 loss function, multiplying it by the binary vector I had from before values_list_tensor. The binary vector now has values 1 and 0 instead of 1 and -1. This way when the prediction is in the range the loss is 0, else it is the normal l2 loss. As I am unable to see the values of the tensors, I am not sure if this is correct. However, I can view the final loss and it is always 0, so something is wrong here. I am unsure if the multiplication is being done correctly and if values_list_tensor is calculated accurately? Can someone help and tell me what could be wrong?
loss = tf.reduce_mean(tf.nn.l2_loss(tf.matmul(tf.transpose(tf.expand_dims(values_list_tensor, 1)), tf.expand_dims(result[0, :], 1))))
Thanks
To answer the question in the comment. One way to write a piece-wise function is using tf.cond. For example, here is a function that returns 0 in [-1, 1] and x everywhere else:
sess = tf.InteractiveSession()
x = tf.placeholder(tf.float32)
y = tf.cond(tf.logical_or(tf.greater(x, 1.0), tf.less(x, -1.0)), lambda : x, lambda : 0.0)
y.eval({x: 1.5}) # prints 1.5
y.eval({x: 0.5}) # prints 0.0

SVM with RBF: Decision values tend to be equal to the negative of the bias term for faraway test samples

Using RBF kernel in SVM, why the decision value of test samples faraway from the training ones tend to be equal to the negative of the bias term b?
A consequence is that, once the SVM model is generated, if I set the bias term to 0, the decision value of test samples faraway from the training ones tend to 0. Why it happens?
Using the LibSVM, the bias term b is the rho. The decision value is the distance from the hyperplane.
I need to understand what defines this behavior. Does anyone understand that?
Running the following R script, you can see this behavior:
library(e1071)
library(mlbench)
data(Glass)
set.seed(2)
writeLines('separating training and testing samples')
testindex <- sort(sample(1:nrow(Glass), trunc(nrow(Glass)/3)))
training.samples <- Glass[-testindex, ]
testing.samples <- Glass[testindex, ]
writeLines('normalizing samples according to training samples between 0 and 1')
fnorm <- function(ran, data) {
(data - ran[1]) / (ran[2] - ran[1])
}
minmax <- data.frame(sapply(training.samples[, -10], range))
training.samples[, -10] <- mapply(fnorm, minmax, training.samples[, -10])
testing.samples[, -10] <- mapply(fnorm, minmax, testing.samples[, -10])
writeLines('making the dataset binary')
training.samples$Type <- factor((training.samples$Type == 1) * 1)
testing.samples$Type <- factor((testing.samples$Type == 1) * 1)
writeLines('training the SVM')
svm.model <- svm(Type ~ ., data=training.samples, cost=1, gamma=2**-5)
writeLines('predicting the SVM with outlier samples')
points = c(0, 0.8, 1, # non-outliers
1.5, -0.5, 2, -1, 2.5, -1.5, 3, -2, 10, -9) # outliers
outlier.samples <- t(sapply(points, function(p) rep(p, 9)))
svm.pred <- predict(svm.model, testing.samples[, -10], decision.values=TRUE)
svm.pred.outliers <- predict(svm.model, outlier.samples, decision.values=TRUE)
writeLines('') # printing
svm.pred.dv <- c(attr(svm.pred, 'decision.values'))
svm.pred.outliers.dv <- c(attr(svm.pred.outliers, 'decision.values'))
names(svm.pred.outliers.dv) <- points
writeLines('test sample decision values')
print(head(svm.pred.dv))
writeLines('non-outliers and outliers decision values')
print(svm.pred.outliers.dv)
writeLines('svm.model$rho')
print(svm.model$rho)
writeLines('')
writeLines('<< setting svm.model$rho to 0 >>')
writeLines('predicting the SVM with outlier samples')
svm.model$rho <- 0
svm.pred <- predict(svm.model, testing.samples[, -10], decision.values=TRUE)
svm.pred.outliers <- predict(svm.model, outlier.samples, decision.values=TRUE)
writeLines('') # printing
svm.pred.dv <- c(attr(svm.pred, 'decision.values'))
svm.pred.outliers.dv <- c(attr(svm.pred.outliers, 'decision.values'))
names(svm.pred.outliers.dv) <- points
writeLines('test sample decision values')
print(head(svm.pred.dv))
writeLines('non-outliers and outliers decision values')
print(svm.pred.outliers.dv)
writeLines('svm.model$rho')
print(svm.model$rho)
Comments about the code:
It uses a dataset of 9 dimensions.
It splits the dataset into training and testing.
It normalizes the samples between 0 and 1 for all dimensions.
It makes the problem to be binary.
It fits a SVM model.
It predicts the testing samples, getting the decision values.
It predicts some synthetic (outlier) samples outside [0, 1] in the feature space, getting the decision values.
It shows that the decision value for outliers tends to be the negative of the bias term b generated by the model.
It sets the bias term b to 0.
It predicts the testing samples, getting the decision values.
It predicts some synthetic (outlier) samples outside [0, 1] in the feature space, getting the decision values.
It shows that the decision value for outliers tends to be 0.
Do you mean negative of the bias term instead of inverse?
The decision function of the SVM is sign(w^T x - rho), where rho is the bias term , w is the weight vector, and x is the input. But thats in the primal space / linear form. w^T x is replaced by our kernel function, which in this case is the RBF kernel.
The RBF kernel is defined as . So if the distance between two things is very large, then it gets squared - we get a huge number. γ is a positive number, so we are making our huge giant value a huge giant negative value. exp(-10) is already on the order of 5*10^-5, so for far away points the RBF kernel is going to become essentailly zero. If sample is far aware from all of your training data, than all of the kernel products will be nearly zero. that means w^T x will be nearly zero. And so what you are left with is essentially sign(0-rho), ie: the negative of your bias term.

XOR gate with a neural network

I was trying to implement an XOR gate with tensorflow. I succeeded in implementing that, but i don't fully understand why it works. I got help from stackoverflow posts here and here. So both with one hot true and without one hot true outputs. Here is the network as i understood, in order to set things clear.
My Question #1:
Notice the RELU function and Sigmoid function. Why we need that(specifically the RELU function)? You may say that in order to achieve non linearity. I understand how RELU achieves non-linearity. I got the answer from here. Now from what I understand the difference between using RELU and without using RELU is this(see the picture).[I tested the tf.nn.relu function. The output is like this]
Now, if the first function works, why not the second function? From my perspective RELU achieves non-linearity by combining multiple linear functions. So both is linear function(upper two). If first one achieves non linearity, 2nd one should too, shouldn't it? The question is that, without using the RELU why the network gets stuck?
XOR gate with one hot true outputs
hidden1_neuron = 10
def Network(x, weights, bias):
layer1 = tf.nn.relu(tf.matmul(x, weights['h1']) + bias['h1'])
layer_final = tf.matmul(layer1, weights['out']) + bias['out']
return layer_final
weight = {
'h1' : tf.Variable(tf.random_normal([2, hidden1_neuron])),
'out': tf.Variable(tf.random_normal([hidden1_neuron, 2]))
}
bias = {
'h1' : tf.Variable(tf.random_normal([hidden1_neuron])),
'out': tf.Variable(tf.random_normal([2]))
}
x = tf.placeholder(tf.float32, [None, 2])
y = tf.placeholder(tf.float32, [None, 2])
net = Network(x, weight, bias)
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(net, y)
loss = tf.reduce_mean(cross_entropy)
train_op = tf.train.AdamOptimizer(0.2).minimize(loss)
init_op = tf.initialize_all_variables()
xTrain = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
yTrain = np.array([[1, 0], [0, 1], [0, 1], [1, 0]])
with tf.Session() as sess:
sess.run(init_op)
for i in range(5000):
train_data = sess.run(train_op, feed_dict={x: xTrain, y: yTrain})
loss_val = sess.run(loss, feed_dict={x: xTrain, y: yTrain})
if(not(i%500)):
print(loss_val)
result = sess.run(net, feed_dict={x:xTrain})
print(result)
The code you see above implements the XOR gate with one hot true outputs. If i take out tf.nn.relu, the network gets stuck. Why?
My Question #2:
How can I understand if a network is going to get stuck on some local minima[or some value]? Is it from the plot of cost function (or loss function)? Say, for the network designed above, I used cross entropy as the loss function. I could not find the plotting of cross entropy function. (If you can provide this, this would be very helpful.)
My Question #3:
Notice on the code there is a line hidden1_neuron = 10. It means that i have set the number of neurons in the hidden layer 10. Reducing the number of neurons to 5 makes the network to get stuck. So what should be the number of neurons on hidden layer?
The output when the network works the way it is supposed to :
2.42076
0.000456363
0.000149548
7.40216e-05
4.34194e-05
2.78939e-05
1.8924e-05
1.33214e-05
9.62602e-06
7.06308e-06
[[ 7.5128479 -7.58900356]
[-5.65254211 5.28509617]
[-6.96340656 6.62380219]
[ 7.26610374 -5.9665451 ]]
The output when the network gets stuck:
1.45679
0.346579
0.346575
0.346575
0.346574
0.346574
0.346574
0.346574
0.346574
0.346574
[[ 15.70696926 -18.21559143]
[ -7.1562047 9.75774956]
[ -0.03214722 -0.03214724]
[ -0.03214722 -0.03214724]]
Question 1
Both the ReLU and Sigmoid function is non-linear. On the contrary, the function drawn to the right of the ReLU function is linear. Applying multiple linear activation functions will still make the network linear.
Therefore, the network gets stuck when trying to perform linear regression on a non-linear problem.
Question 2
Yes, you will have to pay attention to the progression of the error rate. In larger problem instances, you would typically pay attention to the development of the error function on your test set. This is done by measuring the accuracy of the network after a period of training.
Question 3
The XOR problem requires at least 2 input, 2 hidden, and 1 output node, that is: five nodes are required to correctly model the XOR problem with a simple neural network.

Resources