XOR gate with a neural network - machine-learning

I was trying to implement an XOR gate with tensorflow. I succeeded in implementing that, but i don't fully understand why it works. I got help from stackoverflow posts here and here. So both with one hot true and without one hot true outputs. Here is the network as i understood, in order to set things clear.
My Question #1:
Notice the RELU function and Sigmoid function. Why we need that(specifically the RELU function)? You may say that in order to achieve non linearity. I understand how RELU achieves non-linearity. I got the answer from here. Now from what I understand the difference between using RELU and without using RELU is this(see the picture).[I tested the tf.nn.relu function. The output is like this]
Now, if the first function works, why not the second function? From my perspective RELU achieves non-linearity by combining multiple linear functions. So both is linear function(upper two). If first one achieves non linearity, 2nd one should too, shouldn't it? The question is that, without using the RELU why the network gets stuck?
XOR gate with one hot true outputs
hidden1_neuron = 10
def Network(x, weights, bias):
layer1 = tf.nn.relu(tf.matmul(x, weights['h1']) + bias['h1'])
layer_final = tf.matmul(layer1, weights['out']) + bias['out']
return layer_final
weight = {
'h1' : tf.Variable(tf.random_normal([2, hidden1_neuron])),
'out': tf.Variable(tf.random_normal([hidden1_neuron, 2]))
}
bias = {
'h1' : tf.Variable(tf.random_normal([hidden1_neuron])),
'out': tf.Variable(tf.random_normal([2]))
}
x = tf.placeholder(tf.float32, [None, 2])
y = tf.placeholder(tf.float32, [None, 2])
net = Network(x, weight, bias)
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(net, y)
loss = tf.reduce_mean(cross_entropy)
train_op = tf.train.AdamOptimizer(0.2).minimize(loss)
init_op = tf.initialize_all_variables()
xTrain = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
yTrain = np.array([[1, 0], [0, 1], [0, 1], [1, 0]])
with tf.Session() as sess:
sess.run(init_op)
for i in range(5000):
train_data = sess.run(train_op, feed_dict={x: xTrain, y: yTrain})
loss_val = sess.run(loss, feed_dict={x: xTrain, y: yTrain})
if(not(i%500)):
print(loss_val)
result = sess.run(net, feed_dict={x:xTrain})
print(result)
The code you see above implements the XOR gate with one hot true outputs. If i take out tf.nn.relu, the network gets stuck. Why?
My Question #2:
How can I understand if a network is going to get stuck on some local minima[or some value]? Is it from the plot of cost function (or loss function)? Say, for the network designed above, I used cross entropy as the loss function. I could not find the plotting of cross entropy function. (If you can provide this, this would be very helpful.)
My Question #3:
Notice on the code there is a line hidden1_neuron = 10. It means that i have set the number of neurons in the hidden layer 10. Reducing the number of neurons to 5 makes the network to get stuck. So what should be the number of neurons on hidden layer?
The output when the network works the way it is supposed to :
2.42076
0.000456363
0.000149548
7.40216e-05
4.34194e-05
2.78939e-05
1.8924e-05
1.33214e-05
9.62602e-06
7.06308e-06
[[ 7.5128479 -7.58900356]
[-5.65254211 5.28509617]
[-6.96340656 6.62380219]
[ 7.26610374 -5.9665451 ]]
The output when the network gets stuck:
1.45679
0.346579
0.346575
0.346575
0.346574
0.346574
0.346574
0.346574
0.346574
0.346574
[[ 15.70696926 -18.21559143]
[ -7.1562047 9.75774956]
[ -0.03214722 -0.03214724]
[ -0.03214722 -0.03214724]]

Question 1
Both the ReLU and Sigmoid function is non-linear. On the contrary, the function drawn to the right of the ReLU function is linear. Applying multiple linear activation functions will still make the network linear.
Therefore, the network gets stuck when trying to perform linear regression on a non-linear problem.
Question 2
Yes, you will have to pay attention to the progression of the error rate. In larger problem instances, you would typically pay attention to the development of the error function on your test set. This is done by measuring the accuracy of the network after a period of training.
Question 3
The XOR problem requires at least 2 input, 2 hidden, and 1 output node, that is: five nodes are required to correctly model the XOR problem with a simple neural network.

Related

Exporting a neural network created in Python to CoreML, is that possible?

Is that possible to export a neural network algorithm, like this one published by this guy to a CoreML model?
from numpy import exp, array, random, dot
class NeuralNetwork():
def __init__(self):
# Seed the random number generator, so it generates the same numbers
# every time the program runs.
random.seed(1)
# We model a single neuron, with 3 input connections and 1 output connection.
# We assign random weights to a 3 x 1 matrix, with values in the range -1 to 1
# and mean 0.
self.synaptic_weights = 2 * random.random((3, 1)) - 1
# The Sigmoid function, which describes an S shaped curve.
# We pass the weighted sum of the inputs through this function to
# normalise them between 0 and 1.
def __sigmoid(self, x):
return 1 / (1 + exp(-x))
# The derivative of the Sigmoid function.
# This is the gradient of the Sigmoid curve.
# It indicates how confident we are about the existing weight.
def __sigmoid_derivative(self, x):
return x * (1 - x)
# We train the neural network through a process of trial and error.
# Adjusting the synaptic weights each time.
def train(self, training_set_inputs, training_set_outputs, number_of_training_iterations):
for iteration in xrange(number_of_training_iterations):
# Pass the training set through our neural network (a single neuron).
output = self.think(training_set_inputs)
# Calculate the error (The difference between the desired output
# and the predicted output).
error = training_set_outputs - output
# Multiply the error by the input and again by the gradient of the Sigmoid curve.
# This means less confident weights are adjusted more.
# This means inputs, which are zero, do not cause changes to the weights.
adjustment = dot(training_set_inputs.T, error * self.__sigmoid_derivative(output))
# Adjust the weights.
self.synaptic_weights += adjustment
# The neural network thinks.
def think(self, inputs):
# Pass inputs through our neural network (our single neuron).
return self.__sigmoid(dot(inputs, self.synaptic_weights))
if __name__ == "__main__":
#Intialise a single neuron neural network.
neural_network = NeuralNetwork()
print "Random starting synaptic weights: "
print neural_network.synaptic_weights
# The training set. We have 4 examples, each consisting of 3 input values
# and 1 output value.
training_set_inputs = array([[0, 0, 1], [1, 1, 1], [1, 0, 1], [0, 1, 1]])
training_set_outputs = array([[0, 1, 1, 0]]).T
# Train the neural network using a training set.
# Do it 10,000 times and make small adjustments each time.
neural_network.train(training_set_inputs, training_set_outputs, 10000)
print "New synaptic weights after training: "
print neural_network.synaptic_weights
# Test the neural network with a new situation.
print "Considering new situation [1, 0, 0] -> ?: "
print neural_network.think(array([1, 0, 0]))
What should be the done?
Yes, this is possible. You can use the NeuralNetworkBuilder class from coremltools for this.

Create a List and Use it in Loss Function Tensorflow

I am trying to create a list based on my neural network outputs and use it in Tensorflow as a loss function.
Assume that results is list of size [1, batch_size] that is output by a neural network. I check to see whether the first value of this list is in a specific range passed in as a placeholder called valid_range, and if it is add 1 to a list. If it is not, add -1. The goal is to make all predictions of the network in the range, so the correct predictions is a tensor of all 1, which I call correct_predictions.
values_list = []
for j in range(batch_size):
a = results[0, j] >= valid_range[0]
b = result[0, j] <= valid_range[1]
c = tf.logical_and(a, b)
if (c == 1):
values_list.append(1)
else:
values_list.append(-1.)
values_list_tensor = tf.convert_to_tensor(values_list)
correct_predictions = tf.ones([batch_size, ], tf.float32)
Now, I want to use this as a loss function in my network, so that I can force all the predictions to be in the specified range. I try to train like this:
loss = tf.reduce_mean(tf.squared_difference(values_list_tensor, correct_predictions))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
gradients, variables = zip(*optimizer.compute_gradients(loss))
gradients, _ = tf.clip_by_global_norm(gradients, gradient_clip_threshold)
optimize = optimizer.apply_gradients(zip(gradients, variables))
This, however, has a problem and throws an error on the last optimize line, saying:
ValueError: No gradients provided for any variable: ['<tensorflow.python.training.optimizer._RefVariableProcessor object at 0x7f0245d4afd0>',
'<tensorflow.python.training.optimizer._RefVariableProcessor object at 0x7f0245d66050>'
...
I tried to debug this in Tensorboard, and I notice that the list I am creating does not appear in the graph, so basically the x part of the loss function is not part of the network itself. Is there some way to accurately create a list based on the predictions of a neural network and use it in the loss function in Tensorflow to train the network?
Please help, I have been stuck on this for a few days now.
Edit:
Following what was suggested in the comments, I decided to use a l2 loss function, multiplying it by the binary vector I had from before values_list_tensor. The binary vector now has values 1 and 0 instead of 1 and -1. This way when the prediction is in the range the loss is 0, else it is the normal l2 loss. As I am unable to see the values of the tensors, I am not sure if this is correct. However, I can view the final loss and it is always 0, so something is wrong here. I am unsure if the multiplication is being done correctly and if values_list_tensor is calculated accurately? Can someone help and tell me what could be wrong?
loss = tf.reduce_mean(tf.nn.l2_loss(tf.matmul(tf.transpose(tf.expand_dims(values_list_tensor, 1)), tf.expand_dims(result[0, :], 1))))
Thanks
To answer the question in the comment. One way to write a piece-wise function is using tf.cond. For example, here is a function that returns 0 in [-1, 1] and x everywhere else:
sess = tf.InteractiveSession()
x = tf.placeholder(tf.float32)
y = tf.cond(tf.logical_or(tf.greater(x, 1.0), tf.less(x, -1.0)), lambda : x, lambda : 0.0)
y.eval({x: 1.5}) # prints 1.5
y.eval({x: 0.5}) # prints 0.0

The dark mystery of tensorflow and tensorboard using cross-validation in training. Weird graphs showing up

This is the first time I'm using tensorboard, as I am getting a weird bug for my graph.
This is what I get if I open up the 'STEP' window.
However, this is what I get if I open up the 'RELATIVE'. (Similary when opening the 'WALL' window).
In addition to that, to test the performance of the model, I apply cross-validation every few steps. The accuracy of this cross-validation drops from ~10% (random guessing), to 0% after some time. I am not sure where I have made a mistake, as I am not a pro with tensorflow, but I suspect my problem to be in the graph building. The code looks as follows:
def initialize_parameters():
global_step = tf.get_variable("global_step", shape=[], trainable=False,
initializer=tf.constant_initializer(1), dtype=tf.int64)
Weights = {
"W_Conv1": tf.get_variable("W_Conv1", shape=[3, 3, 1, 64],
initializer=tf.random_normal_initializer(mean=0.00, stddev=0.01),
),
...
"W_Affine3": tf.get_variable("W_Affine3", shape=[128, 10],
initializer=tf.random_normal_initializer(mean=0.00, stddev=0.01),
)
}
Bias = {
"b_Conv1": tf.get_variable("b_Conv1", shape=[1, 16, 8, 64],
initializer=tf.random_normal_initializer(mean=0.00, stddev=0.01),
),
...
"b_Affine3": tf.get_variable("b_Affine3", shape=[1, 10],
initializer=tf.random_normal_initializer(mean=0.00, stddev=0.01),
)
}
return Weights, Bias, global_step
def build_model(W, b, global_step):
keep_prob = tf.placeholder(tf.float32)
learning_rate = tf.placeholder(tf.float32)
is_training = tf.placeholder(tf.bool)
## 0.Layer: Input
X_input = tf.placeholder(shape=[None, 16, 8], dtype=tf.float32, name="X_input")
y_input = tf.placeholder(shape=[None, 10], dtype=tf.int8, name="y_input")
inputs = tf.reshape(X_input, (-1, 16, 8, 1)) #must be a 4D input into the CNN layer
inputs = tf.contrib.layers.batch_norm(
inputs,
center=False,
scale=False,
is_training=is_training
)
## 1. Layer: Conv1 (64, stride=1, 3x3)
inputs = layer_conv(inputs, W['W_Conv1'], b['b_Conv1'], is_training)
...
## 7. Layer: Affine 3 (128 units)
logits = layer_affine(inputs, W['W_Affine3'], b['b_Affine3'], is_training)
## 8. Layer: Softmax, or loss otherwise
predict = tf.nn.softmax(logits) #should be an argmax, or should this even go through
## Output: Loss functions and model trainers
loss = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(
labels=y_input,
logits=logits
)
)
trainer = tf.train.GradientDescentOptimizer(
learning_rate=learning_rate
)
updateModel = trainer.minimize(loss, global_step=global_step)
## Test Accuracy
correct_pred = tf.equal(tf.argmax(y_input, 1), tf.argmax(predict, 1))
acc_op = tf.reduce_mean(tf.cast(correct_pred, "float"))
return X_input, y_input, loss, predict, updateModel, keep_prob, learning_rate, is_training
Now I suspect my error to be in the definition of the loss-function of the graph, but I am not sure. Any idea what the problem could be? Or does the model converge correctly and all those errors are expected?
Yes I think you are runing same model more than once with your cross-validation implementation.
Just try at the end of every loop
session.close()
I suspect you are getting such strange output (and I have seen similar myself) because you are running the same model more than once and it is saving the Tensorboard output in exactly the same place. I can't see in your code how you name the file where you are putting the output? Try to make the file path in this part of code unique:
`summary_writer = tf.summary.FileWriter(unique_path_to_log, sess.graph)`
You can also try to locate the directory where your existing output has bene put in and try to remove the files that have the older (or newer?) timestamps and this way Tensorboard will not be confused as to which one to use.

Transposed convolution on feature maps using Theano

I asked similar question on CrossValidation for the image interpretation. I'm moving my detailed question here to include some code details.
The results I'm having are not fully desirable So maybe you have faced this issue before and you can help me find it out.
It is fully convolution neural network "no fully connected part".
Training part
first the images are transposed to match the convolution function. (batch_no,img_channels,width,height)
input.transpose(0, 3, 1, 2)
Learning optimized using learning rate:3e-6, Hu_uniform initialization and nestrove for 500 epochs until this convergence.
Training cost: 1.602449
Training loss: 4.610442
validation error: 5.126761
Test loss: 5.885714
Backward part
Loading Image
jpgfile = np.array(Image.open(join(testing_folder,img_name)))
Reshape to one batch
batch = jpgfile.reshape(1, jpgfile.shape[0], jpgfile.shape[1], 3)
Run the model to extract first feature map after activation using Relu
output = classifier.layer0.output
Test_model = theano.function(
inputs=[x],
outputs=output,
)
layer_Fmaps = Test_model(test_set_x)
Apply backwork model to reconstruct the image using the only activated
neurons
bch, ch, row, col = layer_Fmaps.shape
output_grad_reshaped = layer_Fmaps.reshape((-1, 1, row, col))
output_grad_reshaped = output_grad_reshaped[0].reshape(1,1,row,col)
input_shape = (1, 3, 226, 226)
W = classifier.layer0.W.get_value()[0].reshape(1,3,7,7)
kernel = theano.shared(W)
inp = T.tensor4('inp')
deconv_out = T.nnet.abstract_conv.conv2d_grad_wrt_inputs(
output_grad = inp,
filters=kernel,
input_shape= input_shape,
filter_shape=(1,3,7,7),
border_mode=(0,0),
subsample=(1,1)
)
f = theano.function(
inputs = [inp],
outputs= deconv_out)
f_out = f(output_grad_reshaped)
deconved_relu = T.nnet.relu(f_out)[0].transpose(1,2,0)
deconved = f_out[0].transpose(1,2,0)
Here we have two images results, the first is the transposed image without activation and the second with relu since kernels might have some negative weights.
It is clear from the transposed convolution image that this kernel is learn to detect some useful feature related to this image. But the reconstructing part is breaking the image color scheme during the transpose convolution. It might be because the pixels values are small float numbers. Do you see where is the problem here ?

Why does gradient descent update 0-valued weights at all?

I was reading this question and the discussion makes sense to me: when all weights are initialized to zero, gradient descent can't tell where the error came from, so it can't update those weights.
What I don't understand is why I can't see this empirically. I'm running the following piece of code (runnable here):
w = tf.Variable(tf.zeros([2,1]))
b = tf.Variable(tf.zeros([1]))
x = tf.placeholder(tf.float32, shape=[1, 2])
y = tf.placeholder(tf.float32, shape=[1])
pred = tf.sigmoid(tf.matmul(x, w) + b)
loss = tf.reduce_mean(tf.square(pred - y))
train_step = tf.train.GradientDescentOptimizer(0.1).minimize(loss)
with tf.Session() as sess:
sess.run(tf.initialize_all_variables())
for i in range(100):
for x_ex, y_ex in dataset:
sess.run(train_step, feed_dict={x: x_ex, y: y_ex})
print(sess.run(w))
And the output I'm seeing is like:
[[ 0.]
[ 0.]]
[[ 0.02530853]
[ 0. ]]
[[ 0.02530853]
[ 0.02499614]]
[[-0.00059909]
[-0.00091148]]
[[-0.00059909]
[-0.00091148]]
[[ 0.02472398]
[-0.00091148]]
[[ 0.02472398]
[ 0.02410331]]
If the weights start out as zero, why is gradient descent able to update them at all?
As a follow up question, if a weight is randomnly initialized to be positive, but the optimal value for that weight is negative, do we just have to trust that in an update step the optimizer won't accidentally update the weight to be 0 (and thus halt the weight's updatability)? I know the odds of weight + update step being exactly 0 are almost neglibible, but it could still be an issue, especially with millions of weights in a NN.
It's not necessarily a problem in gradient descent, but how the partial derivatives are being calculated with backpropagation.
How bp computes the partial derivative for weights in layer l:
δ/δΘ^{l}_{ij}=a^l_jδ^{l+1}_i where activation 'a' is applying the non-linear
function g (e.g. sigmoid, tanh, ReLU) to the neuron's output:
a^l_j=g(Θ^{l−1}a^{l−1}) and where delta is the difference propagated backwards
from the successive layer: δ^l=(Θ^l)^Tδ^{l+1}.∗g′(Θ^{l−1}a^{l−1})
The .* stands for element-wise multiplication.
So if you looked at how the activation is computed, zero-weights
prevent the activation from increasing or decreasing. All-zero weights
mean zero activation.
There are other ways to calculate the gradient which do not have this issue!

Resources