Neural Net - trying to predict that 5 + 5 = 10 - machine-learning

I'm learning about Neural Networks and I recently had this idea: trying to give a NN training data of the function $f(x) = 2x$. The question is, can the NN accurately predict that it has to double the input number to give the correct output?
This is just a "mental exercise", to better my understanding of how NNs work.
My Python code doesn't work, here's what I've tried:
Neural Network class:
import numpy as np
class NeuralNetwork:
def __init__(self, inputnodes, hiddennodes, outputnodes, learningrate):
self.inodes = inputnodes
self.hnodes = hiddennodes
self.onodes = outputnodes
self.lr = learningrate
self.wih = np.random.normal(0.0, pow(self.inodes, -0.5), (self.hnodes, self.inodes))
self.who = np.random.normal(0.0, pow(self.hnodes, -0.5), (self.onodes, self.hnodes))
def train(self, inputs_list, targets_list):
inputs = np.array(inputs_list, ndmin=2).T
targets = np.array(targets_list, ndmin=2).T
hidden_outputs = np.dot(self.wih, inputs)
final_outputs = np.dot(self.who, hidden_outputs)
output_errors = targets - final_outputs
hidden_errors = np.dot(self.who.T, output_errors)
self.who += self.lr * np.dot(
(output_errors * final_outputs * (1.0 - final_outputs)),
np.transpose(hidden_outputs)
)
self.wih += self.lr * np.dot(
(hidden_errors * hidden_outputs * (1.0 - hidden_outputs)),
np.transpose(inputs)
)
def query(self, inputs_list):
inputs = np.array(inputs_list, ndmin=2).T
hidden_outputs = np.dot(self.wih, inputs)
final_outputs = np.dot(self.who, hidden_outputs)
return final_outputs
Training the network and predicting a value:
input_nodes = 1
hidden_nodes = 20
output_nodes = 1
learning_rate = 0.3
nn = NeuralNetwork(input_nodes, hidden_nodes, output_nodes, learning_rate)
for i in range(10):
i += 1
inputs = np.log(i)
targets = np.log(2*i)
nn.train(inputs, targets)
print(nn.query(np.asfarray([4])))
Here's the output I'm getting trying to run this code:
x.py:26: RuntimeWarning: overflow encountered in multiply
(output_errors * final_outputs * (1.0 - final_outputs)),
x.py:31: RuntimeWarning: overflow encountered in multiply
(hidden_errors * hidden_outputs * (1.0 - hidden_outputs)),
[[nan]]
I don't really know how to interpret this, and if my design is correct for this application. Any help would be appreciated.
Thanks.

Some suggestions:
Since the function of interest (f(x)=2x) is linear and requires only one weight, we can vastly simplify the network by having 1 weight and 0 hidden layers. We're trying to debug a problem, so we should simplify as much as possible to eliminate sources of error. Using a hidden layer with multiple hidden nodes implies that we need to find matrices such that W1.dot(W2)=2 because we seek the function x.dot(W1).dot(W2), which is harder because changing 1 weight changes the entire product; finding the correct answer requires aligning all of those weights.
Because the function of interest is linear, we know that any use of nonlinear functions is a distraction. Also, Saturation of sigmoid and tanh functions, or the dying ReLU phenomenon, could introduce additional problems to the optimization dynamics which could prevent us from making progress. See: https://stats.stackexchange.com/questions/301285/what-is-vanishing-gradient
The learning rate is probably too large. I believe this is the problem because you're having numerical overflow; this can happen when the optimizer consistently overshoots the minimum. See: https://stats.stackexchange.com/questions/364360/how-can-change-in-cost-function-be-positive
Scaling the inputs and the targets of a regression problem can dramatically improve the optimizer dynamics. For an example, see https://stats.stackexchange.com/questions/432707/alternating-negative-and-positive-value-of-slope-and-y-intercept-in-gradient-des/432714#432714
Additional tips for training neural networks are here: https://stats.stackexchange.com/questions/352036/what-should-i-do-when-my-neural-network-doesnt-learn/352037#352037

I think you are missing a very important part / building block in artificial neural networks architecture , this block is called the activation function , which tries to normalize output between [0,1] or [-1,1]
so i think attaching (which is very important) an activation function after computing every hidden layer outputs may solve this problem , as data propagating network will maintain normalized values for example between [0,1] so overflow may will not happen
notes
sigmoid activation and tanh are most popular and suitable for you problem
your learning rate maybe slightly large , try use 0.01

Related

In backpropogation, what does it mean when the error of a neural network converges to 0.5?

I've been trying to learn the math behind neural networks and have implemented (in Octave) a version of the following equations which include bias terms.
Back-propagation equations matrix form:
Visual representation of the problem and Network:
clear; clc; close all;
#Initialize weights and bias from input to hidden layer
W1 = rand(3,4)
b1 = ones(3,1)
#Initialize weights from hidden to output
W2 = rand(2,3)
b2 = ones(2,1)
#define sigmoid function
s = #(z) 1./(1 + exp(-z));
ds = #(z) s(z).*(1-s(z));
data = csvread("data.txt");
for j = 1 : 100
for i = 1 : length(data)
x0 = data(i,2:5)';
#Find the truth
if data(i,6) == 1 ;
t = [1;0] ;
else
t = [0;1];
end
#Forward propagate
x1 = s(W1*x0 + b1);
x2 = s(W2*x1 + b2);
iter = (j-1)*length(data) + i;
E((j-1)*length(data) + i) = norm(x2-t)^2;
E(length(E))
#Back propagate
delta2 = (x2-t).*ds(W2*x1+b2);
delta1 = W2'*delta2.*ds(W1*x0+b1);
dedw2 = delta2*x1';
dedw1 = delta1*x0';
alpha = 0.001*(40000-iter)/40000;
W2 = W2 - alpha*dedw2;
W1 = W1 - alpha*dedw1;
b2 = b2 - alpha*delta2;
b1 = b1 - alpha*delta1;
end
end
plot(E)
title('Gradient Descent')
xlabel('Iteration')
ylabel('Error')
When I run this, I converge on weights that give an constant error of 0.5 rather than 0.0. The error plot looks something like this depending on the initial samples of W1 and W2:
The resulting weights W1 and W2 yield output ~[0.5,0.5] for the whole set rather than [1,0](isStairs = true) or [0,1](isStairs = False)
Other information:
If I loop over a single data point instead of the entire learning set, it does converge to zero error for that particular case. (like 20 iterations or so), so I assume my derivatives are correct?
For the model to converge the learning rate has to be insanely small. Not sure what this means.
Is this neural network valid to solve the described problem? If so, what does it mean to converge to an error of 0.5?
The NN learns from data. If there is only one example, it will learn this example by heard and you have zero error. But if you have more examples, they will likely not lie on a nice curve, but are noisy instead. So it is harder to learn the data by heard for the network (it also depends on the number of free parameters that the NN has but you get the idea)... However, you don't want the NN to learn everything in detail. You want it to learn the overall trend (so not the noise). But this also means, that your error won't converge to zero as there is noise, which your NN should not learn... So don't worry if you have a (small) error at the end.
But what about the learning rate? Well, imagine you have 10 examples. Eight of them describe a perfect line but two exhibit noise. One sightly to the right (lets say +1) and the other slightly to the left (-1). If the NN estimates one of those points and updates to minimize the error drawn from it. The update will jump from + to - or vice versa. Depending on your learning rate, this jumping may eventually converge to the middle (which is the correct function) or may go on forever... This is essentially what the learning rate does: it determines how much impact an estimation error has on the update/learning of the network. So a good idea is to choose a larger learning rate the the beginning (where the network has a really bad performance due to its random initialization) and decrease the rate when it already learned something. You can achieve the same thing with a small learning rate but you will need longer time for it;)

How does the Keras Adam optimizer learning rate hyper-parameter relate to individually computed learning rates for network parameters?

So through my limited understanding of Adam (mainly through this post: https://towardsdatascience.com/adam-latest-trends-in-deep-learning-optimization-6be9a291375c) I gather that the Adam optimizer computes individual learning rates for each parameter in a network.
But in the Keras docs (https://keras.io/optimizers/) the Adam optimizer takes a learning rate parameter.
My question is how does the learning rate parameter taken by the Adam object correlate to these computed learning rates? As far as I can tell, this isn't covered in the post linked (Or it is but it went over my head).
As this is very specific question, I wouldn't go to any mathematical details of Adam. I guess in the article, the line it computes individual learning rates for different parameters got you off.
This is the screenshot of the actual Adam algorithm proposed in the paper https://arxiv.org/pdf/1412.6980.pdf
Adam keeps an exponentially decaying average of past gradients so it behaves like a heavy ball with friction which helps it faster convergence and stability.
But, if you look into the algorithm there's an alpha (step size), this is the keras equivalent of learning rate = 0.001 we provide. So, the algorithm needs a step size to update the parameters (simply, it's a scaling factor for the weight update). As for the varying learning rate (or update), you can see the last equation (it uses m_t and v_t, these are updated in the loop) but the alpha stays fixed in the whole algorithm. This is the keras learning rate that we have to provide.
As, alpha stays same, we sometimes have to use learning rate scheduling where we actually decrease the learning rate after few epochs. There are other variations where we increase the learning rate first then decrease.
Just wanted to add this, in case an implementation/example in 1-D clarifies anything:
import numpy as np
import matplotlib.pyplot as plt
from math import sqrt
eps = 1e-6
delta = 1e-6
MAX_ITER = 100000
def f(x):
return (np.square(x) / 10) - 2*np.sin(x)
def df(x):
return (f(x) - f(x - delta))/delta
def main():
x_0 = -13 # initial position
a = 0.1 # step size / learning rate
x_k = x_0
B_1 = 0.99 # first decay rate
B_2 = 0.999 # second decay rate
i = 0
m_k = df(x_k)
d_k = df(x_k)**2
while True:
# update moment estimates and parameters
m_k = B_1 * m_k + (1 - B_1) * df(x_k)
d_k = B_2 * d_k + (1 - B_2) * (df(x_k)**2)
x_k = x_k - a * m_k / sqrt(d_k + eps)
# termination criterion
if abs(df(x_k)/df(x_0)) <= eps:
break
if i > MAX_ITER:
break
i = i+1

Is there an optimizer in keras based on precision or recall instead of loss?

I am developping a segmentation neural network with only two classes, 0 and 1 (0 is the background and 1 the object that I want to find on the image). On each image, there are about 80% of 1 and 20% of 0. As you can see, the dataset is unbalanced and it makes the results wrong. My accuracy is 85% and my loss is low, but that is only because my model is good at finding the background !
I would like to base the optimizer on another metric, like precision or recall which is more usefull in this case.
Does anyone know how to implement this ?
You don't use precision or recall to be optimize. You just track them as valid scores to get the best weights. Do not mix loss, optimizer, metrics and other. They are not meant for the same thing.
THRESHOLD = 0.5
def precision(y_true, y_pred, threshold_shift=0.5-THRESHOLD):
# just in case
y_pred = K.clip(y_pred, 0, 1)
# shifting the prediction threshold from .5 if needed
y_pred_bin = K.round(y_pred + threshold_shift)
tp = K.sum(K.round(y_true * y_pred_bin)) + K.epsilon()
fp = K.sum(K.round(K.clip(y_pred_bin - y_true, 0, 1)))
precision = tp / (tp + fp)
return precision
def recall(y_true, y_pred, threshold_shift=0.5-THRESHOLD):
# just in case
y_pred = K.clip(y_pred, 0, 1)
# shifting the prediction threshold from .5 if needed
y_pred_bin = K.round(y_pred + threshold_shift)
tp = K.sum(K.round(y_true * y_pred_bin)) + K.epsilon()
fn = K.sum(K.round(K.clip(y_true - y_pred_bin, 0, 1)))
recall = tp / (tp + fn)
return recall
def fbeta(y_true, y_pred, beta = 2, threshold_shift=0.5-THRESHOLD):
# just in case
y_pred = K.clip(y_pred, 0, 1)
# shifting the prediction threshold from .5 if needed
y_pred_bin = K.round(y_pred + threshold_shift)
tp = K.sum(K.round(y_true * y_pred_bin)) + K.epsilon()
fp = K.sum(K.round(K.clip(y_pred_bin - y_true, 0, 1)))
fn = K.sum(K.round(K.clip(y_true - y_pred, 0, 1)))
precision = tp / (tp + fp)
recall = tp / (tp + fn)
beta_squared = beta ** 2
return (beta_squared + 1) * (precision * recall) / (beta_squared * precision + recall)
def model_fit(X,y,X_test,y_test):
class_weight={
1: 1/(np.sum(y) / len(y)),
0:1}
np.random.seed(47)
model = Sequential()
model.add(Dense(1000, input_shape=(X.shape[1],)))
model.add(Activation('relu'))
model.add(Dropout(0.35))
model.add(Dense(500))
model.add(Activation('relu'))
model.add(Dropout(0.35))
model.add(Dense(250))
model.add(Activation('relu'))
model.add(Dropout(0.35))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adamax',metrics=[fbeta,precision,recall])
model.fit(X, y,validation_data=(X_test,y_test), epochs=200, batch_size=50, verbose=2,class_weight = class_weight)
return model
No. To do a 'gradient descent', you need to compute a gradient. For this the function need to be somehow smooth. Precision/recall or accuracy is not a smooth function, it has only sharp edges on which the gradient is infinity and flat places on which the gradient is zero. Hence you can not use any kind of numerical method to find a minimum of such a function - you would have to use some kind of combinatorial optimization and that would be NP-hard.
As others have stated, precision/recall is not directly usable as a loss function. However, better proxy loss functions have been found that help with a whole family of precision/recall related functions (e.g. ROC AUC, precision at fixed recall, etc.)
The research paper Scalable Learning of Non-Decomposable Objectives covers this with a method to sidestep the combinatorial optimization by the use of certain calculated bounds, and some Tensorflow code by the authors is available at the tensorflow/models repository. Additionally, there is a followup question on StackOverflow that has an answer that adapts this into a usable Keras loss function.
Special thanks to Francois Chollet and other participants on the Keras issue thread here that turned up that research paper. You may also find that thread provides other useful insights into the problem at hand.
Having the same problem with an unbalanced dataset, I'd suggest you use the F1 score as the metric of your optimizer.
Andrew Ng teaches that having ONE metric for the model is the simplest (best?) way to train a model. If you have 2 metrics, like precision and recall - it's not clear which one is more important. Trying to set limits on one metric obviously impacts the other metric...
F1 score is the prodigy of recall and precision - it is their harmonic mean.
Keras that I'm using, unfortunately has no implementation of F1 score as a metric, like there is one for accuracy, or many other Keras metrics https://keras.io/api/metrics/.
I found an implementation of the F1 score as a Keras metric, used at each epoch at:
https://medium.com/#aakashgoel12/how-to-add-user-defined-function-get-f1-score-in-keras-metrics-3013f979ce0d
I've implemented the simple function from the above article and the model trains now on F1 score as its Keras optimizer metric. Results on test: accuracy went down a bit and F1 score went up a lot.
I have the same problem regarding an unbalanced dataset for binary classification and I want to increase the recall sensitivity too. I found out that there is a built-in function for recall in tf.keras and can be used in the compile statement as follow:
from tensorflow.keras.metrics import Recall, Accuracy
model.compile(loss='binary_crossentropy' , optimizer=opt, metrics=[Accuracy(),Recall()])
The recommended approach to deal with an unbalanced dataset like you have is to use class_weights or sample_weights. See the model fit API for details.
Quote:
class_weight: Optional dictionary mapping class indices (integers) to a weight (float) value, used for weighting the loss function (during training only). This can be useful to tell the model to "pay more attention" to samples from an under-represented class.
With weights that are inversely proportional to the class frequency the loss will avoid just predicting the background class.
I understand that this is not how you formulated the question but imho it is the most practical approach to the issue you are facing.
I think that the Callbacks and Early Stopping mechanisms provide one with techniques that can lead you as close as possible to what you want to achieve. Please read the following article by Jason Brownlee about Early Stopping (read to the end!):
https://machinelearningmastery.com/how-to-stop-training-deep-neural-networks-at-the-right-time-using-early-stopping/

How to find if a data set can train a neural network?

I'm a newbie to machine learning and this is one of the first real-world ML tasks challenged.
Some experimental data contains 512 independent boolean features and a boolean result.
There are about 1e6 real experiment records in the provided data set.
In a classic XOR example all 4 out of 4 possible states are required to train NN. In my case its only 2^(10-512) = 2^-505 which is close to zero.
I have no more information about the data nature, just these (512 + 1) * 1e6 bits.
Tried NN with 1 hidden layer on available data. Output of the trained NN on the samples even from the training set are always close to 0, not a single close to "1". Played with weights initialization, gradient descent learning rate.
My code utilizing TensorFlow 1.3, Python 3. Model excerpt:
with tf.name_scope("Layer1"):
#W1 = tf.Variable(tf.random_uniform([512, innerN], minval=-2/512, maxval=2/512), name="Weights_1")
W1 = tf.Variable(tf.zeros([512, innerN]), name="Weights_1")
b1 = tf.Variable(tf.zeros([1]), name="Bias_1")
Out1 = tf.sigmoid( tf.matmul(x, W1) + b1)
with tf.name_scope("Layer2"):
W2 = tf.Variable(tf.random_uniform([innerN, 1], minval=-2/512, maxval=2/512), name="Weights_2")
#W2 = tf.Variable(tf.zeros([innerN, 1]), name="Weights_2")
b2 = tf.Variable(tf.zeros([1]), name="Bias_2")
y = tf.nn.sigmoid( tf.matmul(Out1, W2) + b2)
with tf.name_scope("Training"):
y_ = tf.placeholder(tf.float32, [None,1])
cross_entropy = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(
labels = y_, logits = y)
)
train_step = tf.train.GradientDescentOptimizer(0.005).minimize(cross_entropy)
with tf.name_scope("Testing"):
# Test trained model
correct_prediction = tf.equal( tf.round(y), tf.round(y_))
# ...
# Train
for step in range(500):
batch_xs, batch_ys = Datasets.train.next_batch(300, shuffle=False)
_, my_y, summary = sess.run([train_step, y, merged_summaries],
feed_dict={x: batch_xs, y_: batch_ys})
I suspect two cases:
my fault – bad NN implementation, wrong architecture;
bad data. Compared to XOR example, incomplete training data would result in a failing NN. However, the training examples fed to the trained NN are supposed to give right predictions, aren't they?
How to evaluate if it is possible at all to train a neural network (a 2-layer perceptron) on the provided data to forecast the result? A case of aceptable set would be the XOR example. Opposed to some random noise.
There are only ad hoc ways to know if it is possible to learn a function with a differentiable network from a dataset. That said, these ad hoc ways do usually work. For example, the network should be able to overfit the training set without any regularisation.
A common technique to gauge this is to only fit the network on a subset of the full dataset. Check that the network can overfit to that, then increase the size of the subset, and increase the size of the network as well. Unfortunately, deciding whether to add extra layers or add more units in a hidden layer is an arbitrary decision you'll have to make.
However, looking at your code, there are a few things that could be going wrong here:
Are your outputs balanced? By that I mean, do you have the same number of 1s as 0s in the dataset targets?
Your initialisation in the first layer is all zeros, the gradient to this will be zero, so it can't learn anything (although, you have a real initialisation above it commented out).
Sigmoid nonlinearities are more difficult to optimise than simpler nonlinearities, such as ReLUs.
I'd recommend using the built-in definitions for layers in Tensorflow to not worry about initialisation, and switching to ReLUs in any hidden layers (you need sigmoid at the output for your boolean target).
Finally, deep learning isn't actually very good at most "bag of features" machine learning problems because they lack structure. For example, the order of the features doesn't matter. Other methods often work better, but if you really want to use deep learning then you could look at this recent paper, showing improved performance by just using a very specific nonlinearity and weight initialisation (change 4 lines in your code above).

Using RNN to recover sine wave from noisy signal

I am involved with an application that needs to estimate the state of a certain system in real time by measuring a set of (non-linearly) dependent parameters. Up until now the application was using an extended Kalman filter, but it was found to be underperforming in certain circumstances, which is likely caused by the fact that the differences between the real system and its model used in the filter are too significant to be modeled as white noise. We cannot use a more precise model for a number of unrelated reasons.
We decided to try recurrent neural networks for the task. Since my experience with neural networks is quite limited, before tackling the real task itself, I decided to practice with a hand crafted problem first. That problem, however, I could not solve, so I'm asking for help here.
Here's what I did: I generated some sine waveforms of varying phase, frequency, amplitude, and offset. Then I distorted the waveforms with some white noise, and (unsuccessfully) attempted to train an LSTM network to recover my waveforms from the noisy signal. I expected that the network will eventually learn to fit a sine waveform into the noisy data set.
Here's the source (slightly abridged, but it should work):
#!/usr/bin/env python3
import time
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, LSTM
from keras.layers.wrappers import TimeDistributed
from keras.objectives import mean_absolute_error, cosine_proximity
POINTS_PER_WF = int(1e4)
X_SPACE = np.linspace(0, 100, POINTS_PER_WF)
def make_waveform_with_noise():
def add_noise(vec):
stdev = float(np.random.uniform(0.01, 0.2))
return vec + np.random.normal(0, stdev, size=len(vec))
f = np.random.choice((np.sin, np.cos))
wf = f(X_SPACE * np.random.normal(scale=5)) *\
np.random.normal(scale=5) + np.random.normal(scale=50)
return wf, add_noise(wf)
RESCALING = 1e-3
BATCH_SHAPE = (1, POINTS_PER_WF, 1)
model = Sequential([
TimeDistributed(Dense(5, activation='tanh'), batch_input_shape=BATCH_SHAPE),
LSTM(20, activation='tanh', inner_activation='sigmoid', return_sequences=True),
LSTM(20, activation='tanh', inner_activation='sigmoid', return_sequences=True),
TimeDistributed(Dense(1, activation='tanh'))
])
def compute_loss(y_true, y_pred):
skip_first = POINTS_PER_WF // 2
y_true = y_true[:, skip_first:, :] * RESCALING
y_pred = y_pred[:, skip_first:, :] * RESCALING
me = mean_absolute_error(y_true, y_pred)
cp = cosine_proximity(y_true, y_pred)
return me + cp
model.summary()
model.compile(optimizer='adam', loss=compute_loss,
metrics=['mae', 'cosine_proximity'])
NUM_ITERATIONS = 30000
for iteration in range(NUM_ITERATIONS):
wf, noisy_wf = make_waveform_with_noise()
y = wf.reshape(BATCH_SHAPE) * RESCALING
x = noisy_wf.reshape(BATCH_SHAPE) * RESCALING
info = model.train_on_batch(x, y)
model.save_weights('final.hdf5')
The first dense layer is actually useless, the reason I added it is because I wanted to make sure I can successfully combine LSTM and time distributed dense layers, since my real application will likely need that setup.
The error function was modified a number of times. Initially I was using plain mean squared error, but the training process was extremely slow, and it was mostly converging to simply copying the input noisy signal into the output. The cosine proximity metric I added later essentially defines the degree of similarity between the shapes of the functions; it seemed to speed up the learning quite a bit. Also note that I'm applying the loss function only to the last half of the dataset; the motivation for that is that I expected that the network will need to see a few periods of the signal in order to be able to correctly identify the parameters of the waveform. However, I found that this modification has no visible effect on the performance of the network.
The latest modification of the script uses Adam optimizer, I also experimented with RMSProp with varying learning rate and decay settings, but I found no noticeable difference in behavior of the network.
I am using Theano 0.9 (dev) backend configured to use 64 bit floating point, in order to prevent possible issues with numerical stability. The epsilon value is set accordingly to 1e-14.
This is what the output looks like after 15k..30k training steps (performance stops improving starting from about 15k steps) (the first plot is zoomed in for the sake of clarity):
Plot legend:
blue (0) - noisy signal, input of the RNN
green (1) - recovered signal, output of the RNN
red (2) - ground truth
My question is: what am I doing wrong?

Resources