Ambiguity in recurrent neural network training in Julia Flux - machine-learning

I'm using Julia's Flux library to learn about neural networks. According to the documentation for train! (where train! takes arguments (loss, params, data, opt)):
For each datapoint d in data, compute the gradient of loss with respect to params through backpropagation and call the optimizer opt.
(see source for train!: https://github.com/FluxML/Flux.jl/blob/master/src/optimise/train.jl)
For a conventional NN based on Dense -- let's say with a one-dimensional input and output, i.e. with one feature -- this is easy to understand. Each element in data is a pair of single numbers, an independent sample of 1-d input/output values. train! does forward- and backpropagation on each pair of 1-d samples one at a time. In the process, the loss function is evaluated on each sample. (Do I have this right?)
My question is: how does this extend to a recurrent NN? Take the case of an RNN with 1-d (i.e. one feature) input and output. It seems like there's some ambiguity in how to structure the input and output data, and the results change based on the structure. As one example:
x = [[1], [2], [3]]
y = [4, 5, 6]
data = zip(x, y)
m = RNN(1, 1)
opt = Descent()
loss(x, y) = sum((Flux.stack(m.(x), 1) .- y) .^ 2)
train!(loss, params(m), data, opt)
(loss function taken from: https://github.com/FluxML/Flux.jl/blob/master/docs/src/models/recurrence.md)
In this example, when train! loops through each sample (for d in data), each value of d is a pair of single values from x and y, e.g. ([1], 4). loss is evaluated based on these single values. This is the same as in the Dense case.
On the other hand, consider:
x = [[[1], [2], [3]]]
y = [[4, 5, 6]]
m = RNN(1, 1)
opt = Descent()
loss(x, y) = sum((Flux.stack(m.(x), 1) .- y) .^ 2)
train!(loss, params(m), zip(x, y), opt)
Note that the only difference here is that x and y are nested in an extra pair of square brackets. As a result there's only one d in data, and it's a pair of sequences: ([[1], [2], [3]], [4, 5, 6]). loss can be evaluated on this version of d, and it returns a 1-d value, as required for training. But the value returned by loss is different than in any of the three results from the previous case, so the training process turns out differently.
The point is that both structures are valid in the sense that loss and train! handle them without error. Conceptually, I can make an argument for both structures being correct. But the results are different, and I assume that only one way is right. In other words, for training an RNN, should each d in data be a whole sequence, or a single element from a sequence?

Related

how are the leaf values of xgboost regression trees relate to the prediction

It seems that the sum of corresponding leaf values of each tree doesn't equal to the prediction. Here is a sample code:
X = pd.DataFrame({'x': np.linspace(-10, 10, 10)})
y = X['x'] * 2
model = xgb.XGBRegressor(booster='gbtree', tree_method='exact', n_estimators=100, max_depth=1).fit(X, y)
Xtest = pd.DataFrame({'x': np.linspace(-20, 20, 101)})
Ytest = model.predict(Xtest)
plt.plot(X['x'], y, 'b.-')
plt.plot(Xtest['x'], Ytest, 'r.')
The tree dumps reads:
model.get_booster().get_dump()[:2]
['0:[x<0] yes=1,no=2,missing=1\n\t1:leaf=-2.90277791\n\t2:leaf=2.65277767\n',
'0:[x<2.22222233] yes=1,no=2,missing=1\n\t1:leaf=-1.90595233\n\t2:leaf=2.44333339\n']
If I only use one tree to do prediction:
Ytest2 = model.predict(Xtest, ntree_limit=1)
plt.plot(XX1['x'], Ytest2, '.')
np.unique(Ytest2) # array([-2.4028, 3.1528], dtype=float32)
Clearly, Ytest2's unique values does not corresponds to the leaf value of the first tree, which is -2.90277791 and 2.65277767, although the observed split point is right at 0.
How are the leaf values related to the predictions?
Why are the leaf values in the first tree not symmetric, provided that the input is symmetric?
Before fitting the first tree, xgboost makes an initial prediction. This is controlled by the parameter base_score, which defaults to 0.5. And indeed, -2.902777 + 0.5 ~=-2.4028 and 2.652777 + 0.5 ~= 3.1528.
That also explains your second question: the differences from that initial prediction are not symmetric. If you set learning_rate=1 you probably could get the predictions to be symmetric after one round, or you could just set base_score=0.

Simple RNN example showing numerics

I'm trying to understand RNNs and I would like to find a simple example that actually shows the one hot vectors and the numerical operations. Preferably conceptual since actual code may make it even more confusing. Most examples I google just show boxes with loops coming out of them and its really difficult to understand what exactly is going on. In the rare case where they do show the vectors its still difficult to see how they are getting the values.
for example I don't know where the values are coming from in this picture https://i1.wp.com/karpathy.github.io/assets/rnn/charseq.jpeg
If the example could integrate LSTMs and other popular extensions that would be cool too.
In the simple RNN case, a network accepts an input sequence x and produces an output sequence y while a hidden sequence h stores the network's dynamic state, such that at timestep i: x(i) ∊ ℝM, h(i) ∊ ℝN, y(i) ∊ ℝP the real valued vectors of M/N/P dimensions corresponding to input, hidden and output values respectively. The RNN changes its state and omits output based on the state equations:
h(t) = tanh(Wxh ∗ [x(t); h(t-1)]), where Wxh a linear map: ℝM+N ↦ ℝN, * the matrix multiplication and ; the concatenation operation. Concretely, to obtain h(t) you concatenate x(t) with h(t-1), you apply matrix multiplication between Wxh (of shape (M+N, N)) and the concatenated vector (of shape M+N) , and you use a tanh non-linearity on each element of the resulting vector (of shape N).
y(t) = sigmoid(Why * h(t)), where Why a linear map: ℝN ↦ ℝP. Concretely, you apply matrix multiplication between Why (of shape (N, P)) and h(t) (of shape N) to obtain a P-dimensional output vector, on which the sigmoid function is applied.
In other words, obtaining the output at time t requires iterating through the above equations for i=0,1,...,t. Therefore, the hidden state acts as a finite memory for the system, allowing for context-dependent computation (i.e. h(t) fully depends on both the history of the computation and the current input, and so does y(t)).
In the case of gated RNNs (GRU or LSTM), the state equations get somewhat harder to follow, due to the gating mechanisms which essentially allow selection between the input and the memory, but the core concept remains the same.
Numeric Example
Let's follow your example; we have M = 4, N = 3, P = 4, so Wxh is of shape (7, 3) and Why of shape (3, 4). We of course do not know the values of either W matrix, so we cannot reproduce the same results; we can still follow the process though.
At timestep t<0, we have h(t) = [0, 0, 0].
At timestep t=0, we receive input x(0) = [1, 0, 0, 0]. Concatenating x(0) with h(0-), we get [x(t); h(t-1)] = [1, 0, 0 ..., 0] (let's call this vector u to ease notation). We apply u * Wxh (i.e. multiplying a 7-dimensional vector with a 7 by 3 matrix) and get a vector v = [v1, v2, v3], where vi = Σj uj Wji = u1 W1i + u2 W2i + ... + u7 W7i. Finally, we apply tanh on v, obtaining h(0) = [tanh(v1), tanh(v2), tanh(v3)] = [0.3, -0.1, 0.9]. From h(0) we can also get y(0) via the same process; multiply h(0) with Why (i.e. 3 dimensional vector with a 3 by 4 matrix), get a vector s = [s1, s2, s3, s4], apply sigmoid on s and get σ(s) = y(0).
At timestep t=1, we receive input x(1) = [0, 1, 0, 0]. We concatenate x(1) with h(0) to get a new u = [0, 1, 0, 0, 0.3, -0.1, 0.9]. u is again multiplied with Wxh, and tanh is again applied on the result, giving us h(1) = [1, 0.3, 1]. Similarly, h(1) is multiplied by Why, giving us a new s vector on which we apply the sigmoid to obtain σ(s) = y(1).
This process continues until the input sequence finishes, ending the computation.
Note: I have ignored bias terms in the above equations because they do not affect the core concept and they make notation impossible to follow

Create a List and Use it in Loss Function Tensorflow

I am trying to create a list based on my neural network outputs and use it in Tensorflow as a loss function.
Assume that results is list of size [1, batch_size] that is output by a neural network. I check to see whether the first value of this list is in a specific range passed in as a placeholder called valid_range, and if it is add 1 to a list. If it is not, add -1. The goal is to make all predictions of the network in the range, so the correct predictions is a tensor of all 1, which I call correct_predictions.
values_list = []
for j in range(batch_size):
a = results[0, j] >= valid_range[0]
b = result[0, j] <= valid_range[1]
c = tf.logical_and(a, b)
if (c == 1):
values_list.append(1)
else:
values_list.append(-1.)
values_list_tensor = tf.convert_to_tensor(values_list)
correct_predictions = tf.ones([batch_size, ], tf.float32)
Now, I want to use this as a loss function in my network, so that I can force all the predictions to be in the specified range. I try to train like this:
loss = tf.reduce_mean(tf.squared_difference(values_list_tensor, correct_predictions))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
gradients, variables = zip(*optimizer.compute_gradients(loss))
gradients, _ = tf.clip_by_global_norm(gradients, gradient_clip_threshold)
optimize = optimizer.apply_gradients(zip(gradients, variables))
This, however, has a problem and throws an error on the last optimize line, saying:
ValueError: No gradients provided for any variable: ['<tensorflow.python.training.optimizer._RefVariableProcessor object at 0x7f0245d4afd0>',
'<tensorflow.python.training.optimizer._RefVariableProcessor object at 0x7f0245d66050>'
...
I tried to debug this in Tensorboard, and I notice that the list I am creating does not appear in the graph, so basically the x part of the loss function is not part of the network itself. Is there some way to accurately create a list based on the predictions of a neural network and use it in the loss function in Tensorflow to train the network?
Please help, I have been stuck on this for a few days now.
Edit:
Following what was suggested in the comments, I decided to use a l2 loss function, multiplying it by the binary vector I had from before values_list_tensor. The binary vector now has values 1 and 0 instead of 1 and -1. This way when the prediction is in the range the loss is 0, else it is the normal l2 loss. As I am unable to see the values of the tensors, I am not sure if this is correct. However, I can view the final loss and it is always 0, so something is wrong here. I am unsure if the multiplication is being done correctly and if values_list_tensor is calculated accurately? Can someone help and tell me what could be wrong?
loss = tf.reduce_mean(tf.nn.l2_loss(tf.matmul(tf.transpose(tf.expand_dims(values_list_tensor, 1)), tf.expand_dims(result[0, :], 1))))
Thanks
To answer the question in the comment. One way to write a piece-wise function is using tf.cond. For example, here is a function that returns 0 in [-1, 1] and x everywhere else:
sess = tf.InteractiveSession()
x = tf.placeholder(tf.float32)
y = tf.cond(tf.logical_or(tf.greater(x, 1.0), tf.less(x, -1.0)), lambda : x, lambda : 0.0)
y.eval({x: 1.5}) # prints 1.5
y.eval({x: 0.5}) # prints 0.0

How tf.gradients work in TensorFlow

Given I have a linear model as the following I would like to get the gradient vector with regards to W and b.
# tf Graph Input
X = tf.placeholder("float")
Y = tf.placeholder("float")
# Set model weights
W = tf.Variable(rng.randn(), name="weight")
b = tf.Variable(rng.randn(), name="bias")
# Construct a linear model
pred = tf.add(tf.mul(X, W), b)
# Mean squared error
cost = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)
However if I try something like this where cost is a function of cost(x,y,w,b) and I only want to gradients with respect to w and b:
grads = tf.gradients(cost, tf.all_variable())
My placeholders will also be included (X and Y).
Even if I do get a gradient with [x,y,w,b] how do I know which element in the gradient that belong to each parameter since it is just a list without names to which parameter the derivative has be taken with regards to?
In this question I'm using parts of this code and I build on this question.
Quoting the docs for tf.gradients
Constructs symbolic partial derivatives of sum of ys w.r.t. x in xs.
So, this should work:
dc_dw, dc_db = tf.gradients(cost, [W, b])
Here, tf.gradients() returns the gradient of cost wrt each tensor in the second argument as a list in the same order.
Read tf.gradients for more information.

SVM with RBF: Decision values tend to be equal to the negative of the bias term for faraway test samples

Using RBF kernel in SVM, why the decision value of test samples faraway from the training ones tend to be equal to the negative of the bias term b?
A consequence is that, once the SVM model is generated, if I set the bias term to 0, the decision value of test samples faraway from the training ones tend to 0. Why it happens?
Using the LibSVM, the bias term b is the rho. The decision value is the distance from the hyperplane.
I need to understand what defines this behavior. Does anyone understand that?
Running the following R script, you can see this behavior:
library(e1071)
library(mlbench)
data(Glass)
set.seed(2)
writeLines('separating training and testing samples')
testindex <- sort(sample(1:nrow(Glass), trunc(nrow(Glass)/3)))
training.samples <- Glass[-testindex, ]
testing.samples <- Glass[testindex, ]
writeLines('normalizing samples according to training samples between 0 and 1')
fnorm <- function(ran, data) {
(data - ran[1]) / (ran[2] - ran[1])
}
minmax <- data.frame(sapply(training.samples[, -10], range))
training.samples[, -10] <- mapply(fnorm, minmax, training.samples[, -10])
testing.samples[, -10] <- mapply(fnorm, minmax, testing.samples[, -10])
writeLines('making the dataset binary')
training.samples$Type <- factor((training.samples$Type == 1) * 1)
testing.samples$Type <- factor((testing.samples$Type == 1) * 1)
writeLines('training the SVM')
svm.model <- svm(Type ~ ., data=training.samples, cost=1, gamma=2**-5)
writeLines('predicting the SVM with outlier samples')
points = c(0, 0.8, 1, # non-outliers
1.5, -0.5, 2, -1, 2.5, -1.5, 3, -2, 10, -9) # outliers
outlier.samples <- t(sapply(points, function(p) rep(p, 9)))
svm.pred <- predict(svm.model, testing.samples[, -10], decision.values=TRUE)
svm.pred.outliers <- predict(svm.model, outlier.samples, decision.values=TRUE)
writeLines('') # printing
svm.pred.dv <- c(attr(svm.pred, 'decision.values'))
svm.pred.outliers.dv <- c(attr(svm.pred.outliers, 'decision.values'))
names(svm.pred.outliers.dv) <- points
writeLines('test sample decision values')
print(head(svm.pred.dv))
writeLines('non-outliers and outliers decision values')
print(svm.pred.outliers.dv)
writeLines('svm.model$rho')
print(svm.model$rho)
writeLines('')
writeLines('<< setting svm.model$rho to 0 >>')
writeLines('predicting the SVM with outlier samples')
svm.model$rho <- 0
svm.pred <- predict(svm.model, testing.samples[, -10], decision.values=TRUE)
svm.pred.outliers <- predict(svm.model, outlier.samples, decision.values=TRUE)
writeLines('') # printing
svm.pred.dv <- c(attr(svm.pred, 'decision.values'))
svm.pred.outliers.dv <- c(attr(svm.pred.outliers, 'decision.values'))
names(svm.pred.outliers.dv) <- points
writeLines('test sample decision values')
print(head(svm.pred.dv))
writeLines('non-outliers and outliers decision values')
print(svm.pred.outliers.dv)
writeLines('svm.model$rho')
print(svm.model$rho)
Comments about the code:
It uses a dataset of 9 dimensions.
It splits the dataset into training and testing.
It normalizes the samples between 0 and 1 for all dimensions.
It makes the problem to be binary.
It fits a SVM model.
It predicts the testing samples, getting the decision values.
It predicts some synthetic (outlier) samples outside [0, 1] in the feature space, getting the decision values.
It shows that the decision value for outliers tends to be the negative of the bias term b generated by the model.
It sets the bias term b to 0.
It predicts the testing samples, getting the decision values.
It predicts some synthetic (outlier) samples outside [0, 1] in the feature space, getting the decision values.
It shows that the decision value for outliers tends to be 0.
Do you mean negative of the bias term instead of inverse?
The decision function of the SVM is sign(w^T x - rho), where rho is the bias term , w is the weight vector, and x is the input. But thats in the primal space / linear form. w^T x is replaced by our kernel function, which in this case is the RBF kernel.
The RBF kernel is defined as . So if the distance between two things is very large, then it gets squared - we get a huge number. γ is a positive number, so we are making our huge giant value a huge giant negative value. exp(-10) is already on the order of 5*10^-5, so for far away points the RBF kernel is going to become essentailly zero. If sample is far aware from all of your training data, than all of the kernel products will be nearly zero. that means w^T x will be nearly zero. And so what you are left with is essentially sign(0-rho), ie: the negative of your bias term.

Resources