How do you implement softmax with Tensorflow.JS - machine-learning

Using Tensorflow.JS, I am trying to get a machine learning model running with a last dense layer using a softmax activation function. When I try to run it, I receive:
Error when checking target: expected dense_Dense2 to have shape [,1], but got array with shape [3,2].
If I comment out the fit function and run it, I get back a 1x2 array (as expected since I have 2 units on the last layer and I am only entering in one test.
Additionally, when I alter the y variables to this array: [[1,2,3]], it trains (but I don't think this is correct since the ys are not the correct shape of the last layer (2).
Any advice or help would be appreciated to fill in my gap of knowledge.
var tf = require('#tensorflow/tfjs');
let xs = tf.tensor([
let ys = tf.tensor([
async function createModel () {
const model = tf.sequential();
model.add(tf.layers.dense({inputShape: [2], units: 32, activation: "relu"}));
model.add(tf.layers.dense({units: 2}));
model.compile({loss: "sparseCategoricalCrossentropy",optimizer:"adam"});
//await, ys, {epochs:1000});

Here is the softmax activation on the last layer:
const model = tf.sequential();
model.add(tf.layers.dense({inputShape: [2], units: 32, activation: "relu"}));
model.add(tf.layers.dense({units: 2, activation:'softmax'}));
model.compile({loss: "sparseCategoricalCrossentropy",optimizer:"adam"});
For the error:
Error when checking target: expected dense_Dense2 to have shape [,1], but got array with shape [3,2].
You can consider the answer given here
Your error is related to the loss function sparseCategoricalCrossentropy which expect labels as tensor1d. If you change this loss function to categoricalCrossentropy, it will work. Both losses do the same thing, it is the way the labels are encoded that is different. But as it is in the question the labels are neither encoded for categoricalCrossentropy nor sparseCategoricalCrossentropy.
When using sparseCategoricalCrossentropy, the labels are a 1d tensor where the value are the index of the category
When using categoricalCrossentropy, the labels are a 2d tensor aka one-hot encoding


Ambiguity in recurrent neural network training in Julia Flux

I'm using Julia's Flux library to learn about neural networks. According to the documentation for train! (where train! takes arguments (loss, params, data, opt)):
For each datapoint d in data, compute the gradient of loss with respect to params through backpropagation and call the optimizer opt.
(see source for train!:
For a conventional NN based on Dense -- let's say with a one-dimensional input and output, i.e. with one feature -- this is easy to understand. Each element in data is a pair of single numbers, an independent sample of 1-d input/output values. train! does forward- and backpropagation on each pair of 1-d samples one at a time. In the process, the loss function is evaluated on each sample. (Do I have this right?)
My question is: how does this extend to a recurrent NN? Take the case of an RNN with 1-d (i.e. one feature) input and output. It seems like there's some ambiguity in how to structure the input and output data, and the results change based on the structure. As one example:
x = [[1], [2], [3]]
y = [4, 5, 6]
data = zip(x, y)
m = RNN(1, 1)
opt = Descent()
loss(x, y) = sum((Flux.stack(m.(x), 1) .- y) .^ 2)
train!(loss, params(m), data, opt)
(loss function taken from:
In this example, when train! loops through each sample (for d in data), each value of d is a pair of single values from x and y, e.g. ([1], 4). loss is evaluated based on these single values. This is the same as in the Dense case.
On the other hand, consider:
x = [[[1], [2], [3]]]
y = [[4, 5, 6]]
m = RNN(1, 1)
opt = Descent()
loss(x, y) = sum((Flux.stack(m.(x), 1) .- y) .^ 2)
train!(loss, params(m), zip(x, y), opt)
Note that the only difference here is that x and y are nested in an extra pair of square brackets. As a result there's only one d in data, and it's a pair of sequences: ([[1], [2], [3]], [4, 5, 6]). loss can be evaluated on this version of d, and it returns a 1-d value, as required for training. But the value returned by loss is different than in any of the three results from the previous case, so the training process turns out differently.
The point is that both structures are valid in the sense that loss and train! handle them without error. Conceptually, I can make an argument for both structures being correct. But the results are different, and I assume that only one way is right. In other words, for training an RNN, should each d in data be a whole sequence, or a single element from a sequence?

How to create custom (convolution) connection between two different keras layers

I am implementing a custom connection between two different keras layers. The neural network begins something like below:
model = tf.keras.Sequential()
c1 = model.add(Conv2D(6, kernel_size=[5,5], strides=(stride,stride), padding="valid", input_shape=(32,32,1),
activation = 'tanh'))
s2 = model.add(AveragePooling2D(pool_size=2, strides=2, padding='valid'))
Now, the output of s2 has a size of 14*14*6
Here, I want to apply my custom connection to convolution layer c3 which has an output size of 10*10*16 (that is, 16 filters need to be applied on s2 of size 14*14*6 and get an output of 10*10*16). For this, I need to use kernal_size = 5*5, filers=16, stride = 1, and padding=valid.
However, all the 6 feature maps (of s2) are not connected to 16 feature maps of (c3). The connections are explained as given here.
For example (the explanation of given link above), to build your first feature map of C3, you convolve 3 of your input maps (of s2 of size 14*14*6) with 5x5 filters, which gives you 3 10x10 maps that are summed up to give your first feature map, which is then of size 10x10.
I read somewhere that, we need to use Functional API to build this.
But, I am not sure, how to proceed further. Can someone help on implementing this.
My initial approach of implementing this is as follows:
from keras.models import Model
from keras.layers import Conv2D, Input, Concatenate, Lambda, Add
inputTensor = Input(shape=(14, 14,6))
stride =1
group0_a = Lambda(lambda x: x[:,:,0])(inputTensor)
group0_b = Lambda(lambda x: x[:,:,1])(inputTensor)
group0_c = Lambda(lambda x: x[:,:,2])(inputTensor) # Take 0,1,2 feature map of s2
conv_group0_a = Conv2D(1, kernel_size=[5,5], strides=(stride,stride), padding="valid", activation = 'tanh')(group0_a)
conv_group0_b = Conv2D(1, kernel_size=[5,5], strides=(stride,stride), padding="valid", activation = 'tanh')(group0_b)
conv_group0_c = Conv2D(1, kernel_size=[5,5], strides=(stride,stride), padding="valid", activation = 'tanh')(group0_c) #Applying convolution on each of 0, 1, 2 feature maps of s2 with distinct kernals
added_0 = Add()([conv_group0_a, conv_group0_b, conv_group0_c]) #adding all the three to get one of the 10*10*16
#Repeat this for 16 neurons of c3 and then finally
output_layer = Concatenate()([]) #concatenate them
Mymodel = Model(inputTensor,output_layer)
I want to know, if my approach is correct (I know it is not because I am getting too many errors). So, I need help in recreating the custom connection as explained above. Any help is appreciated.
the above code is correct, the only change I made is group0_a = Lambda(lambda x: x[:,:,0:1])(inputTensor), that is instead of passing x as x[:,:,0] I passed it as x[:,:,0:1]

Simple RNN example showing numerics

I'm trying to understand RNNs and I would like to find a simple example that actually shows the one hot vectors and the numerical operations. Preferably conceptual since actual code may make it even more confusing. Most examples I google just show boxes with loops coming out of them and its really difficult to understand what exactly is going on. In the rare case where they do show the vectors its still difficult to see how they are getting the values.
for example I don't know where the values are coming from in this picture
If the example could integrate LSTMs and other popular extensions that would be cool too.
In the simple RNN case, a network accepts an input sequence x and produces an output sequence y while a hidden sequence h stores the network's dynamic state, such that at timestep i: x(i) ∊ ℝM, h(i) ∊ ℝN, y(i) ∊ ℝP the real valued vectors of M/N/P dimensions corresponding to input, hidden and output values respectively. The RNN changes its state and omits output based on the state equations:
h(t) = tanh(Wxh ∗ [x(t); h(t-1)]), where Wxh a linear map: ℝM+N ↦ ℝN, * the matrix multiplication and ; the concatenation operation. Concretely, to obtain h(t) you concatenate x(t) with h(t-1), you apply matrix multiplication between Wxh (of shape (M+N, N)) and the concatenated vector (of shape M+N) , and you use a tanh non-linearity on each element of the resulting vector (of shape N).
y(t) = sigmoid(Why * h(t)), where Why a linear map: ℝN ↦ ℝP. Concretely, you apply matrix multiplication between Why (of shape (N, P)) and h(t) (of shape N) to obtain a P-dimensional output vector, on which the sigmoid function is applied.
In other words, obtaining the output at time t requires iterating through the above equations for i=0,1,...,t. Therefore, the hidden state acts as a finite memory for the system, allowing for context-dependent computation (i.e. h(t) fully depends on both the history of the computation and the current input, and so does y(t)).
In the case of gated RNNs (GRU or LSTM), the state equations get somewhat harder to follow, due to the gating mechanisms which essentially allow selection between the input and the memory, but the core concept remains the same.
Numeric Example
Let's follow your example; we have M = 4, N = 3, P = 4, so Wxh is of shape (7, 3) and Why of shape (3, 4). We of course do not know the values of either W matrix, so we cannot reproduce the same results; we can still follow the process though.
At timestep t<0, we have h(t) = [0, 0, 0].
At timestep t=0, we receive input x(0) = [1, 0, 0, 0]. Concatenating x(0) with h(0-), we get [x(t); h(t-1)] = [1, 0, 0 ..., 0] (let's call this vector u to ease notation). We apply u * Wxh (i.e. multiplying a 7-dimensional vector with a 7 by 3 matrix) and get a vector v = [v1, v2, v3], where vi = Σj uj Wji = u1 W1i + u2 W2i + ... + u7 W7i. Finally, we apply tanh on v, obtaining h(0) = [tanh(v1), tanh(v2), tanh(v3)] = [0.3, -0.1, 0.9]. From h(0) we can also get y(0) via the same process; multiply h(0) with Why (i.e. 3 dimensional vector with a 3 by 4 matrix), get a vector s = [s1, s2, s3, s4], apply sigmoid on s and get σ(s) = y(0).
At timestep t=1, we receive input x(1) = [0, 1, 0, 0]. We concatenate x(1) with h(0) to get a new u = [0, 1, 0, 0, 0.3, -0.1, 0.9]. u is again multiplied with Wxh, and tanh is again applied on the result, giving us h(1) = [1, 0.3, 1]. Similarly, h(1) is multiplied by Why, giving us a new s vector on which we apply the sigmoid to obtain σ(s) = y(1).
This process continues until the input sequence finishes, ending the computation.
Note: I have ignored bias terms in the above equations because they do not affect the core concept and they make notation impossible to follow

Tensorflow - Using tf.losses.hinge_loss causing Shapes Incompatible error

My current code using sparse_softmax_cross_entropy works fine.
loss_normal = (
However, when I try to use the hinge_loss:
loss_normal = (
It reported an error saying:
ValueError: Shapes (1024, 2) and (1024,) are incompatible
The error seems to be originated from this function in the file:
with ops.name_scope(scope, "hinge_loss", (logits, labels)) as scope:
I modified my code as below to just extract 1 column of the logits tensor:
loss_normal = (
But it still reports a similar error:
ValueError: Shapes (1024, 1) and (1024,) are incompatible.
Can someone please help point out why my code works fine with sparse_softmax_cross_entropy loss but not hinge_loss?
The tensor labels has the shape [1024], the tensor logits has [1024, 2] shape. This works fine for tf.nn.sparse_softmax_cross_entropy_with_logits:
labels: Tensor of shape [d_0, d_1, ..., d_{r-1}] (where r is rank of
labels and result) and dtype int32 or int64. Each entry in labels must
be an index in [0, num_classes). Other values will raise an exception
when this op is run on CPU, and return NaN for corresponding loss and
gradient rows on GPU.
logits: Unscaled log probabilities of shape
[d_0, d_1, ..., d_{r-1}, num_classes] and dtype float32 or float64.
But tf.hinge_loss requirements are different:
labels: The ground truth output tensor. Its shape should match the
shape of logits. The values of the tensor are expected to be 0.0 or
logits: The logits, a float tensor.
You can resolve this in two ways:
Reshape the labels to [1024, 1] and use just one row of logits, like you did - logits[:,1:]:
labels = tf.reshape(labels, [-1, 1])
hinge_loss = (
I think you'll also need to reshape the class_weights the same way.
Use all of learned logits features via tf.reduce_sum, which will make a flat (1024,) tensor:
logits = tf.reduce_sum(logits, axis=1)
hinge_loss = (
This way you don't need to reshape labels or class_weights.

Create a List and Use it in Loss Function Tensorflow

I am trying to create a list based on my neural network outputs and use it in Tensorflow as a loss function.
Assume that results is list of size [1, batch_size] that is output by a neural network. I check to see whether the first value of this list is in a specific range passed in as a placeholder called valid_range, and if it is add 1 to a list. If it is not, add -1. The goal is to make all predictions of the network in the range, so the correct predictions is a tensor of all 1, which I call correct_predictions.
values_list = []
for j in range(batch_size):
a = results[0, j] >= valid_range[0]
b = result[0, j] <= valid_range[1]
c = tf.logical_and(a, b)
if (c == 1):
values_list_tensor = tf.convert_to_tensor(values_list)
correct_predictions = tf.ones([batch_size, ], tf.float32)
Now, I want to use this as a loss function in my network, so that I can force all the predictions to be in the specified range. I try to train like this:
loss = tf.reduce_mean(tf.squared_difference(values_list_tensor, correct_predictions))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
gradients, variables = zip(*optimizer.compute_gradients(loss))
gradients, _ = tf.clip_by_global_norm(gradients, gradient_clip_threshold)
optimize = optimizer.apply_gradients(zip(gradients, variables))
This, however, has a problem and throws an error on the last optimize line, saying:
ValueError: No gradients provided for any variable: ['< object at 0x7f0245d4afd0>',
'< object at 0x7f0245d66050>'
I tried to debug this in Tensorboard, and I notice that the list I am creating does not appear in the graph, so basically the x part of the loss function is not part of the network itself. Is there some way to accurately create a list based on the predictions of a neural network and use it in the loss function in Tensorflow to train the network?
Please help, I have been stuck on this for a few days now.
Following what was suggested in the comments, I decided to use a l2 loss function, multiplying it by the binary vector I had from before values_list_tensor. The binary vector now has values 1 and 0 instead of 1 and -1. This way when the prediction is in the range the loss is 0, else it is the normal l2 loss. As I am unable to see the values of the tensors, I am not sure if this is correct. However, I can view the final loss and it is always 0, so something is wrong here. I am unsure if the multiplication is being done correctly and if values_list_tensor is calculated accurately? Can someone help and tell me what could be wrong?
loss = tf.reduce_mean(tf.nn.l2_loss(tf.matmul(tf.transpose(tf.expand_dims(values_list_tensor, 1)), tf.expand_dims(result[0, :], 1))))
To answer the question in the comment. One way to write a piece-wise function is using tf.cond. For example, here is a function that returns 0 in [-1, 1] and x everywhere else:
sess = tf.InteractiveSession()
x = tf.placeholder(tf.float32)
y = tf.cond(tf.logical_or(tf.greater(x, 1.0), tf.less(x, -1.0)), lambda : x, lambda : 0.0)
y.eval({x: 1.5}) # prints 1.5
y.eval({x: 0.5}) # prints 0.0
