One-hot-encoded labels___multi-hot-encoded output_Keras - machine-learning

I have a 1D-image with 1x2048 pixels as input and 32 classes for which I have defined a layer of 32 filters with the same size of the image(1x2048) which are L1-regularized.
My image examples are one-hot encodded. However, my goal is to get a multi-hot encoded output when I sum some of these images together and feed it to the trained model.
The training goes well and it can classify each class seperately, but if I sum two image and feed it to the model it only outputs a one-hot encoded vector( although I expect a two-hot encoded vector). If I look at the kernels after training, they make sense as most of the weights are zero except the ones which define my class.
I don't understand why I get a one-hot vector output rather than multi-hot vector.
The reason I don't already sum the images and use them for training the model is that the possible making the possible combination of the images exceed my memory power.
An image of the network I have in mind
input_shape=(1,2048,1)
model = Sequential()
model.add(Conv2D(32, kernel_size=(1, 2048), strides=(1, 1),
activation='sigmoid',
input_shape=input_shape,
kernel_regularizer=keras.regularizers.l1(0.01),
kernel_constraint=keras.constraints.non_neg() ))
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=optimizer,metrics=['accuracy'])

You are using the wrong loss function
categorical_crossentropy will always return you exactly one 1-value in your vector, no matter the input. It tries to classify every instance into one (and only one) available class.
What you desire, though, is (potentially) mutliple ones in your output. Therefore, you should use binary_crossentropy instead. Also see this post.
On a side note, I would heavily advice you to really consider this twice, since - if you don't really have the case with multiple classes that often, it will maybe result in a lot of false positives. I.e., cases where you get more than one class predicted.
On another note, you might want to consider using Conv1D since your signal is 1-dimensional only.

#Azerila
The thing you are looking for is Mixup augmentation. It is implemented as follows:
def mixup(entry1,entry2):
image1,label1 = entry1
image2,label2 = entry2
alpha = [0.2]
dist = tfd.Beta(alpha, alpha)
l = dist.sample(1)[0][0]
img = l*image1+(1-l)*image2
lab = l*label1+(1-l)*label2
return img, lab

Related

What is the purpose of having the same input and output in PyTorch nn.Linear function?

I think this is a comprehension issue, but I would appreciate any help.
I'm trying to learn how to use PyTorch for autoencoding. In the nn.Linear function, there are two specified parameters,
nn.Linear(input_size, hidden_size)
When reshaping a tensor to its minimum meaningful representation, as one would in autoencoding, it makes sense that the hidden_size would be smaller. However, in the PyTorch tutorial there is a line specifying identical input_size and hidden_size:
class NeuralNetwork(nn.Module):
def __init__(self):
super(NeuralNetwork, self).__init__()
self.flatten = nn.Flatten()
self.linear_relu_stack = nn.Sequential(
nn.Linear(28*28, 512),
nn.ReLU(),
nn.Linear(512, 512),
nn.ReLU(),
nn.Linear(512, 10),
)
I guess my question is, what is the purpose of having the same input and hidden size? Wouldn't this just return an identical tensor?
I suspect that this just a requirement after calling the nn.ReLU() activation function.
As well stated by wikipedia:
An autoencoder is a type of artificial neural network used to learn
efficient codings of unlabeled data. The
encoding is validated and refined by attempting to regenerate the
input from the encoding.
In other words, the idea of the autoencoder is to learn an identity. This identity-function will be learned only for particular inputs (i.e. without anomalies). From this, the following points derive:
Input will have same dimensions as output
Autoencoders are (generally) built to learn the essential features of the input
Because of point (1), you have that autoencoder will have a series of layers (e.g. a series of nn.Linear() or nn.Conv()).
Because of point (2), you generally have an Encoder which compresses the information (as your code-snippet, you start from 28x28 to the ending 10) and a Decoder that decompress the information (10 -> 28x28). Generally the latent space dimensionality (10) is much smaller than the input (28x28) across several implementation of this theoretical architecture. Now that the end-goal of the Encoder part is clear, you may appreciate that the compression may produce additional data during the compression itself (nn.Linear(28*28, 512)), which will disappear when the series of layers will give the final output (10).
Note that because the model in your question includes a nonlinearity after the linear layer, the model will not learn an identity transform between the input and output. In the specific case of the relu nonlinearity, the model could learn an identity transform if all of the input values were positive, but in general this won't be the case.
I find it a little easier to imagine the issue if we had an even smaller model consisting of Linear --> Sigmoid --> Linear. In such a case, the input will be mapped through the first matrix transform and then "squashed" into the space [0, 1] as the "hidden" layer representation. The next ("output") layer would need to take this squashed view of the input and come up with some way of "unsquashing" it back into the original. But with an affine output layer, it's not possible to do this, so the model will have to learn some other, non-identity, transforms for the two matrices.
There are some neat visualizations of this concept on Chris Olah's blog that are well worth a look.

usage of GRU in non sequential context

I was implementing a GRU in keras, I was still a bit confused about some things, but got to a model:
modelGRU = tf.keras.models.Sequential()
modelGRU.add(layers.Bidirectional(tf.keras.layers.GRU(50, activation='tanh', input_shape=(1, 4))))
modelGRU.add(layers.Dense(99))
Then I've found out that my model does not make any sense, since I put the model parameters (which are 4 parameters like depth, angle, ..., which are the same at all times) in a single GRU. This gives me an output of dimensions 100 (50*2), and then a dense layer is used to generate the 99 outputs. These 99 outputs are a timeseries and that is why I initially taught of GRU, but of course this implementation above is not right, since my model parameters have no time or sequential information. However, this model seems to be working better than the model I have implemented once I understood everything better:
params_input = keras.Input(shape=(99,4))
aantal_units = 5
naRNN = (tf.keras.layers.GRU(aantal_units,input_shape=(99,5),return_sequences=True))(params_input)
ylist = tf.unstack(naRNN,num=99,axis=1)
ylistdense = []
for ii in range(0,99):
yy = tf.keras.layers.Dense(1,activation='linear')(ylist[ii])
ylistdense.append(yy)
conc = tf.keras.layers.concatenate(ylistdense)
model = keras.Model(inputs=params_input,outputs=conc)
Here for my input I copied 99 times the model parameters, in order to have an input of shape (99,4), put these in an GRU layer, and then for every timestep individually I make a dense layer in order to predict the outcome.
Her the architecture of my second implementation is visualized
So my question is: can a GRU be used for non sequential input? or is there something wrong with my second implementation?

Specifying class or sample weights in Keras for one-hot encoded labels in a TF Dataset

I am trying to train an image classifier on an unbalanced training set. In order to cope with the class imbalance, I want either to weight the classes or the individual samples. Weighting the classes does not seem to work. And somehow for my setup I was not able to find a way to specify the samples weights. Below you can read how I load and encode the training data and the two approaches that I tried.
Training data loading and encoding
My training data is stored in a directory structure where each image is place in the subfolder corresponding to its class (I have 32 classes in total). Since the training data is too big too all load at once into memory I make use of image_dataset_from_directory and by that describe the data in a TF Dataset:
train_ds = keras.preprocessing.image_dataset_from_directory (training_data_dir,
batch_size=batch_size,
image_size=img_size,
label_mode='categorical')
I use label_mode 'categorical', so that the labels are described as a one-hot encoded vector.
I then prefetch the data:
train_ds = train_ds.prefetch(buffer_size=buffer_size)
Approach 1: specifying class weights
In this approach I try to specify the class weights of the classes via the class_weight argument of fit:
model.fit(
train_ds, epochs=epochs, callbacks=callbacks, validation_data=val_ds,
class_weight=class_weights
)
For each class we compute weight which are inversely proportional to the number of training samples for that class. This is done as follows (this is done before the train_ds.prefetch() call described above):
class_num_training_samples = {}
for f in train_ds.file_paths:
class_name = f.split('/')[-2]
if class_name in class_num_training_samples:
class_num_training_samples[class_name] += 1
else:
class_num_training_samples[class_name] = 1
max_class_samples = max(class_num_training_samples.values())
class_weights = {}
for i in range(0, len(train_ds.class_names)):
class_weights[i] = max_class_samples/class_num_training_samples[train_ds.class_names[i]]
What I am not sure about is whether this solution works, because the keras documentation does not specify the keys for the class_weights dictionary in case the labels are one-hot encoded.
I tried training the network this way but found out that the weights did not have a real influence on the resulting network: when I looked at the distribution of predicted classes for each individual class then I could recognize the distribution of the overall training set, where for each class the prediction of the dominant classes is most likely.
Running the same training without any class weight specified led to similar results.
So I suspect that the weights don't seem to have an influence in my case.
Is this because specifying class weights does not work for one-hot encoded labels, or is this because I am probably doing something else wrong (in the code I did not show here)?
Approach 2: specifying sample weight
As an attempt to come up with a different (in my opinion less elegant) solution I wanted to specify the individual sample weights via the sample_weight argument of the fit method. However from the documentation I find:
[...] This argument is not supported when x is a dataset, generator, or keras.utils.Sequence instance, instead provide the sample_weights as the third element of x.
Which is indeed the case in my setup where train_ds is a dataset. Now I really having trouble finding documentation from which I can derive how I can modify train_ds, such that it has a third element with the weight. I thought using the map method of a dataset can be useful, but the solution I came up with is apparently not valid:
train_ds = train_ds.map(lambda img, label: (img, label, class_weights[np.argmax(label)]))
Does anyone have a solution that may work in combination with a dataset loaded by image_dataset_from_directory?

cnn max pooling - non consecutive sliding window (skip gram like)?

When using keras to build a simple cnn like the code below and when it is used on text-based problems such as document classification, I understand that this is as if we are extracting 4-grams from the text (kernel_size of 4) and use them as features.
model = Sequential()
model.add(embedding_layer)
model.add(Conv1D(filters=100, kernel_size=4, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=4))
model.add(Dense(4, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
and in this case, the kernel size in the conv1D layer is like a sliding window of size 4 that walks over sequences of tokens in the text to emit 4-grams.
I wonder if here is a way such that we can create 'non-consecutive sliding window in the convolution, i.e., that would generate 'skip-gram' equivalent. So for example, given the following 1d vector:
[a, b, c, d, e, f]
a conv1d with a kernel_size=3 skip=1 will scan the following sequences:
[(a,c,d),(b,d,e),(c,e,f),(d,f,padding),(e,padding,padding)] union [(a,b,d),(b,c,e),(c,d,f),(d,e,padding),(e,f,padding),(f,padding,padding)]
The reason I say 'union' is simply because I suppose from the implementation point of view, it may be easier to generate either part 1 or part 2, giving another parameter for the revised conv1d layer. and if thhat's the case and doable, I can work around this by concatenating multiple layers. But the minimum is really to have an extended conv1d layer that would take additional parameters such that it does either the first or the second part of scanning.
The idea is not new as this paper already experimented it: http://www.aclweb.org/anthology/D/D16/D16-1085.pdf
But excuse my lack of in-depth knowledge of keras I do not know how to implement it. Any suggestions please,
Many thanks in advance
You can do this creating a custom convolutional layer where certain elements in the weight matrix are zero.
You can take the regular Conv1D layer as the base class.
But before doing this, notice that you can create a "dilated" convolution by passing the dilation_rate=n parameter when creating a regular convolutional layer. This will skip n-1 grams between each taken gram in the window. Your window will have fixed regular spaces.
Creating a custom layer for that:
import keras.backend as K
#a 1D convolution that skips some entries
class SkipConv1D(Conv1D):
#in the init, let's just add a parameter to tell which grams to skip
def __init__(self, validGrams, **kwargs):
#for this example, I'm assuming validGrams is a list
#it should contain zeros and ones, where 0's go on the skip positions
#example: [1,1,0,1] will skip the third gram in the window of 4 grams
assert len(validGrams) == kwargs.get('kernel_size')
self.validGrams = K.reshape(K.constant(validGrams),(len(validGrams),1,1))
#the chosen shape matches the dimensions of the kernel
#the first dimension is the kernel size, the others are input and ouptut channels
#initialize the regular conv layer:
super(SkipConv1D,self).__init__(**kwargs)
#here, the filters, size, etc, go inside kwargs, so you should use them named
#but you may make them explicit in this __init__ definition
#if you think it's more comfortable to use it like this
#in the build method, let's replace the original kernel:
def build(self, input_shape):
#build as the original layer:
super(SkipConv1D,self).build(input_shape)
#replace the kernel
self.originalKernel = self.kernel
self.kernel = self.validGrams * self.originalKernel
Be aware of some things that weren't taken care of in this answer:
The method get_weights() will still return the original kernel, not the kernel with the skipped mask. (It's possible to fix this, but there will be an extra work, if necessary, please tell me)
There are unused weights in this layer. This is a simple implementation. The focus here was to keep it the most similar possible to an existing Conv layer, with all its features. It's also possible to use only strictly necessary weights, but this will increase the complexity a lot, and require lots of rewriting of the keras original code for recreating all the original possibilities.
If your kernel_size is too long, it will be very boring to define the validGrams var. You may want to create a version that takes some skipped indices and then converts it in the type of list used above.
Different channels skipping different grams:
It's possible to do this inside a layer as well, if instead of using a validGrams with shape (length,), you use one with shape (length,outputFilters).
In this case, at the point where we create the validGrams matrix, we should reshape it like:
validGrams = np.asarray(validGrams)
shp = (validGrams.shape[0],1,validGrams.shape[1])
validGrams = validGrams.reshape(shp)
self.validGrams = K.constant(validGrams)
You can also simply use many parallel SkipConv1D with different parameters and then concatenate their results.
inputs = Input(yourInputShape)
out = embedding_layer(inputs)
out1 = SkipConv1D(filters=50,kernel_size=4,validGrams=[1,0,1,1])(out)
out2 = SkipConv1D(filters=50,kernel_size=4,validGrams=[1,1,0,1])(out)
out = Concatenate()([out1,out2]) #if using 'channels_first' use Concatenate(axis=1)
out = MaxPooling1D(pool_size=4)(out)
out = Dense(4, activation='softmax')(out)
model = Model(inputs,out)

Translating a TensorFlow LSTM into synapticjs

I'm working on implementing an interface between a TensorFlow basic LSTM that's already been trained and a javascript version that can be run in the browser. The problem is that in all of the literature that I've read LSTMs are modeled as mini-networks (using only connections, nodes and gates) and TensorFlow seems to have a lot more going on.
The two questions that I have are:
Can the TensorFlow model be easily translated into a more conventional neural network structure?
Is there a practical way to map the trainable variables that TensorFlow gives you to this structure?
I can get the 'trainable variables' out of TensorFlow, the issue is that they appear to only have one value for bias per LSTM node, where most of the models I've seen would include several biases for the memory cell, the inputs and the output.
Internally, the LSTMCell class stores the LSTM weights as a one big matrix instead of 8 smaller ones for efficiency purposes. It is quite easy to divide it horizontally and vertically to get to the more conventional representation. However, it might be easier and more efficient if your library does the similar optimization.
Here is the relevant piece of code of the BasicLSTMCell:
concat = linear([inputs, h], 4 * self._num_units, True)
# i = input_gate, j = new_input, f = forget_gate, o = output_gate
i, j, f, o = array_ops.split(1, 4, concat)
The linear function does the matrix multiplication to transform the concatenated input and the previous h state into 4 matrices of [batch_size, self._num_units] shape. The linear transformation uses a single matrix and bias variables that you're referring to in the question. The result is then split into different gates used by the LSTM transformation.
If you'd like to explicitly get the transformations for each gate, you can split that matrix and bias into 4 blocks. It is also quite easy to implement it from scratch using 4 or 8 linear transformations.

Resources