I am doing a multilabel classification using some recurrent neural network structure. My question is about the loss function: my output will be vectors of true/false (1/0) values to indicate each label's class. Many resources said the Hamming loss is the appropriate objective. However, the Hamming loss has a problem in the gradient calculation:
H = average (y_true XOR y_pred),the XOR cannot derive the gradient of the loss. So is there other loss functions for training multilabel classification? I've tried MSE and binary cross-entropy with individual sigmoid input.
H = average(y_true*(1-y_pred)+(1-y_true)*y_pred)
is a continuous approximation of the hamming loss.
Related
My goal is to train a VAE using a convolutional decoder/encoder for Cifar 10. The only way I see KL divergence part of ELBO being computed is if I flatten the latent space after the convolutional encoder and compute mean and variance from that using two dense networks. Then I reshape the sampled latent code z = mu + epsilon * sigma back into something in the shape of (B, C, W, H) and apply transpose convolutions to turn it back into an image.
However, https://github.com/rtflynn/Cifar-Autoencoder shows that (I've verified this with my own code too) CNN autoencoders for images generally seem to do a lot worse when you put these dense layers in between them. Probably something related to the loss of spatial information due to flattening the latent space. So I'm not confident that my method of computing the KL divergence will give good results.
Are there any other ways I could do this?
Hyperparameter Tuning use two techniques like Grid Search or Random Search.
Gradient Descent is mostly used to minimize the Loss function.
Here query is in when we will use Grid Search and Gradient descent.
Gradient Descent is used to optimize the model meaning its weights and biases to minimize the loss. It tries to reach to minima of the loss function and their generalise the model to a good extent. It optimizes the model based on the hyperparameters given to it.
For example, the learning rate is used like
W = W - ( learning_rate * gradient )
Here, the hyperparameter of learning rate affects W which are the weights.
In order to choose a better value of a hyperparameter, GridSearch and RandomSearch algorithms are used. Hyperparameters are constant during training but need to be fine tuned so that the model converges at something good.
Gradient Descent optimizes the model based on hyperparameters. Whereas in order to fine tune the hyperparameters, GridSearch and RandomSearch are used.
Gradient descent is used for the optimization of the model ( weights and biases )
Hyperparameter Tuning algorithms fine tune hyperparameter which affect the gradient descent.
The usage could be followed in this way.
Train the model on some chosen hyperparameters.
Evaluate the model for its loss and accuracy.
Run hyperparameter tuning to get better values for hyperparameters.
Train the model again with updated hyperparameters.
Follow this routine until the model reaches a considerable high accuracy and less loss.
I've heard several different varieties about setting up weights and biases in a neural network, and it's left me with a few questions:
Which layers use weights? (I've been told the input layer doesn't, are there others?)
Does each layer get a global bias (1 per layer)? Or does each individual neuron get its own bias?
In common textbook networks like a multilayer perceptron - each hidden layer and the output layer in a regressor, or up to the softmax, normalized output layer of a classifier, have weights. Every node has a single bias.
Here's a paper that I find particularly helpful explaining the conceptual function of this arrangement:
http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/
Essentially, the combination of weights and biases allow the network to form intermediate representations that are arbitrary rotations, scales, and distortions (thanks to nonlinear activation functions) for previous layers, ultimately linearizing the relationship between input and output.
This arrangement can also be expressed by the simple linear-algebraic expression L2 = sigma(W L1 + B) where L1 and L2 are activation vectors of two adjacent layers, W is a weight matrix, B is a bias vector, and sigma is an activation function, which is somewhat mathematically and computationally appealing.
I am building a convolution autoencoder that uses MSE as its error function. How is MSE defined for images? If the image is presented in simple matrix form, is MSE simply the square of the difference of individual determinants? Or is it the square of the determinant of the difference of the matrices?
There is no determinant involved when calculating MSE. MSE stands for Mean Squared Error, and it is simply a sum over squares of the differences per each single pixel in your matrix. In other words - cost is model agnostic, MSE is defined in exactly the same way whether you use conv-autoencoder, simple autoencoder or simple MLP.
I have trained a SVM and logistic regression classifier on my dataset. Both classifier provide a weight vector which is of the size of the number of features. I can use this weight vector to select the 10 most important features by just selecting the 10 features with the highest weights.
Should I use the absolute values of the weights, i.e. selecting the 10 features with the highest absolute values?
Second, this only works for SVM with linear kernel but not with RBF kernel as I have read. For non-linear kernel the weights are somehow no more linear. What is the exact reason that the weight vector cannot be used to determine the importance of features in case of non-linear kernel SVM?
As I answered to similar question, weight vector of any linear classifier indicates feature importance: simply because final value is a linear combination of feature values with weights as coefficients, so the bigger weight, the more impact to the final value is caused by the corresponding summand.
Thus, for linear classifier you can take features with biggest weights (not with biggest values of the feature itself, or the biggest product of weight and feature value).
It also explains why SVM with non-linear kernels like RBF don't have such a property: both feature values and weights are transformed into another space and you can't say that the bigger weight leads to bigger impact, see wiki.
If you need to select most important features for non-linear SVM, use special methods for feature selection, namely wrapper methods.