I made a regression model with SVM and applied the RBF kernel, my data exceeds 7 dimensions. the value of my MSE: 36,069 for the problem I'm solving is a low error, but my R-Square is very low R2: 0.30
Is it okay for me to evaluate my SVM with R-square, since my data is not in 2D?
Related
I have a trained a fasttext model on a binary text classification problem and generated the learning curve over increasing training size
I get very quick a very low training loss , close to 0, which stays constant.
I interpret this as the model overfitting on the data.
But the validation loss curve looks good to me, slowly decreasing.
Crossvalidation on unknow data produces as well accuracies with little variation, about 90% accuracy.
So I am wondering, if I indeed have an "Overfiiting" model as the learning curve suggests.
Is there any other check I can do on my model ?
As the fasttext model uses as well epochs, I am even wondering if a learning curve should vary the epochs (and keep training size constant) or "slowly increase training set size while keep epoch constant" (or both ...)
I'm training a Siamese CNN model based on X-ray dataset and it gets stuck at 0.50 validation accuracy. Moreover, training and validation loss decrease while training accuracy hovers around 0.50 as well.
I think that is a error or something, but I don't know exactly what it is. Maybe use Cross-Validation could help.
Code:
Archictecture of CNN
Summary
Training the model...(results)
I am doing a multilabel classification using some recurrent neural network structure. My question is about the loss function: my output will be vectors of true/false (1/0) values to indicate each label's class. Many resources said the Hamming loss is the appropriate objective. However, the Hamming loss has a problem in the gradient calculation:
H = average (y_true XOR y_pred),the XOR cannot derive the gradient of the loss. So is there other loss functions for training multilabel classification? I've tried MSE and binary cross-entropy with individual sigmoid input.
H = average(y_true*(1-y_pred)+(1-y_true)*y_pred)
is a continuous approximation of the hamming loss.
I have trained a SVM and logistic regression classifier on my dataset. Both classifier provide a weight vector which is of the size of the number of features. I can use this weight vector to select the 10 most important features by just selecting the 10 features with the highest weights.
Should I use the absolute values of the weights, i.e. selecting the 10 features with the highest absolute values?
Second, this only works for SVM with linear kernel but not with RBF kernel as I have read. For non-linear kernel the weights are somehow no more linear. What is the exact reason that the weight vector cannot be used to determine the importance of features in case of non-linear kernel SVM?
As I answered to similar question, weight vector of any linear classifier indicates feature importance: simply because final value is a linear combination of feature values with weights as coefficients, so the bigger weight, the more impact to the final value is caused by the corresponding summand.
Thus, for linear classifier you can take features with biggest weights (not with biggest values of the feature itself, or the biggest product of weight and feature value).
It also explains why SVM with non-linear kernels like RBF don't have such a property: both feature values and weights are transformed into another space and you can't say that the bigger weight leads to bigger impact, see wiki.
If you need to select most important features for non-linear SVM, use special methods for feature selection, namely wrapper methods.
I was thinking about the risks hided under training a non linear classifier against a labelled (large enough) dataset which is linearly separable.
What would be the main classification misleadings we can come up with? Some example?
In the bias-variance tradeoff, a non-linear classifier has, in general, a larger variance than the linear one. If the dataset is generated by a linearly-separable process but the measurements are noisy, then it will be more susceptible to overfitting.
However, if the dataset is large enough and the classifier is unbiased, then a non-linear classifier would eventually produce effectively a separating hyperplane.