I have a target variable which can be either 0 or 1, and 99.34% of it is 0 (about 50,000 entries in total). Logistic regression and naive bayes both just predicted all zeroes. Does anyone have a suggestion for this type of problem? I would like to determine feature importance.
Cheers
edit: I have about 10 features to predict with
One possibility is to give weights to the training examples, so that examples of class 1 count for more in the loss function than examples of class 0. I'm not sure what language/library you are using, but for example in scikit-learn's LogisticRegression there is an argument called class_weight that takes care of this for you (either by setting it to 'balanced' or by choosing it yourself). Or if you've implemented the logistic regression from scratch you can easily add these weights to your loss function yourself; it doesn't make the gradient computation any more complicated.
Related
I currently have a dataset of drawings, each drawing being represented by some features. Each feature (independent variable) is a continuous number. None of the drawings have a label as of yet, which is why I am planning to start a sort of questionaire with people. However, before I can correctly setup such questionaire, I should have an idea of what kind of labels I should use for my training data.
At first thought, I was thinking about letting people rate the drawings on a scale, for example from 1 to 5 with 1 being bad, 3 being average and 5 being good. Alternatively, I could also reduce the question to a simple good or bad question. The latter would mean I lose some valuable information, but the dependent variable could then be considered 'binary'.
Using the training data I then composed, I would need to have a machine learning algorithm (model) which given a drawing, predicts if the drawing is good or not. Ideally, I would have some way of tuning the strictness in this prediction. For example, the model could instead of simply predicting 'good' or 'bad', predict the likelyhood of a painting being good on a scale of 0 to 1. I could then say "Well, let's say all paintings which are 70% likely to be good, are considered as good". Another example would be that the model predicts the goodness using the same categorical values the people used to rate the drawing initially. So it would either predict the drawing being a 1, 2, 3, 4 or 5. Similar to my first example, I could then say "Well, all paintings which are rated at least a 4, are considered good paintings" and tune this threshhold to my liking.
After doing some research, I came up with logistic and linear regression being good candidates. However, if which of the two would be the best for my scenario? Equally important, how would I need to format my labels? Just simple 0's and 1's or a scale?
You could use a 1 vs all representation if you wanted to use a multi-class categorical classification:
Essentially, you train 1 classifier for every category you have (you have 10 categories, so you have 10 classifiers) and then each classifier is just trained to predict whether or not the category belongs to each specific class.
There are alternative ways to make multi-class logistic regression work that only require training a single model, such as by using categorical cross entropy, but given that you'd like to use ordinal data, a linear regression used as a regression model is likely more ideal. You'd predict a value between 1 and 10 and then just round to the nearest integer. This way you aren't penalizing close guesses as much as far away guesses.
what keeps you from using a logistic regression model. Due to a lack of better dataset I used the standard diabetes data. The target variable is an integer between 50 and 200. I normalised the data between [-1,1] such that I can use sigmoid as activation function. For the loss I decided to use
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import MaxPooling2D, Input, Convolution2D
import numpy as np
from sklearn import datasets
diabetes = datasets.load_diabetes()
x_train=diabetes.data
y_train=2*(diabetes.target-min(diabetes.target))/(max(diabetes.target)-min(diabetes.target))-1
inputs = tf.keras.Input(shape=(x_train.shape[1],))
outputs = tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)(inputs)
model = tf.keras.Model(inputs=inputs, outputs=outputs)
model.compile(optimizer=tf.keras.optimizers.Adam(), # Optimizer
loss=tf.keras.losses.MSE,
metrics=['sparse_categorical_accuracy'])
history = model.fit(x_train, y_train,
batch_size=64,
epochs=300,
validation_data=(x_train, y_train))
You could also use a linear regression model. There you only need to replace the activation function by linear. However I think the squashing character, besides ensuring hat there is no rating larger 1 or smaller -1.
A last alternative would be to train pair-wise preference. The idea is to show the human two drawings and ask which one he likes more. Then build a binary model, e.g., logistic regression. This approach appears preferable to me as it is easier to answer for the human
I have a few set of questions related to the usage of various activation functions used in neural networks? I would highly appreciate if someone could give good explanatory answers.
Why ReLU is used only on hidden layers specifically?
Why Sigmoid is a not used in Multi-class classification?
Why we do not use any activation function in regression problems having all negative values?
Why we use "average='micro','macro','average'" while calculating performance metric in multi_class classification?
I'll answer to the best of my ability the 2 first questions:
Relu (=max(0,x)) is used to extract feature maps from data. This is why it is used in the hidden layers where we're learning what important characteristics or features the data holds that could make the model learn how to classify for example. In the FC layers, it's time to make a decision about the output, so we usually use sigmoid or softmax, which tend to give us numbers between 0 and 1 (probability) that can give an interpretable result.
Sigmoid gives a probability for each class. So, if you have 10 classes, you'll have 10 probabilities. And depending on the threshold used, your model would predict for example that the image corresponds to two classes when in multi-classification you want just one predicted class per image. That's why softmax is used in this context: It chooses the class with the maximum probability. So it'll predict just one class.
I have used resnet50 to solve a multi-class classification problem. The model outputs probabilities for each class. Which loss function should I choose for my model?
After choosing binary cross entropy :
After choosing categorical cross entropy:
The above results are for the same model with just different loss functions.This model is supposed to classify images into 26 classes so categorical cross entropy should work.
Also, in the first case accuracy is about 96% but losses are so high. Why?
edit 2:
Model architecture:
You definitely need to use categorical_crossentropy for a multi-classification problem. binary_crossentropy will reduce your problem down to a binary classification problem in a way that's unclear without further looking into it.
I would say that the reason you are seeing high accuracy in the first (and to some extent the second) case is because you are overfitting. The first dense layer you are adding contains 8 million parameters (!!! to see that do model.summary()), and you only have 70k images to train it with 8 epochs. This architectural choice is very demanding both in computing power and in data requirement. You are also using a very basic optimizer (SGD). Try to use a more powerful Adam.
Finally, I am a bit surprised at your choice to take a 'sigmoid' activation function in the output layer. Why not a more classic 'softmax'?
For a multi-class classification problem you use the categorical_crossentropy loss, as what it does is match the ground truth probability distribution with the one predicted by the model.
This is exactly what is used for multi-class classification, you have a misconception of you think you can't use this loss.
I've learned the Logistic Regression for some days, and i think the logistic regression's dataset's labels needs to be 1 or 0, is it right ?
But when i lookup the libSVM library's regression dataset, i see the label values are continues number(e.g. 1.0086,1.0089 ...), did i miss something ?
Note that the libSVM library could be used for regression problem.
Thanks so much !
Contrary to its name, logistic regression is a classification algorithm and it outputs class probability conditioned on the data point. Therefore the training set labels need to be either 0 or 1. For the dataset you mentioned, logistic regression is not a suitable algorithm.
SVM is a classification algorithm and it uses the input labels -1 or 1. It is not a probabilistic algorithm and it doesn't output class probabilities. It also can be adapted to regression.
Are you using a 3rd party library or programming this yourself? Generally the labels are used as ground truth so you can see how effective your approach was.
For example if your algo is trying to predict what a particular instance is it might output -1, the ground truth label will be +1 which means you did not successfully classify that particular instance.
Note that "regression" is a general term. To say someone will perform regression analysis doesn't necessarily tell you what algorithm they will be using, nor all of the nature of the data sets. All it really tells you is that you have a set of samples with features which you want to use to predict a single outcome value (a model for conditional probability).
One major difference between logistic regression and linear regression is that the former is usually trained on categorical, binary-labeled sample sets; while the latter is trained on real-labeled (ℝ) sample sets.
Any time your labels are real valued, it means you're probably going to use linear regression or similar, or else convert those real valued labels to categorical labels (e.g. via thresholds or bins) if you want to in fact use logistic regression. There is potentially a big difference in the quality and interpretation of your results though, if you try to convert from one such problem setup to another.
See also Regression Analysis.
I have a binary class dataset (0 / 1) with a large skew towards the "0" class (about 30000 vs 1500). There are 7 features for each instance, no missing values.
When I use the J48 or any other tree classifier, I get almost all of the "1" instances misclassified as "0".
Setting the classifier to "unpruned", setting minimum number of instances per leaf to 1, setting confidence factor to 1, adding a dummy attribute with instance ID number - all of this didn't help.
I just can't create a model that overfits my data!
I've also tried almost all of the other classifiers Weka provides, but got similar results.
Using IB1 gets 100% accuracy (trainset on trainset) so it's not a problem of multiple instances with the same feature values and different classes.
How can I create a completely unpruned tree?
Or otherwise force Weka to overfit my data?
Thanks.
Update: Okay, this is absurd. I've used only about 3100 negative and 1200 positive examples, and this is the tree I got (unpruned!):
J48 unpruned tree
------------------
F <= 0.90747: 1 (201.0/54.0)
F > 0.90747: 0 (4153.0/1062.0)
Needless to say, IB1 still gives 100% precision.
Update 2: Don't know how I missed it - unpruned SimpleCart works and gives 100% accuracy train on train; pruned SimpleCart is not as biased as J48 and has a decent false positive and negative ratio.
Weka contains two meta-classifiers of interest:
weka.classifiers.meta.CostSensitiveClassifier
weka.classifiers.meta.MetaCost
They allows you to make any algorithm cost-sensitive (not restricted to SVM) and to specify a cost matrix (penalty of the various errors); you would give a higher penalty for misclassifying 1 instances as 0 than you would give for erroneously classifying 0 as 1.
The result is that the algorithm would then try to:
minimize expected misclassification cost (rather than the most likely class)
The quick and dirty solution is to resample. Throw away all but 1500 of your positive examples and train on a balanced data set. I am pretty sure there is a resample component in Weka to do this.
The other solution is to use a classifier with a variable cost for each class. I'm pretty sure libSVM allows you to do this and I know Weka can wrap libSVM. However I haven't used Weka in a while so I can't be of much practical help here.