Logistic regression training data set true/false ratio

Logistic regression training data set true/false ratio - machine-learning

I am working on a classifier, by logistic regression, based on Spark ML.
and I wonder should I train the equal quantity of data for true , false.
I mean
When I want to classify people into male or female,
Is it ok that train a model with 100 male data + 100 female data.
The online people may 40% male and 60% female , but this percent is forcasted based on the past, so it can be change(like 30% female, 70% male)
In this situation.
what female/male percent of data should I train?
is this related with overfitting?
when If I trained a model with 40%female + 60%male, It is useless to classifying a field data composed with 70%female+30%male?
Spark classification sample data has 43 false, 57true.
https://github.com/apache/spark/blob/master/data/mllib/sample_binary_classification_data.txt
what means the true/false ratio of trainig data in logisticregression?
I am really not good at English, but hope you understand me.

It should not matter what ratio you use, as long as it is reasonable.
60:40, 30:70, 50:50, it's okay. Just make sure it's not too lopsided, like 99:1.
If the entire data set is 70:30 female:male, and you want to only use a subset of this dataset, going for a 60:40 female:male ratio will not kill you.
Consider the following example:
Your test data contains 99% males, and 1 % female.
Technically, you can classify all males correctly, ALL females incorrectly, and your algorithm would show an error of 1%. Seems pretty good right? No, because your data is too lopsided.
This low error is not a result of overfitting (high variance), but rather a result of a lopsided data set.
This is an extreme example, but you get the point.

Related

Multiple Logistic Regression analysis - what is the reason of very high delta AIC among models?

I am dealing with salamanders biology and I am looking for the climatic and geomorphology variables best (sufficiently) explaining their presence/absence in the area. I have 1855 pixels with salamander presence and 104760 without their presence and my climatic and geomorphology variables cover all this area (all these pixels). I am applying Multiple Logistic Regression analysis in R based on glm(). The multicollinearity of my models seems to be acceptable (variables VIF<3) however the AIC values of my models are high (18272.47, 17576.52, 17391.83, 17087.87, 17026.07) and unfortunately the deltas AIC as well (61.79, 365.76, 550.44, 1246.40). I am more „salamander biologist“ than statistician. Can I ask for any advice or recommendation?
Many thanks

You have high AIC and delta AIC because you have a lot of observations.
AIC is only useful when you compare models for the same dataset. By itself, AIC doesn't mean anything. The formula for AIC (as from wiki) is:
2k - 2log(logLikelihood of Model), where k is the number of estimated
parameters.
So the more observations you have, the larger the logLik of the model. For example (below the deviance is -2*logLik):
data = iris
data$Species = factor(ifelse(data$Species=="versicolor","v","o"))
fit_full = glm(Species ~ .,data=data,family="binomial")
summary(fit_full)[c("aic","deviance")]
$aic
[1] 155.0697
$deviance
[1] 145.0697
We fit on a subet of 50:
fit_50 = glm(Species ~ .,data[sample(nrow(data),50),],family="binomial")
summary(fit_50)[c("aic","deviance")]
$aic
[1] 106.369
$deviance
[1] 96.36902
One way you can check is doing an anova, anova(fit_full,test="Chisq") , to check whether any of your independent variables show a strong effect. Another thing you can do is see whether you are predicting the labels correctly:
pred_labels = ifelse(predict(fit_full,type="response")>0.5,"v","o")
confusionMatrix(table(pred_labels,data$Species))$overall
Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
0.7400000 0.3809524 0.6621433 0.8081242 0.6666667
AccuracyPValue McnemarPValue
0.0325328 0.1093146

Performance measure for classification problem with unbalanced dataset

I have an anomaly detection problem with a big difference between healthy and anomalous data (i.e. >20.000 healthy datapoints against <30 anomalies).
Currently, I use just precision, recall and f1 score to measure the performance of my model. But I have no good method to set the threshold parameter. But that is not the problem at the moment.
I want to measure if the model is able to distinguish between the two classes independent of the threshold. I have read, that the ROC-AUC measure can be used if the data is unbalanced (https://medium.com/usf-msds/choosing-the-right-metric-for-evaluating-machine-learning-models-part-2-86d5649a5428). But with my data I get very high ROC-AUC scores (>0.97), even if the model outputs low scores if an anomaly occurs.
Maybe someone knows a better performance measure for this task or should I stick with the ROC-AUC score?
I try to add an example for my problem:
We consider a case where we have 20448 data points. We have 26 anomalies in this data. With my model I get the following anomaly scores for this anomalies:
[1.26146367, 1.90735495, 3.08136725, 1.35184909, 2.45533306,
2.27591039, 2.5894709 , 1.8333928 , 2.19098432, 1.64351134,
1.38457746, 1.87627623, 3.06143893, 2.95044859, 1.35565042,
2.26926566, 1.59751463, 3.1462369 , 1.6684134 , 3.02167491,
3.14508974, 1.0376038 , 1.86455995, 1.61870919, 1.35576177,
1.64351134]
If I now output how many data points have a higher anomaly score as, for example 1.38457746, I get 281 data points. That look like a bad performance from my perspective. But at the end the ROC AUC score is still 0.976038.
len(np.where(scores > 1.38457746)[0]) # 281

how to classifying with not ordianal data

i'm new to machine learning field.
Trying to classify 10 people with a their phone call logs.
The phone call logs look like this
UserId IsInboundCall Duration PhoneNumber(hashed)
1 false 23 1011112222
2 true 45 1033334444
Trained with this kind of 8700 logs with SVM from sklearn gives a result is accuracy 88%
I have a several question about this result and
what is a proper way to use some not ordinal data(ex. phone number)
I'm not sure using a hashed phone number as a feature but this multi class classifiers accuracy is not bad, is it just a coincidence?
How to use not oridnal data as a feature?
If this classifier have to classify more 1000 classes(more 1000 users), is SVM still work on that case?
Any advice is helpful for me. Thanks

1) Try the SVM without Phone number as a feature to get a sense of how much impact it has.
2) In order to avoid Ordinal Data you can either transform into a number or use a 1 of K approach. Say you added an Phone OS field with possible values {IOS, Android, Blackberry} you can represent this as a number 0,1,2 or as 3 features (1,0,0), (0,1,0), (0,0,1).
3) The SVM will still give good results as long as the data is approximately linearly separable. To achieve this you might need to add more features and map into a different feature space (an RBF kernel is a good start).

How to decide numClasses parameter to be passed to Random Forest algorithm in SPark MLlib with pySpark

I am working on Classification using Random Forest algorithm in Spark have a sample dataset that looks like this:
Level1,Male,New York,New York,352.888890
Level1,Male,San Fransisco,California,495.8001345
Level2,Male,New York,New York,-495.8001345
Level1,Male,Columbus,Ohio,165.22352099
Level3,Male,New York,New York,495.8
Level4,Male,Columbus,Ohio,652.8
Level5,Female,Stamford,Connecticut,495.8
Level1,Female,San Fransisco,California,495.8001345
Level3,Male,Stamford,Connecticut,-552.8234
Level6,Female,Columbus,Ohio,7000
Here the last value in each row will serve as a label and rest serve as features. But I want to treat label as a category and not a number. So 165.22352099 will denote a category and so will -552.8234. For this I have encoded my features as well as label into categorical data. Now what I am having difficulty in is deciding what should I pass for numClasses parameter in Random Forest algorithm in Spark MlLib? I mean should it be equal to number of unique values in my label? My label has like 10000 unique values so if I put 10000 as value of numClasses then wouldn't it decrease the performance dramatically?
Here is the typical signature of building a model for Random Forest in MlLib:
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
numTrees=3, featureSubsetStrategy="auto",
impurity='gini', maxDepth=4, maxBins=32)

The confusion comes from the fact that you are doing something that you should not do. You problem is clearly a regression/ranking, not a classification. Why would you think about it as a classification? Try to answer these two questions:
Do you have at least 100 samples per each value (100,000 * 100 = 1,000,000)?
Is there completely no structure in the classes, so for example - are objects with value "200" not more similar to those with value "100" or "300" than to those with value "-1000" or "+2300"?
If at least one answer is no, then you should not treat this as a classification problem.
If for some weird reason you answered twice yes, then the answer is: "yes, you should encode each distinct value as a different class" thus leading to 10000 unique classes, which leads to:
extremely imbalanced classification (RF, without balancing meta-learner will nearly always fail in such scenario)
extreme number of classes (there are no models able to solve it, for sure RF will not solve it)
extremely small dimension of the problem- looking at as small is your number of features I would be surprised if you could predict from that binary classifiaction. As you can see how irregular are these values, you have 3 points which only diverge in first value and you get completely different results:
Level1,Male,New York,New York,352.888890
Level2,Male,New York,New York,-495.8001345
Level3,Male,New York,New York,495.8
So to sum up, with nearly 100% certainty this is not a classification problem, you should either:
regress on last value (keyword: reggresion)
build a ranking (keyword: learn to rank)
bucket your values to at most 10 different values and then - classify (keywords: imbalanced classification, sparse binary representation)

100% accuracy from libsvm

I'm training and cross-validating (10-fold) data using libSVM (with linear kernel).
The data consist 1800 fMRI intensity voxels represented as a single datapoint.
There are around 88 datapoints in the training-set-file for svm-train.
the training-set-file looks as follow:
+1 1:0.9 2:-0.2 ... 1800:0.1
-1 1:0.6 2:0.9 ... 1800:-0.98
...
I should also mention i'm using the svm-train script (came along with the libSVM package).
The problem is that when running svm-train - it's result as 100% accuracy!
This doesn't seem to reflect the true classification results!
The data isn't unbalanced since
#datapoints labeled +1 == #datpoints labeled -1
Iv'e also checked the scaler (scaling correctly), and also tried to change the labels randomly to see how it impacts the accuracy - and it's decreasing from 100% to 97.9%.
Could you please help me understand the problem?
If so, what can I do to fix it?
Thanks,
Gal Star

Make sure you include '-v 10' in the svmtrain option. I'm not sure your 100% accuracy comes from training sample or validation sample. It is very possible to get a 100% training accuracy since you have much less sample number than the feature number. But if your model suffers from overfitting, the validation accuracy may be low.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart