I am trying to predict the total expenditure of a consumer from the consumer expenditure survey (data here). I chose the variables as Age, Income, Urban/ Rural, Sex, Education to predict the total expenditures in an household.
The correlation between Income, Expenditure is relatively less, and the predictions have an RMSE of ~3000 for a data with mean ~10000. I used transformation, normalization, scaling, and cross validation for pre-processing the data. However, none of the models are performing well in predicting the total expenditure. Is there any way to improve the predictions?
( I tried Linear regression, Lasso, KNN, Random forest, Gradient boosting algorithms)
Here's the scatterplot for income and expenditure,
scatterplot for income vs expenditure
I think the model is not performing well because of the less correlation. Any ideas to tackle such situations?
Related
I am currently building a binary classification model to predict stock price movements (trend prediction). More specifically, the model predicts the probability that a stock outperforms the daily median return:
>Class 0: return >= median
>
>Class 1: return < median return
Accordingly, I (should) be dealing with a balanced prediction problem.
The ten stocks with the highest probability will be bought, and the ten stocks with the lowest probability will be shorted daily. So, ideally, the model performs well on both classes (I use softmax, so the model must exclusively decide).
I am wondering whether I should use the Accuracy, F1 or AUC-ROC when choosing the optimal model under these circumstances?
My understanding is that both are suitable metrics when the two classes are equally important. This StackExchange-Answer recommends the AUC over Accuracy because it will "strongly discourage people going for models that are representative, but not discriminative (...) and [only] select models that achieve false positive and true positive rates that are significantly above random chance, which is not guaranteed for accuracy". In contrast, this answer recommends the F1-Score because it is the combination of accuracy and AUC score.
I guess what's confusing me is that I will make use of both classes based on the probabilty assigned by the model. Also, I do not have an imbalanced dataset which usually calls for using the AUC-ROC.
Which evaluation metric should I choose to find the optimal model on validation data?
Thanks a lot for any thoughts or recommendations.
I'm trying to compare multiple species distribution modeling approaches via k-fold cross-validation. Currently I'm calculating the RSME and AUC to compare model-performance. A friend suggested to further use the sum of log-likelihoods as metric to compare models. However, one of the models is a random forest fitted with the ranger package. If actually possible how would I calculate the log-likelihood for a random forest model and would it actually be a comparable metric to use with other models (GAM, GLM).
Thanks for your help.
I am working on a classification-based project, and I am evaluating different ML models based on their training accuracy, testing accuracy, confusion matrix, and the AUC score. I am now stuck in understanding the difference between the scores I get by calculating accuracy of a ML model on the test set (X_test), and the AUC score.
If I am correct, both metrics calculate how well a ML model is able to predict the correct class of previously unseen data. I also understand that for both, the higher the number, the better, for as long as the model is not over-fit or under-fit.
Assuming a ML model is neither over-fit nor under-fit, what is the difference between test accuracy score and the AUC score?
I don't have a background in math and stats, and pivoted towards data science from business background. Therefore, I will appreciate an explanation a business person can understand.
Both terms quantify the quality of a classification model, however, the accuracy quantifies a single manifestation of the variables, which means it describes a single confusion matrix. The AUC (area under the curve) represents the trade-off between the true-positive-rate (tpr) and the false-positive-rate (fpr) in multiple confusion matrices, that are generated for different fpr values for the same classifier.
A confusion matrix is of the form:
1) The accuracy is a measure for a single confusion matrix and is defined as:
where tp=true-positives, tn=true-negatives, fp=false-positives and fn=false-negatives (the amount of each).
2) The AUC measures the area under the ROC (receiver operating characteristic), that is the trade-off curve between the true-positive-rate and the false-positive-rate. For each choice of the false-positive-rate (fpr) threshold,the true-positive-rate (tpr) is determined. I.e, for a given classifier a fpr of 0, 0.1, 0.2 and so fourth is accepted, and for each fpr it's dependent tpr is evaluated. Therefore, you get a function tpr(fpr) that maps the interval [0,1] onto the same interval, because both rates are defined in those intervals. The area under this line is called the AUC, that is between 0 and 1, whereby a random classification is expected to yield an AUC of 0.5.
The AUC, as it is the area under the curve, is defined as:
However, in real (and finite) applications, the ROC is a step function and the AUC is determined by a weighted sum these levels.
Graphics are from Borgelt's Intelligent Data Mining Lecture.
what's the relation between mutual information and predict accuracy for classification or MSE for regression? Is it possible to have high accuracy/low MSE with low mutual information in data mining?
Mutual information is defined for pairs of probability distributions. Much of what can be said regarding its relationship to other quantities depends heavily on how you compute and represent these probability distributions (e.g. discrete versus continuous probability distributions).
Given a set of probability distributions, the relationship between classification accuracy and mutual information has been studied in the literature. In short, one quantity puts bounds on the other, at least for discrete probability distributions.
I don't know of any formal studies looking at the relationship between the MSE and mutual information.
All of that being said, if I had a concrete data set and got low mutual information scores for two variables but also a very low MSE in a regression model, I would take a hard look at how the mutual information was computed. 99 out of 100 times this occurs because the original formulation of Shannon entropy (and by extension mutual information) is used on continuous / floating point data, even though this method only applies to discrete data.
I have been working on the Sentiment analysis prediction using the Rotten Tomatoes movie reviews dataset.
The dataset has 5 classes {0,1,2,3,4} where 0 being very negative and 4 being very positive
The dataset is highly unbalanced,
total samples = 156061
'0': 7072 (4.5%),
'1': 27273 (17.4%),
'2': 79583 (50.9%),
'3': 32927 (21%),
'4': 9206 (5.8%)
as you can see class 2 has almost 50% samples and 0 and 5 contribute to ~10% of training set
So there is a very strong bias for class 2 thus reducing the accuracy of classification for class 0 and 4.
What can I do to balance the dataset? One solution would be to get equal number of samples by reducing the samples to only 7072 for each class, but it reduces the dataset drastically!
How can I optimize and balance the dataset without affecting the accuracy of overall classification?
You should not balance the dataset, you should train a classifier in a balanced manner. Nearly all existing classifiers can be trained with some cost sensitive objective. For example - SVMs let you "weight" your samples, simply weight samples of the smaller class more. Similarly Naive Bayes has classes priors - change them! Random forest, Neural networks, Logistic regression, they all let you somehow "weight" samples, it is the core technique for getting more balanced results.
For classification problems, you can try class_weight='balanced' option in your estimator, such as Logistic, SVM, etc. For example:
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression