Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
I have to evaluate logistic regression model. This model is aimed to detect frouds, so in real life the algorithm will face highly imbalanced data.
Some people say that I need to balance train set only, while test set should remain similar to real life data. On the other hand, many people say that model must be trained and tested on balanced samples.
I tried to test my model for both (balanced, unbalanced) sets and get the same ROC AUC (0.73), but different precision-recall curve AUC - 0.4 (for unbalanced) and 0.74 for (balanced).
What shoud I choose?
And what metrics should I use to evaluate my model perfomance?
Since you are dealing with a problem that has an unbalanced concept (a disproportionately greater amount of not-fraud over fraud), I recommend you utilize F-scoring with a real-world "matching" unbalanced set. This would allow you to compare models without having to ensure that your test set is balanced, since that could mean that you are overrepresenting frauds and underrepresenting non-frauds cases in your test set.
Here are some references and how to implement on sklearn:
https://en.wikipedia.org/wiki/F-score
https://deepai.org/machine-learning-glossary-and-terms/f-score
https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'm working on a deep learning classifier (Keras and Python) that classifies time series into three categories. The loss function that I'm using is the standard categorical cross-entropy. In addition to this, I also have an attention map which is being learnt within the same model.
I would like this attention map to be as small as possible, so I'm using a regularizer. Here comes the problem: how do I set the right regularization parameter? What I want is the network to reach its maximum classification accuracy first, and then starts minimising the intensity attention map. For this reason, I train my model once without regulariser and a second time with the regulariser on. However, if the regulariser parameter (lambda) is too high, the network loses completely accuracy and only minimises the attention, while if the regulariser is too small, the network only cares about the classification error and won't minimise the attention, even when the accuracy is already the maximum.
Is there a smarter way to combine the categorical cross-entropy with the regulariser? Maybe something that considers the variation of categorical cross-entropy in time, and if it doesn't go down for, say N iterations, it only considers the regulariser?
Thank you
Regularisation is a way to fight with overfitting. So, you should understood if your model overfits. A simple way to do it: you can compare f1 score for train and test. If f1 score for train is high and for test is low, seems, you have overfitting - so you need add some regularisation.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am new to Machine Learning. Can anyone tell me the major difference between classification and regression in machine learning?
Regression aims to predict a continuous output value. For example, say that you are trying to predict the revenue of a certain brand as a function of many input parameters. A regression model would literally be a function which can output potentially any revenue number based on certain inputs. It could even output revenue numbers which never appeared anywhere in your training set.
Classification aims to predict which class (a discrete integer or categorical label) the input corresponds to. e.g. let us say that you had divided the sales into Low and High sales, and you were trying to build a model which could predict Low or High sales (binary/two-class classication). The inputs might even be the same as before, but the output would be different. In the case of classification, your model would output either "Low" or "High," and in theory every input would generate only one of these two responses.
(This answer is true for any machine learning method; my personal experience has been with random forests and decision trees).
Regression - the output variable takes continuous values.
Example :Given a picture of a person, we have to predict their age on the basis of the given picture
Classification - the output variable takes class labels.
Example: Given a patient with a tumor, we have to predict whether the tumor is malignant or benign.
I am a beginner in Machine Learning field but as far as i know, regression is for "continuous values" and classification is for "discrete values". With regression, there is a line for your continuous value and you can see that your model is good or bad fit. On the other hand, you can see how discrete values gains some meaning "discretely" with classification. If i am wrong please feel free to make correction.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I have a question regarding data preprocessing for machine learning. Specifically transforming the data so it has zero mean and unit variance.
I have split my data into two datasets (I know I should have three, but for the sake of simplicity let's just say I have two). Should I transform my training data set so that the entire training data set has unit variance and zero mean and then when testing the model transform each test input vector so that each particular test input vector presents unit variance and zero mean, or I should just transform the entire dataset (traning and testing) together so that the whole thing presents unit var and zero mean? My belief is that I should do the former that way I won't be introducing a despicable amount of bias into the test data set. But I am no expert, thus my question.
Fitting your preprocessor should only be done on the training-set and the mean and variance transformers are then used on the test-set. Computing these statistics on train and test leaks some information about the test-set.
Let me link you to a good course on Deep-Learning and show you a citation (both from Andrej Karpathy):
Common pitfall. An important point to make about the preprocessing is that any preprocessing statistics (e.g. the data mean) must only be computed on the training data, and then applied to the validation / test data. E.g. computing the mean and subtracting it from every image across the entire dataset and then splitting the data into train/val/test splits would be a mistake. Instead, the mean must be computed only over the training data and then subtracted equally from all splits (train/val/test).
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
we have some dataset:
every day sale count of 100 products from January to June,
our object is to predict each day sale count in July.
so how to split the dataset to training set, validation set
Time series are the typical case where you should not split randomly (in general you should not split randomly when where there is significant example-example correlation).
Usually sales aren't a strictly dynamic time series (as stock prices) but using train_test_split could be problematic.
You can obtain the desired cross-validation splits without using sklearn (e.g. sklearn: User defined cross validation for time series data, Pythonic Cross Validation on Time Series...).
70-80% for training is standard. Assuming uniform distribution of the examples, you can use data from January to April / May for the training set and the remaining records for validation.
Currently, to my knowledge, sklearn does not support rigorous cross-validation of time-dependent problems. All out-of-the-box cross-validation routines will construct training folds that include future information relative to test folds (e.g. [WIP] RollingWindow cross-validation #3638).
Moreover you should consider if your data are seasonal or have another obvious division in groups (e.g. geographic regions).
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
As given in the textbook Machine Learning by Tom M. Mitchell, the first statement about decision tree states that, "Decision tree leaning is a method for approximating discrete valued functions". Could someone kindly elaborate this statement, probably even justify it with an example. Thanks in advance :) :)
In a simple example, consider observations rows with two attributes; the training data contains classification (discrete values) based on a combination of those attributes. The learning phase has to determine which attributes to consider in which order, so that it can effectively do well in achieving the desired modelling.
For instance, consider a model that will answer "What should I order for dinner?" given the inputs of desired price range, cuisine, and spiciness. The training data will contain your history from a variety of restaurant experiences. The model will have to determine which is most effective in reaching a good entrée classification: eliminate restaurants based on cuisine first, then consider price, and finally tune the choice according to Scoville units; or perhaps check the spiciness first and start by dump choices that aren't spicy enough before going on to the other two factors.
Does that explain what you need?