I'm trying to build a classifier "model" using some classification techniques. Beginning with the C4.5 technique, faced the problem of missing values so:
How to deal with the missing values exist in a data-set ?
Should I have to stay on "?" in the missing attribute ?
There are several ways of dealing with missing values:
Get missing data: If possible, try to acquire missing values.
Discard missing data: Reduce the data available to a dataset having no missing values by discarding all instances with missing values or features.
Imputation: A better strategy is to impute the missing values, i.e., to infer them from the known part of the data. A common approach is to use the mean, the median or the most frequent value of the row or column in which the missing values are located. It is recommended to use multiple imputations.
This might help: http://jmlr.csail.mit.edu/papers/volume8/saar-tsechansky07a/saar-tsechansky07a.pdf
Related
I am currently working in a project for my MSc and I am having this issue with that dataset. I don't have previous experience in machine learning and this is my first exposure.
In my dataset I started doing my EDA (Exploratory Data Analysis) and I have a categorical feature with missing data which is Province_State. This column has 52360 missing values and as a percentage that is a 5.40%. I guess that is not too bad and according to what I learnt, I should impute these missing values or delete the column if I have reasonable reasonings.
My logical reasoning is that, not every country has provinces. Therefore that is pretty normal that there are missing values. I clearly don't see a point in imputing these missing values with a random value because that is not logically and it will also lead inaccuracy within the model because we cannot come up with a value which does not practically exist for that particular country.
I think I should do one of the following:
Impute all the missing values to a constant value such as -1 or "NotApplicable"
Remove the feature from the dataset
Please help me with a solution and thank you very much in advance.
(This dataset can be accessed from this link)
There are many ways to handle missing data .Deleting the whole column is not a good idea in most cases as you will be discarding information, however if you still want to delete the feature perform univariate analysis on that feature and see if its useful and decide accordingly.
Instead of removing the feature you can use any of the following ways:
Impute missing values with Mean/Median.
Predict missing values.
Impute all the missing values to -1.
Use algorithms that support missing values.
I have a dataset and am trying to predict houseprices. Several variables (#bedrooms, #bathrooms, area, ...) use the constants 0 or -1 to indicate "not known". Is this good practice?
Dropping these values would result in the loss of too much data. Interpolation does not seem like a good option, especially since there are cases where multiple of these values are unknown and they are relatively highly correlated to each other.
Taking the mean of the column to substitute these values with would not work seeing as all houses are fundamentally different.
Does anyone have advise on this?
Totally depends on what ML algorithm you want to use. Some can handle null values for missing data and others can't.
Usually interpolating/predicting these missing values is a reasonable idea. You could run another algorithm first to predict the missing values based on your available data and then run a second algorithm for the housing price prediction.
I have missing values in my Target Variable (y). Since I want to train my model with more data, I don't want to drop missing rows, instead I'd like to use KNN Imputer algorithm. But also, I'd like to prevent data leakage. So, the best way is to split the data as "train" and "test", then impute the missing target variables in the train dataset (same can be done for missing values in test dataset).
However, I faced with an error:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
As I understand, the missing values (NaN) have created that specific error.
How to proceed when there are missing values in Target Variable?
Target variable is not advised to be imputed (unless you are sure about the value), this is because they control how the learning algorithm learns. If you already know a value of target variable there is no need for a ML algorithm - right? Therefore, the best way to deal with missing target variable is to delete it. For other missing features, you can use impute strategies.
working on classification problem, missing value denoted by '?', So why -99999?
df.replace('?',-99999,inplace=True)
This depends on what your eventual use of this data will be. Having strings in a numerical column is not great from a data cleanliness perspective, so replacing with np.nan (a float type) or the newish pd.NA is probably the best idea from a data presentation standpoint. Most models cannot make use of those values, but some can, e.g. xgboost. For models that cannot handle missing values (or when you don't want the model to handle them internally), you need to decide the best way to impute.
Imputing with values outside the real data range, like -99999, is largely fine for tree models: they don't care about scale, so you're really just saying it's less than everything else. In parametric models like logistic regression though, this will badly mess up parameter estimates, and I'd strongly advise against it. Adding missingness indicators helps out, but still I suspect numerical issues with such large imputation values, and so mean/median or model-based imputation would be better.
we can replace with any very big number, the only purpose of doing that is to make that an outlier, because in most cases dropping the missing values causes loss of data
Before I dive into the Question itself I'll give a brief explanation of the data set and the problem
The Data set
I have a data set of roughly 20000 records and I intend to use it to train a classifier which classifies the a given record as 'Positive' or 'Negative'. The data set is also pretty imbalanced with a 5:1 ratio favoring the 'Positive' side.
One of the Features called 'Price' within the Data set which contains a monetary value (thus is <0) and has a few missing values (about 200). When I analyzed the data set all the rows which had NaN for 'Price' were classified as 'Negative'.
The Problem
What would be the best strategy to impute this column? I came up with the following options
I could drop these rows but since all of them are from the 'Negative'
class, that doesn't seem viable
Impute it with a value an extreme value such as -1000.00 as it is a monetary value. While it may work in this situation. It would not work had the value also taken negative values. and I wish to learn a more generic approach to the problem.
Impute it as normal with a stategy such as 'mean' or 'nearest
neighbour' which still could affect the performance as a majority of
the classes are 'Positive'
I could add a new Column called 'wasCompleted' which has a value of 1 if there was a value for the 'Price' feature or 0 if there wasn't. And still go with an option like (2) or (3). Which would still not solve any issue within those stategies
Considering this scenario what would be the best option to consider to impute these values?
There is at least one more option to consider:
Leave it as it is, and use ML method which can handle missing values much better than using any kind of imputation or creation of additional features. Such a method is e.g. LightGMB.