I have missing values in my Target Variable (y). Since I want to train my model with more data, I don't want to drop missing rows, instead I'd like to use KNN Imputer algorithm. But also, I'd like to prevent data leakage. So, the best way is to split the data as "train" and "test", then impute the missing target variables in the train dataset (same can be done for missing values in test dataset).
However, I faced with an error:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
As I understand, the missing values (NaN) have created that specific error.
How to proceed when there are missing values in Target Variable?
Target variable is not advised to be imputed (unless you are sure about the value), this is because they control how the learning algorithm learns. If you already know a value of target variable there is no need for a ML algorithm - right? Therefore, the best way to deal with missing target variable is to delete it. For other missing features, you can use impute strategies.
Related
I am currently working in a project for my MSc and I am having this issue with that dataset. I don't have previous experience in machine learning and this is my first exposure.
In my dataset I started doing my EDA (Exploratory Data Analysis) and I have a categorical feature with missing data which is Province_State. This column has 52360 missing values and as a percentage that is a 5.40%. I guess that is not too bad and according to what I learnt, I should impute these missing values or delete the column if I have reasonable reasonings.
My logical reasoning is that, not every country has provinces. Therefore that is pretty normal that there are missing values. I clearly don't see a point in imputing these missing values with a random value because that is not logically and it will also lead inaccuracy within the model because we cannot come up with a value which does not practically exist for that particular country.
I think I should do one of the following:
Impute all the missing values to a constant value such as -1 or "NotApplicable"
Remove the feature from the dataset
Please help me with a solution and thank you very much in advance.
(This dataset can be accessed from this link)
There are many ways to handle missing data .Deleting the whole column is not a good idea in most cases as you will be discarding information, however if you still want to delete the feature perform univariate analysis on that feature and see if its useful and decide accordingly.
Instead of removing the feature you can use any of the following ways:
Impute missing values with Mean/Median.
Predict missing values.
Impute all the missing values to -1.
Use algorithms that support missing values.
I have a dataset and am trying to predict houseprices. Several variables (#bedrooms, #bathrooms, area, ...) use the constants 0 or -1 to indicate "not known". Is this good practice?
Dropping these values would result in the loss of too much data. Interpolation does not seem like a good option, especially since there are cases where multiple of these values are unknown and they are relatively highly correlated to each other.
Taking the mean of the column to substitute these values with would not work seeing as all houses are fundamentally different.
Does anyone have advise on this?
Totally depends on what ML algorithm you want to use. Some can handle null values for missing data and others can't.
Usually interpolating/predicting these missing values is a reasonable idea. You could run another algorithm first to predict the missing values based on your available data and then run a second algorithm for the housing price prediction.
I have a (probably stupid) question about predicting a new instance with a missing predictor(s).
I am given a data. Let's say I preprocess, clean data and as a result, let's say, 10 predictors left. Then, I train my model on a resulting data, so I am ready to use model to predict.
Now, what should I do if I want to predict a new instance which 1 or 2 predictors are missing?
There are at least two reasonable solutions.
(1) Average the output over the possible values of the missing variable or variables, conditional on the values of the non-missing variables. That is, compute a weighted average of the output prediction(missing, non-missing) for each possible value of missing, weighted by the probability of missing given non-missing. This is essentially a variety of what's called "multiple imputation" in the literature.
The first thing to try is to just weight by the unconditional distribution of missing. If that seems too complicated, a very rough approximation is to substitute the mean value of missing into the prediction.
(2) Build a model for each combination variables. If you have n variables, this means building 2^n variables. If n = 10, 1024 models is not a big deal these days. Then if you are missing some variables, just use the model for the ones that are present.
By the way, you might get more interest in this question at stats.stackexchange.com.
I'm trying to build a classifier "model" using some classification techniques. Beginning with the C4.5 technique, faced the problem of missing values so:
How to deal with the missing values exist in a data-set ?
Should I have to stay on "?" in the missing attribute ?
There are several ways of dealing with missing values:
Get missing data: If possible, try to acquire missing values.
Discard missing data: Reduce the data available to a dataset having no missing values by discarding all instances with missing values or features.
Imputation: A better strategy is to impute the missing values, i.e., to infer them from the known part of the data. A common approach is to use the mean, the median or the most frequent value of the row or column in which the missing values are located. It is recommended to use multiple imputations.
This might help: http://jmlr.csail.mit.edu/papers/volume8/saar-tsechansky07a/saar-tsechansky07a.pdf
What's the best way to handle missing feature attribute values with Weka's C4.5 (J48) decision tree? The problem of missing values occurs during both training and classification.
If values are missing from training instances, am I correct in assuming that I place a '?' value for the feature?
Suppose that I am able to successfully build the decision tree and then create my own tree code in C++ or Java from Weka's tree structure. During classification time, if I am trying to classify a new instance, what value do I put for features that have missing values? How would I descend the tree past a decision node for which I have an unknown value?
Would using Naive Bayes be better for handling missing values? I would just assign a very small non-zero probability for them, right?
From Pedro Domingos' ML course in University of Washington:
Here are three approaches what Pedro suggests for missing value of A:
Assign most common value of A among other examples sorted to node n
Assign most common value of A among other examples with same target value
Assign probability p_i to each possible value v_i of A; Assign fraction p_i of example to each descendant in tree.
The slides and video is now viewable at here.
An alternative approach is to leave the missing value as the '?', and not use it for the information gain calculation. No node should have an unknown value during classification because you ignored it during the information gain step. For classifying, I believe you simply consider the missing value unknown and do not delete it during classification on that specific attribute.