Imputation of missing numeric values while preserving the absence of it - machine-learning

Before I dive into the Question itself I'll give a brief explanation of the data set and the problem
The Data set
I have a data set of roughly 20000 records and I intend to use it to train a classifier which classifies the a given record as 'Positive' or 'Negative'. The data set is also pretty imbalanced with a 5:1 ratio favoring the 'Positive' side.
One of the Features called 'Price' within the Data set which contains a monetary value (thus is <0) and has a few missing values (about 200). When I analyzed the data set all the rows which had NaN for 'Price' were classified as 'Negative'.
The Problem
What would be the best strategy to impute this column? I came up with the following options
I could drop these rows but since all of them are from the 'Negative'
class, that doesn't seem viable
Impute it with a value an extreme value such as -1000.00 as it is a monetary value. While it may work in this situation. It would not work had the value also taken negative values. and I wish to learn a more generic approach to the problem.
Impute it as normal with a stategy such as 'mean' or 'nearest
neighbour' which still could affect the performance as a majority of
the classes are 'Positive'
I could add a new Column called 'wasCompleted' which has a value of 1 if there was a value for the 'Price' feature or 0 if there wasn't. And still go with an option like (2) or (3). Which would still not solve any issue within those stategies
Considering this scenario what would be the best option to consider to impute these values?

There is at least one more option to consider:
Leave it as it is, and use ML method which can handle missing values much better than using any kind of imputation or creation of additional features. Such a method is e.g. LightGMB.

Related

Machine learning with handling features which are suppose to have missing data

I am currently working in a project for my MSc and I am having this issue with that dataset. I don't have previous experience in machine learning and this is my first exposure.
In my dataset I started doing my EDA (Exploratory Data Analysis) and I have a categorical feature with missing data which is Province_State. This column has 52360 missing values and as a percentage that is a 5.40%. I guess that is not too bad and according to what I learnt, I should impute these missing values or delete the column if I have reasonable reasonings.
My logical reasoning is that, not every country has provinces. Therefore that is pretty normal that there are missing values. I clearly don't see a point in imputing these missing values with a random value because that is not logically and it will also lead inaccuracy within the model because we cannot come up with a value which does not practically exist for that particular country.
I think I should do one of the following:
Impute all the missing values to a constant value such as -1 or "NotApplicable"
Remove the feature from the dataset
Please help me with a solution and thank you very much in advance.
(This dataset can be accessed from this link)
There are many ways to handle missing data .Deleting the whole column is not a good idea in most cases as you will be discarding information, however if you still want to delete the feature perform univariate analysis on that feature and see if its useful and decide accordingly.
Instead of removing the feature you can use any of the following ways:
Impute missing values with Mean/Median.
Predict missing values.
Impute all the missing values to -1.
Use algorithms that support missing values.

Predicting houseprice: is it okay to use a constant (int) to indicate "unknown"

I have a dataset and am trying to predict houseprices. Several variables (#bedrooms, #bathrooms, area, ...) use the constants 0 or -1 to indicate "not known". Is this good practice?
Dropping these values would result in the loss of too much data. Interpolation does not seem like a good option, especially since there are cases where multiple of these values are unknown and they are relatively highly correlated to each other.
Taking the mean of the column to substitute these values with would not work seeing as all houses are fundamentally different.
Does anyone have advise on this?
Totally depends on what ML algorithm you want to use. Some can handle null values for missing data and others can't.
Usually interpolating/predicting these missing values is a reasonable idea. You could run another algorithm first to predict the missing values based on your available data and then run a second algorithm for the housing price prediction.

Why doesn't ID3 algorithm work on the UCI Mushroom dataset in Weka?

I can’t seem to apply the ID3 classification algorithm to Mushroom.arff dataset. This dataset consists of nominal attributes only. I think I need to preprocess this in order for it to work, but I don’t know how. How do I proceed?
The ID3 algorithm is an unpruned decision tree generation algorithm with the following properties:
It can only deal with nominal attributes.
It fails to handle missing values.
Empty leaves may result in unclassified instances.
The Mushroom dataset consists of 22 nominal attributes and satisfies the first condition, however upon inspection you’ll find the attribute 'stalk-root' has 2480 (31%) missing values. This is the reason it is unselectable in Weka by default when you try to classify.
In order to fix this, you may proceed with these two solutions.
You may remove the attribute.
Open the .arff file, select the stalk-root attribute in the Attributes tab and click Remove.
You’ll now see that ID3 is available. I was able to get F-score of 1.0.
You may use techniques to handle missing values.
In situations where you do not want to lose out on information(in this case the “stalk-root” attribute), you may proceed with these techniques:
Use a measure of central tendency for the attribute such as mean, median to replace the empty values.
Use the attribute mean or median for all samples belonging to the same class as the given tuple.
Use the most probable value to fill in the missing value using inference-based tools using a Bayesian formalism.

A binary classification dataset with a 'age' feature whose some of values are missing

This classification problem has 300000 tuples and 20 features. I want to use SVM algorithm to solve this problem. The 'age' feature is between 1 and 100, but this feature of some tuples is missing and blank. How should i solve it.
This of course depends on the distribution of your missing variable, but I would try imputation - try to fill in the blanks using a mean age value and see what kind of results do you get. One step further would be to create a model predicting age given the other input variables and use that for imputation.
You might also add a variable indicating that a given row has some imputed values - this in some cases yields better training results, as you give your algorithm more information.
Additionally to simple imputation by mean as already mentioned by #dratewka, I would suggest trying:
Imputing the feature using classic imputation mechanisms, like e.g. K nearest neighbour imputation. With this, for a sample S with age being missing, those K samples that are nearest to S are used to derive a suitable value for imputing age (with the distance of K neighbours to S measured with all other features).
After performing the previous step, try your prediction with using age and with leaving it out. In case you see that your prediction performance is not influenced by age, disregarding this information altogether in the first place might be reasonable as well.

What does it mean to have zero mean in the data?

I'm trying to find ways to normalize my dataset (represented as a matrix with documents as rows and columns as features) and I came across a technique called feature scaling. I found a Wikipedia article on it here.
One of the methods listed is Standardization which says "Feature standardization makes the values of each feature in the data have zero-mean and unit-variance." What does that mean (no pun intended)?
In this method, "we subtract the mean from each feature. Then we divide the values (mean is already subtracted) of each feature by its standard deviation." When they say 'subtract the mean', is it the mean of the entire matrix or the mean of the column pertaining to that feature?
Also, if this feature scaling method is applied, does the mean not have to be subtracted from columns when performing Principal Component Analysis (PCA) on the data?
The basic idea is to do a simple (and reversible) transformation on your dataset set to make it easier to handle. You are subtracting a constant from each column and then dividing each column by a (different) constant. Those constants are column-specific.
When they say 'subtract the mean', is it the mean of the entire matrix
or the mean of the column pertaining to that feature?
The mean of the column pertaining to that feature.
...does the mean not have to be subtracted from columns when performing Principal Component Analysis (PCA) on the data?
Correct. PCA requires data with a mean of zero. Usually this is enforced by subtracting the mean as a first step. If the mean has already been subtracted that step is not required. However, there is no harm in performing the "subtract the mean" operation twice. Because the second time the mean will be zero, so nothing will change. Formally, we might say that standardization is idempotent.
From looking at the article, my understanding is that you would subtract the mean of that feature. This will give you a set of data for the feature that describes the same layout of the data but normalized.
Imagine you added data for a new feature. You're probably going to want the data for your original features to remain the same, and not be influenced by the new feature.
I guess you would still get a "standardized" range of values if you subtracted the mean of the whole data set, but that would be something different - you're probably more interested in how the data of a single feature lies around its mean.
You could also have a look (or ask the question) on math.stackexchange.com.

Resources