I am currently working in a project for my MSc and I am having this issue with that dataset. I don't have previous experience in machine learning and this is my first exposure.
In my dataset I started doing my EDA (Exploratory Data Analysis) and I have a categorical feature with missing data which is Province_State. This column has 52360 missing values and as a percentage that is a 5.40%. I guess that is not too bad and according to what I learnt, I should impute these missing values or delete the column if I have reasonable reasonings.
My logical reasoning is that, not every country has provinces. Therefore that is pretty normal that there are missing values. I clearly don't see a point in imputing these missing values with a random value because that is not logically and it will also lead inaccuracy within the model because we cannot come up with a value which does not practically exist for that particular country.
I think I should do one of the following:
Impute all the missing values to a constant value such as -1 or "NotApplicable"
Remove the feature from the dataset
Please help me with a solution and thank you very much in advance.
(This dataset can be accessed from this link)
There are many ways to handle missing data .Deleting the whole column is not a good idea in most cases as you will be discarding information, however if you still want to delete the feature perform univariate analysis on that feature and see if its useful and decide accordingly.
Instead of removing the feature you can use any of the following ways:
Impute missing values with Mean/Median.
Predict missing values.
Impute all the missing values to -1.
Use algorithms that support missing values.
Related
I have a dataset and am trying to predict houseprices. Several variables (#bedrooms, #bathrooms, area, ...) use the constants 0 or -1 to indicate "not known". Is this good practice?
Dropping these values would result in the loss of too much data. Interpolation does not seem like a good option, especially since there are cases where multiple of these values are unknown and they are relatively highly correlated to each other.
Taking the mean of the column to substitute these values with would not work seeing as all houses are fundamentally different.
Does anyone have advise on this?
Totally depends on what ML algorithm you want to use. Some can handle null values for missing data and others can't.
Usually interpolating/predicting these missing values is a reasonable idea. You could run another algorithm first to predict the missing values based on your available data and then run a second algorithm for the housing price prediction.
I am working on customer segmentation based on their purchases for different type of product category.
Below is a dummy representation of my data. (The data is in percentage of the total revenue per each category the customer purchased):
Image Link
As seen in the image link above, altho this data have only a few 0's but the original data has many 0s. therefore, using this data for kmeans clustering does not output any acceptable insights and skews the data towards the left.
dropping the rows or averaging the missing data is misleading. :/
How to deal with missing values it's your choice, it will impact your clustering of course. There is no one "correct" way.
Few popular ways:
Fill each column missing values with average/mean of that feature
Bootstrapping: select random row and copy it's value to fill missing value
Closer Neighbor: find the closest neighbor and fill according to his missing values.
Without seeing your full data and why you're trying to do with clustering, it's a bit hard to help. Depends on the case...
You can always do some feature extraction (e.g. PCA), maybe it will give some better insights
Before I dive into the Question itself I'll give a brief explanation of the data set and the problem
The Data set
I have a data set of roughly 20000 records and I intend to use it to train a classifier which classifies the a given record as 'Positive' or 'Negative'. The data set is also pretty imbalanced with a 5:1 ratio favoring the 'Positive' side.
One of the Features called 'Price' within the Data set which contains a monetary value (thus is <0) and has a few missing values (about 200). When I analyzed the data set all the rows which had NaN for 'Price' were classified as 'Negative'.
The Problem
What would be the best strategy to impute this column? I came up with the following options
I could drop these rows but since all of them are from the 'Negative'
class, that doesn't seem viable
Impute it with a value an extreme value such as -1000.00 as it is a monetary value. While it may work in this situation. It would not work had the value also taken negative values. and I wish to learn a more generic approach to the problem.
Impute it as normal with a stategy such as 'mean' or 'nearest
neighbour' which still could affect the performance as a majority of
the classes are 'Positive'
I could add a new Column called 'wasCompleted' which has a value of 1 if there was a value for the 'Price' feature or 0 if there wasn't. And still go with an option like (2) or (3). Which would still not solve any issue within those stategies
Considering this scenario what would be the best option to consider to impute these values?
There is at least one more option to consider:
Leave it as it is, and use ML method which can handle missing values much better than using any kind of imputation or creation of additional features. Such a method is e.g. LightGMB.
I'm trying to build a classifier "model" using some classification techniques. Beginning with the C4.5 technique, faced the problem of missing values so:
How to deal with the missing values exist in a data-set ?
Should I have to stay on "?" in the missing attribute ?
There are several ways of dealing with missing values:
Get missing data: If possible, try to acquire missing values.
Discard missing data: Reduce the data available to a dataset having no missing values by discarding all instances with missing values or features.
Imputation: A better strategy is to impute the missing values, i.e., to infer them from the known part of the data. A common approach is to use the mean, the median or the most frequent value of the row or column in which the missing values are located. It is recommended to use multiple imputations.
This might help: http://jmlr.csail.mit.edu/papers/volume8/saar-tsechansky07a/saar-tsechansky07a.pdf
I'm trying to find ways to normalize my dataset (represented as a matrix with documents as rows and columns as features) and I came across a technique called feature scaling. I found a Wikipedia article on it here.
One of the methods listed is Standardization which says "Feature standardization makes the values of each feature in the data have zero-mean and unit-variance." What does that mean (no pun intended)?
In this method, "we subtract the mean from each feature. Then we divide the values (mean is already subtracted) of each feature by its standard deviation." When they say 'subtract the mean', is it the mean of the entire matrix or the mean of the column pertaining to that feature?
Also, if this feature scaling method is applied, does the mean not have to be subtracted from columns when performing Principal Component Analysis (PCA) on the data?
The basic idea is to do a simple (and reversible) transformation on your dataset set to make it easier to handle. You are subtracting a constant from each column and then dividing each column by a (different) constant. Those constants are column-specific.
When they say 'subtract the mean', is it the mean of the entire matrix
or the mean of the column pertaining to that feature?
The mean of the column pertaining to that feature.
...does the mean not have to be subtracted from columns when performing Principal Component Analysis (PCA) on the data?
Correct. PCA requires data with a mean of zero. Usually this is enforced by subtracting the mean as a first step. If the mean has already been subtracted that step is not required. However, there is no harm in performing the "subtract the mean" operation twice. Because the second time the mean will be zero, so nothing will change. Formally, we might say that standardization is idempotent.
From looking at the article, my understanding is that you would subtract the mean of that feature. This will give you a set of data for the feature that describes the same layout of the data but normalized.
Imagine you added data for a new feature. You're probably going to want the data for your original features to remain the same, and not be influenced by the new feature.
I guess you would still get a "standardized" range of values if you subtracted the mean of the whole data set, but that would be something different - you're probably more interested in how the data of a single feature lies around its mean.
You could also have a look (or ask the question) on math.stackexchange.com.