why replacing missing value with -99999 in classification dataset - machine-learning

working on classification problem, missing value denoted by '?', So why -99999?
df.replace('?',-99999,inplace=True)

This depends on what your eventual use of this data will be. Having strings in a numerical column is not great from a data cleanliness perspective, so replacing with np.nan (a float type) or the newish pd.NA is probably the best idea from a data presentation standpoint. Most models cannot make use of those values, but some can, e.g. xgboost. For models that cannot handle missing values (or when you don't want the model to handle them internally), you need to decide the best way to impute.
Imputing with values outside the real data range, like -99999, is largely fine for tree models: they don't care about scale, so you're really just saying it's less than everything else. In parametric models like logistic regression though, this will badly mess up parameter estimates, and I'd strongly advise against it. Adding missingness indicators helps out, but still I suspect numerical issues with such large imputation values, and so mean/median or model-based imputation would be better.

we can replace with any very big number, the only purpose of doing that is to make that an outlier, because in most cases dropping the missing values causes loss of data

Related

Predicting houseprice: is it okay to use a constant (int) to indicate "unknown"

I have a dataset and am trying to predict houseprices. Several variables (#bedrooms, #bathrooms, area, ...) use the constants 0 or -1 to indicate "not known". Is this good practice?
Dropping these values would result in the loss of too much data. Interpolation does not seem like a good option, especially since there are cases where multiple of these values are unknown and they are relatively highly correlated to each other.
Taking the mean of the column to substitute these values with would not work seeing as all houses are fundamentally different.
Does anyone have advise on this?
Totally depends on what ML algorithm you want to use. Some can handle null values for missing data and others can't.
Usually interpolating/predicting these missing values is a reasonable idea. You could run another algorithm first to predict the missing values based on your available data and then run a second algorithm for the housing price prediction.

best practices for using Categorical Variables in H2O?

I'm trying to use H2O's Random Forest for a multinominal classification into 71 classes with 38,000 training set examples. I have one features that is a string that in many cases are predictive, so I want to use it as a categorical feature.
The hitch is that even after canonicalizing the strings (uppercase, stripping out numbers, punctuation, etc.), I still have 7,000 different strings (some due to spelling or OCR errors, etc.) I have code to remove strings that are relatively rare, but I'm not sure what a reasonable cut off value is. (I can't seem to find any help in the documentation.)
I'm also not sure what to due with nbin_cats hyperparameter. Should I make it equal to the number of different categorical variables I have? [added: default for nbin_cats is 1024 and I'm well below that at around 300 different categorical values, so I guess I don't have to do anything with this parameter]
I'm also thinking perhaps if a categorical value is associated with too many different categories that I'm trying to predict, maybe I should drop it as well.
I'm also guessing I need to increase the tree depth to handle this better.
Also, is there a special value to indicate "don't know" for the strings that I am filtering out? (I'm mapping it to a unique string but I'm wondering if there is a better value that indicates to H2O that the categorical value is unknown.)
Many thanks in advance.
High cardinality categorical predictors can sometimes hurt model performance, and specifically in the case of tree-based models, the tree ensemble (GBM or Random Forest) ends up memorizing the training data. The model has a poor time generalizing on validation data.
A good indication of whether this is happening is if your string/categorical column has very high variable importance. This means that the trees are continuing to split on this column to memorize the training data. Another indication is if you see much smaller error on your training data than on your validation data. This means the trees are overfitting to the training data.
Some methods for handling high cardinality predictors are:
removing the predictor from the model
performing categorical encoding [pdf]
performing grid search on nbins_cats and categorical_encoding
There is a Python example in the H2O tutorials GitHub repo that showcases the effects of removing the predictor from the model and performing grid search here.

Should 'deceptive' training cases be given to a Naive Bayes Classifier

I am setting up a Naive Bayes Classifier to try to determine sameness between two records of five string properties. I am only comparing each pair of properties exactly (i.e., with a java .equals() method). I have some training data, both TRUE and FALSE cases, but let's just focus on the TRUE cases for now.
Let's say there are some TRUE training cases where all five properties are different. That means every comparator fails, but the records are actually determined to be the 'same' after some human assessment.
Should this training case be fed to the Naive Bayes Classifier? On the one hand, considering the fact that NBC treats each variable separately these cases shouldn't totally break it. However, it certainly seems true that feeding in enough of these cases wouldn't be beneficial to the classifier's performance. I understand that seeing a lot of these cases would mean better comparators are required, but I'm wondering what to do in the time being. Another consideration is that the flip-side is impossible; that is, there's no way all five properties could be the same between two records and still have them be 'different' records.
Is this a preferential issue, or is there a definitive accepted practice for handling this?
Usually you will want to have a training data set that is as feasibly representative as possible of the domain from which you hope to classify observations (often difficult though). An unrepresentative set may lead to a poorly functioning classifier, particularly in a production environment where various data are received. That being said, preprocessing may be used to limit the exposure of a classifier trained on a particular subset of data, so it is quite dependent on the purpose of the classifier.
I'm not sure why you wish to exclude some elements though. Parameter estimation/learning should account for the fact that two different inputs may map to the same output --- that is why you would use machine learning instead of simply using a hashmap. Considering that you usually don't have 'all data' to build your model, you have to rely on this type of inference.
Have you had a look at the NLTK; it is in python but it seems that OpenNLP may be a suitable substitute in Java? You can employ better feature extraction techniques that lead to a model that accounts for minor variations in input strings (see here).
Lastly, it seems to me that you want to learn a mapping from input strings to the classes 'same' and 'not same' --- you seem to want to infer a distance measure (just checking). It would make more sense to invest effort in directly finding a better measure (e.g. for character transposition issues you could use edit distances). I'm not sure that NB is well-suited to your problem as it is attempting to determine a class given an observation(s) (or its features). This class will have to be discernible over various different strings (I'm assuming you are going to concatenate string1 & string2, and offer them to the classifier). Will there be enough structure present to derive such a widely applicable property? This classifier is basically going to need to be able to deal with all pair-wise 'comparisons' ,unless you build NBs for each one-vs-many pairing. This does not seem like a simple approach.

Is it considered overfit a decision tree with a perfect attribute?

I have a 6-dimensional training dataset where there is a perfect numeric attribute which separates all the training examples this way: if TIME<200 then the example belongs to class1, if TIME>=200 then example belongs to class2. J48 creates a tree with only 1 level and this attribute as the only node.
However, the test dataset does not follow this hypothesis and all the examples are missclassified. I'm having trouble figuring out whether this case is considered overfitting or not. I would say it is not as the dataset is that simple, but as far as I understood the definition of overfit, it implies a high fitting to the training data, and this I what I have. Any help?
However, the test dataset does not follow this hypothesis and all the examples are missclassified. I'm having trouble figuring out whether this case is considered overfitting or not. I would say it is not as the dataset is that simple, but as far as I understood the definition of overfit, it implies a high fitting to the training data, and this I what I have. Any help?
Usually great training score and bad testing means overfitting. But this assumes IID of the data, and you are clearly violating this assumption - your training data is completely different from the testing one (there is a clear rule for the training data which has no meaning for testing one). In other words - your train/test split is incorrect, or your whole problem does not follow basic assumptions of where to use statistical ml. Of course we often fit models without valid assumptions about the data, in your case - the most natural approach is to drop a feature which violates the assumption the most - the one used to construct the node. This kind of "expert decisions" should be done prior to building any classifier, you have to think about "what is different in test scenario as compared to training one" and remove things that show this difference - otherwise you have heavy skew in your data collection, thus statistical methods will fail.
Yes, it is an overfit. The first rule in creating a training set is to make it look as much like any other set as possible. Your training set is clearly different than any other. It has the answer embedded within it while your test set doesn't. Any learning algorithm will likely find the correlation to the answer and use it and, just like the J48 algorithm, will regard the other variables as noise. The software equivalent of Clever Hans.
You can overcome this by either removing the variable or by training on a set drawn randomly from the entire available set. However, since you know that there is a subset with an embedded major hint, you should remove the hint.
You're lucky. At times these hints can be quite subtle which you won't discover until you start applying the model to future data.

Data Vectorization

Met a tricky issue when trying to vectorize my feature. I have a feature like this:
most of it is numeric, like 0, 1, 33.3, 100, etc.
some of is empty, which represents not provided.
some of it is "auto", which means it adapts the context.
Now my question is, how to encode this feature into vectors effectively? One thing I can do is just to treat all numerical value as categorical too, but that will result in an explosion in the feature space, also not good for representing similar data points. What should I do?
Thanks!
--- THE ALGORITHM/MODEL I'M USING ---
It's LSTM (Long Short Term Memory) neural network. Currently I'm going with the following approach say I have 2 data points:
col1
entry1: 1.0
entry2: auto
It'll be encoded into:
col1-a col1-b
entry1: 1.0 0
entry2: dummy 1
So col1-b will represent whether it's auto or not. The dummy number will be the median of all the numeric data. Will this work?
Also, I for each numeric value they have a unit associated, so there's another column which has value like 'px', 'pt', in this case, does the numeric value still has meaning if I extracted the unit into another column? They has actual meaning when associated (numeric + unit), but can the NN notice that if they are on different dimensions?
That depends on what type of algorithm you you will be using. If you want to use something like association rule classification then you will have to treat all of your variables as categorical data. If you want to use logistic regression, then that isn't needed. You'd have to provide more details to get a better answer.
edit
I made some edits after reading your edit.
It sounds like what you have is at least reasonable. I've read books where people use the mean/median/mode to fill in missing values for numeric data. As for which specific one works the best for you I don't know. Can you try training your classifier with each version?
As for your issue with the "auto" column, it sounds like you want to do something similar to running a regression with categorical data. I don't have much experience with neural networks, but I know that if you were to use something like logistic regression then this is the approach you would want to use. Hopefully this gives you an idea of what you have to research.
As far as treating all of your numerical data as categorical data, you can do that as well, but you have to normalize it first. You can do something like min-max normalization and then just take the interger part of the number. Now your data will be the same as categorical data.

Resources