I have 6 text features (say f1,f2,..,f6) available for the data on which I have trained a model. But when this model is deployed and a new data point comes, for which I have to make prediction using this model, it has only 2 features (f1, and f2). So, there is the problem of feature mismatch. How can I tackle this problem?
I have a few thoughts, but that are not very efficient.
Use only two features for training (f1 and f2), and discard other features (f3,..,f6). But this leads to a loss of information and my test set accuracy decreases.
Learn some relation between (f3,..,f6) with (f1 and f2). So that even though, (f3,..,f6) is not there in the new data point, the information can be extracted from f1, and f2 only.
The best way is of course train a new model using f1, f2 and any new data you may have.
Don't want to do that? If you don't have f3...f6, you shouldn't magically expect the model works as intended.
Now, think what are those "f3...f6"? Are they related to the new information you have? If they are, you may be able to approximate them. We can't tell you what to do because we don't have any clue what they are. Interpolation? Regression? Rough approximation?
My suggestion: you are missing most of the predictors for your model. Your old model is meaningless. Please just train a new one.
Perhaps you could fill in data for f3 to f6 with noise data that is an average value for all data that includes that feature. That way the data from features f3 through f6 won't stand out too much, and won't lean your classifier one way or the other. The classifier would be more likely to rely on the features provided f1 and f2 to classify.
When calculating this make sure the averages are calculated for each classification first then averaged. That way if your data set has a large amount of one class it won't skew the average.
Of course this might be an over simplification, and would work best with binary classification. It depends on the data set and classification.
Hope this helps :)
Related
I loaded a dataset with 156 variables for a project. The goal is to figure out a model to predict a test data set. I am confused about where to start with. Normally I would start with the basic linear regression model, but with 156 columns/variables, how should one start with a model building? Thank you!
The question here is pretty open ended.
You need to confirm whether you are solving for regression or classification.
You need to go through some descriptive statistics of your data set to find out the type of values you have in the dataset. Are there outliers, missing values, columns whose values are in billions as against columns who values are in small fractions.
If you have categorical data, what type of categories do you have. What is the frequency count of the categorical values.
Accordingly you clean the data (if required)
Post this you may want to understand the correlation(via pearsons or chi-square depending on the data types of the variables you have) among these 156 variables and see how correlated they are.
You may then choose to get rid of certain variables after looking at the correlation or by performing a PCA (which helps to retain high variance among the dataset) and bringing the dataset variables down to fewer dimensions.
You may then look at fitting regression models or classification models(depending on your need) to have a simpler model at first and then adjusting things as you look at improving your accuracy (or minimizing the loss)
So say for each of my ‘things’ to classify I have:
{house, flat, bungalow, electricityHeated, gasHeated, ... }
Which would be made into a feature vector:
{1,0,0,1,0,...} which would mean a house that is heated by electricity.
For my training data I would have all this data- but for the actual thing I want to classify I might only have what kind of house it is, and a couple other things- not all the data ie.
{1,0,0,?,?,...}
So how would I represent this?
I would want to find the probability that a new item would be gasHeated.
I would be using a SVM linear classifier- I don’t have any core to show because this is purely theoretical at the moment. Any help would be appreciated :)
When I read this question, it seems that you may have confused with feature and label.
You said that you want to predict whether a new item is "gasHeated", then "gasHeated" should be a label rather than a feature.
btw, one of the most-common ways to deal with missing value is to set it as "zero" (or some unused value, say -1). But normally, you should have missing value in both training data and testing data to make this trick be effective. If this only happened in your testing data but not in your training data, it means that your training data and testing data are not from the same distribution, which basically violated the basic assumption of machine learning.
Let's say you have a trained model and a testing sample {?,0,0,0}. Then you can create two new testing samples, {1,0,0,0}, {0,0,0,0}, and you will have two predictions.
I personally don't think SVM is a good approach if you have missing values in your testing dataset. Just like I have mentioned above, although you can get two new predictions, but what if each one has different predictions? It is difficult to assign a probability to results of SVM in my opinion unless you use logistic regression or Naive Bayes. I would prefer Random Forest in this situation.
I am using z-score to normalize my data before training my model. When I do predictions on a daily basis, I tend to have very few observations each day, perhaps just a dozen or so. My question is, can I normalize the test data just by itself, or should I attach it to the entire training set to normalize it?
The reason I am asking is, the normalization is based on mean and std_dev, which obviously might look very different if my dataset consists only of a few observations.
You need to have all of your data in the same units. Among other things, this means that you need to use the same normalization transformation for all of your input. You don't need to include the new data in the training per se -- however, keep the parameters of the normalization (the m and b of y = mx + b) and apply those to the test data as you receive them.
It's certainly not a good idea to predict on a test set using a model trained with a very different data distribution. I would use the same mean and std of your training data to normalize you test set.
There are two data sets - the training one and a data set of features, labels for which are yet to be predicted (the new one).
I built a Random Forest classifier. Along the way I had to do two things:
Normalize continuous numeric features.
Perform a one-hot-encoding on the categorical ones.
Now I have two questions. When i am predicting labels for the new data:
Do I need to normalize the incoming features? (common sense tells me that yes :) ) If so, should I take the mean, max, min values for a specific feature from the training data set or should I somehow take into account the new values of the features?
How do I hot-one-encode the new values of the features? Do I expand the dictionary of the possible categories for a specific category taking into account the possibly new values of the features?
In my case I possess both data sets, so I could calculate all this stuff in advance, but what if I only had a classifier and a new data set?
I only have a basic knowledge of the type of classifiers and normalization techniques you're using, but the general rule, that I think applies to what you're doing as well, is to do the following.
Your classifier is not a Random Forest Classifier. That is only one step of the pipeline that acts as your actual classifier. This pipeline / actual classifier is what you describe:
Normalize continuous numeric features.
Perform a one-hot-encoding on the categorical ones.
Use a Random Forest Classifier on what you get from the first 2 steps.
This pipeline, that encompasses 3 things, is what you're actually using as your classifier.
Now, how does a classifier work?
You build some state based on the training data.
You use that state to make predictions on the test data.
So:
Do I need to normalize the incoming features? (common sense tells me that yes :) ) If so, should I take the mean, max, min values for a specific feature from the training data set or should I somehow take into account the new values of the features?
Your classifier normalizes the incoming features for the training data, so it will normalize those for unseen instances too. To do this, it must use the state it has built during training.
For example, if you were doing min-max scaling on your features, your state would store a min(f) and max(f) for each feature f. Then, during testing / prediction, you would do min-max scaling for each feature f using the stored min(f) and max(f) values.
I'm not sure what you mean by "normalize continuous numeric features". Do you mean discretization? If you build some state for this discretization during training, then you need to find a way to factor that in.
How do I hot-one-encode the new values of the features? Do I expand the dictionary of the possible categories for a specific category taking into account the possibly new values of the features?
Don't you know how many values each category can have beforehand? Usually you do (since categoricals are things like nationality, continent etc. - things you know in advance). If you can get a value for a categorical feature that you haven't seen during training, it begs the question if you should even care about it. What good is a categorical value you've never trained on?
Maybe add an "unknown" category. I think expanding for a single one should be fine, what good are more going to do if you've never trained on them?
What kind of categoricals do you have?
I could be wrong, but do you really need one-hot encoding? AFAIK, tree-based classifiers don't seem to benefit that much from it.
I have a data set with two classes and was trying to get an optimal classifier using Weka. The best classifier I could obtain was about 79% accuracy. Then I tried adding attributes to my data by classifying it and saving the probability distribution generated by this classification in the data itself.
When I reran the training process on the modified data I got over 93% accuracy!! I sure this is wrong but I can't exactly figure out why.
These are the exact steps I went through:
Open the data in Weka.
Click on add Filter and select AddClassification from Supervised->attribute.
Select a classifier. I select J48 with default settings.
Set "Output Classification" to false and set Output Distribution to true.
Run the filter and restore the class to be your original nominal class. Note the additional attributes added to the end of the attribute list. They will have the names: distribution_yourFirstClassName and distribution_yourSecondClassName.
Go the Classify tab and select a classifier: again I selected J48.
Run it. In this step I noticed much more accuracy than before.
Is this a valid way of creating classifiers? Didn't I "cheat" by adding classification information within the original data? If it is valid, how would one proceed to create a classifier that can predict unlabeled data? How can it add the additional attribute (the distribution) ?
I did try reproducing the same effect using a FilteredClassifier but it didn't work.
Thanks.
The process that you appear to have undertaken seems somewhat close to the Stacking ensemble method, where classifier outputs are used to generate an ensemble output (more on that here).
In your case however, the attributes and a previously trained classifier output is being used to predict your class. It is likely that most of the second J48 model's rules will be based on the first (As the class output will correlate more strongly to the J48 than the other attributes), but with some fine-tuning to improve model accuracy. In this case, the concept of 'two heads are better than one' is used to improve the overall performance of the model.
That's not to say that it is all good though. If you needed to use your J48 with unseen data, then you would not be able to use the same J48 that was used for your attributes (unless you saved it previously). Additionally, you are adding more processing work by using more than one classifier as opposed to the single J48. These costs would also need to be considered against the problem that you are tackling.
Hope this helps!
Okay, here is how I did cascaded learning:
I have the dataset D and divided into 10 equal sized stratified folds (D1 to D10) without repetition.
I applied algorithm A1 to train a classifier C1 on D1 to D9 and then just like you, applied C1 on D10 to give me the additional distribution of positive and negative classes. I name this D10 with the additional two (or more, depending on what information from C1 you want to be included in D10) attributes/features as D10_new.
Next, I applied the same algorithm to train a classifier C2 on D1 to D8 and D10 and then just like you, applied C2 on D9 to give me the additional distribution of positive and negative classes. I name this D9 with the additional attributes/features as D9_new.
In this way I create D1_new to D10_new.
Then I applied another classifier (perhaps with algorithm A2) on these D1_new to D10_new to predict the labels (a 10 fold CV is a good choice).
In this setup, you removed the bias of seeing the data prior to testing it. Also, it is advisable that A1 and A2 should be different.