I have a (probably stupid) question about predicting a new instance with a missing predictor(s).
I am given a data. Let's say I preprocess, clean data and as a result, let's say, 10 predictors left. Then, I train my model on a resulting data, so I am ready to use model to predict.
Now, what should I do if I want to predict a new instance which 1 or 2 predictors are missing?
There are at least two reasonable solutions.
(1) Average the output over the possible values of the missing variable or variables, conditional on the values of the non-missing variables. That is, compute a weighted average of the output prediction(missing, non-missing) for each possible value of missing, weighted by the probability of missing given non-missing. This is essentially a variety of what's called "multiple imputation" in the literature.
The first thing to try is to just weight by the unconditional distribution of missing. If that seems too complicated, a very rough approximation is to substitute the mean value of missing into the prediction.
(2) Build a model for each combination variables. If you have n variables, this means building 2^n variables. If n = 10, 1024 models is not a big deal these days. Then if you are missing some variables, just use the model for the ones that are present.
By the way, you might get more interest in this question at stats.stackexchange.com.
Related
I haven't been able to find any canonical sources on how tweedie_variance_power comes into play when predicting using an XGBoost algorithm with objective=reg:tweedie. My dependent variable is log-transformed auto insurance claim amount, so when I go to predict, in order to get units in dollars, I apply exp to the "raw" predictions from XGBoost (which look like they're on a log scale).
However (and perhaps this is due to this model not being a very good one), when I apply exp(log_predictions), the resultant and presumably-dollar-amount predictions are so much lower than expected, given the dollar amounts in the training data. Am I missing something? Does my tweedie_variance_power = 2 for this model need to also be accounted for when transforming back to dollar units?
Related question: Xgboost tweedie: Why is the formula to get the prediction from the link = exp(link)/ 2?
I loaded a dataset with 156 variables for a project. The goal is to figure out a model to predict a test data set. I am confused about where to start with. Normally I would start with the basic linear regression model, but with 156 columns/variables, how should one start with a model building? Thank you!
The question here is pretty open ended.
You need to confirm whether you are solving for regression or classification.
You need to go through some descriptive statistics of your data set to find out the type of values you have in the dataset. Are there outliers, missing values, columns whose values are in billions as against columns who values are in small fractions.
If you have categorical data, what type of categories do you have. What is the frequency count of the categorical values.
Accordingly you clean the data (if required)
Post this you may want to understand the correlation(via pearsons or chi-square depending on the data types of the variables you have) among these 156 variables and see how correlated they are.
You may then choose to get rid of certain variables after looking at the correlation or by performing a PCA (which helps to retain high variance among the dataset) and bringing the dataset variables down to fewer dimensions.
You may then look at fitting regression models or classification models(depending on your need) to have a simpler model at first and then adjusting things as you look at improving your accuracy (or minimizing the loss)
This classification problem has 300000 tuples and 20 features. I want to use SVM algorithm to solve this problem. The 'age' feature is between 1 and 100, but this feature of some tuples is missing and blank. How should i solve it.
This of course depends on the distribution of your missing variable, but I would try imputation - try to fill in the blanks using a mean age value and see what kind of results do you get. One step further would be to create a model predicting age given the other input variables and use that for imputation.
You might also add a variable indicating that a given row has some imputed values - this in some cases yields better training results, as you give your algorithm more information.
Additionally to simple imputation by mean as already mentioned by #dratewka, I would suggest trying:
Imputing the feature using classic imputation mechanisms, like e.g. K nearest neighbour imputation. With this, for a sample S with age being missing, those K samples that are nearest to S are used to derive a suitable value for imputing age (with the distance of K neighbours to S measured with all other features).
After performing the previous step, try your prediction with using age and with leaving it out. In case you see that your prediction performance is not influenced by age, disregarding this information altogether in the first place might be reasonable as well.
I have a set of 3-5 black box scoring functions that assign positive real value scores to candidates.
Each is decent at ranking the best candidate highest, but they don't always agree--I'd like to find how to combine the scores together for an optimal meta-score such that, among a pool of candidates, the one with the highest meta-score is usually the actual correct candidate.
So they are plain R^n vectors, but each dimension individually tends to have higher value for correct candidates. Naively I could just multiply the components, but I hope there's something more subtle to benefit from.
If the highest score is too low (or perhaps the two highest are too close), I just give up and say 'none'.
So for each trial, my input is a set of these score-vectors, and the output is which vector corresponds to the actual right answer, or 'none'. This is kind of like tech interviewing where a pool of candidates are interviewed by a few people who might have differing opinions but in general each tend to prefer the best candidate. My own application has an objective best candidate.
I'd like to maximize correct answers and minimize false positives.
More concretely, my training data might look like many instances of
{[0.2, 0.45, 1.37], [5.9, 0.02, 2], ...} -> i
where i is the ith candidate vector in the input set.
So I'd like to learn a function that tends to maximize the actual best candidate's score vector from the input. There are no degrees of bestness. It's binary right or wrong. However, it doesn't seem like traditional binary classification because among an input set of vectors, there can be at most 1 "classified" as right, the rest are wrong.
Thanks
Your problem doesn't exactly belong in the machine learning category. The multiplication method might work better. You can also try different statistical models for your output function.
ML, and more specifically classification, problems need training data from which your network can learn any existing patterns in the data and use them to assign a particular class to an input vector.
If you really want to use classification then I think your problem can fit into the category of OnevsAll classification. You will need a network (or just a single output layer) with number of cells/sigmoid units equal to your number of candidates (each representing one). Note, here your number of candidates will be fixed.
You can use your entire candidate vector as input to all the cells of your network. The output can be specified using one-hot encoding i.e. 00100 if your candidate no. 3 was the actual correct candidate and in case of no correct candidate output will be 00000.
For this to work, you will need a big data set containing your candidate vectors and corresponding actual correct candidate. For this data you will either need a function (again like multiplication) or you can assign the outputs yourself, in which case the system will learn how you classify the output given different inputs and will classify new data in the same way as you did. This way, it will maximize the number of correct outputs but the definition of correct here will be how you classify the training data.
You can also use a different type of output where each cell of output layer corresponds to your scoring functions and 00001 means that the candidate your 5th scoring function selected was the right one. This way your candidates will not have to be fixed. But again, you will have to manually set the outputs of the training data for your network to learn it.
OnevsAll is a classification technique where there are multiple cells in the output layer and each perform binary classification in between one of the classes vs all others. At the end the sigmoid with the highest probability is assigned 1 and rest zero.
Once your system has learned how you classify data through your training data, you can feed your new data in and it will give you output in the same way i.e. 01000 etc.
I hope my answer was able to help you.:)
Hi I've been doing a machine learning project about predicting if a given (query, answer) pair is a good match (label the pair with 1 if it is a good match, 0 otherwise). But the problem is, in the training set, all the items are labelled with 1. So I got confused because I don't think the training set has strong discriminative power. To be more specific, now I could extract some features like:
1. textual similarity between query and answer
2. some attributes like the posting date, who created it, which aspect is it about etc.
Maybe I should try semi supervised learning (never studied it so have no idea if it will work)? But with such a training set I even cannot do validation....
Actually, you can train a data set on only positive examples; 1-class SVM does this. However, this presumes that anything "sufficiently outside" the original data set is negative data, with "sufficiently outside" affected mainly by gamma (allowed error rate) and k (degree of the kernel function).
A solution for your problem depends on the data you have. You are quite correct that a model trains better when given representative negative examples. The description you give strongly suggests that you do know there are insufficient matches.
Do you need a strict +/- scoring for the matches? Most applications simply rank them: the match strength is the score. This changes your problem from a classification to a prediction case. If you do need a strict +/- partition (classification), then I suggest that you slightly alter your training set: include only obvious examples: throw out anything scored near your comfort threshold for declaring a match.
With these inputs only, train your model. You'll have a clear "alley" between good and bad matches, and the model will "decide" which way to judge the in-between cases in testing and production.