Predicting probability distribution with partial information about the distribution - machine-learning

I need to predict predict probability distribution for a given input data. However instead of having labeled data, I have partial information about the output probability distribution. For example, if the distribution is over four classes, I know correct value of probability for one of the classes for a given input. Following are sample set of data
y_1 -> (?, 0.25, ?, ?)
y_2 -> (?, ?, 0.1, ?)
i.e., for y_1, we know that probability of belonging it to class 2 is 25%. However we do not know anything about the other three classes. Is there a way I can perform machine learning on this kind of data. One hueristic approach is to distribute the remaining probability eaqually. I was wondering if there is any work which deals with above kind of problem. Any pointer is appreciated.

This sounds like an example of logged contextual bandit where you use logged data derived from an exploration experiment where you know the probability of chosen action.More details and example : https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Logged-Contextual-Bandit-Example

Related

How to build separate classifiers for each label in the dataset?

I have a list of columns and each column is to be labelled by a label from another list of labels.
Eg: Two columns namely, ALT_ID and MTRC_NM are matched with labels Alternate ID and Metric Name respectively.
This fuzzy string matching has been taken care of. Problem is, I want to incorporate a learning model in this.
Essentially, after the matched results are displayed, the user curates the matches as CORRECT or INCORRECT. Based on this feedback and other features of the column (like minimum value, maximum value), I want to train a classifier such that the learning model will eventually stop making the incorrect matches in the future.
Note: In the first run, only the name of the column is used to produce the first set of results. After this, I want to use other features(like minimum value) to train the model.
Problem is, there can be 10,000 terms (or labels), maybe even more and the user just marks these as CORRECT or INCORRECT. For incorrect classifications, the user does not tell us what the correct classification should be.
I believe one solution could be to make separate classifiers for each label and based on the Correct/Incorrect feedback for a particular classification, we can use these feature vectors to train a classifier for this classification. So in the future, if the fuzzy string matching nominates Metric Name as the classification for some column, we can let the "Metric Name" classifier decide if it is correct or incorrect.
I don't know how to make separate classifiers for each label. I also don't know if this approach is feasible. Any other solution to this problem will also help.
You do not want to create separate models for each label as training more than 10 000 models isn't really feasible. Two possible things that come to my mind are:
Create a supervised learning model with one label as input and probability of each of 10 000 labels as output which only uses correct examples for predictions.
Create a reinforcement learning model with the same input but with output which maximises reward function defined as +1 for each positive prediction and -1 for each negative prediction. This model will also try to maximise the number of correct predictions but will be able to learn from incorrect predictions at the same time i.e. predict -1 score for an incorrect pair (x,y).

How to explain feature importance after one-hot encode used for decision tree

I know decision tree has feature_importance attribute calculated by Gini and it could be used to check which features are more important.
However, for application in scikit-learn or Spark, it only accepts numeric attribute, so I have to transfer string attribute to numeric attribute and then do one-hot encoder on that. When features are put into decision tree model, it's 0-1 encoded other than original format, my question is, how to explain feature importance for original attributes? should I avoid one-hot encoder when try to explain feature importance?
Thanks.
Conceptually, you may want to use something along the lines of permutation importance. The basic idea, is that you take your original dataset, and randomly shuffle the values of each column 1 at a time. Then, you score your perturbed data with the model and compare the performance to the original performance. If done 1 column at a time, you can assess the performance hit you take by destroying each variable, indexing it to the variable that had the most loss (which would become 1, or 100%). If you can do this to your original dataset, prior to the 1 hot encoding, then you'll be getting an importance measure that groups them together overall.

What type of ML is this? Algorithm to repeatedly choose 1 correct candidate from a pool (or none)

I have a set of 3-5 black box scoring functions that assign positive real value scores to candidates.
Each is decent at ranking the best candidate highest, but they don't always agree--I'd like to find how to combine the scores together for an optimal meta-score such that, among a pool of candidates, the one with the highest meta-score is usually the actual correct candidate.
So they are plain R^n vectors, but each dimension individually tends to have higher value for correct candidates. Naively I could just multiply the components, but I hope there's something more subtle to benefit from.
If the highest score is too low (or perhaps the two highest are too close), I just give up and say 'none'.
So for each trial, my input is a set of these score-vectors, and the output is which vector corresponds to the actual right answer, or 'none'. This is kind of like tech interviewing where a pool of candidates are interviewed by a few people who might have differing opinions but in general each tend to prefer the best candidate. My own application has an objective best candidate.
I'd like to maximize correct answers and minimize false positives.
More concretely, my training data might look like many instances of
{[0.2, 0.45, 1.37], [5.9, 0.02, 2], ...} -> i
where i is the ith candidate vector in the input set.
So I'd like to learn a function that tends to maximize the actual best candidate's score vector from the input. There are no degrees of bestness. It's binary right or wrong. However, it doesn't seem like traditional binary classification because among an input set of vectors, there can be at most 1 "classified" as right, the rest are wrong.
Thanks
Your problem doesn't exactly belong in the machine learning category. The multiplication method might work better. You can also try different statistical models for your output function.
ML, and more specifically classification, problems need training data from which your network can learn any existing patterns in the data and use them to assign a particular class to an input vector.
If you really want to use classification then I think your problem can fit into the category of OnevsAll classification. You will need a network (or just a single output layer) with number of cells/sigmoid units equal to your number of candidates (each representing one). Note, here your number of candidates will be fixed.
You can use your entire candidate vector as input to all the cells of your network. The output can be specified using one-hot encoding i.e. 00100 if your candidate no. 3 was the actual correct candidate and in case of no correct candidate output will be 00000.
For this to work, you will need a big data set containing your candidate vectors and corresponding actual correct candidate. For this data you will either need a function (again like multiplication) or you can assign the outputs yourself, in which case the system will learn how you classify the output given different inputs and will classify new data in the same way as you did. This way, it will maximize the number of correct outputs but the definition of correct here will be how you classify the training data.
You can also use a different type of output where each cell of output layer corresponds to your scoring functions and 00001 means that the candidate your 5th scoring function selected was the right one. This way your candidates will not have to be fixed. But again, you will have to manually set the outputs of the training data for your network to learn it.
OnevsAll is a classification technique where there are multiple cells in the output layer and each perform binary classification in between one of the classes vs all others. At the end the sigmoid with the highest probability is assigned 1 and rest zero.
Once your system has learned how you classify data through your training data, you can feed your new data in and it will give you output in the same way i.e. 01000 etc.
I hope my answer was able to help you.:)

When classifying, does one need to normalize new incoming features when predicting on real data?

There are two data sets - the training one and a data set of features, labels for which are yet to be predicted (the new one).
I built a Random Forest classifier. Along the way I had to do two things:
Normalize continuous numeric features.
Perform a one-hot-encoding on the categorical ones.
Now I have two questions. When i am predicting labels for the new data:
Do I need to normalize the incoming features? (common sense tells me that yes :) ) If so, should I take the mean, max, min values for a specific feature from the training data set or should I somehow take into account the new values of the features?
How do I hot-one-encode the new values of the features? Do I expand the dictionary of the possible categories for a specific category taking into account the possibly new values of the features?
In my case I possess both data sets, so I could calculate all this stuff in advance, but what if I only had a classifier and a new data set?
I only have a basic knowledge of the type of classifiers and normalization techniques you're using, but the general rule, that I think applies to what you're doing as well, is to do the following.
Your classifier is not a Random Forest Classifier. That is only one step of the pipeline that acts as your actual classifier. This pipeline / actual classifier is what you describe:
Normalize continuous numeric features.
Perform a one-hot-encoding on the categorical ones.
Use a Random Forest Classifier on what you get from the first 2 steps.
This pipeline, that encompasses 3 things, is what you're actually using as your classifier.
Now, how does a classifier work?
You build some state based on the training data.
You use that state to make predictions on the test data.
So:
Do I need to normalize the incoming features? (common sense tells me that yes :) ) If so, should I take the mean, max, min values for a specific feature from the training data set or should I somehow take into account the new values of the features?
Your classifier normalizes the incoming features for the training data, so it will normalize those for unseen instances too. To do this, it must use the state it has built during training.
For example, if you were doing min-max scaling on your features, your state would store a min(f) and max(f) for each feature f. Then, during testing / prediction, you would do min-max scaling for each feature f using the stored min(f) and max(f) values.
I'm not sure what you mean by "normalize continuous numeric features". Do you mean discretization? If you build some state for this discretization during training, then you need to find a way to factor that in.
How do I hot-one-encode the new values of the features? Do I expand the dictionary of the possible categories for a specific category taking into account the possibly new values of the features?
Don't you know how many values each category can have beforehand? Usually you do (since categoricals are things like nationality, continent etc. - things you know in advance). If you can get a value for a categorical feature that you haven't seen during training, it begs the question if you should even care about it. What good is a categorical value you've never trained on?
Maybe add an "unknown" category. I think expanding for a single one should be fine, what good are more going to do if you've never trained on them?
What kind of categoricals do you have?
I could be wrong, but do you really need one-hot encoding? AFAIK, tree-based classifiers don't seem to benefit that much from it.

How to build multivariate ranking system?

I have data of various sellers on ecommerce platform. I am trying to compute seller ranking score based on various features, such as
1] Order fulfillment rates [numeric]
2] Order cancel rate [numeric]
3] User rating [1-5] { 1-2 : Worst, 3: Average , 5: Good} [categorical]
4] Time taken to confirm the order. (shorter the time taken better is the seller) [numeric]
My first instinct was to normalize all the features, then multiply parameters/feature by some weight . Add them together for each seller score. Finally, find relative ranking of sellers based on this score.
My Seller score equation looks like
Seller score = w1* Order fulfillment rates - w2*Order cancel rate + w3 * User rating + w4 * Time taken to confirm order
where, w1,w2,w3,w4 are weights.
My question is three fold
Are there better algorithms/approaches to solve this problem? i.e I linearly added the various features, I want to know better approach to build the ranking system?
How to come with the values for the weights?
Apart from using above features, few more that I can think of are ratio of positive to negative reviews, rate of damaged goods etc. How will these fit into my Score equation?
How to incorporate numeric and categorical variables in finding seller ranking score? (I have few categorical variables)
Is there an accepted way to weight multivariate systems like this ?
I would suggest the following approach:
First of all, keep in a matrix all features that you have available, whether you consider them useful or not.
(Hint: categorical variables are converted to numerical by simple encoding. Thus you can easily incorporate them (in the exact way you encoded user rating))
Then, you have to apply a dimensionality reduction algorithm, such as Singular Value Decomposition (SVD), in order to keep the most significant variables. Applying SVD may surprise you as to which features may be significant and which aren't.
After applying SVD, choosing the right weights for the n-most important features you decided to keep, is really up to you because it is purely qualitative and domain-dependent (as far as which features are more important).
The only way you could possibly calculate weights in a formalistic way is if the features were directly connected to something, e.g., revenue. Since this very rarely occurs, I suggest manually defining the weights; but for the sake of normalization, set:
w1 + w2 + ... + wn = 1
That is, split the "total importance" among the features you selected in a manner that sums to 1.

Resources