I am studying machine learning, and I am very confused about the definition of "prediction" and "label", and I would like to know what is the relationship between them?
My understand is:
"prediction" is something you are going to predict, based on "label".
E.g., label = MCQ1 MCQ2, prediction = Final_term_mark
it can predict a student's final term mark by his/her grade of MCQ1 and MCQ2.
Is that correct?
Labels are the known values for old data.
Prediction is your predicted value for new data, where you do not have a label (or pretend that you do not have a label - in evaluation).
During training, you try to make your predictions match the labels.
According to developers.google.com. A label is a thing we're predicting—the y variable in simple linear regression. The label could be the future price of wheat, the kind of animal shown in a picture, the meaning of an audio clip, or just about anything.
Predictors are more like an estimator that is capable of making prediction like a Linear Regression model.
Related
I have a list of columns and each column is to be labelled by a label from another list of labels.
Eg: Two columns namely, ALT_ID and MTRC_NM are matched with labels Alternate ID and Metric Name respectively.
This fuzzy string matching has been taken care of. Problem is, I want to incorporate a learning model in this.
Essentially, after the matched results are displayed, the user curates the matches as CORRECT or INCORRECT. Based on this feedback and other features of the column (like minimum value, maximum value), I want to train a classifier such that the learning model will eventually stop making the incorrect matches in the future.
Note: In the first run, only the name of the column is used to produce the first set of results. After this, I want to use other features(like minimum value) to train the model.
Problem is, there can be 10,000 terms (or labels), maybe even more and the user just marks these as CORRECT or INCORRECT. For incorrect classifications, the user does not tell us what the correct classification should be.
I believe one solution could be to make separate classifiers for each label and based on the Correct/Incorrect feedback for a particular classification, we can use these feature vectors to train a classifier for this classification. So in the future, if the fuzzy string matching nominates Metric Name as the classification for some column, we can let the "Metric Name" classifier decide if it is correct or incorrect.
I don't know how to make separate classifiers for each label. I also don't know if this approach is feasible. Any other solution to this problem will also help.
You do not want to create separate models for each label as training more than 10 000 models isn't really feasible. Two possible things that come to my mind are:
Create a supervised learning model with one label as input and probability of each of 10 000 labels as output which only uses correct examples for predictions.
Create a reinforcement learning model with the same input but with output which maximises reward function defined as +1 for each positive prediction and -1 for each negative prediction. This model will also try to maximise the number of correct predictions but will be able to learn from incorrect predictions at the same time i.e. predict -1 score for an incorrect pair (x,y).
In the tutorial https://www.tensorflow.org/tutorials/keras/classification
at
https://www.tensorflow.org/tutorials/keras/classification#make_predictions
A prediction is an array of 10 numbers. They represent the model's
"confidence" that the image corresponds to each of the 10 different
articles of clothing. You can see which label has the highest
confidence value:
Instead of confidence, if I want to estimate the probability of each class (different articles of clothing). How will I do that?
As #desertnaut mention a comment, above, the confidence in the code
probability_model = tf.keras.Sequential([model,
tf.keras.layers.Softmax()])
predictions = probability_model.predict(test_images)
given by the variable predictions are indeed probability.
I know some machine learning algorithms can output the probability of predicted labels of an input sample.
For example, give a sample with three possible labels, a probability tuple (0.2,0.3,0.5) can be outputted through some probabilistic learning algorithms, such as logistic regression or probability estimate tree. Then the label with maximum probability (here 0.5) is outputted as the final prediction.
My question is, given a new sample having the predicted probability tuple (0.3,0.4,0.3), how can I quantitatively determine the likelihood of that the predicted label (here the second label) is correct?
Many thanks
(This IMHO question doesn't belong here. It does belong to stat stack exchange)
The answer is freakingly simple: the probability/likelihood -- which is not exactly the same -- is 0.4, which is pretty low.
If you want to run a small experiment. Build/learn a model classify a few instances and compare it to the ground truth. In addition sum up the probability of the most likely label. You will see that the sum of probabilities matches the fraction of correctly classified instances
or:
your model is wrong
your sample set is to small
I need to classify website text with zero or more categories/labels (5 labels such as finance, tech, etc). My problem is handling text that isn't one of these labels.
I tried ML libraries (maxent, naive bayes), but they match "other" text incorrectly with one of the labels. How do I train a model to handle the "other" text? The "other" label is so broad and it's not possible to pick a representative sample.
Since I have no ML background and don't have much time to build a good training set, I'd prefer a simpler approach like a term frequency count, using a predefined list of terms to match for each label. But with the counts, how do I determine a relevancy score, i.e. if the text is actually that label? I don't have a corpus and can't use tf-idf, etc.
Another idea , is to user neural networks with softmax output function, softmax will give you a probability for every class, when the network is very confident about a class, will give it a high probability, and lower probabilities to the other classes, but if its insecure, the differences between probabilities will be low and none of them will be very high, what if you define a treshold like : if the probability for every class is less than 70% , predict "other"
Whew! Classic ML algorithms don't combine both multi-classification and "in/out" at the same time. Perhaps what you could do would be to train five models, one for each class, with a one-against-the-world training. Then use an uber-model to look for any of those five claiming the input; if none claim it, it's "other".
Another possibility is to reverse the order of evaluation: train one model as a binary classifier on your entire data set. Train a second one as a 5-class SVM (for instance) within those five. The first model finds "other"; everything else gets passed to the second.
What about creating histograms? You could use a bag of words approach using significant indicators of for e.g. Tech and Finance. So, you could try to identify such indicators by analyzing the certain website's tags and articles or just browse the web for such inidicators:
http://finance.yahoo.com/news/most-common-words-tech-finance-205911943.html
Let's say your input vactor X has n dimensions where n represents the number of indicators. For example Xi then holds the count for the occurence of the word "asset" and Xi+k the count of the word "big data" in the current article.
Instead of defining 5 labels, define 6. Your last category would be something like a "catch-all" category. That's actually your zero-match category.
If you must match the zero or more category, train a model which returns probability scores (such as a neural net as Luis Leal suggested) per label/class. You could than rate your output by that score and say that every class with a score higher than some threshold t is a matching category.
Try this NBayes implementation.
For identifying "Other" categories, dont bother much. Just train on your required categories which clearly identifies them, and introduce a threshold in the classifier.
If the values for a label does not cross a threshold, then the classifier adds the "Other" label.
It's all in the training data.
AWS Elasticsearch percolate would be ideal, but we can't use it due to the HTTP overhead of percolating documents individually.
Classify4J appears to be the best solution for our needs because the model looks easy to train and it doesn't require training of non-matches.
http://classifier4j.sourceforge.net/usage.html
I currently have a system set up where I train from old posts/categories and try to predict what category a new post will be. I am using a pipeline with TfidfVectorizer and LinearSVC to train the dataset and storing that in a pickle, then I process new posts by loading that pickle and using predict from the loaded pickle to classify the new posts. Currently, I am struggling with a few labels and I don't know why.
I am looking to provide some output on what words were triggered in the new post for each classification label so that I can see why a certain label was chosen when classifying new data against a training set, but I cannot find a way to do this.
I know that I can output the top features in my vectorizer when I am training, but how can I output essentially the reason why a certain label was chosen over another one?
During the training phase of the SVM for each word of the corpus vocabulary you learn a weight for each of the classes.
Then, during inference, you calculate the dot product between the class weights and the vector description of the instance to be classified. The algorithm returns the class that yields the highest dot product scores. Hence, you can have an estimate of how things work by examining those weights (coef_ attribute) for your instance.
I agree however that other methods like trees are more interpretable.