How to evaluate ML image classifier with confidence - machine-learning

Suppose I have a model that classifies images in to one of n categories. I know how to calculate the accuracy and sensitivity based just on the output label. However, I want to be more specific. How could I also incorporate the confidence percentage which is produced with each output???

You could use bootstrapping to obtain a confidence interval of your model on the dataset. A full demonstration here. If you want it for an individual sample, you may define another list, like the stat list, and store the predicted probabilities for that individual in there instead.

Related

How to build separate classifiers for each label in the dataset?

I have a list of columns and each column is to be labelled by a label from another list of labels.
Eg: Two columns namely, ALT_ID and MTRC_NM are matched with labels Alternate ID and Metric Name respectively.
This fuzzy string matching has been taken care of. Problem is, I want to incorporate a learning model in this.
Essentially, after the matched results are displayed, the user curates the matches as CORRECT or INCORRECT. Based on this feedback and other features of the column (like minimum value, maximum value), I want to train a classifier such that the learning model will eventually stop making the incorrect matches in the future.
Note: In the first run, only the name of the column is used to produce the first set of results. After this, I want to use other features(like minimum value) to train the model.
Problem is, there can be 10,000 terms (or labels), maybe even more and the user just marks these as CORRECT or INCORRECT. For incorrect classifications, the user does not tell us what the correct classification should be.
I believe one solution could be to make separate classifiers for each label and based on the Correct/Incorrect feedback for a particular classification, we can use these feature vectors to train a classifier for this classification. So in the future, if the fuzzy string matching nominates Metric Name as the classification for some column, we can let the "Metric Name" classifier decide if it is correct or incorrect.
I don't know how to make separate classifiers for each label. I also don't know if this approach is feasible. Any other solution to this problem will also help.
You do not want to create separate models for each label as training more than 10 000 models isn't really feasible. Two possible things that come to my mind are:
Create a supervised learning model with one label as input and probability of each of 10 000 labels as output which only uses correct examples for predictions.
Create a reinforcement learning model with the same input but with output which maximises reward function defined as +1 for each positive prediction and -1 for each negative prediction. This model will also try to maximise the number of correct predictions but will be able to learn from incorrect predictions at the same time i.e. predict -1 score for an incorrect pair (x,y).

Is there a way to find the most representative set of samples of the entire dataset?

I'm working on text classification and I have a set of 200.000 tweets.
The idea is to manually label a short set of tweets and train classifiers to predict the labels of the rest. Supervised learning.
What I would like to know is if there is a method to choose what samples to include in the train set in a way that this train set is a good representation of the whole data set, and because the high diversity included in the train set, the trained classifiers have considerable trust to be applied on the rest of tweets.
This sounds like a stratification question - do you have pre-existing labels or do you plan to design the labels based on the sample you're constructing?
If it's the first scenario, I think the steps in order of importance would be:
Stratify by target class proportions (so if you have three classes, and they are 50-30-20%, train/dev/test should follow the same proportions)
Stratify by features you plan to use
Stratify by tweet length/vocabulary etc.
If it's the second scenario, and you don't have labels yet, you may want to look into using n-grams as a feature, coupled with a dimensionality reduction or clustering approach. For example:
Use something like PCA or t-SNE to maximize distance between tweets (or a large subset), then pick candidates from different regions of the projected space
Cluster them based on lexical items (unigrams or bigrams, possibly using log frequencies or TF-IDF and stop word filtering, if content words are what you're looking for) - then you can cut the tree at a height that gives you n bins, which you can then use as a source for samples (stratify by branch)
Use something like LDA to find n topics, then sample stratified by topic
Hope this helps!
It seems that before you know anything about the classes you are going to label, a simple uniform random sample will do almost as well as any stratified sample - because you don't know in advance what to stratify on.
After labelling this first sample and building the first classifier, you can start so-called active learning: make predictions for the unlabelled dataset, and sample some tweets in which your classifier is least condfident. Label them, retrain the classifier, and repeat.
Using this approach, I managed to create a good training set after several (~5) iterations, with ~100 texts in each iteration.

multi-label text classification with zero or more labels

I need to classify website text with zero or more categories/labels (5 labels such as finance, tech, etc). My problem is handling text that isn't one of these labels.
I tried ML libraries (maxent, naive bayes), but they match "other" text incorrectly with one of the labels. How do I train a model to handle the "other" text? The "other" label is so broad and it's not possible to pick a representative sample.
Since I have no ML background and don't have much time to build a good training set, I'd prefer a simpler approach like a term frequency count, using a predefined list of terms to match for each label. But with the counts, how do I determine a relevancy score, i.e. if the text is actually that label? I don't have a corpus and can't use tf-idf, etc.
Another idea , is to user neural networks with softmax output function, softmax will give you a probability for every class, when the network is very confident about a class, will give it a high probability, and lower probabilities to the other classes, but if its insecure, the differences between probabilities will be low and none of them will be very high, what if you define a treshold like : if the probability for every class is less than 70% , predict "other"
Whew! Classic ML algorithms don't combine both multi-classification and "in/out" at the same time. Perhaps what you could do would be to train five models, one for each class, with a one-against-the-world training. Then use an uber-model to look for any of those five claiming the input; if none claim it, it's "other".
Another possibility is to reverse the order of evaluation: train one model as a binary classifier on your entire data set. Train a second one as a 5-class SVM (for instance) within those five. The first model finds "other"; everything else gets passed to the second.
What about creating histograms? You could use a bag of words approach using significant indicators of for e.g. Tech and Finance. So, you could try to identify such indicators by analyzing the certain website's tags and articles or just browse the web for such inidicators:
http://finance.yahoo.com/news/most-common-words-tech-finance-205911943.html
Let's say your input vactor X has n dimensions where n represents the number of indicators. For example Xi then holds the count for the occurence of the word "asset" and Xi+k the count of the word "big data" in the current article.
Instead of defining 5 labels, define 6. Your last category would be something like a "catch-all" category. That's actually your zero-match category.
If you must match the zero or more category, train a model which returns probability scores (such as a neural net as Luis Leal suggested) per label/class. You could than rate your output by that score and say that every class with a score higher than some threshold t is a matching category.
Try this NBayes implementation.
For identifying "Other" categories, dont bother much. Just train on your required categories which clearly identifies them, and introduce a threshold in the classifier.
If the values for a label does not cross a threshold, then the classifier adds the "Other" label.
It's all in the training data.
AWS Elasticsearch percolate would be ideal, but we can't use it due to the HTTP overhead of percolating documents individually.
Classify4J appears to be the best solution for our needs because the model looks easy to train and it doesn't require training of non-matches.
http://classifier4j.sourceforge.net/usage.html

Naive Bays classifier: output percentage is too low

I'm writing a naive bayes classifier for a class project and I just got it working... sort of. While I do get an error-free output, the winning output label had an output probability of 3.89*10^-85.
Wow.
I have a couple of ideas of what I might be doing wrong. Firstly, I am not normalizing the output percentages for the classes, so all of the percentages are effectively zero. While that would give me numbers that look nice, I don't know if that's the correct thing to do.
My second idea was to reduce the number of features. Our input data is a list of pseudo-images in the form of a very long text file. Currently, our features are just the binary value of every pixel of the image, and with a 28x28 image that's a lot of features. If I instead chopped the image into blocks of size, say, 7x7, how much would that actually improve the output percentages?
tl;dr Here's the general things I'm trying to understand about naive bayes:
1) Do you need to normalize the output percentages from testing each class?
2) How much of an effect does having too many features have on the results?
Thanks in advance for any help you can give me.
It could be normal. The output of a naive bayes is not meant to be a real probability. What it is meant to do is order a score among competing classes.
The reason why the probability is so low is that many Naive Bayes implementations are the product of the probabilities of all the observed features of the instance that is being classified. If you are classifying text, each feature may have a low conditional probability for each class (example: lower than 0.01). If you multiply 1000s of feature probabilities, you quickly end up with numbers such as you have reported.
Also, the probabilities returned are not the probabilities of each class given the instance, but an estimate of the probabilities of observing this set of features, given the class. Thus, the more you have features, the less likely it is to observe these exact features. A bayesian theorem is used to change argmax_c P(class_c|features) to argmax_c P(class_c)*P(features|class_c), and then the P(features|class_c) is further simplified by making independence assumption, which allows changing that to a product of the probabilities of observing each individual feature given the class. These assumptions don't change the argmax (the winning class).
If I were you, I would not really care about the probability output, focus instead on the accuracy of your classifier and take action to improve the accuracy, not the calculated probabilities.

How to calculate TF*IDF for a single new document to be classified?

I am using document-term vectors to represent a collection of document. I use TF*IDF to calculate the term weight for each document vector. Then I could use this matrix to train a model for document classification.
I am looking forward to classify new document in future. But in order to classify it, I need to turn the document into a document-term vector first, and the vector should be composed of TF*IDF values, too.
My question is, how could I calculate the TF*IDF with just a single document?
As far as I understand, TF can be calculated based on a single document itself, but the IDF can only be calculated with a collection of document. In my current experiment, I actually calculate the TF*IDF value for the whole collection of documents. And then I use some documents as training set and the others as test set.
I just suddenly realized that this seems not so applicable to real life.
ADD 1
So there are actually 2 subtly different scenarios for classification:
to classify some documents whose content are known but label are not
known.
to classify some totally unseen document.
For 1, we can combine all the documents, both with and without labels. And get the TF*IDF over all of them. This way, even we only use the documents with labels for training, the training result will still contain the influence of the documents without labels.
But my scenario is 2.
Suppose I have the following information for term T from the summary of the training set corpus:
document count for T in the training set is n
total number of training documents is N
Should I calculate the IDF of t for a unseen document D as below?
IDF(t, D)= log((N+1)/(n+1))
ADD 2
And what if I encounter a term in the new document which didn't show up in the training corpus before?
How should I calculate the weight for it in the doc-term vector?
TF-IDF doesn't make sense for a single document, independent of a corpus. It's fundamentally about emphasizing relatively rare and informative words.
You need to keep corpus summary information in order to compute TF-IDF weights. In particular, you need the document count for each term and the total number of documents.
Whether you want to use summary information from the whole training set and test set for TF-IDF, or for just the training set is a matter of your problem formulation. If it's the case that you only care to apply your classification system to documents whose contents you have, but whose labels you do not have (this is actually pretty common), then using TF-IDF for the entire corpus is okay. If you want to apply your classification system to entirely unseen documents after you train, then you only want to use the TF-IDF summary information from the training set.
TF obviously only depends on the new document.
IDF, you compute only on your training corpus.
You can add a slack term to the IDF computation, or adjust it as you suggested. But for a reasonable training set, the constant +1 term will not have a whole lot of effect. AFAICT, in classic document retrieval (think: search), you don't bother to do this. Often, they query document will not become part of your corpus, so why would it be part of IDF?
For unseen words, TF calculation is not a problem as TF is a document specific metric. While computing IDF, you can use smoothed inverse document frequency technique.
IDF = 1 + log(total documents / document frequency of a term)
Here the lower bound for IDF is 1. So if a word is not seen in the training corpus, its IDF is 1. Since, there is no universally agreed single formula for computing tf-idf or even idf, your formula for tf-idf calculation is also reasonable.
Note that, in many cases, unseen terms are ignored if they don't have much impact in the classification task. Sometimes, people replace unseen tokens with a special symbol like UNKNOWN_TOKEN and do their computation.
Alternative of TF-IDF: Another way of computing weight of each term of a document is using Maximum Likelihood Estimation. While computing MLE, you can smooth using additive smoothing technique which is also known as Laplace smoothing. MLE is used in case you are using Generative models like Naive Bayes algorithm for document classification.

Resources