Evaluate per class precision and recall in FastText - machine-learning

I am using Facebook Research FastText library for text classification following this tutorial. I have 2 labels for which i am performing the classification (2-class). The output of the prediction on test file shows the precision and recall for the same. How can i calculate per-class precision and recall for my test file?

I had to deal with this recently myself. This issue in Github describes the issue, and presents a solution.
In summary, you need to do this as a post processing step. The code linked to above does a comparison of your actual labels vs the predicted ones and computes a confusion matrix which accurately reflects the classifier's performance for binary classification. This code computes the confusion matrix and accuracy only. If you wanted to also add in the precision and recall you could similarly use the scikitlearn API such as sklearn.metrics.precision_recall_fscore_support

Related

Sklearn models: decision function vs predict_proba for roc curve

In Sklearn, roc curve requires (y_true, y_scores). Generally, for y_scores, I feed in probabilities as outputted by a classifier's predict_proba function. But in the sklearn example, I see both predict_prob and decision_fucnction are used.
I wonder what is the difference in terms of real life model evaluation?
The functional form of logistic regression is -
f(x)=11+e−(β0+β1x1+⋯+βkxk)
This is what is returned by predict_proba.
The term inside the exponential i.e.
d(x)=β0+β1x1+⋯+βkxk
is what is returned by decision_function. The "hyperplane" referred to in the documentation is
β0+β1x1+⋯+βkxk=0
My Understanding after reading few resources:
Decision Function: Gives the distances from the hyperplane. These are therefore unbounded. This can not be equated to probabilities. For getting probabilities, there are 2 solutions - Platt Scaling & Multi-Attribute Spaces to calibrate outputs using Extreme Value Theory.
Predict Proba: Gives the actual probabilities (0 to 1) however attribute 'probability' has to be set to True while fitting the model itself. It uses Platt scaling which is known to have theoretical issues.
Refer to this in documentation.

Is it possible to calculate AUC using OOB sample in Bagged trees?

I have few questions on OOB sample in Bagged trees.
1.Do we always calculate only error on OOB samples? If yes, which error metric is used for evaluation(like rmse, misclassification err)?
2.Also, do we have this OOB concept in boosting also?
Is it possible to calculate AUC using OOB sample in Bagged trees?
An ROC curve is the most commonly used way to visualize the performance of a binary classifier, and AUC is (arguably) the best way to summarize its performance in a single number. It does not matter wether you are using Bagged Trees or not. You can find a nice explanation here
1.Do we always calculate only error on OOB samples?
Not necessarily, before bootstrapping, you can set aside validation set and do cross validation
If yes, which error metric is used for evaluation(like rmse, misclassification err)?
If it is Regression problem, sum of squared errors(RSS) for the tree can be used
For a Classification problem, Misclassification error rate can be used.
2.Also, do we have this OOB concept in boosting also?
Let's see what is OOB ? The key to bagging is that trees are repeatedly fit to bootstrapped subsets of the observations.On average, each bagged tree makes use of around two-thirds of the observations.The remaining one-third of the observations not used to fit a given bagged tree are referred to as the out-of-bag (OOB) observations. Reference: An Introduction to Statistical Learning, Section 8.2.1, Out-of-Bag Error Estimation
Boosting does not involve bootstrap sampling; instead each tree is fit on a modified version of the original data set. Reference: An Introduction to Statistical Learning, Section 8.2.3
Therefore going by the definition,OOB concept is not applicable for Boosting.
But note that most implementation of Boosted Tree algorithms will have an option to set OOB in some way. Please refer to documentation of respective implementation to understand their version.

When training a classifier based on a training set, what should I do if some of the training samples are worth more (are more valuable) than the rest?

I am trying to train a classifier based on a given training set (say a 2-class problem with 100 samples per class). How can I train my classifier in a way that some of the samples in the training set (say the first 20 samples from each class) are more valuable than the rest of the samples? (for some reasons, these samples are more similar to the test set, so they should be considered more important in training the classifier)
Is it ok if I just replicate those samples a couple of times?
I don't know if it matters or not, but my classifier consists of a feature selection step (a filter based method called fast correlation based filter) and a classification step (linear SVM). Also, my test set is a totally different set and I cannot use at all for any step of the training.
Is it ok if I just replicate those samples a couple of times?
It depends on the methods you are using. For some - it is fine, like SVM you are refering to - it has additive loss function over samples and does not care about duplicates. However this is not how you should approach the problem with SVM, since it directly supports weighting of samples, and this is what you should do - attach weight to samples. Depending on the library / language used it might be available or not, but this is the correct way. In libsvm for example you would simply pass sample_weight to your fit call, like here.

Machine Learning, After training, how exactly does it get a prediction? opencv

So after you have a machine learning algorithm trained, with your layers, nodes, and weights, how exactly does it go about getting a prediction for an input vector? I am using MultiLayer Perceptron (neural networks).
From what I currently understand, you start with your input vector to be predicted. Then you send it to your hidden layer(s) where it adds your bias term to each data point, then adds the sum of the product of each data point and the weight for each node (found in training), then runs that through the same activation function used in training. Repeat for each hidden layer, then does the same for your output layer. Then each node in the output layer is your prediction(s).
Is this correct?
I got confused when using opencv to do this, because in the guide it says when you use the function predict:
If you are using the default cvANN_MLP::SIGMOID_SYM activation
function with the default parameter values fparam1=0 and fparam2=0
then the function used is y = 1.7159*tanh(2/3 * x), so the output
will range from [-1.7159, 1.7159], instead of [0,1].
However, when training it is also stated in the documentation that SIGMOID_SYM uses the activation function:
f(x)= beta*(1-e^{-alpha x})/(1+e^{-alpha x} )
Where alpha and beta are user defined variables.
So, I'm not quite sure what this means. Where does the tanh function come into play? Can anyone clear this up please? Thanks for the time!
The documentation where this is found is here:
reference to the tanh is under function descriptions predict.
reference to activation function is by the S looking graph in the top part of the page.
Since this is a general question, and not code specific, I did not post any code with it.
I would suggest that you read about appropriate algorithm that your are using or plan to use. To be honest there is no one definite algorithm to solve a problem but you can explore what features you got and what you need.
Regarding how an algorithm performs prediction is totally depended on the choice of algorithm. Support Vector Machine (SVM) performs prediction by fitting hyperplanes on the feature space and using some metric such as distance for learning and than the learnt model is used for prediction. KNN on the other than uses simple nearest neighbor measurement for prediction.
Please do more work on what exactly you need and read through the research papers to get proper understanding. There is not magic involved in prediction but rather mathematical formulations.

What are the metrics to evaluate a machine learning algorithm

I would like to know what are the various techniques and metrics used to evaluate how accurate/good an algorithm is and how to use a given metric to derive a conclusion about a ML model.
one way to do this is to use precision and recall, as defined here in wikipedia.
Another way is to use the accuracy metric as explained here. So, what I would like to know is whether there are other metrics for evaluating an ML model?
I've compiled, a while ago, a list of metrics used to evaluate classification and regression algorithms, under the form of a cheatsheet. Some metrics for classification: precision, recall, sensitivity, specificity, F-measure, Matthews correlation, etc. They are all based on the confusion matrix. Others exist for regression (continuous output variable).
The technique is mostly to run an algorithm on some data to get a model, and then apply that model on new, previously unseen data, and evaluate the metric on that data set, and repeat.
Some techniques (actually resampling techniques from statistics):
Jacknife
Crossvalidation
K-fold validation
bootstrap.
Talking about ML in general is a quite vast field, but I'll try to answer any way. The Wikipedia definition of ML is the following
Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data.
In this context learning can be defined parameterization of an algorithm. The parameters of the algorithm are derived using input data with a known output. When the algorithm has "learned" the association between input and output, it can be tested with further input data for which the output is well known.
Let's suppose your problem is to obtain words from speech. Here the input is some kind of audio file containing one word (not necessarily, but I supposed this case to keep it quite simple). You'd record X words N times and then use (for example) N/2 of the repetitions to parameterize your algorithm, disregarding - at the moment - how your algorithm would look like.
Now on the one hand - depending on the algorithm - if you feed your algorithm with one of the remaining repetitions, it may give you some certainty estimate which may be used to characterize the recognition of just one of the repetitions. On the other hand you may use all of the remaining repetitions to test the learned algorithm. For each of the repetitions you pass it to the algorithm and compare the expected output with the actual output. After all you'll have an accuracy value for the learned algorithm calculated as the quotient of correct and total classifications.
Anyway, the actual accuracy will depend on the quality of your learning and test data.
A good start to read on would be Pattern Recognition and Machine Learning by Christopher M Bishop
There are various metrics for evaluating the performance of ML model and there is no rule that there are 20 or 30 metrics only. You can create your own metrics depending on your problem. There are various cases wherein when you are solving real - world problem where you would need to create your own custom metrics.
Coming to the existing ones, it is already listed in the first answer, I would just highlight each metrics merits and demerits to better have an understanding.
Accuracy is the simplest of the metric and it is commonly used. It is the number of points to class 1/ total number of points in your dataset. This is for 2 class problem where some points belong to class 1 and some to belong to class 2. It is not preferred when the dataset is imbalanced because it is biased to balanced one and it is not that much interpretable.
Log loss is a metric that helps to achieve probability scores that gives you better understanding why a specific point is belonging to class 1. The best part of this metric is that it is inbuild in logistic regression which is famous ML technique.
Confusion metric is best used for 2-class classification problem which gives four numbers and the diagonal numbers helps to get an idea of how good is your model.Through this metric there are others such as precision, recall and f1-score which are interpretable.

Resources