What evaluation classifiers? Precision & recall? - machine-learning

I have some labeled data which classifies datasets as positive or negative. Now i have an algorithm that does the same automatically and I want to compare the results.
I was said to use precision and recall, but I'm not sure whether those are appropriate because the true negatives don't even appear in the formulas. I'd rather tend to use a general "prediction rate" for both, positives and negatives.
How would be a good way to evaluate the algorithm? Thanks!!

There is no general "best" method of evaluation, everything depends on what is your aim, as each method captures different phenomena:
Accuracy is the simple measure, well suited for multi-label classification and rather well balanced data
F1-score captures precision/recall tradeoff
MCC is a good measure which is well suited for dataset with large dissproportion in the class sizes

Related

Confidence Probability for Binary Machine Learning Classification

When using SKlearn and getting probabilities with the predict_proba(x) function for a binary classification [1, 0] the function returns the probability that the classification falls into each class. example [.8, .34].
Is there a community adopted standard way to reduce this down to a single classification confidence which takes all factors into consideration?
Option 1)
Just take the probability for the classification that was predicted (.8 in this example)
Option 2)
Some mathematical formula or function call which which takes into consideration all of the different probabilities and returns a single number. Such a confidence approach could take into consideration who close the probabilities of the different classes and return a lower confidence if there is not much separation between the different classes.
Theres no standard of of doing it. But what you can do is vary the threshold. What I exactly mean is if you use predict instead it throws out a binary out classifying your dataset, what its doing is taking 0.5 as a threshhold for predicting. Like if the probability of classifying in 1 is >0.5 classify it as 1 and 0 if <=0.5. But this can lead to a bad f1-score in some cases.
So, the approach should be to vary the threshhold and and choose one which yields maximum f1-score or any other metric you want to use as a score function. ROC(Receiver operating characteristic)curves are meant for this purpose only. And infact, the motive behind sklearn for giving out the class probabilities for this only, to let you choose the best threshhold.
A very nice example is predicting whether the patient has cancer or not. So you have to choose your threshhold wisely, if you choose it high you'll might be getting false-negatives a lot or if you choose it low you might get false-positives a lot. So you just choose the threshold according to your needs (as its better to get more false-positives).
Hope it helps!

Machine Learning Experiment Design with Small Positive Sample Set in Sci-kit Learn

I am interested in any tips on how to train a set with a very limited positive set and a large negative set.
I have about 40 positive examples (quite lengthy articles about a particular topic), and about 19,000 negative samples (most drawn from the sci-kit learn newsgroups dataset). I also have about 1,000,000 tweets that I could work with.. negative about the topic I am trying to train on. Is the size of the negative set versus the positive going to negatively influence training a classifier?
I would like to use cross-validation in sci-kit learn. Do I need to break this into train / test-dev / test sets? Is know there are some pre-built libraries in sci-kit. Any implementation examples that you recommend or have used previously would be helpful.
Thanks!
The answer to your first question is yes, the amount by which it will affect your results depends on the algorithm. My advive would be to keep an eye on the class-based statistics such as recall and precision (found in classification_report).
For RandomForest() you can look at this thread which discusses
the sample weight parameter. In general sample_weight is what
you're looking for in scikit-learn.
For SVM's have a look at either this example or this
example.
For NB classifiers, this should be handled implicitly by Bayes
rule, however in practice you may see some poor performances.
For you second question it's up for discussion, personally I break my data into a training and test split, perform cross validation on the training set for parameter estimation, retrain on all the training data and then test on my test set. However the amount of data you have may influence the way you split your data (more data means more options).
You could probably use Random Forest for your classification problem. There are basically 3 parameters to deal with data imbalance. Class Weight, Samplesize and Cutoff.
Class Weight-The higher the weight a class is given, the more its error rate is decreased.
Samplesize- Oversample the minority class to improve class imbalance while sampling the defects for each tree[not sure if Sci-kit supports this, used to be param in R)
Cutoff- If >x% trees vote for the minority class, classify it as minority class. By default x is 1/2 in Random forest for 2-class problem. You can set it to a lower value for the minority class.
Check out balancing predict error at https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
For the 2nd question if you are using Random Forest, you do not need to keep separate train/validation/test set. Random Forest does not choose any parameters based on a validation set, so validation set is un-necessary.
Also during the training of Random Forest, the data for training each individual tree is obtained by sampling by replacement from the training data, thus each training sample is not used for roughly 1/3 of the trees. We can use the votes of these 1/3 trees to predict the out of box probability of the Random forest classification. Thus with OOB accuracy you just need a training set, and not validation or test data to predict performance on unseen data. Check Out of Bag error at https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm for further study.

Training on imbalanced data using TensorFlow

The Situation:
I am wondering how to use TensorFlow optimally when my training data is imbalanced in label distribution between 2 labels. For instance, suppose the MNIST tutorial is simplified to only distinguish between 1's and 0's, where all images available to us are either 1's or 0's. This is straightforward to train using the provided TensorFlow tutorials when we have roughly 50% of each type of image to train and test on. But what about the case where 90% of the images available in our data are 0's and only 10% are 1's? I observe that in this case, TensorFlow routinely predicts my entire test set to be 0's, achieving an accuracy of a meaningless 90%.
One strategy I have used to some success is to pick random batches for training that do have an even distribution of 0's and 1's. This approach ensures that I can still use all of my training data and produced decent results, with less than 90% accuracy, but a much more useful classifier. Since accuracy is somewhat useless to me in this case, my metric of choice is typically area under the ROC curve (AUROC), and this produces a result respectably higher than .50.
Questions:
(1) Is the strategy I have described an accepted or optimal way of training on imbalanced data, or is there one that might work better?
(2) Since the accuracy metric is not as useful in the case of imbalanced data, is there another metric that can be maximized by altering the cost function? I can certainly calculate AUROC post-training, but can I train in such a way as to maximize AUROC?
(3) Is there some other alteration I can make to my cost function to improve my results for imbalanced data? Currently, I am using a default suggestion given in TensorFlow tutorials:
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, y))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)
I have heard this may be possible by up-weighting the cost of miscategorizing the smaller label class, but I am unsure of how to do this.
(1)It's ok to use your strategy. I'm working with imbalanced data as well, which I try to use down-sampling and up-sampling methods first to make the training set even distributed. Or using ensemble method to train each classifier with an even distributed subset.
(2)I haven't seen any method to maximise the AUROC. My thought is that AUROC is based on true positive and false positive rate, which doesn't tell how well it works on each instance. Thus, it may not necessarily maximise the capability to separate the classes.
(3)Regarding weighting the cost by the ratio of class instances, it similar to Loss function for class imbalanced binary classifier in Tensor flow
and the answer.
Regarding imbalanced datasets, the first two methods that come to mind are (upweighting positive samples, sampling to achieve balanced batch distributions).
Upweighting positive samples
This refers to increasing the losses of misclassified positive samples when training on datasets that have much fewer positive samples. This incentivizes the ML algorithm to learn parameters that are better for positive samples. For binary classification, there is a simple API in tensorflow that achieves this. See (weighted_cross_entropy) referenced below
https://www.tensorflow.org/api_docs/python/tf/nn/weighted_cross_entropy_with_logits
Batch Sampling
This involves sampling the dataset so that each batch of training data has an even distribution positive samples to negative samples. This can be done using the rejections sampling API provided from tensorflow.
https://www.tensorflow.org/api_docs/python/tf/contrib/training/rejection_sample
I'm one who struggling with imbalanced data. What my strategy to counter imbalanced data are as below.
1) Use cost function calculating 0 and 1 labels at the same time like below.
cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(_pred) + (1-y)*tf.log(1-_pred), reduction_indices=1))
2) Use SMOTE, oversampling method making number of 0 and 1 labels similar. Refer to here, http://comments.gmane.org/gmane.comp.python.scikit-learn/5278
Both strategy worked when I tried to make credit rating model.
Logistic regression is typical method to handle imbalanced data and binary classification such as predicting default rate. AUROC is one of the best metric to counter imbalanced data.
1) Yes. This is well received strategy to counter imbalanced data. But this strategy is good in Neural Nets only if you using SGD.
Another easy way to balance the training data is using weighted examples. Just amplify the per-instance loss by a larger weight/smaller when seeing imbalanced examples. If you use online gradient descent, it can be as simple as using a larger/smaller learning rate when seeing imbalanced examples.
Not sure about 2.

How can i proof my results after mine some dataset?

I wonder if thereĀ“s anyway to proof the correctness of my results after apply some data mining algorithms to a set of data. When i say data mining algorithms im talking about the basic algorithms
If you have many examples, a simple way is to split available data in three partitions:
training data (around 50%-60% of available examples, randomly chosen);
validation data (20%-25%);
test data (20%-25%).
Training data are used to adjust parameters of the data mining algorithms.
With validation data you can compare models/algorithms/parameters and choose a winner.
Test data can give you a forecast of winner's performance in the "real world" because they are independent (during the training/validation phase you don't make any choice based on test data).
Anyway there are many schemes and probably the best place to delve deeper into the matter is http://stats.stackexchange.com
There can be several ways to proof correctness of your results. Firstly, you have to choose performance criteria
Accuracy of algorithm
Standard Deviation of results
Computation time
Based on either of these criteria, you have to adopt different-different mechanism to prove correctness of your algorithm.
1. Accuracy of algorithm
for this you have to understand, what are those point which can be questioned when you say that my algorithm's accuracy is XY.WZ%.
First question, is your algorithm giving better result because of over-fitting?
To avoid over-fitting by your algorithm, you can divide your data into three parts
training data
validation data
testing data
by doing so, if you are get good testing results, you can be sure that your algorithm did not over-fit. if there is a big difference between training and testing accuracy that is a sign of over-fitting.
What if you find out that your algorithm over-fit?
You can use several regularization techniques that keeps value of weights coefficient lower and helps in preventing over-fitting. You can know more about this in lectures of machine learning by Andre N.G at coursra.
Second question, is your data-set fairly chosen?
Suppose you have 100 dataset and you divided it in 50-30-20 set (training-validation-testing). Now question comes which 50 for training and which 30 dataset for validation and so on. So for different-2 selection of these data-set, you will get different-2 accuracy values. So, you should take 5-10 different-2 sets and then provide and average of results. This technique is known as cross-validation technique.
An another way to prove correctness of your algorithm is to provide confusion matrix in case of muticlass classification and sensitivity and specificity in case of binary classification. you can look at their wiki pages.
2. Standard deviation of results
If your algorithm is based on random population generation or based on heuristics then you are most likely to get different solution at each run of algorithm . In this case, you should provide an standard deviation of multiple runs on same data-set and same parameter setting by your algorithm.
3. computation time of algorithm
This might not be important in every case but if you are doing an comparison of your algorithm with other algorithm then you should provide comparison of computation time, however this has nothing to do with correctness of your algorithm but it does gives an idea of comprehensiveness of your algorithm.
What good are proven results?
At most you will be able to prove that your implementation matches some theoretical mathematical model, or that an approximative algorithm approximates this mathematical model.
But in practise, real data will not satisfy your mathematical assumptions anyway.
Often, the best proof is: does it work?
That is, on real, unseen data. Not on the data that you used to choose your parameters, because then you are prone to overfitting.

Suggestions to improve my normalized accuracy with libsvm

I'm with a problem when I try to classify my data using libsvm. My training and test data are highly unbalanced. When I do the grid search for the svm parameters and train my data with weights for the classes, the testing gives the accuracy of 96.8113%. But because the testing data is unbalanced, all the correct predicted values are from the negative class, which is larger than the positive class.
I tried a lot of things, from changing the weights until changing the gamma and cost values, but my normalized accuracy (which takes into account the positive classes and negative classes) is lower in each try. Training 50% of positives and 50% of negatives with the default grid.py parameters i have a very low accuracy (18.4234%).
I want to know if the problem is in my description (how to build the feature vectors), in the unbalancing (should i use balanced data in another way?) or should i change my classifier?
Better data always helps.
I think that imbalance is part of the problem. But a more significant part of the problem is how you're evaluating your classifier. Evaluating accuracy given the distribution of positives and negatives in your data is pretty much useless. So is training on 50% and 50% and testing on data that is distributed 99% vs 1%.
There are problems in real life that are like the one your studying (that have a great imbalance in positives to negatives). Let me give you two examples:
Information retrieval: given all documents in a huge collection return the subset that are relevant to search term q.
Face detection: this large image mark all locations where there are human faces.
Many approaches to these type of systems are classifier-based. To evaluate two classifiers two tools are commonly used: ROC curves, Precision Recall curves and the F-score. These tools give a more principled approach to evaluate when one classifier is working better than the another.

Resources