Is there any metric to evaluate output probabilities' precision in classification models? - machine-learning

I am currently developing a model in Python and Keras for a binary classification task (success/failure). My aim is to generate success probabilities for each observation so that I can use them later on in another task.
Do you know of any metric that quantifies the accuracy of these probabilities individually (and not the overall accuracy of the model)?
Thank you in advance.

Related

What does the KNN algorithm do in the training phase?

Unlike other algorithms like linear regressions ,KNN doesn't seems to perform any calculation in the training phase. Like in case of linear regressions it finds the coefficients in the training phase.But what about KNN?
During training phase, KNN arranges the data (sort of indexing process) in order to find the closest neighbors efficiently during the inference phase. Otherwise, it would have to compare each new case during inference with the whole dataset making it quite inefficient.
You can read more about it at: https://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbor-algorithms
KNN belongs to the group of lazy learners. As opposed to eager learners such as logistic regression, svms, neural nets, lazy learners just store the training data in memory. Then, during inference, it find the K nearest neighbours from the training data in order to classify the new instance.
KNN is an instance based method, which completely relies on training examples, in other words, it memorizes all the training examples So in case of classification, whenever any examples appears, it compute euclidean distance between the input example and all the training examples, and returns the label of the closest training example based on the distance.
Knn is lazy learner . It means that , like other algorithms learn in their training phase (Linear regression etc) , Knn learn in training phase . It actually just store data points in RAM at time of training .
Like in case of linear regressions it finds the coefficients in the training phase.But what about KNN?--> In case of KNN it tunes its parameter in testing phase . In testing phase it finds its optimal solution of parameters (K value , Distance calculating technique etc).
Unlike other algorithms which learn in training phase and get tested in testing phase , KNN learn and get tested(K fold CV) for parameters in testing phase .
Distance calculation->https://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbor-algorithms
KNN python docs->https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

Multi-label classification Keras metrics

Which metrics is better for multi-label classification in Keras: accuracy or categorical_accuracy? Obviously the last activation function is sigmoid and as loss function is binary_crossentropy in this case.
I would not use Accuracy for classification tasks with unbalanced classes.
Especially for multi-label tasks, you probably have most of your labels to be False. That is, each data point can only have a small set of labels compared to the cardinality of all of the possibile labels.
For that reason accuracy is not a good metric, if your model predict all False (sigmoid activation output < 0.5) then you would measure a very high accuracy.
I would analyze either the AUC or recall/precision at each epoch.
Alternatively a multi-label task can be seen as a ranking task (like Recommender Systems) and you could evaluate precision#k or recall#k where k are the top predicted labels.
If your Keras back-end is TensorFlow, check out the full list of supported metrics here: https://www.tensorflow.org/api_docs/python/tf/keras/metrics.
Actually, there is no metric named accuracy in Keras. When you set metrics=['accuray'] in Keras, the correct accuracy metric will be inferred automatically based on the loss function used. As a result, since you have used binary_crossentropy as the loss function, the binary_accuracy will be chosen as the metric.
Now, you should definitely choose binary_accuracy over categorical_accuracy in a multi-label classification task since classes are independent from each other and the prediction for each class should be considered independently of the predictions for other classes.

Can linear classification take non binary targets?

I'm following a TensorFlow example that takes a bunch of features (real estate related) and "expensive" (ie house price) as the binary target.
I was wondering if the target could take more than just a 0 or 1. Let's say, 0 (not expensive), 1 (expensive), 3 (very expensive).
I don't think this is possible as the logistic regression model has asymptotes nearing 0 and 1.
This might be a stupid question, but I'm totally new to ML.
I think I found the answer myself. From Wikipedia:
First, the conditional distribution y|x is a Bernoulli distribution rather than a Gaussian distribution, because the dependent variable is binary. Second, the predicted values are probabilities and are therefore restricted to (0,1) through the logistic distribution function because logistic regression predicts the probability of particular outcomes.
Logistic Regression is defined for binary classification tasks.(For more details, please logistic_regression. For multi-class classification problems, you can use Softmax Classification algorithm. Following tutorials shows how to write a Softmax Classifier in Tensorflow Library.
Softmax_Regression in Tensorflow
However, your data set is linearly non-separable (most of the time this is the case in real-world datasets) you have to use an algorithm which can handle nonlinear decision boundaries. Algorithm such as Neural Network or SVM with Kernels would be a good choice. Following IPython notebook shows how to create a simple Neural Network in Tensorflow.
Neural Network in Tensorflow
Good Luck!

Logistic Regression only recognizing predominant classes

I am participating in the Kaggle San Francisco Crime competition and i am currently trying o number of different classifiers to test benchmark performances. I am using a LogisticRegressionClassifier from sklearn, without any parameter tuning and I noticed from sklearn.metrict.classification_report that it is only predicting the predominant classses,i.e. the classes which have the highest number of occurrences in my training set.
Intuition tells me that this has to parameter tuning, but I am not sure which parameters I have to tweek in order to make the classifier more aware of less predominant classes ( LogisticRegressionClassifier has quite a few ). At the moment it is predicting only 3 classes from 38 or smth like that so it definitely needs improvement.
Any ideas?
If your model is classifying only predominant classes then you are facing problem of imbalance classes. Here are some good reads to tackle this in machine learning.
Logistic Regression is a binary classifier and uses one-vs-all or one-vs-one technique for multiclass classification, which is not good if you have higher number of output classes (33 in your case). Try using other classifier. For a start , use softmax classifier which is an extension of logistic classifier having support for multi-class classification. In scikit learn, set multi_class variable as multinomial to use softmax regression.
Other way to improve your model could be using GridSearch for parameter tuning.
On a side note, I would recommend you to use other models as well.

What is the difference between classification and prediction?

What is the difference between classification and prediction in machine learning?
Classification is the prediction of a categorial variable within a predefined vocabulary based on training examples.
The prediction of numerical (continuous) variables is called regression.
In summary, classification is one kind of prediction, but there are others. Hence, prediction is a more general problem.
Functionality
Classification is about determining a (categorial) class (or label) for an element in a dataset
Prediction is about predicting a missing/unknown element(continuous value) of a dataset
Working Strategy
In classification, data is grouped into categories based on a training dataset.
In prediction, a classification/regression model is built to predict the outcome(continuous value)
Example
In a hospital, the grouping of patients based on their medical record or treatment outcome is considered classification, whereas, if you use a classification model to predict the treatment outcome for a new patient, it is considered a prediction.
Classification is the process of identifying the category or class label of the new observation to which it belongs.
Predication is the process of identifying the missing or unavailable numerical data for a new observation.
That is the key difference between classification and prediction. The predication does not concern about the class label like in classification.
Predictions can be using both regression as well as classification models. It means that once a model is trained on the training data; the next phase is to do predictions for the data whose real/ground-truth values are either unknown or kept aside to evaluate the performance of model. If the nature of the problem is of determining classes/labels/categories athen its classification and if the problem is about determining real numbers (numeric) values then its regression. In nutshell, predictions are supposed to done with both classification and regression for the test data set.
1.Prediction is like saying something which may going to be happened in future.Prediction may be a kind of classification
2.Prediction is mostly based on our future assumptions
whereas
1.Classification is categorization of the things or data that we already have with us.This categorization can be based on any kind of technique or algorithms
2.Classification is mostly based on our current or past assumptions

Resources