What is interpretability in machine learning?

What is interpretability in machine learning? - machine-learning

I read this line today :
Every regression gets better with the addition of more features or variables... But adding more features increases complexity and reduces interpretability of the model as well.
I am unable to understand what is interpretability? (searched it on google but still did not get it)
Please help thank you

I would say that interpretability in a regression problems is when you can explain the result of your model to non statistician / domain experts.
For example: you try to predict the size of people depending on many variable, including sex. If you use linear regression, you will be able to say that the model will add 20cm (again, for example) to the predicted size if the person is a man (compared to a woman). The domain expert will understand the relationship between explanatory variable and the predicted result, without understanding statistics or how a linear regression works.
In addition, I disagree with the fact that the addition of more features or variables always improve regression result.
What is a better regression ? Improvement in choosen metrics ? For training or test set ? A "better regression" doesn't mean anything...
If we assume that a better regression is a regression which is better to predict the target for a new dataset, more variable doesn't always improve prediction power, especially when there is no regularization, if the added feature contains futures variables or many others cases.

Related

Choosing right metrics for regression model

I have always been using r2 score metrics. I know there are several evaluation metrics out there i have read several articles about it. Since i'm still a beginner in machine learning. I'm still very confused of
When to use each of it, is depending on our case, if yes please give me example
I read this article and it said, r2 score is not straightforward, we need other stuff to measure the performance of our model. Does it mean we need more than 1 evaluation metrics in order to get better insight of our model performance?
Is it recommended if we only measure our model performance by just one evaluation metrics?
From this article it said knowing the distribution of our data and our business goal helps us to understand choose appropriate metrics. What does it mean by that?
How to know for each metrics that the model is 'good' enough?

There are different evaluation metrics for regression problems like below.
Mean Squared Error(MSE)
Root-Mean-Squared-Error(RMSE)
Mean-Absolute-Error(MAE)
R² or Coefficient of Determination
Mean Square Percentage Error (MSPE)
so on so forth..
As you mentioned you need to use them based on your problem type, what you want to measure and the distribution of your data.
To do this, you need to understand how these metrics evaluate the model. You can check the definitions and pros/cons of evaluation metrics from this nice blog post.
R² shows what variation of your purpose variable is described by independent variables. A good model can give R² score close to 1.0 but it does not mean it should be. Models which have low R² can also give low MSE score. So to ensure your predictive power of your model it is better to use MSE, RMSE or other metrics besides the R².
No. You can use multiple evaluation metrics. The important thing is if you compare two models, you need to use same test dataset and the same evaluation metrics.
For example, if you want to penalize your bad predictions too much, you can use MSE evaluation metric because it basically measures the average squared error of our predictions or if your data have too much outlier MSE give too much penalty to this examples.
The good model definition changes based on your problem complexity. For example if you train a model which predicts that heads or tails and gives %49 accuracy it is not good enough because the baseline of this problem is %50. But for any other problem, %49 accuracy may enough for your problem. So in a summary, it depends on your problem and you need to define or think that human(baseline) threshold.

How azure ML give an output for a value which is not used when training the model?

I am trying to predict the price of a house. Therefore I added no-of-rooms as one variable to get the prediction. Previous values for that variable was (3,2,1) when I was training the model. Now I am adding no-of-rooms as "6" to get an output(which was not use before to get the predicted value). How will it give the output for a new value?Is it only consider the variables except no-of-rooms ? I used Boosted decision tree regression as the model.

The short answer is that when you train your model on a set of features and then use a test set to run predictions, yes it will be able to utilize/understand feature values that the model hasn't previously seen during training. If you have large outliers in your test set that would differ significantly from what the model saw during training, it will affect accuracy, but it will still attempt a prediction.
This is less of a Azure Machine Learning question and more machine learning basics (or really just the basics of how regression works). I would do some research on both "linear regression", and the concept of "over-fitting in machine learning". These are two very basic conceptual topics that will help with your understanding. Understanding regression will help you see why a model can use a value it hasn't previously seen to create a prediction.

How can I get the relative importance of features of a logistic regression for a particular prediction?

I am using a Logistic Regression (in scikit) for a binary classification problem, and am interested in being able to explain each individual prediction. To be more precise, I'm interested in predicting the probability of the positive class, and having a measure of the importance of each feature for that prediction.
Using the coefficients (Betas) as a measure of importance is generally a bad idea as answered here, but I'm yet to find a good alternative.
So far the best I have found are the following 3 options:
Monte Carlo Option: Fixing all other features, re-run the prediction replacing the feature we want to evaluate with random samples from the training set. Do this a large number of times. This would establish a baseline probability for the positive class. Then compare with the probability of the positive class of the original run. The difference is a measure of Importance of the feature.
"Leave-one-out" classifiers: To evaluate the importance of a feature, first create a model which uses all features, and then another that uses all features except the one being tested. Predict the new observation using both models. The difference between the two would be the importance of the feature.
Adjusted betas: Based on this answer, ranking the importance of the features by 'the magnitude of its coefficient times the standard deviation of the corresponding parameter in the data.'
All options (using betas, Monte Carlo and "Leave-one-out") seem like poor solutions to me.
The Monte Carlo is dependent on the distribution of the training set, and I cannot find any literature to support it.
The "leave one out" would be easily tricked by two correlated features (when one were absent, the other one would step in to compensate, and both would be given 0 importance).
The adjusted betas sounds plausible, but I cannot find any literature to support it.
Actual question: What is the best way to interpret the importance of each feature, at the moment of a decision, with a linear classifier?
Quick note #1: for Random Forests this is trivial, we can simply use the prediction + bias decomposition, as explained beautifully in this blog post. The problem here is how to do something similar with linear classifiers such as Logistic Regression.
Quick note #2: there are a number of related questions on stackoverflow (1 2 3 4 5). I have not been able to find an answer to this specific question.

If you want the importance of the features for a particular decision, why not simulate the decision_function (Which is provided by scikit-learn, so you can test whether you get the same value) step by step? The decision function for linear classifiers is simply:
intercept_ + coef_[0]*feature[0] + coef_[1]*feature[1] + ...
The importance of a feature i is then just coef_[i]*feature[i]. Of course this is similar to looking at the magnitude of the coefficients, but since it is multiplied with the actual feature and it is also what happens under the hood it might be your best bet.

I suggest to use eli5 which already have similar things implemented.
For you question:
Actual question: What is the best way to interpret the importance of each feature, at the moment of a decision, with a linear classifier?
I would say the answer come the the function show_weights() from eli5.
Furthermore this can be implemented with many other classifiers.
For more info you can see this question in related question.

What are the metrics to evaluate a machine learning algorithm

I would like to know what are the various techniques and metrics used to evaluate how accurate/good an algorithm is and how to use a given metric to derive a conclusion about a ML model.
one way to do this is to use precision and recall, as defined here in wikipedia.
Another way is to use the accuracy metric as explained here. So, what I would like to know is whether there are other metrics for evaluating an ML model?

I've compiled, a while ago, a list of metrics used to evaluate classification and regression algorithms, under the form of a cheatsheet. Some metrics for classification: precision, recall, sensitivity, specificity, F-measure, Matthews correlation, etc. They are all based on the confusion matrix. Others exist for regression (continuous output variable).
The technique is mostly to run an algorithm on some data to get a model, and then apply that model on new, previously unseen data, and evaluate the metric on that data set, and repeat.
Some techniques (actually resampling techniques from statistics):
Jacknife
Crossvalidation
K-fold validation
bootstrap.

Talking about ML in general is a quite vast field, but I'll try to answer any way. The Wikipedia definition of ML is the following
Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data.
In this context learning can be defined parameterization of an algorithm. The parameters of the algorithm are derived using input data with a known output. When the algorithm has "learned" the association between input and output, it can be tested with further input data for which the output is well known.
Let's suppose your problem is to obtain words from speech. Here the input is some kind of audio file containing one word (not necessarily, but I supposed this case to keep it quite simple). You'd record X words N times and then use (for example) N/2 of the repetitions to parameterize your algorithm, disregarding - at the moment - how your algorithm would look like.
Now on the one hand - depending on the algorithm - if you feed your algorithm with one of the remaining repetitions, it may give you some certainty estimate which may be used to characterize the recognition of just one of the repetitions. On the other hand you may use all of the remaining repetitions to test the learned algorithm. For each of the repetitions you pass it to the algorithm and compare the expected output with the actual output. After all you'll have an accuracy value for the learned algorithm calculated as the quotient of correct and total classifications.
Anyway, the actual accuracy will depend on the quality of your learning and test data.
A good start to read on would be Pattern Recognition and Machine Learning by Christopher M Bishop

There are various metrics for evaluating the performance of ML model and there is no rule that there are 20 or 30 metrics only. You can create your own metrics depending on your problem. There are various cases wherein when you are solving real - world problem where you would need to create your own custom metrics.
Coming to the existing ones, it is already listed in the first answer, I would just highlight each metrics merits and demerits to better have an understanding.
Accuracy is the simplest of the metric and it is commonly used. It is the number of points to class 1/ total number of points in your dataset. This is for 2 class problem where some points belong to class 1 and some to belong to class 2. It is not preferred when the dataset is imbalanced because it is biased to balanced one and it is not that much interpretable.
Log loss is a metric that helps to achieve probability scores that gives you better understanding why a specific point is belonging to class 1. The best part of this metric is that it is inbuild in logistic regression which is famous ML technique.
Confusion metric is best used for 2-class classification problem which gives four numbers and the diagonal numbers helps to get an idea of how good is your model.Through this metric there are others such as precision, recall and f1-score which are interpretable.

Predictive features with high presence in one class

I am doing a logistic regression to predict the outcome of a binary variable, say whether a journal paper gets accepted or not. The dependent variable or predictors are all the phrases used in these papers - (unigrams, bigrams, trigrams). One of these phrases has a skewed presence in the 'accepted' class. Including this phrase gives me a classifier with a very high accuracy (more than 90%), while removing this phrase results in accuracy dropping to about 70%.
My more general (naive) machine learning question is:
Is it advisable to remove such skewed features when doing classification?
Is there a method to check skewed presence for every feature and then decide whether to keep it in the model or not?

If I understand correctly you ask whether some feature should be removed because it is a good predictor (it makes your classifier works better). So the answer is short and simple - do not remove it in fact, the whole concept is to find exactly such features.
The only reason to remove such feature would be that this phenomena only occurs in the training set, and not in real data. But in such case you have wrong data - which does not represnt the underlying data density and you should gather better data or "clean" the current one so it has analogous characteristics as the "real ones".

Based on your comments, it sounds like the feature in your documents that's highly predictive of the class is a near-tautology: "paper accepted on" correlates with accepted papers because at least some of the papers in your database were scraped from already-accepted papers and have been annotated by the authors as such.
To me, this sounds like a useless feature for trying to predict whether a paper will be accepted, because (I'd imagine) you're trying to predict paper acceptance before the actual acceptance has been issued ! In such a case, none of the papers you'd like to test your algorithm with will be annotated with "paper accepted on." So, I'd remove it.
You also asked about how to determine whether a feature correlates strongly with one class. There are three things that come to mind for this problem.
First, you could just compute a basic frequency count for each feature in your dataset and compare those values across classes. This is probably not super informative, but it's easy.
Second, since you're using a log-linear model, you can train your model on your training dataset, and then rank each feature in your model by its weight in the logistic regression parameter vector. Features with high positive weight are indicative of one class, while features with large negative weight are strongly indicative of the other.
Finally, just for the sake of completeness, I'll point out that you might also want to look into feature selection. There are many ways of selecting relevant features for a machine learning algorithm, but I think one of the most intuitive from your perspective might be greedy feature elimination. In such an approach, you train a classifier using all N features in your model, and measure the accuracy on some held-out validation set. Then, train N new models, each with N-1 features, such that each model eliminates one of the N features, and measure the resulting drop in accuracy. The feature with the biggest drop was probably strongly predictive of the class, while features that have no measurable difference can probably be omitted from your final model. As larsmans points out correctly in the comments below, this doesn't scale well at all, but it can be a useful method sometimes.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart