How to choose scoring method for Regression? - machine-learning

Do you have any idea that besides R square, is there any other effective and recognizable scoring method to evaluate a Regressor such as Gradient Boosting Regressor, Random Forest Regressor, SVR and so on.
If there are many different scoring methods, what is the factor for us to consider when choosing among them. Thank you!

You need to take care of overfitting. R-square won't take care of that. You may want to use Adjusted R-square or least square error with component for weights, something like
(y - w.x)^2 + lambda*(w.w)
where lambda is a weight for restricting model, y is the original output, x is the input, w is the derived weights of the model such that w.x is the predicted output.

Related

How can I get the joint probability of new image and latent variable from generative model?

I just learned that generative model tries to learn p(x|z)p(z) = p(x,z).
But after I study some sample code of generative models such as VAE and GAN, I found that the output of model is the generated image x, which is a 2D matrix.
In my realization, the content of matrix means the probability of every pixel and the latent variable, is this right?
If it's right, is it possible to get joint probability p(x,z) between latent variables z and a whole image x from generative model?
Thanks!
What a generative model is trying to learn is just p(x). p(x|z) = 1 if g(z) = x and 0 otherwise, because GANs and VAEs are deterministic mappings and therefore have 100% chance to map to the same target given the same input.
Extracting the probability of x is not an easy task though and depends on the approach. With GANs you can approximate this by sampling from the model. E.g. you sample 1000 images and see how often an image occured. Then this image has a probability of occurences / 1000. By the law of large numbers you will eventually recover the actual probability distribution of your generator this way.
If you want an exact way to calculate probabilities you can use FLOW networks like GLOW or RealNVP, which optimize for log(p(x)) directly and have a way to recover p(x).

Inverse prediction in Machine Learning

I have a question on inverse prediction in Machine Learning/Data Science. Here I give a example to illustrate my question: I have 20 input features X = (x0, x1, ... x19) and 3 output variables Y = (y0, y1, y2). The number of training/test data usually small, such as <1000 items or even <100 in the training set.
In general, by using the machine learning toolbox (such as scikit learn), I can train the models (such as random forest, linear/polynomial regression and neural network) from X --> Y. But what I actually want to know is, for example, how should I set X, so that I can have y1 values in a specific range (for example y1 > 100).
Does anyone know how to solve this kind of "inverse prediction"? There are two ways in my mind:
Train the model in the normal way: X-->Y, then set a dense mesh in the high dimension X space. In this example, it is 20 dimensions. Then use all the point in this mesh as input data and throw them to the trained model. Select all the input points where the predicted y1 > 100. Finally, use some methods, such as clustering to look for some patterns in the selected data points.
Direct learn models from Y to X. Then, set a dense mesh in the high dimension Y space, where let y1 > 100. Then use the trained models to calculate the X data points.
The second method might be OK when the Y also have high dimensions. But usually, in my application, Y is very low-dimension and X is very high-dimension, which makes me think method 2 is not very practical.
Does anyone have any new thoughts? I think this should be somehow very common in industry and maybe some people meet similar situation before.
Thank you!
From what I understand of your needs, #1 is an excellent fit for this problem. I recommend that you use a simple binary classifier SVM to discriminate good/bad X vectors. SVM works well with high-dimensional spaces, and reading out the coefficients is easy in most SVM interfaces.
Similar note that may be useful:
In inverse/backward prediction, we can predict inversely with similar accuracy of direct/forward prediction of X--->Y and backward of Y--->X only just with solving the systems of equations X<---->Y assuming weights and intercepts. Also, usually, it is better for linear problems AX=B. Note that it is usually possible the Python code for inverse prediction has a considerable error while solving the system of equations (n*n) is better choice with suitable accuracy for that.
Regards

Can a machine learning model provide information about mean and standard deviation of data on which it was trained?

Consider a parametric binary classifier (such as Logistic Regression, SVM etc.) trained on a dataset (say containing two features for e.g. Blood Pressure and Cholesterol level). The dataset is thrown away and the trained model can only be used as a black box (no tweaks and inside information can be gathered from the trained model). Only a set of data points can be provided and their labels predicted.
Is it possible to get information about the mean and/or standard deviation and/or range of the features of the dataset on which this model was trained? If yes, how so? and If no, then why can't we?
Thank you for your response! :)
SVM does not provide any information about the data statistics, it is a maximum margin classifier and it finds the best separating hyperplane between two datasets in the feature space, as a linear combination of "support vectors". If you use kernel functions, then this combination is in the kernel space, it is not even in the original feature space. SVM does not have a straightforward probabilistic interpretation whatsoever.
Logistic regression is a discriminative classifer and models the conditional probability p (y|x,w) where y is your label, x is your data and w are the features. After maximum likelihood training you are left with w and it is again a discriminator (hyperplane) in the feature space, so you don't have the features again.
The following can be considered. Use a Gaussian classifier. Assume that your class is produced by the prior class probability p (y). Then a class conditional density p (x|y,w) produces your data. Then by the Bayes rule, you will have: p (y|x,w) = (p (y)p (x|y,w))/p (x). If you define the class conditional density p (x|y,w) as Gaussian, its parameter set w will consists of the mean vector m and covariance matrix C of x, assuming it is being produced by the class y. But remember that, this will work only based on the assumption that the current data vector belongs to a specific class. Conditioned on w, a better option would be for mean vector: E [x|w]. This the expectation of x with respect to p (x|w). It comes down to a weighted average of mean vectors for the class y=0 and y=1, with respect to their prior class probabilities. Same should work for covariance as well, but it needs to be derived properly, I am not %100 sure right now.

When to use generative algorithms in machine learning?

Suppose I have a training set made by (x, y) samples.
To apply a generative algorithm, let's say the Gaussian discriminative, I must assume that
p(x|y) ~ Normal(mu, sigma) for every possible sigma
or I just need to I know if x ~ Normal(mu, sigma) given y?
How can I evaluate if p(x|y) follows a multivariate Normal distribution well enough (up to a threshold) to me to use generative algorithm?
That's a lot of questions.
To apply a generative algorithm, let's say the Gaussian
discriminative, I must assume that
p(x|y) ~ Normal(mu, sigma) for every possible sigma
No, you must assume that's true for some mu, sigma pair. In practice you won't know what mu and sigma is, so you'll need to either estimate it (frequentist, Max Likelihood/Max A Posteriori estimates), or even better incorporate uncertainty about your estimates of the parameters into predictions (Bayesian methodology).
How can I evaluate if p(x|y) follows a multivariate Normal distribution?
Classically, using a goodness of fit test. If the dimensionality of x is more than a handful, though, this won't work because standard tests involve the number of items in bins, and the number of bins you need in high dimensions is astronomical so you have very low expected counts.
A better idea is to say the following: what are my options for modelling the (conditional) distribution of x? You can compare between these options using model comparison techniques. Read up on model checking and comparison.
Finally, your last point:
well enough (up to a threshold) to me to use generative algorithm?
The paradox of many generative methods, including Fisher's Linear Discriminant Analysis for example, as well as the Naive Bayes classifier, is that the classifier can work very well even though the model is poor for the data. There's no particularly sound reason why this should be the case, but many have observed it to be empirically true. Whether it works can be checked much more easily than whether the assumed distribution explains the data very well: just split your data into training and testing and find out!

Can someone explain to me the difference between a cost function and the gradient descent equation in logistic regression?

I'm going through the ML Class on Coursera on Logistic Regression and also the Manning Book Machine Learning in Action. I'm trying to learn by implementing everything in Python.
I'm not able to understand the difference between the cost function and the gradient. There are examples on the net where people compute the cost function and then there are places where they don't and just go with the gradient descent function w :=w - (alpha) * (delta)w * f(w).
What is the difference between the two if any?
Whenever you train a model with your data, you are actually producing some new values (predicted) for a specific feature. However, that specific feature already has some values which are real values in the dataset. We know the closer the predicted values to their corresponding real values, the better the model.
Now, we are using cost function to measure how close the predicted values are to their corresponding real values.
We also should consider that the weights of the trained model are responsible for accurately predicting the new values. Imagine that our model is y = 0.9*X + 0.1, the predicted value is nothing but (0.9*X+0.1) for different Xs.
[0.9 and 0.1 in the equation are just random values to understand.]
So, by considering Y as real value corresponding to this x, the cost formula is coming to measure how close (0.9*X+0.1) is to Y.
We are responsible for finding the better weight (0.9 and 0.1) for our model to come up with a lowest cost (or closer predicted values to real ones).
Gradient descent is an optimization algorithm (we have some other optimization algorithms) and its responsibility is to find the minimum cost value in the process of trying the model with different weights or indeed, updating the weights.
We first run our model with some initial weights and gradient descent updates our weights and find the cost of our model with those weights in thousands of iterations to find the minimum cost.
One point is that gradient descent is not minimizing the weights, it is just updating them. This algorithm is looking for minimum cost.
A cost function is something you want to minimize. For example, your cost function might be the sum of squared errors over your training set. Gradient descent is a method for finding the minimum of a function of multiple variables. So you can use gradient descent to minimize your cost function. If your cost is a function of K variables, then the gradient is the length-K vector that defines the direction in which the cost is increasing most rapidly. So in gradient descent, you follow the negative of the gradient to the point where the cost is a minimum. If someone is talking about gradient descent in a machine learning context, the cost function is probably implied (it is the function to which you are applying the gradient descent algorithm).
It's strange to think about it, but there is more than one measure for how "accurately" a line fits to data points.
To access how accurately a line fits the data, we have a "cost" function which which can compare predicted vs. actual values and provide a "penalty" for how wrong it is.
penalty = cost_funciton(predicted, actual)
A naive cost function might just take the difference between the predicted and actual.
More sophisticated functions will square the value, since we'd rather have many small errors than one large error.
Additionally, each point has a different "sensitivity" to moving the line. Some points react very strongly to movement. Others react less strongly.
Often, you can make a tradeoff, and move TOWARD a point that is sensitive, and AWAY from a point that is NOT sensitive. In that scenario , you get more than you give up.
The "gradient" is a way of measuring how sensitive each point is to moving the line.
This article does a good job of describing WHY there is more than one measure, and WHY some points are more sensitive than others:
https://towardsdatascience.com/wrapping-your-head-around-gradient-descent-with-pictures-3fbd810235f5?source=friends_link&sk=7117e5de8c66bd4a4c2bb2a87a928773
Let's take an example of logistic regression model for binary classification. Output(Predicted Value) of the model for any given input will be offset(deviation) with respect to the actual output(Expected Value) while training. So, the model needs to be trained with minimal error(loss) so that model can perform well with high accuracy.
The function used to find the parameters(m and c in case of linear equation, y = mx+c) value at which the minimal error(loss) occurs is called Cost Function/Loss Function. Loss function is a term used to find the loss for single row/record of the training sample and Cost function is a term used to find the loss for the entire training dataset.
Now, How do we find the parameter(m and c in our case) values at which the minimum loss occurs? Its by using gradient descent algorithm using the equation, which helps us to find the points at which the minimum loss occurs and the parameters values at this points are considered for model building (let say y = 0.5x + 2) where m=.5 and c=2 are the points at which the loss is minimum.
Cost function is something is like at what cost you are building your model for a good model that cost should be minimum. To find the minimum cost function we use gradient descent method. That give value of coefficients to determine minimum cost function

Resources