I have a question on inverse prediction in Machine Learning/Data Science. Here I give a example to illustrate my question: I have 20 input features X = (x0, x1, ... x19) and 3 output variables Y = (y0, y1, y2). The number of training/test data usually small, such as <1000 items or even <100 in the training set.
In general, by using the machine learning toolbox (such as scikit learn), I can train the models (such as random forest, linear/polynomial regression and neural network) from X --> Y. But what I actually want to know is, for example, how should I set X, so that I can have y1 values in a specific range (for example y1 > 100).
Does anyone know how to solve this kind of "inverse prediction"? There are two ways in my mind:
Train the model in the normal way: X-->Y, then set a dense mesh in the high dimension X space. In this example, it is 20 dimensions. Then use all the point in this mesh as input data and throw them to the trained model. Select all the input points where the predicted y1 > 100. Finally, use some methods, such as clustering to look for some patterns in the selected data points.
Direct learn models from Y to X. Then, set a dense mesh in the high dimension Y space, where let y1 > 100. Then use the trained models to calculate the X data points.
The second method might be OK when the Y also have high dimensions. But usually, in my application, Y is very low-dimension and X is very high-dimension, which makes me think method 2 is not very practical.
Does anyone have any new thoughts? I think this should be somehow very common in industry and maybe some people meet similar situation before.
Thank you!
From what I understand of your needs, #1 is an excellent fit for this problem. I recommend that you use a simple binary classifier SVM to discriminate good/bad X vectors. SVM works well with high-dimensional spaces, and reading out the coefficients is easy in most SVM interfaces.
Similar note that may be useful:
In inverse/backward prediction, we can predict inversely with similar accuracy of direct/forward prediction of X--->Y and backward of Y--->X only just with solving the systems of equations X<---->Y assuming weights and intercepts. Also, usually, it is better for linear problems AX=B. Note that it is usually possible the Python code for inverse prediction has a considerable error while solving the system of equations (n*n) is better choice with suitable accuracy for that.
Regards
Related
How to know that feature scaling is require in Linear Regression, multilinear regression, polynomial regression? Because some where I am getting a point that feature scaling is not required because coefficient is there and somewhere I am getting that feature scaling is required so what's the actual answer.
Both the statements are correct but incomplete.
If you are using simple linear model such as y = w1 * x1 + w2 * x2 then feature scaling is not required. As the coefficient w1 and w2 will be learned or adapted accordingly.
But if you modify the above expression with the regularization term or defining a constraints over variables, then the coefficient will be biased toward the feature with larger magnitude without feature scaling.
In conclusion: Feature scaling is important when we modify the expression for simple linear model. Also it is a good practice to normalize the features before applying any algorithm.
Suppose we have two features of weight and price, as in the below table. The “Weight” cannot have a meaningful comparison with the “Price.” So the assumption algorithm makes that since “Weight” > “Price,” thus “Weight,” is more important than “Price.” link
Feature scaling is required when the data columns have large variation in their ranges. Getting the min, max and mean of the data in each column is great way
Plotting the data is a next. This identifies the range of the different dimensions of the data easily.
I just learned that generative model tries to learn p(x|z)p(z) = p(x,z).
But after I study some sample code of generative models such as VAE and GAN, I found that the output of model is the generated image x, which is a 2D matrix.
In my realization, the content of matrix means the probability of every pixel and the latent variable, is this right?
If it's right, is it possible to get joint probability p(x,z) between latent variables z and a whole image x from generative model?
Thanks!
What a generative model is trying to learn is just p(x). p(x|z) = 1 if g(z) = x and 0 otherwise, because GANs and VAEs are deterministic mappings and therefore have 100% chance to map to the same target given the same input.
Extracting the probability of x is not an easy task though and depends on the approach. With GANs you can approximate this by sampling from the model. E.g. you sample 1000 images and see how often an image occured. Then this image has a probability of occurences / 1000. By the law of large numbers you will eventually recover the actual probability distribution of your generator this way.
If you want an exact way to calculate probabilities you can use FLOW networks like GLOW or RealNVP, which optimize for log(p(x)) directly and have a way to recover p(x).
Do you have any idea that besides R square, is there any other effective and recognizable scoring method to evaluate a Regressor such as Gradient Boosting Regressor, Random Forest Regressor, SVR and so on.
If there are many different scoring methods, what is the factor for us to consider when choosing among them. Thank you!
You need to take care of overfitting. R-square won't take care of that. You may want to use Adjusted R-square or least square error with component for weights, something like
(y - w.x)^2 + lambda*(w.w)
where lambda is a weight for restricting model, y is the original output, x is the input, w is the derived weights of the model such that w.x is the predicted output.
Suppose I have a training set made by (x, y) samples.
To apply a generative algorithm, let's say the Gaussian discriminative, I must assume that
p(x|y) ~ Normal(mu, sigma) for every possible sigma
or I just need to I know if x ~ Normal(mu, sigma) given y?
How can I evaluate if p(x|y) follows a multivariate Normal distribution well enough (up to a threshold) to me to use generative algorithm?
That's a lot of questions.
To apply a generative algorithm, let's say the Gaussian
discriminative, I must assume that
p(x|y) ~ Normal(mu, sigma) for every possible sigma
No, you must assume that's true for some mu, sigma pair. In practice you won't know what mu and sigma is, so you'll need to either estimate it (frequentist, Max Likelihood/Max A Posteriori estimates), or even better incorporate uncertainty about your estimates of the parameters into predictions (Bayesian methodology).
How can I evaluate if p(x|y) follows a multivariate Normal distribution?
Classically, using a goodness of fit test. If the dimensionality of x is more than a handful, though, this won't work because standard tests involve the number of items in bins, and the number of bins you need in high dimensions is astronomical so you have very low expected counts.
A better idea is to say the following: what are my options for modelling the (conditional) distribution of x? You can compare between these options using model comparison techniques. Read up on model checking and comparison.
Finally, your last point:
well enough (up to a threshold) to me to use generative algorithm?
The paradox of many generative methods, including Fisher's Linear Discriminant Analysis for example, as well as the Naive Bayes classifier, is that the classifier can work very well even though the model is poor for the data. There's no particularly sound reason why this should be the case, but many have observed it to be empirically true. Whether it works can be checked much more easily than whether the assumed distribution explains the data very well: just split your data into training and testing and find out!
I have read through a lot of papers and understand the basic concept of a support vector machine at a very high level. You give it a training input vector which has a set of features and bases on how the "optimization function" evaluates this input vector lets call it x, (lets say we're talking about text classification), the text associated with the input vector x is classified into one of two pre-defined classes, this is only in the case of binary classification.
So my first question is through this procedure described above, all the papers say first that this training input vector x is mapped to a higher (maybe infinite) dimensional space. So what does this mapping achieve or why is this required? Lets say the input vector x has 5 features so who decides which "higher dimension" x is going to be mapped to?
Second question is about the following optimization equation:
min 1/2 wi(transpose)*wi + C Σi = 1..n ξi
so I understand that w has something to do with the margins of the hyperplane from the support vectors in the graph and I know that C is some sort of a penalty but I dont' know what it is a penalty for. And also what is ξi representing in this case.
A simple explanation of the second question would be much appreciated as I have not had much luck understanding it by reading technical papers.
When they talk about mapping to a higher-dimensional space, they mean that the kernel accomplishes the same thing as mapping the points to a higher-dimensional space and then taking dot products there. SVMs are fundamentally a linear classifier, but if you use kernels, they're linear in a space that's different from the original data space.
To be concrete, let's talk about the kernel
K(x, y) = (xy + 1)^2 = (xy)^2 + 2xy + 1,
where x and y are each real numbers (one-dimensional). Note that
(x^2, sqrt(2) x, 1) • (y^2, sqrt(2) y, 1) = x^2 y^2 + 2 x y + 1
has the same value. So K(x, y) = phi(x) • phi(y), where phi(a) = (a^2, sqrt(2), 1), and doing an SVM with this kernel (the inhomogeneous polynomial kernel of degree 2) is the same as if you first mapped your 1d points into this 3d space and then did a linear kernel.
The popular Gaussian RBF kernel function is equivalent to mapping your points into an infinite-dimensional Hilbert space.
You're the one who decides what feature space it's mapped into, when you pick a kernel. You don't necessarily need to think about the explicit mapping when you do that, though, and it's important to note that the data is never actually transformed into that high-dimensional space explicitly - then infinite-dimensional points would be hard to represent. :)
The ξ_i are the "slack variables". Without them, SVMs would never be able to account for training sets that aren't linearly separable -- which most real-world datasets aren't. The ξ in some sense are the amount you need to push data points on the wrong side of the margin over to the correct side. C is a parameter that determines how much it costs you to increase the ξ (that's why it's multiplied there).
1) The higher dimension space happens through the kernel mechanism. However, when evaluating the test sample, the higher dimension space need not be explicitly computed. (Clearly this must be the case because we cannot represent infinite dimensions on a computer.) For instance, radial basis function kernels imply infinite dimensional spaces, yet we don't need to map into this infinite dimension space explicitly. We only need to compute, K(x_sv,x_test), where x_sv is one of the support vectors and x_test is the test sample.
The specific higher dimensional space is chosen by the training procedure and parameters, which choose a set of support vectors and their corresponding weights.
2) C is the weight associated with the cost of not being able to classify the training set perfectly. The optimization equation says to trade-off between the two undesirable cases of non-perfect classification and low margin. The ξi variables represent by how much we're unable to classify instance i of the training set, i.e., the training error of instance i.
See Chris Burges' tutorial on SVM's for about the most intuitive explanation you're going to get of this stuff anywhere (IMO).