I am confused in the mentioned terminologies in machine learning paradigm? Can anybody drop some kind response here?. I shall be grateful to you..
Feature correlation
means that some feature X1 and X2 are dependent to each other regardless of the target prediction Y. In other words we can say if I increase the value of X1 then X2 would also increase or decrease
For Example: Features (Height(X1), Weight(X2) of a person and prediction variable RunningSpeed (Y) of a person). So if we increase the height then obviously weight will also increase.
Feature Interaction
on the other hand says what our model ouputs a prediction on the basis of independent features X1 and X2. In other words we can say what will be our output if X1 is selected only or what will be output if X2 is selected only or what will be the prediction if combination of both X1 + X2 is selected. This combination defines the interaction among feature. Such combination may be (+,-,*,/).
For example: House size(X1) , House Location (X2) and Price (prediction Y). As we can see X1 and X2 are not correlated to each other but both of them contribute in making prediction of House price.
Related
I am trying to predict y, a column of 0s and 1s (classification), using features (X). I'm using ML models like XGBoost.
One of my features, in reality, is highly predictive, let's call it X1. X1 is a column of -1/0/1. When X1 = 1, 80% of the time y = 1. When X1 = -1, 80% of the time y = 0. When X1 = 0, it has no correlation with y.
So in reality, ML aside, any sane person would select this in their model, because if you see X1 = 1 or X1 = -1 you have a 80% chance of predicting whether y is 0 or 1.
However X1 is only -1 or 1 about 5% of the time, and is 0 95% of the time. When I run it through feature selection techniques like Sequential Feature Selection, it doesn't get chosen! And I can understand why ML doesn't choose it, because 95% of the time it is a 0 (and thus uncorrelated with y). And so for any score that I've come across, models with X1 don't score well.
So my question is more generically, how can one deal with this paradox between ML technique and real-life logic? What can I do differently in ML feature selection/modelling to take advantage of the information embedded in the X1 -1's and 1's, which I know (in reality) are highly predictive? What feature selection technique would have spotted the predictive power of X1, if we didn't know anything about it? So far, all methods that I know of need predictive power to be unconditional. Instead, here X1 is highly predictive conditional on not being 0 (which is only 5% of the time). What methods are out there to capture this?
Many thanks for any insight!
Probably sklearn.feature_selection.RFE would be a good option, since it is not really dependant on the feature selection method. What I mean by that, is that it recursively fits the estimator you're planning to use and smaller on smaller subsets of features, and recursively removes features with the lowest scores until a desired amount of features is reached.
This seems like a good appraoch, since regardless of whether the feature in question seems more or less of a good predictor to you, this feature selection method tells you how important the feature is to the model. So if a feature is not considered, it is not as relevant to the model in question.
What approach is the best one to normalize / standardize features that have no theoretical maximum value ?
for example a trend like a stock value that has always been between 0-1000$ doesn't mean it couldn't go further up, so what is the correct approach?
i thought about training a model on a higher maximum (ex. 2000 ),but it doesn't feel right, because no data would be available for the 1000-2000 range, and i think this would introduce bias
TL;DR: use z-scores, maybe take log, maybe take inverse logit, maybe don't normalize at all.
If you wish to normalize safely, use a monotonic mapping, e.g.:
To map (0, inf) into (-inf, inf), you can use y = log(x)
To map (-inf, inf) into (0, 1), you can use y = 1 / (1 + exp(-x)) (inverse logit)
To map (0, inf) into (0, 1), you can use y = x / (1 + x) (inverse logit after log)
If you don't care about bounds, use a linear mapping: y=(x - m) / s, where m is the mean of your feature, and s is its standard deviation. This is called standard scaling, or sometimes z-scoring.
The question you should have asked yourself: why normalize at all?. What are you going to do with your data? Use it as an input feature? Or use it as a target to predict?
For an input feature, leaving it not-normalized is OK, unless you do regularization on model coefficients (like Ridge or Lasso), which works best if all the coefficients are in the same scale (that is, after standard scaling).
For a target feature, leaving it non-normalized is sometimes also OK.
Additive models (like linear regression or gradient boosting) sometimes work better with symmetric distributions. Distributions of stock values (and money values in general) are often skewed to the right, so taking log makes them more convenient.
Finally, if you predict your feature with a neural net with sigmoid activation function, it is inherently bounded. In this case, you might wish the target to be bounded as well. To achieve this, you may use x / (1 + x) as a target: if x is always positive, this value will always be between 0 and 1, just like output of the neural net.
In machine learning suppose we have a GDA (Gaussian Discriminant Analysis) model for classification.
If y can take values 0 or 1 and x represents the vector with n features(n x 1 dimensional)
What does p(x| y=0) or p(x|y=1) signify for a particular training example?
x is actually a vector..how is conditional probability defined for this?
Any help would be much appreciated.
Say that X0 is the set of vectors x that mapped to output 0, and X1 is the set of vectors x that mapped to output 1. Take the mean of each set's vectors, and, similarly, approximate the covariance.
Now build two multivariate normal distributions, with these means and covariances, respectively.
Once you have these distribution, simply plug in the vector you want into the PDF to obtain its density. Note that since the probabilities are continuous, the probability about which you asked is 0, in general.
I am working on a data mining projects and I have to design following model.
I have given 4 feature x1, x2, x3 and x4 and four function defined on these
feature such that each function depend upon some subset of available feature.
e.g.
F1(x1, x2) =x1^2+2x2^2
F2(x2, x3) =2x2^2+3x3^3
F3(x3, x4) =3x3^3+4x4^4
This implies F1 is some function which depend on feature x1, x2. F2 is some feature which depend upon x2, x3 and so on
Now I have a training data set is available where value of x1,x2,x3,x4 is known and sum(F1+F2+F3) { I know total sum but not individual sum of function)
Now using these training data I have to make a model which which can predict total sum of all the function correctly i.e. (F1+F2+F3)
I am new to the data mining and Machine learning field .So I apologizes in advance if this question is too trivial or wrong. I have tried to model it many way but I am not getting any clear thought about it. I will appreciate any help regarding this.
Your problem is non-linear regression. You have features
x1 x2 x3 x4 S where (S = sum(F1+F2+F3) )
You would like to predict S using Xn but S function is non-linear.
Since your function S is non linear, you need to use a non-linear regression algorithm for this problem. Normal nonlinear regression may solve your problem or you may choose other approaches. You may for example try Tree Regression or MARS (Multivariate Adaptive Regression Splines). They are well known algorithms and you can find commercial and open source versions.
2 questions concerning machine learning algorithms like linear / logistic regression, ANN, SVM:
The models in these algorithms are dealing with data sets where each example has a no. of features and one output possible value (ex : getting price of house with features f) but what if the features are enough to produce more than one piece of information about the item of interest which means more than one output?! consider this as an example: a data set about cars where each example (car) has the following features (initial velocity, acceleration, and time), in real world these features are enough to know two variables: velocity via v = v_i + at and distance via s = (v_i * t ) + (0.5 * a *t^2 ) so I want example X with features (x1 , x2 , ... , xn) to have output y1 and y2 in the same step so that after training the model, if a new car example is given with initial velocity and acc. and time, the model will be able to predict the velocity and distance at the same time, is this possible?
in the houses' price prediction example where example X given with features (x1, x2, x3) the model predicts the price, can the process be reversed by any means? meaning if I give the model example X with features x1, x2 with price y can it predict the feature x3?
Depends on the model. A linear model such as linear regression cannot reliably learn the distance formula since it's a cubic function of the given variables. You'd need to add v×t and a×t² as a feature to get a good prediction of the distance. A non-linear model such as a cubic-kernel SVM regression or a multi-layer ANN should be able to learn this from the given features, though, given enough data.
More generally, predicting multiple values with a single model sometimes works and sometimes doesn't -- when in doubt, just fit several models.
You can try. Whether it'll work depends on the relation between the variables and the model.