Understanding Polynomial Regression.
I understand that we use polynomial regression for some kind of non Linear Data set and to give it a curve. I know the equation of writing a Polynomial Regression for single independent variable but i don't really understand how this equation is constructed for 2 variables?
y = a1 * x1 + a2 * x2 + a3 * x1*x2 + a4 * x1^2 + a5 * x2^2
What will be the equation for Polynomial Regression in case we have 3 or more variables? What actually is the logic behind developing these polynomial equation for more than one variable?
You can choose whatever you want, but the general 'formula' (to the best of my own experience and knowledge) is:
Powers (so x1, x1^2, x1^3 etc) up to whichever number you choose (many stop at 2).
Cross products (x1 * x2, x1 * x3 etc)
Combinations (x1^2 * x2, x1 * x2^2 etc) and then you can even add higher combinations (x1 * x2 * x3, and you can even add powers here).
But this gets quickly out of hand, and you can end up with too many features.
I would stick to powers of 2, and cross products (only pairs) with no powers, quite like your example, and if you have three elements, then the multiplication of all three of them, but if you have more than three, I wouldn't bother with triplets.
The idea with polynomials is that you model the complex relationship between the features and polynomials are sometimes a good approximation to more complex relationships (that are not really polynomial in their nature).
I hope this is what you meant and that this can help you.
Related
I am confused in the mentioned terminologies in machine learning paradigm? Can anybody drop some kind response here?. I shall be grateful to you..
Feature correlation
means that some feature X1 and X2 are dependent to each other regardless of the target prediction Y. In other words we can say if I increase the value of X1 then X2 would also increase or decrease
For Example: Features (Height(X1), Weight(X2) of a person and prediction variable RunningSpeed (Y) of a person). So if we increase the height then obviously weight will also increase.
Feature Interaction
on the other hand says what our model ouputs a prediction on the basis of independent features X1 and X2. In other words we can say what will be our output if X1 is selected only or what will be output if X2 is selected only or what will be the prediction if combination of both X1 + X2 is selected. This combination defines the interaction among feature. Such combination may be (+,-,*,/).
For example: House size(X1) , House Location (X2) and Price (prediction Y). As we can see X1 and X2 are not correlated to each other but both of them contribute in making prediction of House price.
What approach is the best one to normalize / standardize features that have no theoretical maximum value ?
for example a trend like a stock value that has always been between 0-1000$ doesn't mean it couldn't go further up, so what is the correct approach?
i thought about training a model on a higher maximum (ex. 2000 ),but it doesn't feel right, because no data would be available for the 1000-2000 range, and i think this would introduce bias
TL;DR: use z-scores, maybe take log, maybe take inverse logit, maybe don't normalize at all.
If you wish to normalize safely, use a monotonic mapping, e.g.:
To map (0, inf) into (-inf, inf), you can use y = log(x)
To map (-inf, inf) into (0, 1), you can use y = 1 / (1 + exp(-x)) (inverse logit)
To map (0, inf) into (0, 1), you can use y = x / (1 + x) (inverse logit after log)
If you don't care about bounds, use a linear mapping: y=(x - m) / s, where m is the mean of your feature, and s is its standard deviation. This is called standard scaling, or sometimes z-scoring.
The question you should have asked yourself: why normalize at all?. What are you going to do with your data? Use it as an input feature? Or use it as a target to predict?
For an input feature, leaving it not-normalized is OK, unless you do regularization on model coefficients (like Ridge or Lasso), which works best if all the coefficients are in the same scale (that is, after standard scaling).
For a target feature, leaving it non-normalized is sometimes also OK.
Additive models (like linear regression or gradient boosting) sometimes work better with symmetric distributions. Distributions of stock values (and money values in general) are often skewed to the right, so taking log makes them more convenient.
Finally, if you predict your feature with a neural net with sigmoid activation function, it is inherently bounded. In this case, you might wish the target to be bounded as well. To achieve this, you may use x / (1 + x) as a target: if x is always positive, this value will always be between 0 and 1, just like output of the neural net.
I have a function f(x): R^n --> R (sorry, is there a way to do LaTeX here?), and I want to build a machine learning algorithm that estimates f(x) for any input point x, based on a bunch of sample xs in a training data set. If I know the value of f(x) for every x in the training data, this should be simple - just do a regression, or take the weighted average of nearby points, or whatever.
However, this isn't what my training data looks like. Rather, I have a bunch of pairs of points (x, y), and I know the value of f(x) - f(y) for each pair, but I don't know the absolute values of f(x) for any particular x. It seems like there ought to be a way to use this data to find an approximation to f(x), but I haven't found anything after some Googling; there are papers like this but they seem to assume that the training data comes in the form of a set of discrete labels for each entity, rather than having labels over pairs of entities.
This is just making something up, but could I try kernel density estimation over f'(x), and then do integration to get f(x)? Or is that crazy, or is there a known better technique?
You could assume that f is linear, which would simplify things - if f is linear we know that:
f(x-y) = f(x) - f(y)
For example, Suppose you assume f(x) = <w, x>, making w the parameter you want to learn. How would the squared loss per sample (x,y) and known difference d look like?
loss((x,y), d) = (f(x)-f(y) - d)^2
= (<w,x> - <w,y> - d)^2
= (<w, x-y> - d)^2
= (<w, z> - d)^2 // where z:=x-y
Which is simply the squared loss for z=x-y
Practically, you would need to construct z=x-y for each pair and then learn f using linear regression over inputs z and outputs d.
This model might be too weak for your needs, but its probably the first thing you should try. Otherwise, as soon as you step away from the linearity assumption, you'd likely arrive at a difficult non-convex optimization problem.
I don't see a way to get absolute results. Any constant in your function (f(x) = g(x) + c) will disappear, in the same way constants disappear in an integral.
I am working on a data mining projects and I have to design following model.
I have given 4 feature x1, x2, x3 and x4 and four function defined on these
feature such that each function depend upon some subset of available feature.
e.g.
F1(x1, x2) =x1^2+2x2^2
F2(x2, x3) =2x2^2+3x3^3
F3(x3, x4) =3x3^3+4x4^4
This implies F1 is some function which depend on feature x1, x2. F2 is some feature which depend upon x2, x3 and so on
Now I have a training data set is available where value of x1,x2,x3,x4 is known and sum(F1+F2+F3) { I know total sum but not individual sum of function)
Now using these training data I have to make a model which which can predict total sum of all the function correctly i.e. (F1+F2+F3)
I am new to the data mining and Machine learning field .So I apologizes in advance if this question is too trivial or wrong. I have tried to model it many way but I am not getting any clear thought about it. I will appreciate any help regarding this.
Your problem is non-linear regression. You have features
x1 x2 x3 x4 S where (S = sum(F1+F2+F3) )
You would like to predict S using Xn but S function is non-linear.
Since your function S is non linear, you need to use a non-linear regression algorithm for this problem. Normal nonlinear regression may solve your problem or you may choose other approaches. You may for example try Tree Regression or MARS (Multivariate Adaptive Regression Splines). They are well known algorithms and you can find commercial and open source versions.
2 questions concerning machine learning algorithms like linear / logistic regression, ANN, SVM:
The models in these algorithms are dealing with data sets where each example has a no. of features and one output possible value (ex : getting price of house with features f) but what if the features are enough to produce more than one piece of information about the item of interest which means more than one output?! consider this as an example: a data set about cars where each example (car) has the following features (initial velocity, acceleration, and time), in real world these features are enough to know two variables: velocity via v = v_i + at and distance via s = (v_i * t ) + (0.5 * a *t^2 ) so I want example X with features (x1 , x2 , ... , xn) to have output y1 and y2 in the same step so that after training the model, if a new car example is given with initial velocity and acc. and time, the model will be able to predict the velocity and distance at the same time, is this possible?
in the houses' price prediction example where example X given with features (x1, x2, x3) the model predicts the price, can the process be reversed by any means? meaning if I give the model example X with features x1, x2 with price y can it predict the feature x3?
Depends on the model. A linear model such as linear regression cannot reliably learn the distance formula since it's a cubic function of the given variables. You'd need to add v×t and a×t² as a feature to get a good prediction of the distance. A non-linear model such as a cubic-kernel SVM regression or a multi-layer ANN should be able to learn this from the given features, though, given enough data.
More generally, predicting multiple values with a single model sometimes works and sometimes doesn't -- when in doubt, just fit several models.
You can try. Whether it'll work depends on the relation between the variables and the model.