I have started to learn ML, and am confused with make_friedman1. It highly improved my accuracy, and increased the data size. But the data isn't the same, it's changed after using this function. What does friedman! actually do?
If make_friedman1 asked here is the one in sklearn.datasets then it is the function which generates the “Friedman #1” regression problem. Here inputs are 10 independent variables uniformly distributed on the interval [0,1], only 5 out of these 10 are actually used. Outputs are created according to the formula::
y = 10 sin(π x1 x2) + 20 (x3 - 0.5)^2 + 10 x4 + 5 x5 + e
where e is N(0,sd)
Quoting from the Friedman's original paper, Multivariate Adaptive Regression Splines ::
A new method is presented for flexible regression modeling of high
dimensional data. The model takes the form of an expansion in product
spline basis functions, where the number of basis functions as well as
the parameters associated with each one (product degree and knot
locations) are automatically determined by the data. This procedure is
motivated by the recursive partitioning approach to regression and
shares its attractive properties. Unlike recursive partitioning,
however, this method produces continuous models with continuous
derivatives. It has more power and flexibility to model relationships
that are nearly additive or involve interactions in at most a few
variables
A spline is adding many polynomial curves end-to-end to make a new smooth curve.
Related
I have 38 variables, like oxygen, temperature, pressure, etc and have a task to determine the total yield produced every day from these variables. When I calculate the regression coefficients and intercept value, they seem to be abnormal and very high (Impractical). For example, if 'temperature' coefficient was found to be +375.456, I could not give a meaning to them saying an increase in one unit in temperature would increase yield by 375.456g. That's impractical in my scenario. However, the prediction accuracy seems right. I would like to know, how to interpret these huge intercept( -5341.27355) and huge beta values shown below. One other important point is that I removed multicolinear columns and also, I am not scaling the variables/normalizing them because I need beta coefficients to have meaning such that I could say, increase in temperature by one unit increases yield by 10g or so. Your inputs are highly appreciated!
modl.intercept_
Out[375]: -5341.27354961415
modl.coef_
Out[376]:
array([ 1.38096017e+00, -7.62388829e+00, 5.64611255e+00, 2.26124164e-01,
4.21908571e-01, 4.50695302e-01, -8.15167717e-01, 1.82390184e+00,
-3.32849969e+02, 3.31942553e+02, 3.58830763e+02, -2.05076898e-01,
-3.06404757e+02, 7.86012402e+00, 3.21339318e+02, -7.00817205e-01,
-1.09676321e+04, 1.91481734e+00, 6.02929848e+01, 8.33731416e+00,
-6.23433431e+01, -1.88442804e+00, 6.86526274e+00, -6.76103795e+01,
-1.11406021e+02, 2.48270706e+02, 2.94836048e+01, 1.00279016e+02,
1.42906659e-02, -2.13019683e-03, -6.71427100e+02, -2.03158515e+02,
9.32094007e-03, 5.56457014e+01, -2.91724945e+00, 4.78691176e-01,
8.78121854e+00, -4.93696073e+00])
It's very unlikely that all of these variables are linearly correlated, so I would suggest that you have a look at simple non-linear regression techniques, such as Decision Trees or Kernel Ridge Regression. These are however more difficult to interpret.
Going back to your issue, these high weights might well be due to there being some high amount of correlation between the variables, or that you simply don't have very much training data.
If you instead of linear regression use Lasso Regression, the solution is biased away from high regression coefficients, and the fit will likely improve as well.
A small example on how to do this in scikit-learn, including cross validation of the regularization hyper-parameter:
from sklearn.linear_model LassoCV
# Make up some data
n_samples = 100
n_features = 5
X = np.random.random((n_samples, n_features))
# Make y linear dependent on the features
y = np.sum(np.random.random((1,n_features)) * X, axis=1)
model = LassoCV(cv=5, n_alphas=100, fit_intercept=True)
model.fit(X,y)
print(model.intercept_)
If you have a linear regression, the formula looks like this (y= target, x= features inputs):
y= x1*b1 +x2*b2 + x3*b3 + x4*b4...+ c
where b1,b2,b3,b4... are your modl.coef_. AS you already realized one of your bigges number is 3.319+02 = 331 and the intercept is also quite big with -5431.
As you already mentioned the coeffiecient variables means how much the target variable changes, if the coeffiecient feature changes with 1 unit and all others features are constant.
so for your interpretation, the higher the absoult coeffienct, the higher the influence of your analysis. But it is important to note that the model is using a lot of high coefficient, that means your model is not depending only of one variable
Is activation only used for non-linearity or for both problems . I am still confused why do we need activation function and how can it help.
Generally, such a question would be suited for Stats Stackexchange or the Data Science Stackexchange, since it is a purely theoretical question, and not directly related to programming (which is what Stackoverflow is for).
Anyways, I am assuming that you are referring to the classes of linearly separable and not linearly separable problems when you talk about "both problems.
In fact, non-linearity in a function is always used, no matter which kind of problem you are trying to solve with a neural network.The simple reason for non-linearities as activation function is simply the following:
Every layer in the network consists of a sequence of linear operations, plus the non-linearity.
Formally - and this is something you might have seen before - you can express the mathemtical operation of a single layer F and it's input h as:
F(h) = Wh + b
where W represents a matrix of weights, plus a bias b. This operation is purely sequential, and for a simple multi-layer perceptron (with n layers and without non-linearities), we can write the calculations as follows:
y = F_n(F_n-1(F_n-2(...(F_1(x))))
which is equivalent to
y = W_n W_n-1 W_n-2 ... W_1 x + b_1 + b_2 + ... + b_n
Specifically, we note that these are only multiplications and additions, which we can rearrange in any way we like; particularly, we could aggregate this into one uber-matrix W_p and bias b_p, to rewrite it in a single formula:
y = W_p x + b_p
This has the same expressive power as the above multi-layer perceptron, but can inherently be modeled by a single layer! (While having much less parameters than before).
Introducing non-linearities to this equation turns the simple "building blocks" F(h) into:
F(h) = g(Wh + b)
Now, the reformulation of a sequence of layers is not possible anymore, and then non-linearity additionally allows us to approximate any arbitrary function.
EDIT:
To address another concern of yours ("how does it help?"), I should explicitly mention that not every function is linearly separable, and thus cannot be solved by a purely linear network (i.e. without non-linearities). One classic simple example is the XOR operator.
I'm new to Machine Learning
I' building a simple model that would be able to predict simple sin function
I generated some sin values, and feeding them into my model.
from math import sin
xs = np.arange(-10, 40, 0.1)
squarer = lambda t: sin(t)
vfunc = np.vectorize(squarer)
ys = vfunc(xs)
model= Sequential()
model.add(Dense(units=256, input_shape=(1,), activation="tanh"))
model.add(Dense(units=256, activation="tanh"))
..a number of layers here
model.add(Dense(units=256, activation="tanh"))
model.add(Dense(units=1))
model.compile(optimizer="sgd", loss="mse")
model.fit(xs, ys, epochs=500, verbose=0)
I then generate some test data, which overlays my learning data, but also introduces some new data
test_xs = np.arange(-15, 45, 0.01)
test_ys = model.predict(test_xs)
plt.plot(xs, ys)
plt.plot(test_xs, test_ys)
Predicted data and learning data looks as follows. The more layers I add, the more curves network is able to learn, but the training process increases.
Is there a way to make it predict sin for any number of curves? Preferably with a small number of layers.
With a fully connected network I guess you won't be able to get arbitrarily long sequences, but with an RNN it looks like people have achieved this. A google search will pop up many such efforts, I found this one quickly: http://goelhardik.github.io/2016/05/25/lstm-sine-wave/
An RNN learns a sequence based on a history of inputs, so it's designed to pick up these kinds of patterns.
I suspect the limitation you observed is akin to performing a polynomial fit. If you increase the degree of polynomial you can better fit a function like this, but a polynomial can only represent a fixed number of inflection points depending on the degree you choose. Your observation here appears the same. As you increase layers you add more non-linear transitions. However, you are limited by a fixed number of layers you chose as the architecture in a fully connected network.
An RNN does not work on the same principals because it maintains a state and can make use of the state being passed forward in the sequence to learn the pattern of a single period of the sine wave and then repeat that pattern based on the state information.
I am new to machine learning. I am having a question regarding polynomial regression using one feature.
My understanding is that if there is one input feature, we can create a hypothesis function by taking the squares and cubes the feature.
Suppose x1 is the input feature and our hypothesis function becomes something like this :
htheta(x) = theta0 + (theta1)x1 + (theta2)x1^2 + (theta3)x1^3.
My question is what is the use case of such scenario ? In what type of data, this type of hypothesis function will help ?
This scenario is for simple curve fitting problems. For example, you might have a spring and want to know how far the spring is stretched as a function of how much force you apply (the spring needn't be a linear spring obeying Hooke's law). You could build a model by collecting a bunch of measurements of different forces applied on the spring (measured in Newtons) and the resulting spring extension (also called displacement) in centimeters. You could then build a model of the form F(x) = theta_1 * x + theta_2 * x^3 + theta_3 * x^5 and fit the three theta parameters. You could of course do this with any other single variable problem (height vs. age, weight vs. blood pressure, current vs. voltage). In practice, you generally have many more than just a one dependent variable though.
Also worth pointing out that the transformations needn't be polynomial in the dependent variable (x in this case). You could just as well try logs, square roots, exponentials etc. If you're asking why is it always a parameter times a function of the input variable, this is more of a modeling choice than anything (specifically a linear model since it's linear in theta). It does not have to be this way and is a simple assumption that restricts the class of functions. Linear models also satisfy some intuitive statistical properties which also justify their use (see here)
2 questions concerning machine learning algorithms like linear / logistic regression, ANN, SVM:
The models in these algorithms are dealing with data sets where each example has a no. of features and one output possible value (ex : getting price of house with features f) but what if the features are enough to produce more than one piece of information about the item of interest which means more than one output?! consider this as an example: a data set about cars where each example (car) has the following features (initial velocity, acceleration, and time), in real world these features are enough to know two variables: velocity via v = v_i + at and distance via s = (v_i * t ) + (0.5 * a *t^2 ) so I want example X with features (x1 , x2 , ... , xn) to have output y1 and y2 in the same step so that after training the model, if a new car example is given with initial velocity and acc. and time, the model will be able to predict the velocity and distance at the same time, is this possible?
in the houses' price prediction example where example X given with features (x1, x2, x3) the model predicts the price, can the process be reversed by any means? meaning if I give the model example X with features x1, x2 with price y can it predict the feature x3?
Depends on the model. A linear model such as linear regression cannot reliably learn the distance formula since it's a cubic function of the given variables. You'd need to add v×t and a×t² as a feature to get a good prediction of the distance. A non-linear model such as a cubic-kernel SVM regression or a multi-layer ANN should be able to learn this from the given features, though, given enough data.
More generally, predicting multiple values with a single model sometimes works and sometimes doesn't -- when in doubt, just fit several models.
You can try. Whether it'll work depends on the relation between the variables and the model.