Linear Regression with time variant independent variables - time-series

In this model I would be trying to predict HDI of a country.
Model looks something like this:
HDI(t) ~ Life Expectancy(t) + GNI (t)
However life expectancy and gni themselves are time variant independent variables. Is there anyway to account for this in linear regression such that i can predict
HDI(2018) using data from years before 2017? Is there anyway for linear regression to capture this?
Would be creating dummy variables like country be of any help?

As you mentioned HDI(t) is depending on Life Expectancy and GNI at the same time t so if you don't have their values in t it would be a guess.
By the way what are you looking for something like a and b in HDI = a * Life Expectancy + b * GNI?
You can assume a and b are time invariant and solve the problem for 2017 (or for several years and calculate mean) then you'll have them :)

Related

Search for the optimal value of x for a given y

Please help me find an approach to solving the following problem: Let X is a matrix X_mxn = (x1,…,xn), xi is a time series and a vector Y_mx1. To predict values ​​from Y_mx1, let's train some model, let linear regression. We get Y = f (X). Now we need to find X for some given value of Y. The most naive thing is brute force, but what are the competent ways to solve such problems? Perhaps there is a use of the scipy.optimize package here, please enlighten me.
get an explanation or matherial to read for understanding
Most scipy-optimize algorithm use gradient method, for those optimization problem, we could apply these into re-engineering of data (find the best date to invest in the stock market...)
If you want to optimize the result, you should choose a good step size and suitable optimize method.
However, we should not classify tge problem as "predict" of xi because what we are doing is to find local/global maximum/minimum.
For example Newton-CG, your data/equation should contain all the information needed/a simulation, but no prediction is made from the method.
If you want to do a pretiction on "time", you could categorize the time data in "year,month..." then using unsupervise learning to "group" the data. If trend is obtained, then we can re-enginning the result to know the time

Time series with multiple independent variables

its been a while since I worked with time series data.
I have to build a model with a data for past 8 years. A dataset contains one dependent variable - price and few independent variables (lets assume, there are 2 independent variables). Each independent variable has its problems - trend, seasonality or both.
date
price (y)
x1
x2
01-01-2022
8
34.674
1.3333
02-01-2022
6
68.542
2.0
03-01-2022
5
44.523
4.0001
How should I approach this task? Should I apply transformations to each independent variable? What model options do I have? Which are suitable for time series with multiple independent variables?
As I understand, Vector auto regression (VAR) would be incorrect, as I want to predict only one feature (price).

Why the hypothesis has to introduce two parameters, namely θ0 and θ1

I was learning Machine Learning from this course on Coursera taught by Andrew Ng. The instructor defines the hypothesis as a linear function of the "input" (x, in my case) like the following:
hθ(x) = θ0 + θ1(x)
In supervised learning, we have some training data and based on that we try to "deduce" a function which closely maps the inputs to the corresponding outputs. To deduce the function, we introduce the hypothesis as a linear function of input (x). My question is, why the function involving two θs is chosen? Why it can't be as simple as y(i) = a * x(i) where a is a co-efficient? Later we can go about finding a "good" value of a for a given example (i) using an algorithm? This question might look very stupid. I apologize but I'm not very good at machine learning I am just a beginner. Please help me understand this.
Thanks!
The a corresponds to θ1. Your proposed linear model is leaving out the intercept, which is θ0.
Consider an output function y equal to the constant 5, or perhaps equal to a constant plus some tiny fraction of x which never exceeds .01. Driving the error function to zero is going to be difficult if your model doesn't have a θ0 that can soak up the D.C. component.

Polynomial regression for one input feature

I am new to machine learning. I am having a question regarding polynomial regression using one feature.
My understanding is that if there is one input feature, we can create a hypothesis function by taking the squares and cubes the feature.
Suppose x1 is the input feature and our hypothesis function becomes something like this :
htheta(x) = theta0 + (theta1)x1 + (theta2)x1^2 + (theta3)x1^3.
My question is what is the use case of such scenario ? In what type of data, this type of hypothesis function will help ?
This scenario is for simple curve fitting problems. For example, you might have a spring and want to know how far the spring is stretched as a function of how much force you apply (the spring needn't be a linear spring obeying Hooke's law). You could build a model by collecting a bunch of measurements of different forces applied on the spring (measured in Newtons) and the resulting spring extension (also called displacement) in centimeters. You could then build a model of the form F(x) = theta_1 * x + theta_2 * x^3 + theta_3 * x^5 and fit the three theta parameters. You could of course do this with any other single variable problem (height vs. age, weight vs. blood pressure, current vs. voltage). In practice, you generally have many more than just a one dependent variable though.
Also worth pointing out that the transformations needn't be polynomial in the dependent variable (x in this case). You could just as well try logs, square roots, exponentials etc. If you're asking why is it always a parameter times a function of the input variable, this is more of a modeling choice than anything (specifically a linear model since it's linear in theta). It does not have to be this way and is a simple assumption that restricts the class of functions. Linear models also satisfy some intuitive statistical properties which also justify their use (see here)

Continuous Regression in Machine Learning

Suppose we have a set of inputs (named x1, x2, ..., xn) that give us the output y. The goal is to predict y from some values of x1... xn that were not seem yet. It's clear to me that this problem can be modelled as a Regression problem on the realm of Machine Learning.
However, let's say data keep coming. I'm able to predict y from x1... xn. Furthermore, I'm able to check afterwards whether or not that prediction was a good one. If it was a good one, everything is fine. On the other hand, I would like to update my model in case that prediction deviates a lot from the real y. The one way I can see this is to insert this new data on my training set and train the regression algorithm again. Two problems arise from that. First, it may cost more than I can afford to recompute my module from scratch from time to time. Second, I may already have too much data on my training set so that new coming data is negligible. However, the new coming data might be more import than the older ones due to the nature of my problem.
It seems that a good solution would be to compute a kind of continuous regression that is more related to the new data than to the older one. I have searched for such approach but I have not found anything relevant. Perhaps I'm looking at the wrong direction. Does anyone have a clue on how to do it?
If you want to consider the newer data more important you have to use weights. Usually it is called
sample_weight
in fit() function in scikit-learn (if you use this library).
Weights can be defined as 1 / (time pass from this current observation).
Now about the second problem. If the recalculation takes much time you can cut your observations and use the latest ones. Fit your model on the whole data and on the fresh one + some part of the old data and check how much your weights are changed. I suppose if you really have a dependence between {x_i} and {y} you don't need the whole dataset.
Otherwise you can use weights again. But for now you will weight weights in the model:
model for old data: w1*x1 + w2*x2 + ...
model for new data: ~w1*x1 + ~w2*x2 + ...
common model: (w1*a1_1 + ~w1*a1_2)*x1 + (w2*a2_1 + ~w2*a2_2)*x2 + ...
Here a1_1, a2_1 are the weights for 'old model', a2_1, a2_2 - for new one, w1, w2 - coefficients of old model, ~w1, ~w2 - of the new one.
Parameters {a} can be estimated as in the first bullet (be hands), but you also can create another linear model to estimate them. But my advice: don't use non-linear regression for {a} - you will overfit.

Resources