Time series with multiple independent variables - time-series

its been a while since I worked with time series data.
I have to build a model with a data for past 8 years. A dataset contains one dependent variable - price and few independent variables (lets assume, there are 2 independent variables). Each independent variable has its problems - trend, seasonality or both.
date
price (y)
x1
x2
01-01-2022
8
34.674
1.3333
02-01-2022
6
68.542
2.0
03-01-2022
5
44.523
4.0001
How should I approach this task? Should I apply transformations to each independent variable? What model options do I have? Which are suitable for time series with multiple independent variables?
As I understand, Vector auto regression (VAR) would be incorrect, as I want to predict only one feature (price).

Related

generalized linear mixed model output spss

I am writing my master thesis and I run a generalized linear mixed regression model in SPSS (version 28) using count data.
Research question: which effect has the population mobility on the Covid-19 incidence at the federal state level in Germany during the period from February 2020 to November 2021.
To test the effect of population mobility (independent variable) on Covid-19 incidence (dependent variable) hierarchical models were used, with fixed factors:
mobility variables in 6 places.(scale)
cumulative vaccination rate (only second dose).( scale)
season (summer as the reference category) (nominal)
and random effects:
one model with days variable (Time level). (Scale)
Second model with federal states variable ( each state has a number from 1 to 16) ( place level). (Nominal)
Third model with both days and federal states (Time and place level).
First I have built intercept-only model to check which type of regression is more suitable for the count data (Possion or Negativ binomial) and to choose also the best variable as an offset from two variables..It showed that negative binomial regression is the best for this data. (Based on the BIC or AIC)
Secondly I have checked the collinearity between the original 6 mobility variables and I have excluded mobility variables that are highly correlated based on VIF. (Only one Variable was excluded)
Thirdly I have built 7 generalized linear models by adding only the fixed effects or the fixed factors which are the 5 mobility variables, the cumulative vaccination rate dose 2 and the season (with summer as a reference category) to the intercept only model gradually. From these 7 models the final model with best model fit was selected.
Finally I have built a generalized linear mixed model with the above final model and a classic random effect by adding Days variable only ((random-intercept component for time; TIME level)) and then with federal states variable only ((random-intercept component for place; PLACE level)) and finally with adding both of them together.
I am not sure if I ran the last step regarding the generalized linear mixed models correctly or not??
These are my Steps:
Analyze-> mixed models-> generalized linear mixed model-> fields and effects:
1.target-> case
Target distribution and relationship (link) with the linear model-> custom :
Distribution-> negative binomial
Link Funktion -> log
2.Fixed effects-> include intercept & 5 mobility variables & cumulative vaccination rate & season
3.random effects-> no intercept & days variable (TIME LEVEL)
Random effect covariance type: variance component
4.weight and offset-> use offset field-> log expected cases adjusted wave variable
Build options like general and estimation remain unchanged (suggested by spss)
Model options like Estimated means remain unchanged (suggested by spss)
I have done the same steps with the other 2 models except with random effects:
3.random effects-> no intercept & Federal state variable (PLACE LEVEL)
3.random effects-> no intercept & days variable & Federal state variable (TIME & PLACE LEVEL)
Output:
1.the variance of the random effect of days variable ( time level ) was very small 5,565E-6, indicating only marginal effect in the model. (MODEL 1)
2.the covariance of the random effect of the federal states was zero and the variance was 0.079 ( place level )(MODEL 2)
3.the variance of the random effect of days variable was very small 4,126E-6 and the covariance of the random effect of the federal states was zero and the variance was 0.060 ( Time and place level )(MODEL 3)
Can someone please check my steps and tell me which model from the models in the last step is the best for the presentation of results and explain also the last point in the output within the picture?
Thanks in advance to all of you...

Darts: Methods for Efficiently Predicting on a Test set (without retraining)

I am using the TFTModel. After training (and validating) using the fit method, I would like to predict all data points in the train, test and validation set using the already trained model.
Currently, there are only the methods:
historical_forcast: supports predicting for multiple time steps (with corresponding look backs) but just one time series
predict: supports predicting for multiple time series but just for n next time steps.
What I am looking for is a method like historical_forcast but where series, past_covariates, and future_covariates are supported for being predicted without retraining. My best attempt so far is to run the following code block on an already trained model:
predictions = []
for s, past_cov, future_cov in zip(series, past_covariates, future_covariates):
predictions.append(model.historical_forecasts(
s,
past_covariates=past_cov,
future_covariates=future_cov,
retrain=False,
start=model.input_chunk_length,
verbose=True
))
Where series, past_covariates, and future_covariates are lists of target time series and covariates respectively, each consisting of the concatenated train, val and test series which I split afterwards again to ensure the availability of the past values needed for predicting at the beginning of test and val.
My objection / question about this: is there a more efficient way to do this through better batching with the current interface, our would I have to call the torch model my self?

cox proportional hazard regression in SPSS using reference group

I am running cox proportional hazard regression in SPSS to see the association of 'predictor' with risk of a disease in a 10 years follow-up. I have another variable 'age_quartiles' with values 1,2,3,4 and want to use '1' as reference to get HRs for 2,3, and 4 relative to '1'. When I put this variable in Strata I still get one 'HR' as follows ('S_URAT_07' is the predictor with continuous values);
Question: How do I get HRs for the predictor for the event based on 'age_quartiles' 2,3 and 4 and keeping 1 as reference group? 'age_quartile' is not a predictor here. Am I suppose to choose a specific method?
As I answered yesterday to this same question on Cross Validated:
The model you're fitting involves only the one parameter for changes in hazard as S_URAT_07 varies (e.g., the B is the change in log hazard for a single unit increase in S_URAT_07), regardless of the level of age_quartiles. What differs by age_quartiles is the baseline hazard function when it's used as a strata or stratification variable, and the hazards are then no longer proportional.
If you specify age_quartiles as a factor (called a categorical covariate in COXREG) rather than a strata variable, you'll again get a single coefficient for S_URAT_07, but also a set of three coefficients that reflect proportionally differing baselines for each level of age_quartiles. You can specify simple contrasts on the factor with the first level as the reference category to reflect comparisons with that category.
If you specify age_quartiles as a factor and also include the interaction bewteen it and S_URAT_07, then you get separate proportional baseline hazard functions, but also allow the impact of S_URAT_07 to differ depending on the age_quartiles level.

Learning from time-series data to predict time-series (not forecasting)

I have a number of datasets where each of them contains a number of input variables (lets say 3) as time series and an output variable, also as a time series and all over the same time period.
Each of these series has the same number of datapoints (say 1000*10 if 10 second data was gathered at 1000Hz).
I want to learn from this data and given a new dataset with 3 time serieses for input variables, I want to predict the time series for the output variable.
I will write the problem below in some non-English notation. I will avoid using terms like features, sample, target etc because since I haven't formulated the problem for any algorithm, I don't want to speculate what will be what.
Datasets to learn from look like this:
dataset1:{Inputs=(timSeries1,timSeries2,timSeries3), Output=(timSeriesOut)}
dataset2:{Inputs=(timSeries1,timSeries2,timSeries3), Output=(timSeriesOut)}
dataset3:{Inputs=(timSeries1,timSeries2,timSeries3), Output=(timSeriesOut)}
.
.
datasetn:{Inputs=(timSeries1,timSeries2,timSeries3), Output=(timSeriesOut)}
Now, given a new (timSeries1, timSeries2, timSeries3) I want to predict (timSeriesOut)
datasetPredict:{Inputs=(timeSeries1,timSeries2,timSeries3), Output = ?}
What technique should I use and how should the problem be formulated? Should I just break it as separate learning problem for each time stamp with three features and one target (either for that or next timestamp)?
Thank you all!

Are data dependencies relevant when preparing data for neural network?

Data: When I have N rows of data like this: (x,y,z) where logically f(x,y)=z, that is z is dependent on x and y, like in my case (setting1, setting2 ,signal) . Different x's and y's can lead to the same z, but the z's wouldn't mean the same thing.
There are 30 unique setting1, 30 setting2 and 1 signal for each (setting1, setting2)-pairing, hence 900 signal values.
Data set: These [900,3] data points are considered 1 data set. I have many samples of these data sets.
I want to make a classification based on these data sets, but I need to flatten the data (make them all into one row). If I flatten it, I will duplicate all the setting values (setting1 and setting2) 30 times, i.e. I will have a row with 3x900 columns.
Question:
Is it correct to keep all the duplicate setting1,setting2 values in the data set? Or should I remove them and only include the unique values a single time?, i.e. have a row with 30 + 30 + 900 columns. I'm worried, that the logical dependency of the signal to the settings will be lost this way. Is this relevant? Or shouldn't I bother including the settings at all (e.g. due to correlations)?
If I understand correctly, you are training NN on a sample where each observation is [900,3].
You are flatning it and getting an input layer of 3*900.
Some of those values are a result of a function on others.
It is important which function, as if it is a liniar function, NN might not work:
From here:
"If inputs are linearly dependent then you are in effect introducing
the same variable as multiple inputs. By doing so you've introduced a
new problem for the network, finding the dependency so that the
duplicated inputs are treated as a single input and a single new
dimension in the data. For some dependencies, finding appropriate
weights for the duplicate inputs is not possible."
Also, if you add dependent variables you risk the NN being biased towards said variables.
E.g. If you are running LMS on [x1,x2,x3,average(x1,x2)] to predict y, you basically assign a higher weight to the x1 and x2 variables.
Unless you have a reason to believe that those weights should be higher, don't include their function.
I was not able to find any link to support, but my intuition is that you might want to decrease your input layer in addition to omitting the dependent values:
From professor A. Ng's ML Course I remember that the input should be the minimum amount of values that are 'reasonable' to make the prediction.
Reasonable is vague, but I understand it so: If you try to predict the price of a house include footage, area quality, distance from major hub, do not include average sun spot activity during the open home day even though you got that data.
I would remove the duplicates, I would also look for any other data that can be omitted, maybe run PCA over the full set of Nx[3,900].

Resources