Thoughts: time series modeling with fable and cross validation - time-series

I am building a time series model using fable and cross validation to determine the best model definition to use. Is there a risk of modeling
model(ETS(GDP))
vs
model(ETS(GDP ~ error('A') + trend('A') + season('A')) and other ETS methods
I am asking this because when I perused the mable from **model(ETS(GDP))**, the chosen model was different among some .id. For example, ETS(A, A, A) for id = 1, ETS(A, Ad, A) for id = 2, etc. If this is the case, is it correct to define all the variants of ETS in order to ensure consistency?
Here is a mable I am referring to:
# A mable: 7 x 5
# Key: .id, LOB [7]
.id LOB ETS ETS_Exponential ARIMA_Exponential
<int> <chr> <model> <model> <model>
1 1 LG <ETS(A,N,N)> <ETS(A,N,N)> <ARIMA(0,0,1) w/ mean>
2 2 LG <ETS(M,N,N)> <ETS(A,N,N)> <ARIMA(0,0,1) w/ mean>
3 3 LG <ETS(A,N,N)> <ETS(A,N,N)> <ARIMA(0,0,1) w/ mean>
4 4 LG <ETS(A,N,N)> <ETS(A,N,N)> <ARIMA(0,0,1) w/ mean>
5 5 LG <ETS(A,N,N)> <ETS(M,N,N)> <ARIMA(0,0,1) w/ mean>
6 6 LG <ETS(A,N,N)> <ETS(M,N,N)> <ARIMA(0,0,0) w/ mean>
7 7 LG <ETS(A,N,N)> <ETS(M,N,N)> <ARIMA(0,0,0) w/ mean>
Thanks.

Why would you want the models to be the same? For example, if you wanted to compare model parameters for some reason, then you might want to fit the same model to all series. But if you just want good forecasts, you are probably better off having different models for different series -- some will be trended, some will be seasonal, etc., and you probably need to allow for that.
If in doubt, you could try both approaches and see which one gives the best forecasts (assuming that is what your ultimate purpose is here).

Related

Loop to select which variable to omit from analysis

I have datasets with a large number of variables and I need to run PCA over these datasets with one variable removed each time. Below are 20 variables for an example dataset. I would like to run PCA with one variable removed from each PCA solution. For example, the first PCA solution will include all variables excluding Var_1_GroupA, the second will include all variables excluding Var_2_GroupA, etc. I am familiar with using macros to write loops but unsure how to complete the following task using macros or code in python.
Var_1_GroupA
Var_2_GroupA
Var_1_GroupB
Var_2_GroupB
Var_3_GroupB
Var_1_GroupC
Var_2_GroupC
Var_3_GroupC
Var_4_GroupC
Var_5_GroupC
Var_1_GroupD
Var_1_GroupE
new_Var_1_GroupA
new_Var_1_GroupB
new_Var_1_GroupC
new_Var_2_GroupC
Var_1_GroupF
Var_1_GroupG
Var_1_GroupH
Var_2_GroupH
In the example below I create 10 variables, and then run a simple means command with a different set of variables each time - excluding one of the variables at a time. You can edit the code to match your variables and your analysis code.
data list list/var1 to var10 (10F1).
begin data
1 2 3 4 5 6 7 8 9 9
5 4 3 6 3 8 1 2 5 8
0 8 6 4 2 1 3 5 7 9
end data.
dataset name wrk.
define !loopit (!pos=!cmdend)
!do !a !in(!1)
means
!do !b !in(!1) !if (!b<>!a) !then !b !ifend !doend
.
!doend
!enddefine.
!loopit var1 var2 var3 var4 var5 var6 var7 var8 var9 var10 .
note you vave to list the variable names in the macro call, can't use var1 to var10.
If you run into trouble while adapting this to your exact needs, these are very helpful in debugging macros:
set mexpand=on.
set mprint=on.

Selecting a cut-off score in SPSS

I have 5 variables for one questionnaire about social support. I want to define the group with low vs. high support. According to the authors low support is defined as a sum score <= 18 AND two items scoring <= 3.
It would be great to get a dummy variable which shows which people are low vs high in support.
How can I do this in the syntax?
Thanks ;)
Assuming your variables are named Var1, Var2 .... Var5, and that they are consecutive in the dataset, this should work:
recode Var1 to Var5 (1 2 3=1)(4 thr hi=0) into L1 to L5.
compute LowSupport = sum(Var1 to Var5) <= 18 and sum(L1 to L5)>=2.
execute.
New variable LowSupport will have value 1 for rows that have the parameters you defined and 0 for other rows.
Note: If your variables are not consecutive you'll have to list all of them instead of using Var1 to var5.

Update model : 3-way interaction terms not dropped

My question is highly related to this one:
R update() interaction term not dropped
However, I don't have multiple categories in my predictor variables, so I don't understand how my issue relates to the answer. Maybe I'm just not understanding it...
I'd like to remove the insignificant 3-way interaction terms in a model reduction process one at a time.
However, the following happens:
model1 <- lme(sum.leafmass ~ stand.td.Sept.2017*stand.wtd.Sept.2017*I((stand.td.Sept.2017)^2)*I((stand.wtd.Sept.2017)^2), random = ~1|block/fence, method="ML", data=subset(Total.CiPEHR, species=="EV"), na.action=na.omit)
model2 <- update(model1,.~.-stand.td.Sept.2017:stand.wtd.Sept.2017:I((stand.td.Sept.2017)^2):I((stand.wtd.Sept.2017)^2))
summary(model2) ##works correctly to eliminate insignificant 4-way interactions
summary(model2)
DF t-value p-value
(Intercept) 4 3.849259 0.0183
stand.td.Sept.2017 4 -1.436666 0.2242
stand.wtd.Sept.2017 4 -2.921806 0.0432
I((stand.td.Sept.2017)^2) 4 4.594303 0.0101
I((stand.wtd.Sept.2017)^2) 4 -0.313197 0.7698
stand.td.Sept.2017:stand.wtd.Sept.2017 4 -1.301935 0.2629
stand.td.Sept.2017:I((stand.td.Sept.2017)^2) 4 1.853451 0.1374
stand.wtd.Sept.2017:I((stand.td.Sept.2017)^2) 4 4.354757 0.0121
stand.td.Sept.2017:I((stand.wtd.Sept.2017)^2) 4 -0.028199 0.9789
stand.wtd.Sept.2017:I((stand.wtd.Sept.2017)^2) 4 1.598564 0.1852
I((stand.td.Sept.2017)^2):I((stand.wtd.Sept.2017)^2) 4 -1.683214 0.1676
stand.td.Sept.2017:stand.wtd.Sept.2017:I((stand.td.Sept.2017)^2) 4 1.972616 0.1198
stand.td.Sept.2017:stand.wtd.Sept.2017:I((stand.wtd.Sept.2017)^2) 4 -1.635314 0.1773
stand.td.Sept.2017:I((stand.td.Sept.2017)^2):I((stand.wtd.Sept.2017)^2) 4 2.190518 0.0936
stand.wtd.Sept.2017:I((stand.td.Sept.2017)^2):I((stand.wtd.Sept.2017)^2) 4 -0.968249 0.3877
##attempt to remove insignificant 3-way interaction
model3 <- update(model2,.~.,-stand.wtd.Sept.2017:I((stand.td.Sept.2017)^2):I((stand.wtd.Sept.2017)^2))
summary(model3)
DF t-value p-value
(Intercept) 4 3.849259 0.0183
stand.td.Sept.2017 4 -1.436666 0.2242
stand.wtd.Sept.2017 4 -2.921806 0.0432
I((stand.td.Sept.2017)^2) 4 4.594303 0.0101
I((stand.wtd.Sept.2017)^2) 4 -0.313197 0.7698
stand.td.Sept.2017:stand.wtd.Sept.2017 4 -1.301935 0.2629
stand.td.Sept.2017:I((stand.td.Sept.2017)^2) 4 1.853451 0.1374
stand.wtd.Sept.2017:I((stand.td.Sept.2017)^2) 4 4.354757 0.0121
stand.td.Sept.2017:I((stand.wtd.Sept.2017)^2) 4 -0.028199 0.9789
stand.wtd.Sept.2017:I((stand.wtd.Sept.2017)^2) 4 1.598564 0.1852
I((stand.td.Sept.2017)^2):I((stand.wtd.Sept.2017)^2) 4 -1.683214 0.1676
stand.td.Sept.2017:stand.wtd.Sept.2017:I((stand.td.Sept.2017)^2) 4 1.972616 0.1198
stand.td.Sept.2017:stand.wtd.Sept.2017:I((stand.wtd.Sept.2017)^2) 4 -1.635314 0.1773
stand.td.Sept.2017:I((stand.td.Sept.2017)^2):I((stand.wtd.Sept.2017)^2) 4 2.190518 0.0936
stand.wtd.Sept.2017:I((stand.td.Sept.2017)^2):I((stand.wtd.Sept.2017)^2) 4 -0.968249 0.3877
##3-way interaction term still there.
Why won't the interaction term drop? The predictor variables are continuous and so should be independent from each other, right..?
Someone please explain if I'm not understanding something basic here...
Solved my own question.
Dummy syntax error. (had an incorrect comma in the .~. portion)
###Incorrect syntax.
model3 <- update(model2,.~.,-stand.wtd.Sept.2017:I((stand.td.Sept.2017)^2):I((stand.wtd.Sept.2017)^2))
###Correct syntax.
model3 <- update(model2,.~.-stand.wtd.Sept.2017:I((stand.td.Sept.2017)^2):I((stand.wtd.Sept.2017)^2))

Generating means of a variable using dummy variables & foreach in Stata

My dataset includes TWO main variables X and Y.
Variable X represents distinct codes (e.g. 001X01, 001X02, etc) for multiple computer items with different brands.
Variable Y represents the tax charged for each code of variable X (e.g. 15 = 15% for 001X01) at a store.
I've created categories for these computer items using dummy variables (e.g. HD dummy variable for Hard-Drives, takes value of 1 when variable X represents a HD, etc). I have a list of over 40 variables (two of them representing X and Y, and the rest is a bunch of dummy variables for the different categories I've created for computer items).
I would like to display the averages of all these categories using a loop in Stata, but I'm not sure how to do this.
For example the code:
mean Y if HD == 1
Mean estimation Number of obs = 5
--------------------------------------------------------------
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
Tax | 7.1 2.537716 1.154172 15.24583
gives me the mean Tax for the category representing Hard Drives. How can I use a loop in Stata to automatically display all the mean Taxes charged for each category? I would do it by hand without a problem, but I want to repeat this process for multiple years, so I would like to use a loop for each year in order to come up with this output.
My goal is to create a separate Excel file with each of the computer categories I've created (38 total) and the average tax for each category by year.
Why bother with the loop and creating the indicator variables? If I understand correctly, your initial dataset allows the use of a simple collapse:
clear all
set more off
input ///
code tax str10 categ
1 0.15 "hd"
2 0.25 "pend"
3 0.23 "mouse"
4 0.29 "pend"
5 0.16 "pend"
6 0.50 "hd"
7 0.54 "monitor"
8 0.22 "monitor"
9 0.21 "mouse"
10 0.76 "mouse"
end
list
collapse (mean) tax, by(categ)
list
To take to Excel you can try export excel or put excel.
Run help collapse and help export for details.
Edit
Because you insist, below is an example that gives the same result using loops.
I assume the same data input as before. Some testing using this example database
with expand 1000000, shows that speed is virtually the same. But almost surely,
you (including your future you) and your readers will prefer collapse.
It is much clearer, cleaner and concise. It is even prettier.
levelsof categ, local(parts)
gen mtax = .
quietly {
foreach part of local parts {
summarize tax if categ == "`part'", meanonly
replace mtax = r(mean) if categ == "`part'"
}
}
bysort categ: keep if _n == 1
keep categ mtax
Stata has features that make it quite different from other languages. Once you
start getting a hold of it, you will find that many things done with loops elsewhere,
can be made loop-less in Stata. In many cases, the latter style will be preferred.
See corresponding help files using help <command> and if you are not familiarized with saved results (e.g. r(mean)), type help return.
A supplement to Roberto's excellent answer: After collapse, you will need a loop to export the results to excel.
levelsof categ, local(levels)
foreach x of local levels {
export excel `x', replace
}
I prefer to use numerical codes for variables such as your category variable. I then assign them value labels. Here's a version of Roberto's code which does this and which, for closer correspondence to your problem, adds a "year" variable
input code tax categ year
1 0.15 1 1999
2 0.25 2 2000
3 0.23 3 2013
4 0.29 1 2010
5 0.16 2 2000
6 0.50 1 2011
7 0.54 4 2000
8 0.22 4 2003
9 0.21 3 2004
10 0.76 3 2005
end
#delim ;
label define catl
1 hd
2 pend
3 mouse
4 monitor
;
#delim cr
label values categ catl
collapse (mean) tax, by(categ year)
levelsof categ, local(levels)
foreach x of local levels {
export excel `:label (categ) `x'', replace
}
The #delim ; command makes it possible to easily list each code on a separate line. The"label" function in the export statement is an extended macro function to insert a value label into the file name.

splitting space delimited entries into new columns in R

I am coding a survey that outputs a .csv file. Within this csv I have some entries that are space delimited, which represent multi-select questions (e.g. questions with more than one response). In the end I want to parse these space delimited entries into their own columns and create headers for them so i know where they came from.
For example I may start with this (note that the multiselect columns have an _M after them):
Q1, Q2_M, Q3, Q4_M
6, 1 2 88, 3, 3 5 99
6, , 3, 1 2
and I want to go to this:
Q1, Q2_M_1, Q2_M_2, Q2_M_88, Q3, Q4_M_1, Q4_M_2, Q4_M_3, Q4_M_5, Q4_M_99
6, 1, 1, 1, 3, 0, 0, 1, 1, 1
6,,,,3,1,1,0,0,0
I imagine this is a relatively common issue to deal with but I have not been able to find it in the R section. Any ideas how to do this in R after importing the .csv ? My general thoughts (which often lead to inefficient programs) are that I can:
(1) pull column numbers that have the special suffix with grep()
(2) loop through (or use an apply) each of the entries in these columns and determine the levels of responses and then create columns accordingly
(3) loop through (or use an apply) and place indicators in appropriate columns to indicate presence of selection
I appreciate any help and please let me know if this is not clear.
I agree with ran2 and aL3Xa that you probably want to change the format of your data to have a different column for each possible reponse. However, if you munging your dataset to a better format proves problematic, it is possible to do what you asked.
process_multichoice <- function(x) lapply(strsplit(x, " "), as.numeric)
q2 <- c("1 2 3 NA 4", "2 5")
processed_q2 <- process_multichoice(q2)
[[1]]
[1] 1 2 3 NA 4
[[2]]
[1] 2 5
The reason different columns for different responses are suggested is because it is still quite unpleasant trying to retrieve any statistics from the data in this form. Although you can do things like
# Number of reponses given
sapply(processed_q2, length)
#Frequency of each response
table(unlist(processed_q2), useNA = "ifany")
EDIT: One more piece of advice. Keep the code that processes your data separate from the code that analyses it. If you create any graphs, keep the code for creating them separate again. I've been down the road of mixing things together, and it isn't pretty. (Especially when you come back to the code six months later.)
I am not entirely sure what you trying to do respectively what your reasons are for coding like this. Thus my advice is more general – so just feel to clarify and I will try to give a more concrete response.
1) I say that you are coding the survey on your own, which is great because it means you have influence on your .csv file. I would NEVER use different kinds of separation in the same .csv file. Just do the naming from the very beginning, just like you suggested in the second block.
Otherwise you might geht into trouble with checkboxes for example. Let's say someone checks 3 out of 5 possible answers, the next only checks 1 (i.e. "don't know") . Now it will be much harder to create a spreadsheet (data.frame) type of results view as opposed to having an empty field (which turns out to be an NA in R) that only needs to be recoded.
2) Another important question is whether you intend to do a panel survey(i.e longitudinal study asking the same participants over and over again) . That (among many others) would be a good reason to think about saving your data to a MySQL database instead of .csv . RMySQL can connect directly to the database and access its tables and more important its VIEWS.
Views really help with survey data since you can rearrange the data in different views, conditional on many different needs.
3) Besides all the personal / opinion and experience, here's some (less biased) literature to get started:
Complex Surveys: A Guide to Analysis Using R (Wiley Series in Survey Methodology
The book is comparatively simple and leaves out panel surveys but gives a lot of R Code and examples which should be a practical start.
To prevent re-inventing the wheel you might want to check LimeSurvey, a pretty decent (not speaking of the templates :) ) tool for survey conductors. Besides I TYPO3 CMS extensions pbsurvey and ke_questionnaire (should) work well too (only tested pbsurvey).
Multiple choice items should always be coded as separate variables. That is, if you have 5 alternatives and multiple choice, you should code them as i1, i2, i3, i4, i5, i.e. each one is a binary variable (0-1). I see that you have values 3 5 99 for Q4_M variable in the first example. Does that mean that you have 99 alternatives in an item? Ouch...
First you should go on and create separate variables for each alternative in a multiple choice item. That is, do:
# note that I follow your example with Q4_M variable
dtf_ins <- as.data.frame(matrix(0, nrow = nrow(<initial dataframe>), ncol = 99))
# name vars appropriately
names(dtf_ins) <- paste("Q4_M_", 1:99, sep = "")
now you have a data.frame with 0s, so what you need to do is to get 1s in an appropriate position (this is a bit cumbersome), a function will do the job...
# first you gotta change spaces to commas and convert character variable to a numeric one
y <- paste("c(", gsub(" ", ", ", x), ")", sep = "")
z <- eval(parse(text = y))
# now you assing 1 according to indexes in z variable
dtf_ins[1, z] <- 1
And that's pretty much it... basically, you would like to reconsider creating a data.frame with _M variables, so you can write a function that does this insertion automatically. Avoid for loops!
Or, even better, create a matrix with logicals, and just do dtf[m] <- 1, where dtf is your multiple-choice data.frame, and m is matrix with logicals.
I would like to help you more on this one, but I'm recuperating after a looong night! =) Hope that I've helped a bit! =)
Thanks for all the responses. I agree with most of you that this format is kind of silly but it is what I have to work with (survey is coded and going into use next week). This is what I came up with from all the responses. I am sure this is not the most elegant or efficient way to do it but I think it should work.
colnums <- grep("_M",colnames(dat))
responses <- nrow(dat)
for (i in colnums) {
vec <- as.vector(dat[,i]) #turn into vector
b <- lapply(strsplit(vec," "),as.numeric) #split up and turn into numeric
c <- sort(unique(unlist(b))) #which values were used
newcolnames <- paste(colnames(dat[i]),"_",c,sep="") #column names
e <- matrix(nrow=responses,ncol=length(c)) #create new matrix for indicators
colnames(e) <- newcolnames
#next loop looks for responses and puts indicators in the correct places
for (i in 1:responses) {
e[i,] <- ifelse(c %in% b[[i]],1,0)
}
dat <- cbind(dat,e)
}
Suggestions for improvement are welcome.

Resources