Interaction of variables-Significance in a GLM - glm

Assume that we have the following glm model:
X ~ A + B + A:B
And assume that neither A nor B are significant but their interaction is. Should I keep only that variable and update my model to
X ~ A:B
Or keep the model the way it is (X~A+B+A:B)? Also what happens when in the same model we have variable A and Interaction being significant but B is not? The correct model to consider is A+A:B or the full one, A+B+A:B??
Thank you for your time in advance!!

Related

Three Way Interaction in mgcv

I am interested in implementing a three-way interaction in mgcv but while there has been some discussion both here and on Cross Validated, I have had trouble finding an answer to how exactly one should code a three-way interaction between two continuous variables and one categorical variable. In my study, I have only four variables (socioeconomic class (Socio), sex, year of death (YOD), and age) for each individual and I am curious how these variables explain the likelihood of someone being buried with burial goods (N=c.12,000).
I have read Pedersen et al. 2019 and have elected to not include global smooths in my model. However, I am not certain if this means I should also not include the lower order interaction terms in my three-way interaction model. For example, should my code be:
mgcv::gam(Goods ~ Socio + te(Age,YOD,by=Socio,k=5), family=binomial(link='logit'),
mydata, method='ML')
should I still include the lower order terms within the three-way interaction:
mgcv::gam(Goods ~ Socio + s(Age,by=Socio,k=5) + s(YOD,by=Socio,k=5) + te(Age,YOD,k=5) +
ti(Age,YOD,by=Socio, k=5), family=binomial(link='logit'),mydata,method='ML')
or is there a different means of coding this?
Unless you want to perform a test on the interaction, then you are better off not decomposing the te() into main and interaction effects. This is because the models
y ~ te(x1, x2) # main + interaction in one smooth model
y ~ s(x1) + s(x2) + ti(x1, x2) # decomposed model
aren't exactly equivalent:
you have to be very picky about setting k on all the terms such that these models are using the same number of basis functions, and
the decomposed model uses more smoothness parameters than the te() version, so is slightly more complex a model even if you get all the bases are roughly of comparable size
Note you are including a global smooth in the second model: this term te(Age,YOD,k=5) will be an global smooth of Age and YOD over the whole data set.
The decomposed version of your first model would be:
Goods ~ Socio +
s(Age, by = Socio) +
s(YOD, by = Socio) +
ti(Age, YOD, by = Socio)
Setting things up to test if you need the factor by terms or not would require more work but I think you'd be better off do that post-hoc on the te() model, where you can compare the fitted surfaces by differencing them.

gretl - dummy interactions

There does not seem to be an "easy" way (such as in R or python) to create interaction terms between dummy variables in gretl ?
Do we really need to code those manually which will be difficult for many levels? Here is a minimal example of manual coding:
open credscore.gdt
SelfemplOwnRent=OwnRent*Selfempl
# model 1
ols Acc 0 OwnRent Selfempl SelfemplOwnRent
Now my manual interaction term will not work for factors with many levels and in fact does not even do the job for binary variables.
Thanks,
ML
One way of doing this is to use lists. Use the dummify-command for generating dummies for each level and the ^-operator for creating the interactions. Example:
open griliches.gdt
discrete med
list X = dummify(med)
list D = dummify(mrt)
list INT = X^D
ols lw 0 X D INT
The command discrete turns your variable into a discrete variable and allows to use dummify (this step is not necessary if your variable is already discrete). Now all interactions terms are stored in the list INT and you can easily assess them in the following ols-command.
#Markus Loecher on your second question:
You can always use the rename command to rename a series. So you would have to loop over all elements in list INT to do so. However, I would rather suggest to rename both input series, in the above example mrt and med respectively, before computing the interaction terms if you want shorter series names.

SPSS version 23, MIXED module: maximum dummy variables?

I am using the MIXED routine, repeated measures. I have 10 dummy variables (0/1) and 8 scaled variables for fixed effects. The results keep showing that one of the dummy variables is redundant. I played around moving the order in which the dummy and scaled variables are listed. Usually the last listed dummy variable gets flagged as being redundant. Is there a maximum number of dummy variables that should be included in the model? Eight of the dummy variables refer to 8 geographical regions of a country.
To understand why SPSS 'kicks out' one of the dummy variables, you should look at the origin of these dummies.
Let's say we have a dependent y belonging to a sample of objects. These objects come from 8 regions, x. In a flat regression model, we model the relation between y and x:
y = a + bx + e.
We want to know the value of b. But x is a nominal variable, so the categories or regions are not numbers, but names. Names don't fit in the above equation.
You have probably recoded x into dummies x1, x2 to x8. Now look at the records in your data and their scores for x and the dummy variables. Here's an example of one record:
x x1 x2 x3 x4 x5 x6 x7 x8
8 0 0 0 0 0 0 0 1
If you look at the dummy variables one by one, and you get to x7, you know that the first 7 are al zeroes. For this record, you therefore already know that x8 must be 1. This is what SPSS means when it 'kicks out' redundant variable. This phenomenon is called perfect collinearity. The information in the last dummy you add to the model is redundant, because it is already in there.
In conclusion: leave out one of the dummies. The dummy variable you leave out will serve as the reference category in your model. For each of the other dummies, you will calculate the coefficient that tells you how big the records or objects with a given value/category of x differ from the reference category that was left out.
There are different ways to code your dummy variables in such a way that you use the mean as reference, in stead of one of the categories. Take a look at dummy coding on Wikipedia.
I also like this article that explains how degrees of freedom work. Although I hadn't mentioned this term before, it does touch on the very same idea of how dummy coding works.

how to use non-target attributes for timeseries prediction in weka?

i have used the weka timeseries plugin w/ algorithms like SMOReg (w/ RegSMOImproved and RegSMO) and HoltWinters. But for all of them i've observed that lag variables are created only for target attributes.
how does one have lag variables created for other (non-target) attributes such that the algo uses these too?
eg: i have 5 attributes ", a, b, c, d"
of which i have to predict for "a". ie. "a" is the "target" attribute
i've observed that lag variables are created only for "date" and "a" and that none of b, c or d are used by the algorithms
note that "overlay" does not really help me because i don't have "future" values for either of b, c or d
what i need is that lag variables be created for b, c and d and they be used for prediction by the chosen algorithm
==================== update ====================
i tried the following approach:
use the "filters->unsupervised->Copy" filter to make multiple (14) copies of the a,b,c,d variables
use "filters->unsupervised->TimesSeriesDelta" filter to shift the copies by consecutive values (eg, 1st copy by 1 day, 2nd copy by 2 days, ... 14th copy by 14 days)
use SMOReg from "classify" panel (w/ %-split of 70%) instead of from "forecast" panel (w/ .3 hold out training evaluation)
but faced the following barriers:
1. can classify (regress, actually, since the target is numeric) only 1 variable at a time
2. did not accept "date" attribute (even though the "date" values are numeric 20150601, 20150602, 20150603, and so on)
3. ran for a long time and then crashed :(
any guidance will be greatly appreciated
ps: the above example is contrived. in my real example, i have date + 8 attributes (all of them numeric), and 3 of them are target (multivariate forecasting)
==================== update ====================
https://github.com/log0ymxm/weka-timeseriesforecasting/blob/master/src/main/java/weka/classifiers/timeseries/core/TSLagMaker.java#L2974
shows that the extra attributes (non-target) are being removed because (line #3027 says):
// otherwise, this is some attribute that we are not predicting and
// wont be able to determine the value for when forecasting future
// instances. So we can't let the model use it.
==================== update ====================
https://github.com/log0ymxm/weka-timeseriesforecasting/blob/master/src/main/java/weka/classifiers/timeseries/WekaForecaster.java#L576
shows that fields-to-lag are same as fields-to-forecast
from the last 2 updates, i found that:
non-target attributes are removed
only target attributes are lagged
i thought this was algorithm specific (eg. HoltWinters, etc), but it's a feature/bug of timesseriesForecasting plugin itself
basically what i want isn't possible without code change :(

Multiple, Binomial Dependent Variables for GLM (or LME4) in R

everyone. I'm pretty new to R. I've been trying to educate myself about this issue, but I've continued to run into road blocks.
I have a data set with two categorical, independent variables (habitat (1,2,3) and site (1,2,3,4,5). My response variables are the presence or absence of AFLP loci. I have 96 loci, and I want to determine which, if any, of these loci are significantly associated with habitat (site is a random effect). Each of the loci can be assumed to be independent from the others.
As far as relevancy to other researchers, this should be a problem that people trying to analyze molecular data with GLM or LME will begin to run into more.
Here is my code:
##Independent variables
Site=AFLP$Site ##AFLP is my data file
Habitat=AFLP$Habitat
##Dependent variable
Loci=AFLP[,4:99]
##Establishing matrix of variables
mydata <- cbind(Site, Habitat, Loci)
##glm
model1 <- glm(Loci ~ (1|Site)+Habitat, data=mydata, family="binomial")
I get this error:
Error in model.frame.default(formula = Loci ~ (1 | Site) + Habitat, data = mydata, :
invalid type (list) for variable 'Loci'
I know this error is associated with the data type of Loci; however, I've tried a bunch of things and still can't figure out how to correctly address the issue.
My problem seems to be similar to the ones in the below links, but again, I haven't been able to figure out how to apply this information to my data set.
http://stackoverflow.com/questions/18067519/using-r-to-do-a-regression-with-multiple-dependent-and-multiple-independent-vari
https://stats.stackexchange.com/questions/26585/how-to-do-a-generalized-linear-model-with-multiple-dependent-variables-in-r
Thank you in advance. If this turns out to have a simple answer, I apologize for taking up space. I have been Googling and trying to educate myself, and I haven't made any head-way.

Resources