Saving a gamlss model to an RDS format - save

I'm fitting an R gamlss model:
set.seed(1)
df <- data.frame(group = c(rep("g1",100),rep("g2",100),rep("g3",100)),
value = c(rgamma(100,rate=5,shape=3),rgamma(100,rate=5,shape=4),rgamma(100,rate=5,shape=5)))
df$group <- factor(df$group, levels=c("g1","g2","g3"))
gamlss.fit <- gamlss::gamlss(formula = value ~ group, sigma.formula = ~group, data = df, family=gamlss.dist::GA(mu.link="log"))
This is what I get:
> gamlss.fit
Family: c("GA", "Gamma")
Fitting method: RS()
Call: gamlss::gamlss(formula = value ~ group, sigma.formula = ~group, family = gamlss.dist::GA(mu.link = "log"), data = df)
Mu Coefficients:
(Intercept) groupg2 groupg3
-0.5392 0.2553 0.5162
Sigma Coefficients:
(Intercept) groupg2 groupg3
-0.66318 0.02355 -0.08610
Degrees of Freedom for the fit: 6 Residual Deg. of Freedom 294
Global Deviance: 217.18
AIC: 229.18
SBC: 251.402
I want to save this gamlss.fit model in RDS format for later use. The saveRDS function works fine.
saveRDS(gamlss.fit, "my.gamlss.fit.RDS")
But then if I terminate the current R session, open a new one and read the RDS saved gamlss.fit model, I get:
Call: gamlss::gamlss(formula = value ~ group, sigma.formula = ~group,
family = gamlss.dist::GA(mu.link = "log"), data = df)
No coefficients
Degrees of Freedom: Total (i.e. Null); 294 Residual
Error in signif(x$null.deviance, digits) :
non-numeric argument to mathematical function
So I cannot really use this object for anything downstream.
I thought that tidypredict's parse_model function might come in handy, but it doesn't seem to support parsing the gamlss model:
> gamlss.parsed.fit <- tidypredict::parse_model(gamlss.fit)
Error: Functions inside the formula are not supported.
- Functions detected: `gamlss`,`gamlss.dist`,`GA`. Use `dplyr` transformations to prepare the data.
This saveRDS is specific to gamlss because if I fit a glm model:
glm.fit <- glm(formula = value ~ group, data = df, family="Gamma"(link='log'))
Which gives:
> glm.fit
Call: glm(formula = value ~ group, family = Gamma(link = "log"), data = df)
Coefficients:
(Intercept) groupg2 groupg3
-0.5392 0.2553 0.5162
Degrees of Freedom: 299 Total (i.e. Null); 297 Residual
Null Deviance: 93.25
Residual Deviance: 79.99 AIC: 226.9
I'll get the same after reading it from the RDS saved file:
Call: glm(formula = value ~ group, family = Gamma(link = "log"), data = df)
Coefficients:
(Intercept) groupg2 groupg3
-0.5392 0.2553 0.5162
Degrees of Freedom: 299 Total (i.e. Null); 297 Residual
Null Deviance: 93.25
Residual Deviance: 79.99 AIC: 226.9
BTW, tidypredict's parse_model neither supports parsing a glm model:
> glm.parsed.fit <- tidypredict::parse_model(glm.fit)
Error: Functions inside the formula are not supported.
- Functions detected: `Gamma`. Use `dplyr` transformations to prepare the data.
Any idea if and how a gamlss model can be saved not using the save function, which its drawbacks are discussed here

Related

Clustered resampling for inner layer of Caret recursive feature elimination

I have data where IDs are contained within clusters.
I would like to perform recursive feature elimination using Caret's rfe function which performs the following procedure:
Clustered resampling for the outer layer (line 2.1) is straightforward, using the index parameter.
However, within each outer resample, I would like to tune tuning parameters using cluster-based cross-validation (inner resampling) (line 2.9). Model tuning in the inner layer is possible by specifying a tuneGrid in rfe and having an appropriate trControl. It is this trControl that I would like to change to allow clustered resampling.
The outer resampling is specified in the rfeControl parameter of rfe.
The inner resampling is specified by trControl of rfe which is passed to train.
The trouble I am having is that I can't seem to specify any inner indices, because after the outer resampling, those indices are no longer valid or no longer present in the outer-resampled data.
I am looking for a way to tell train to take an outer resample (which will be missing a cluster against which to validate), and to tune the model using inner resampling by based on folds of the remaining clusters.
The MWE is as minimal as possible:
library(caret)
library(tidyverse)
library(parallel)
library(doParallel)
range01 <- function(x){(x-min(x))/(max(x)-min(x))}
### Create some random data, 10 features, with some influence over a binomial outcome
set.seed(42)
id <- 1:1000
cluster <- rep(1:10, each = 100)
dat <- data.frame(id, cluster, replicate(10,rnorm(n = 1000, mean = runif(1, 0,100)+cluster, sd = runif(1, 0,20))))
dat <- dat %>% mutate(temp = rowSums(across(X1:X10)), prob = range01(temp), outcome = rbinom(n = nrow(dat), size = 1, prob = prob))
dat$outcome <- as.factor(dat$outcome)
levels(dat$outcome) <- c("control", "case")
dat$outcome <- factor(dat$outcome, levels=rev(levels(dat$outcome)))
### Manual outer folds-based cluster ###
for(i in 1:10) {
assign(paste0("index", i), which(dat$cluster!=i))
}
unit_indices <- list(index1, index2, index3, index4, index5, index6, index7, index8, index9, index10)
### Inner resampling method (THIS IS WHAT I'D LIKE TO CHANGE) ###
cv5 <- trainControl(classProbs = TRUE, method = "cv", number = 5, allowParallel = F) ## Is there a way to have inner cluster-based resampling WITHIN the outer cluster-based resampling?
caret_rfe_functions <- list(summary = twoClassSummary,
fit = function (x, y, first, last, ...) {
train(x, y, ...)
},
pred = caretFuncs$pred,
rank = function(object, x, y) {
vimp <- varImp(object)$importance
vimp <- vimp[order(vimp$Overall,decreasing = TRUE),,drop = FALSE]
vimp$var <- rownames(vimp)
vimp
},
selectSize = function (x, metric = "ROC", tol = 1, maximize = TRUE)
{
if (!maximize) {
best <- min(x[, metric])
perf <- (x[, metric] - best)/best * 100
flag <- perf <= tol
}
else {
best <- max(x[, metric])
perf <- (best - x[, metric])/best * 100
flag <- perf <= tol
}
min(x[flag, "Variables"])
},
selectVar = caretFuncs$selectVar)
caret_rfe_ctrl <- rfeControl(
functions = caret_rfe_functions,
saveDetails = TRUE,
index = unit_indices,
indexOut = NULL,
returnResamp = "all",
allowParallel = T, ### change this if you don't want to / can't go parallel
verbose = TRUE
)
#### Feature selection ####
set.seed(42)
cl <- makePSOCKcluster(10) ### for parallel processing if available
registerDoParallel(cl)
rfe_profile_nnet <- rfe(
form = outcome ~
X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10,
data = dat,
sizes = seq(2,10,1),
rfeControl = caret_rfe_ctrl,
## pass options to train()
method = "nnet",
preProc = c("center", "scale"),
metric = "ROC",
tuneGrid = expand.grid(size = c(1:5), decay = 5),
trControl = cv5) ### I would like to change this to allow inner cluster-based resampling
stopCluster(cl)
rfe_profile_nnet
plot(rfe_profile_nnet)
Presumably the inner cluster-based resampling would be achieved by specifying a new trainControl containing some dynamic inner index based on the outer resample that is selected at the time:
inner_cluster_tune <- trainControl(classProbs = TRUE,
index = {insert magic here}, ### This is the important bit
returnResamp = "all",
summaryFunction = twoClassSummary,
allowParallel = F) ### especially if the outer resample is parallelised
If you try with the original cluster indices e.g.
inner_cluster_tune <- trainControl(classProbs = TRUE,
index = unit_indices,
returnResamp = "all",
summaryFunction = twoClassSummary,
allowParallel = F)
There are various warnings about missing data in the resamples, and things like 24: In [<-.data.frame(*tmp*, , object$method$center, value = structure(list( ... : provided 81 variables to replace 9 variables.
All help greatly appreciated.
As a postscript question , you can see which parameters were used within your rfe like so:
> rfe_profile_nnet$fit
Neural Network
1000 samples
8 predictor
2 classes: 'case', 'control'
Pre-processing: centered (8), scaled (8)
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 800, 800, 800, 800, 800
Resampling results across tuning parameters:
size Accuracy Kappa
1 0.616 0.1605071
2 0.616 0.1686937
3 0.620 0.1820503
4 0.618 0.1788491
5 0.618 0.1788063
Tuning parameter 'decay' was held constant at a value of 5
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were size = 3 and decay = 5.
But does anyone know if this refers to one, or all of the outer resamples? Presumably the same tuning parameters won't necessarily be chosen across all outer resamples

Hide p_value and put stars to significant OR gtsummary

I'm using gtsummary package.
I need to merge different univariate logistic regression and in order to have a good presentation, I want to hide the p_value and bold or put a star to the significant OR (p< 0.05).
Anyone can help me?
Maybe it's easier to use another presentation type like kable, huxtable, I don't know?
Thank you for your help.
Have a nice day
There is a function called add_significance_stars() that hides the p-value and adds stars to the estimate indicating various levels of statistical significance. I've also added code to bold the estimate if significant with modify_table_styling().
library(gtsummary)
#> #BlackLivesMatter
packageVersion("gtsummary")
#> [1] '1.4.0'
tbl <-
trial %>%
select(death, age, grade) %>%
tbl_uvregression(
y = death,
method = glm,
method.args = list(family = binomial),
exponentiate = TRUE
) %>%
# add significance stars to sig estimates
add_significance_stars() %>%
# additioanlly bolding significant estimates
modify_table_styling(
columns = estimate,
rows = p.value < 0.05,
text_format = "bold"
)
Created on 2021-04-14 by the reprex package (v2.0.0)
Here's a quick huxtable version:
l1 <- glm(I(cyl==8) ~ gear, data = mtcars, family = binomial)
l2 <- glm(I(cyl==8) ~ carb, data = mtcars, family = binomial)
huxtable::huxreg(l1, l2, statistics = "nobs", bold_signif = 0.05)
────────────────────────────────────────────────────
(1) (2)
───────────────────────────────────
(Intercept) 5.999 * -1.880 *
(2.465) (0.902)
gear -1.736 *
(0.693)
carb 0.579 *
(0.293)
───────────────────────────────────
nobs 32 32
────────────────────────────────────────────────────
*** p < 0.001; ** p < 0.01; * p < 0.05.
Column names: names, model1, model2
It doesn't show it here, but the significant coefficients are bold on screen (and in any other kind of output).

CFA in data with 3 levels - estimating factor scores at level 2?

I am working on a dataset with 3 levels:
Teacher
School
Country
Using survey responses from the teachers, the aim is to use Confirmatory Factor Analysis (CFA) with the ultimate goal of having the factor scores at the school level.
A further objective is to test for measurement invariance across countries.
I want to use the lavaan package in R, because it is able to deal with the complex survey design of my data trough the lavaan.survey-extension (sampling design, weights etc.)
I have done some preliminary analysis, where i use country-ID as the group argument in the cfa-function. This gives me the possibility to perform the measurement invariance analysis across countries. The issue is, that my factor scores are given at individual teacher level, and i am interested in the school-level.
Any ideas about how to get these factor scores at the school level?
Here are some examples of the functions i use. I do not think that i data sample is needed, but i will create some if it is requested.
library(lavaan)
library(SEMtools)
#define model
reduced_mod <-'
leadership_sup =~ TC3G22D + TC3G22E + TC3G22K
continous_develop_collab =~ TT3G32A + TT3G32B + TT3G32C + TT3G32D '
#Fit model with different restraints:
fit_no_restraint <- cfa(model = reduced_mod, data = cfa_data, group="countryID")
fit_metric <- cfa(model = reduced_mod, data = cfa_data, group="countryID", group.equal = c("loadings"))
fit_scalar <- cfa(model = reduced_mod, data = cfa_data, group="countryID", group.equal = c("loadings", "intercepts"))
#Compare fit statistics
compareFit(scalar = fit_scalar , metric = fit_metric , config = fit_no_restraint)
It seems that you want multilevel measurement invariance. You should use the measEq.syntax() from the semTools package:
## ---------------------
## Multilevel Invariance
## ---------------------
## To test invariance across levels in a MLSEM, specify syntax as though
## you are fitting to 2 groups instead of 2 levels.
mlsem <- ' f1 =~ y1 + y2 + y3
f2 =~ y4 + y5 + y6 '
## metric invariance
syntax.metric <- measEq.syntax(configural.model = mlsem, meanstructure = TRUE,
ID.fac = "std.lv", sample.nobs = c(1, 1),
group = "cluster", group.equal = "loadings")
## by definition, Level-1 means must be zero, so fix them
syntax.metric <- update(syntax.metric,
change.syntax = paste0("y", 1:6, " ~ c(0, NA)*1"))
## save as a character string
mod.metric <- as.character(syntax.metric, groups.as.blocks = TRUE)
## convert from multigroup to multilevel
mod.metric <- gsub(pattern = "group:", replacement = "level:",
x = mod.metric, fixed = TRUE)
## fit model to data
fit.metric <- lavaan(mod.metric, data = Demo.twolevel, cluster = "cluster")
summary(fit.metric)
Source

Why does RMSE increase with horizon when using the timeslice method in caret's trainControl function?

I'm using the timeslice method in caret's trainControl function to perform cross-validation on a time series model. I've noticed that RMSE increases with the horizon argument.
I realise this might happen for several reasons, e.g., if explanatory variables are being forecast and/or there's autocorrelation in the data such that the model can better predict nearer vs. farther ahead observations. However, I'm seeing the same behaviour even when neither is the case (see trivial reproducible example below).
Can anyone explain why RSMEs are increasing with horizon?
# Make data
X = data.frame(matrix(rnorm(1000 * 3), ncol = 3))
X$y = rowSums(X) + rnorm(nrow(X))
# Iterate over different different forecast horizons and record RMSES
library(caret)
forecast_horizons = c(1, 3, 10, 50, 100)
rmses = numeric(length(forecast_horizons))
for (i in 1:length(forecast_horizons)) {
ctrl = trainControl(method = 'timeslice', initialWindow = 500, horizon = forecast_horizons[i], fixedWindow = T)
rmses[i] = train(y ~ ., data = X, method = 'lm', trControl = ctrl)$results$RMSE
}
print(rmses) #0.7859786 0.9132649 0.9720110 0.9837384 0.9849005

How can I change the max sequence length in a Tensorflow RNN Model?

I am currently trying to adapt my tensorflow classifier, which is able to tag a sequence of words to be positive or negative, to handle much longer sequences, without retraining. My model is a RNN, with a max sequence lenght of 210. One input is one word(300 dim), I vectorised the words with Googles word2vec, so I am able to feed a sequence with max 210 words. Now my question is, how can I change the max sequence length to for example 3000, for classifying movie reviews.
My working model with fixed max sequence length of 210(tf_version: 1.1.0):
n_chunks = 210
chunk_size = 300
x = tf.placeholder("float",[None,n_chunks,chunk_size])
y = tf.placeholder("float",None)
seq_length = tf.placeholder("int64",None)
with tf.variable_scope("rnn1"):
lstm_cell = tf.contrib.rnn.LSTMCell(rnn_size,
state_is_tuple=True)
lstm_cell = tf.contrib.rnn.DropoutWrapper (lstm_cell,
input_keep_prob=0.8)
outputs, _ = tf.nn.dynamic_rnn(lstm_cell,x,dtype=tf.float32,
sequence_length = self.seq_length)
fc = tf.contrib.layers.fully_connected(outputs, 1000,
activation_fn=tf.nn.relu)
output = tf.contrib.layers.flatten(fc)
#*1
logits = tf.contrib.layers.fully_connected(output, self.n_classes,
activation_fn=None)
cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits
(logits=logits, labels=y) )
optimizer = tf.train.AdamOptimizer(learning_rate=0.01).minimize(cost)
...
#train
#train_x padded to fit(batch_size*n_chunks*chunk_size)
sess.run([optimizer, cost], feed_dict={x:train_x, y:train_y,
seq_length:seq_length})
#predict:
...
pred = tf.nn.softmax(logits)
pred = sess.run(pred,feed_dict={x:word_vecs, seq_length:sq_l})
What modifications I already tried:
1 Replacing n_chunks with None and simply feed data in
x = tf.placeholder(tf.float32, [None,None,300])
#model fails to build
#ValueError: The last dimension of the inputs to `Dense` should be defined.
#Found `None`.
# at *1
...
#all entrys in word_vecs still have got the same length for example
#3000(batch_size*3000(!= n_chunks)*300)
pred = tf.nn.softmax(logits)
pred = sess.run(pred,feed_dict={x:word_vecs, seq_length:sq_l})
2 Changing x and then restore the old model:
x = tf.placeholder(tf.float32, [None,n_chunks*10,chunk_size]
...
saver = tf.train.Saver(tf.all_variables(), reshape=True)
saver.restore(sess,"...")
#fails as well:
#InvalidArgumentError (see above for traceback): Input to reshape is a
#tensor with 420000 values, but the requested shape has 840000
#[[Node: save/Reshape_5 = Reshape[T=DT_FLOAT, Tshape=DT_INT32,
#_device="/job:localhost/replica:0/task:0/cpu:0"](save/RestoreV2_5,
#save/Reshape_5/shape)]]
# run prediction
If it is possible could you please provide me with any working example or explain me why it isnt?
I am just wondering why not you just assign the n_chunk a value of 3000?
In your first attempt, you cannot use two None, since tf cannot how many dimensions to put for each one. The first dimension is set as None because it is contingent upon the batch size. In your second attempt, you just change one place and the other places where n_chunks is used may conflict with the x placeholder.

Resources