Hide p_value and put stars to significant OR gtsummary - p-value

I'm using gtsummary package.
I need to merge different univariate logistic regression and in order to have a good presentation, I want to hide the p_value and bold or put a star to the significant OR (p< 0.05).
Anyone can help me?
Maybe it's easier to use another presentation type like kable, huxtable, I don't know?
Thank you for your help.
Have a nice day

There is a function called add_significance_stars() that hides the p-value and adds stars to the estimate indicating various levels of statistical significance. I've also added code to bold the estimate if significant with modify_table_styling().
library(gtsummary)
#> #BlackLivesMatter
packageVersion("gtsummary")
#> [1] '1.4.0'
tbl <-
trial %>%
select(death, age, grade) %>%
tbl_uvregression(
y = death,
method = glm,
method.args = list(family = binomial),
exponentiate = TRUE
) %>%
# add significance stars to sig estimates
add_significance_stars() %>%
# additioanlly bolding significant estimates
modify_table_styling(
columns = estimate,
rows = p.value < 0.05,
text_format = "bold"
)
Created on 2021-04-14 by the reprex package (v2.0.0)

Here's a quick huxtable version:
l1 <- glm(I(cyl==8) ~ gear, data = mtcars, family = binomial)
l2 <- glm(I(cyl==8) ~ carb, data = mtcars, family = binomial)
huxtable::huxreg(l1, l2, statistics = "nobs", bold_signif = 0.05)
────────────────────────────────────────────────────
(1) (2)
───────────────────────────────────
(Intercept) 5.999 * -1.880 *
(2.465) (0.902)
gear -1.736 *
(0.693)
carb 0.579 *
(0.293)
───────────────────────────────────
nobs 32 32
────────────────────────────────────────────────────
*** p < 0.001; ** p < 0.01; * p < 0.05.
Column names: names, model1, model2
It doesn't show it here, but the significant coefficients are bold on screen (and in any other kind of output).

Related

Clustered resampling for inner layer of Caret recursive feature elimination

I have data where IDs are contained within clusters.
I would like to perform recursive feature elimination using Caret's rfe function which performs the following procedure:
Clustered resampling for the outer layer (line 2.1) is straightforward, using the index parameter.
However, within each outer resample, I would like to tune tuning parameters using cluster-based cross-validation (inner resampling) (line 2.9). Model tuning in the inner layer is possible by specifying a tuneGrid in rfe and having an appropriate trControl. It is this trControl that I would like to change to allow clustered resampling.
The outer resampling is specified in the rfeControl parameter of rfe.
The inner resampling is specified by trControl of rfe which is passed to train.
The trouble I am having is that I can't seem to specify any inner indices, because after the outer resampling, those indices are no longer valid or no longer present in the outer-resampled data.
I am looking for a way to tell train to take an outer resample (which will be missing a cluster against which to validate), and to tune the model using inner resampling by based on folds of the remaining clusters.
The MWE is as minimal as possible:
library(caret)
library(tidyverse)
library(parallel)
library(doParallel)
range01 <- function(x){(x-min(x))/(max(x)-min(x))}
### Create some random data, 10 features, with some influence over a binomial outcome
set.seed(42)
id <- 1:1000
cluster <- rep(1:10, each = 100)
dat <- data.frame(id, cluster, replicate(10,rnorm(n = 1000, mean = runif(1, 0,100)+cluster, sd = runif(1, 0,20))))
dat <- dat %>% mutate(temp = rowSums(across(X1:X10)), prob = range01(temp), outcome = rbinom(n = nrow(dat), size = 1, prob = prob))
dat$outcome <- as.factor(dat$outcome)
levels(dat$outcome) <- c("control", "case")
dat$outcome <- factor(dat$outcome, levels=rev(levels(dat$outcome)))
### Manual outer folds-based cluster ###
for(i in 1:10) {
assign(paste0("index", i), which(dat$cluster!=i))
}
unit_indices <- list(index1, index2, index3, index4, index5, index6, index7, index8, index9, index10)
### Inner resampling method (THIS IS WHAT I'D LIKE TO CHANGE) ###
cv5 <- trainControl(classProbs = TRUE, method = "cv", number = 5, allowParallel = F) ## Is there a way to have inner cluster-based resampling WITHIN the outer cluster-based resampling?
caret_rfe_functions <- list(summary = twoClassSummary,
fit = function (x, y, first, last, ...) {
train(x, y, ...)
},
pred = caretFuncs$pred,
rank = function(object, x, y) {
vimp <- varImp(object)$importance
vimp <- vimp[order(vimp$Overall,decreasing = TRUE),,drop = FALSE]
vimp$var <- rownames(vimp)
vimp
},
selectSize = function (x, metric = "ROC", tol = 1, maximize = TRUE)
{
if (!maximize) {
best <- min(x[, metric])
perf <- (x[, metric] - best)/best * 100
flag <- perf <= tol
}
else {
best <- max(x[, metric])
perf <- (best - x[, metric])/best * 100
flag <- perf <= tol
}
min(x[flag, "Variables"])
},
selectVar = caretFuncs$selectVar)
caret_rfe_ctrl <- rfeControl(
functions = caret_rfe_functions,
saveDetails = TRUE,
index = unit_indices,
indexOut = NULL,
returnResamp = "all",
allowParallel = T, ### change this if you don't want to / can't go parallel
verbose = TRUE
)
#### Feature selection ####
set.seed(42)
cl <- makePSOCKcluster(10) ### for parallel processing if available
registerDoParallel(cl)
rfe_profile_nnet <- rfe(
form = outcome ~
X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10,
data = dat,
sizes = seq(2,10,1),
rfeControl = caret_rfe_ctrl,
## pass options to train()
method = "nnet",
preProc = c("center", "scale"),
metric = "ROC",
tuneGrid = expand.grid(size = c(1:5), decay = 5),
trControl = cv5) ### I would like to change this to allow inner cluster-based resampling
stopCluster(cl)
rfe_profile_nnet
plot(rfe_profile_nnet)
Presumably the inner cluster-based resampling would be achieved by specifying a new trainControl containing some dynamic inner index based on the outer resample that is selected at the time:
inner_cluster_tune <- trainControl(classProbs = TRUE,
index = {insert magic here}, ### This is the important bit
returnResamp = "all",
summaryFunction = twoClassSummary,
allowParallel = F) ### especially if the outer resample is parallelised
If you try with the original cluster indices e.g.
inner_cluster_tune <- trainControl(classProbs = TRUE,
index = unit_indices,
returnResamp = "all",
summaryFunction = twoClassSummary,
allowParallel = F)
There are various warnings about missing data in the resamples, and things like 24: In [<-.data.frame(*tmp*, , object$method$center, value = structure(list( ... : provided 81 variables to replace 9 variables.
All help greatly appreciated.
As a postscript question , you can see which parameters were used within your rfe like so:
> rfe_profile_nnet$fit
Neural Network
1000 samples
8 predictor
2 classes: 'case', 'control'
Pre-processing: centered (8), scaled (8)
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 800, 800, 800, 800, 800
Resampling results across tuning parameters:
size Accuracy Kappa
1 0.616 0.1605071
2 0.616 0.1686937
3 0.620 0.1820503
4 0.618 0.1788491
5 0.618 0.1788063
Tuning parameter 'decay' was held constant at a value of 5
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were size = 3 and decay = 5.
But does anyone know if this refers to one, or all of the outer resamples? Presumably the same tuning parameters won't necessarily be chosen across all outer resamples

Cost function for logistic regression: weird/oscillating cost history

Background and my thought process:
I wanted to see if I could utilize logistic regression to create a hypothesis function that could predict recessions in the US economy by looking at a date and its corresponding leading economic indicators. Leading economic indicators are known to be good predictors of the economy.
To do this, I got data from OECD on the composite leading (economic) indicators from January, 1970 to July, 2021 in addition to finding when recessions occurred from 1970 to 2021. The formatted data that I use for training can be found further below.
Knowing the relationship between a recession and the Date/LEI wouldn't be a simple linear relationship, I decided to make more parameters for each datapoint so I could fit a polynominal equation to the data. Thus, each datapoint has the following parameters: Date, LEI, LEI^2, LEI^3, LEI^4, and LEI^5.
The Problem:
When I attempt to train my hypothesis function, I get a very strange cost history that seems to indicate that I either did not implement my cost function correctly or that my gradient descent was implemented incorrectly. Below is the imagine of my cost history:
I have tried implementing the suggestions from this post to fix my cost history, as originally I had the same NaN and Inf issues described in the post. While the suggestions helped me fix the NaN and Inf issues, I couldn't find anything to help me fix my cost function once it started oscillating. Some of the other fixes I've tried are adjusting the learning rate, double checking my cost and gradient descent, and introducing more parameters for datapoints (to see if a higher-degree polynominal equation would help).
My Code
The main file is predictor.m.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Program: Predictor.m
% Author: Hasec Rainn
% Desc: Predictor.m uses logistic regression
% to predict when economic recessions will occur
% in the United States. The data it uses is from the past 50 years.
%
% In particular, it uses dates and their corresponding economic leading
% indicators to learn a non-linear hypothesis function to fit to the data.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
LI_Data = dlmread("leading_indicators_formatted.csv"); %Get LI data
RD_Data = dlmread("recession_dates_formatted.csv"); %Get RD data
%our datapoints of interest: Dates and their corresponding
%leading Indicator values.
%We are going to increase the number of parameters per datapoint to allow
%for a non-linear hypothesis function. Specifically, let the 3rd, 4th
%5th, and 6th columns represent LI^2, LI^3, LI^4, and LI^5 respectively
X = LI_Data; %datapoints of interest (row = 1 datapoint)
X = [X, X(:,2).^2]; %Adding LI^2
X = [X, X(:,2).^3]; %Adding LI^3
X = [X, X(:,2).^4]; %Adding LI^4
X = [X, X(:,2).^5]; %Adding LI^5
%normalize data
X(:,1) = normalize( X(:,1) );
X(:,2) = normalize( X(:,2) );
X(:,3) = normalize( X(:,3) );
X(:,4) = normalize( X(:,4) );
X(:,5) = normalize( X(:,5) );
X(:,6) = normalize( X(:,6) );
%What we want to predict: if a recession happens or doesn't happen
%for a corresponding year
Y = RD_Data(:,2); %row = 1 datapoint
%defining a few useful variables:
nIter = 4000; %how many iterations we want to run gradient descent for
ndp = size(X, 1); %number of data points we have to work with
nPara = size(X,2); %number of parameters per data point
alpha = 1; %set the learning rate to 1
%Defining Theta
Theta = ones(1, nPara); %initialize the weights of Theta to 1
%Make a cost history so we can see if gradient descent is implemented
%correctly
costHist = zeros(nIter, 1);
for i = 1:nIter
costHist(i, 1) = cost(Theta, Y, X);
Theta = Theta - (sum((sigmoid(X * Theta') - Y) .* X));
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Function: Cost
% Author: Hasec Rainn
% Parameters: Theta (vector), Y (vector), X (matrix)
% Desc: Uses Theta, Y, and X to determine the cost of our current
% hypothesis function H_theta(X). Uses manual loop approach to
% avoid errors that arrise from log(0).
% Additionally, limits the range of H_Theta to prevent Inf
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function expense = cost(Theta, Y, X)
m = size(X, 1); %number of data points
hTheta = sigmoid(X*Theta'); %hypothesis function
%limit the range of hTheta to [10^-50, 0.9999999999999]
for i=1:size(hTheta, 1)
if (hTheta(i) <= 10^(-50))
hTheta(i) = 10^(-50);
endif
if (hTheta(i) >= 0.9999999999999)
hTheta(i) = 0.9999999999999;
endif
endfor
expense = 0;
for i = 1:m
if Y(i) == 1
expense = expense + -log(hTheta(i));
endif
if Y(i) == 0
expense = expense + -log(1-hTheta(i));
endif
endfor
endfunction
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Function: normalization
% Author: Hasec Rainn
% Parameters: vector
% Desc: Takes in an input and normalizes its value(s)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function n = normalize(data)
dMean = mean(data);
dStd = std(data);
n = (data - dMean) ./ dStd;
endfunction
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Function: Sigmoid
% Author: Hasec Rainn
% Parameters: scalar, vector, or matrix
% Desc: Takes an input and forces its value(s) to be between
% 0 and 1. If a matrix or vector, sigmoid is applied to
% each element.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function result = sigmoid(z)
result = 1 ./ ( 1 + e .^(-z) );
endfunction
The data I used for my learning process can be found here: formatted LI data and recession dates data.
The problem you're running into here is your gradient descent function.
In particular, while you correctly calculate the cost portion (aka, (hTheta - Y) or (sigmoid(X * Theta') - Y) ), you do not calculate the derivative of the cost correctly; in Theta = Theta - (sum((sigmoid(X * Theta') - Y) .* X)), the .*X is not correct.
The derivative is equivalent to the cost of each datapoint (found in the vector hTheta - Y) multiplied by their corresponding parameter j, for every parameter. For more information, check out this article.

Saving a gamlss model to an RDS format

I'm fitting an R gamlss model:
set.seed(1)
df <- data.frame(group = c(rep("g1",100),rep("g2",100),rep("g3",100)),
value = c(rgamma(100,rate=5,shape=3),rgamma(100,rate=5,shape=4),rgamma(100,rate=5,shape=5)))
df$group <- factor(df$group, levels=c("g1","g2","g3"))
gamlss.fit <- gamlss::gamlss(formula = value ~ group, sigma.formula = ~group, data = df, family=gamlss.dist::GA(mu.link="log"))
This is what I get:
> gamlss.fit
Family: c("GA", "Gamma")
Fitting method: RS()
Call: gamlss::gamlss(formula = value ~ group, sigma.formula = ~group, family = gamlss.dist::GA(mu.link = "log"), data = df)
Mu Coefficients:
(Intercept) groupg2 groupg3
-0.5392 0.2553 0.5162
Sigma Coefficients:
(Intercept) groupg2 groupg3
-0.66318 0.02355 -0.08610
Degrees of Freedom for the fit: 6 Residual Deg. of Freedom 294
Global Deviance: 217.18
AIC: 229.18
SBC: 251.402
I want to save this gamlss.fit model in RDS format for later use. The saveRDS function works fine.
saveRDS(gamlss.fit, "my.gamlss.fit.RDS")
But then if I terminate the current R session, open a new one and read the RDS saved gamlss.fit model, I get:
Call: gamlss::gamlss(formula = value ~ group, sigma.formula = ~group,
family = gamlss.dist::GA(mu.link = "log"), data = df)
No coefficients
Degrees of Freedom: Total (i.e. Null); 294 Residual
Error in signif(x$null.deviance, digits) :
non-numeric argument to mathematical function
So I cannot really use this object for anything downstream.
I thought that tidypredict's parse_model function might come in handy, but it doesn't seem to support parsing the gamlss model:
> gamlss.parsed.fit <- tidypredict::parse_model(gamlss.fit)
Error: Functions inside the formula are not supported.
- Functions detected: `gamlss`,`gamlss.dist`,`GA`. Use `dplyr` transformations to prepare the data.
This saveRDS is specific to gamlss because if I fit a glm model:
glm.fit <- glm(formula = value ~ group, data = df, family="Gamma"(link='log'))
Which gives:
> glm.fit
Call: glm(formula = value ~ group, family = Gamma(link = "log"), data = df)
Coefficients:
(Intercept) groupg2 groupg3
-0.5392 0.2553 0.5162
Degrees of Freedom: 299 Total (i.e. Null); 297 Residual
Null Deviance: 93.25
Residual Deviance: 79.99 AIC: 226.9
I'll get the same after reading it from the RDS saved file:
Call: glm(formula = value ~ group, family = Gamma(link = "log"), data = df)
Coefficients:
(Intercept) groupg2 groupg3
-0.5392 0.2553 0.5162
Degrees of Freedom: 299 Total (i.e. Null); 297 Residual
Null Deviance: 93.25
Residual Deviance: 79.99 AIC: 226.9
BTW, tidypredict's parse_model neither supports parsing a glm model:
> glm.parsed.fit <- tidypredict::parse_model(glm.fit)
Error: Functions inside the formula are not supported.
- Functions detected: `Gamma`. Use `dplyr` transformations to prepare the data.
Any idea if and how a gamlss model can be saved not using the save function, which its drawbacks are discussed here

CFA in data with 3 levels - estimating factor scores at level 2?

I am working on a dataset with 3 levels:
Teacher
School
Country
Using survey responses from the teachers, the aim is to use Confirmatory Factor Analysis (CFA) with the ultimate goal of having the factor scores at the school level.
A further objective is to test for measurement invariance across countries.
I want to use the lavaan package in R, because it is able to deal with the complex survey design of my data trough the lavaan.survey-extension (sampling design, weights etc.)
I have done some preliminary analysis, where i use country-ID as the group argument in the cfa-function. This gives me the possibility to perform the measurement invariance analysis across countries. The issue is, that my factor scores are given at individual teacher level, and i am interested in the school-level.
Any ideas about how to get these factor scores at the school level?
Here are some examples of the functions i use. I do not think that i data sample is needed, but i will create some if it is requested.
library(lavaan)
library(SEMtools)
#define model
reduced_mod <-'
leadership_sup =~ TC3G22D + TC3G22E + TC3G22K
continous_develop_collab =~ TT3G32A + TT3G32B + TT3G32C + TT3G32D '
#Fit model with different restraints:
fit_no_restraint <- cfa(model = reduced_mod, data = cfa_data, group="countryID")
fit_metric <- cfa(model = reduced_mod, data = cfa_data, group="countryID", group.equal = c("loadings"))
fit_scalar <- cfa(model = reduced_mod, data = cfa_data, group="countryID", group.equal = c("loadings", "intercepts"))
#Compare fit statistics
compareFit(scalar = fit_scalar , metric = fit_metric , config = fit_no_restraint)
It seems that you want multilevel measurement invariance. You should use the measEq.syntax() from the semTools package:
## ---------------------
## Multilevel Invariance
## ---------------------
## To test invariance across levels in a MLSEM, specify syntax as though
## you are fitting to 2 groups instead of 2 levels.
mlsem <- ' f1 =~ y1 + y2 + y3
f2 =~ y4 + y5 + y6 '
## metric invariance
syntax.metric <- measEq.syntax(configural.model = mlsem, meanstructure = TRUE,
ID.fac = "std.lv", sample.nobs = c(1, 1),
group = "cluster", group.equal = "loadings")
## by definition, Level-1 means must be zero, so fix them
syntax.metric <- update(syntax.metric,
change.syntax = paste0("y", 1:6, " ~ c(0, NA)*1"))
## save as a character string
mod.metric <- as.character(syntax.metric, groups.as.blocks = TRUE)
## convert from multigroup to multilevel
mod.metric <- gsub(pattern = "group:", replacement = "level:",
x = mod.metric, fixed = TRUE)
## fit model to data
fit.metric <- lavaan(mod.metric, data = Demo.twolevel, cluster = "cluster")
summary(fit.metric)
Source

combine time series plot by using R

I wanna combine three graphics on one graph. The data from inside of R which is " nottem ". Can someone help me to write code to put a seasonal mean and harmonic (cosine model) and its time series plots together by using different colors? I already wrote model code just don't know how to combine them together to compare.
Code :library(TSA)
nottem
month.=season(nottem)
model=lm(nottem~month.-1)
summary(nottem)
har.=harmonic(nottem,1)
model1=lm(nottem~har.)
summary(model1)
plot(nottem,type="l",ylab="Average monthly temperature at Nottingham castle")
points(y=nottem,x=time(nottem), pch=as.vector(season(nottem)))
Just put your time series inside a matrix:
x = cbind(serie1 = ts(cumsum(rnorm(100)), freq = 12, start = c(2013, 2)),
serie2 = ts(cumsum(rnorm(100)), freq = 12, start = c(2013, 2)))
plot(x)
Or configure the plot region:
par(mfrow = c(2, 1)) # 2 rows, 1 column
serie1 = ts(cumsum(rnorm(100)), freq = 12, start = c(2013, 2))
serie2 = ts(cumsum(rnorm(100)), freq = 12, start = c(2013, 2))
require(zoo)
plot(serie1)
lines(rollapply(serie1, width = 10, FUN = mean), col = 'red')
plot(serie2)
lines(rollapply(serie2, width = 10, FUN = mean), col = 'blue')
hope it helps.
PS.: zoo package is not needed in this example, you could use the filter function.
You can extract the seasonal mean with:
s.mean = tapply(serie, cycle(serie), mean)
# January, assuming serie is monthly data
print(s.mean[1])
This graph is pretty hard to read, because your three sets of values are so similar. Still, if you want to simply want to graph all of these on the sample plot, you can do it pretty easily by using the coefficients generated by your models.
Step 1: Plot the raw data. This comes from your original code.
plot(nottem,type="l",ylab="Average monthly temperature at Nottingham castle")
Step 2: Set up x-values for the mean and cosine plots.
x <- seq(1920, (1940 - 1/12), by=1/12)
Step 3: Plot the seasonal means by repeating the coefficients from the first model.
lines(x=x, y=rep(model$coefficients, 20), col="blue")
Step 4: Calculate the y-values for the cosine function using the coefficients from the second model, and then plot.
y <- model1$coefficients[2] * cos(2 * pi * x) + model1$coefficients[1]
lines(x=x, y=y, col="red")
ggplot variant: If you decide to switch to the popular 'ggplot2' package for your plot, you would do it like so:
x <- seq(1920, (1940 - 1/12), by=1/12)
y.seas.mean <- rep(model$coefficients, 20)
y.har.cos <- model1$coefficients[2] * cos(2 * pi * x) + model1$coefficients[1]
plot_Data <- melt(data.frame(x=x, temp=nottem, seas.mean=y.seas.mean, har.cos=y.har.cos), id="x")
ggplot(plot_Data, aes(x=x, y=value, col=variable)) + geom_line()

Resources