CFA in data with 3 levels - estimating factor scores at level 2? - methodology

I am working on a dataset with 3 levels:
Teacher
School
Country
Using survey responses from the teachers, the aim is to use Confirmatory Factor Analysis (CFA) with the ultimate goal of having the factor scores at the school level.
A further objective is to test for measurement invariance across countries.
I want to use the lavaan package in R, because it is able to deal with the complex survey design of my data trough the lavaan.survey-extension (sampling design, weights etc.)
I have done some preliminary analysis, where i use country-ID as the group argument in the cfa-function. This gives me the possibility to perform the measurement invariance analysis across countries. The issue is, that my factor scores are given at individual teacher level, and i am interested in the school-level.
Any ideas about how to get these factor scores at the school level?
Here are some examples of the functions i use. I do not think that i data sample is needed, but i will create some if it is requested.
library(lavaan)
library(SEMtools)
#define model
reduced_mod <-'
leadership_sup =~ TC3G22D + TC3G22E + TC3G22K
continous_develop_collab =~ TT3G32A + TT3G32B + TT3G32C + TT3G32D '
#Fit model with different restraints:
fit_no_restraint <- cfa(model = reduced_mod, data = cfa_data, group="countryID")
fit_metric <- cfa(model = reduced_mod, data = cfa_data, group="countryID", group.equal = c("loadings"))
fit_scalar <- cfa(model = reduced_mod, data = cfa_data, group="countryID", group.equal = c("loadings", "intercepts"))
#Compare fit statistics
compareFit(scalar = fit_scalar , metric = fit_metric , config = fit_no_restraint)

It seems that you want multilevel measurement invariance. You should use the measEq.syntax() from the semTools package:
## ---------------------
## Multilevel Invariance
## ---------------------
## To test invariance across levels in a MLSEM, specify syntax as though
## you are fitting to 2 groups instead of 2 levels.
mlsem <- ' f1 =~ y1 + y2 + y3
f2 =~ y4 + y5 + y6 '
## metric invariance
syntax.metric <- measEq.syntax(configural.model = mlsem, meanstructure = TRUE,
ID.fac = "std.lv", sample.nobs = c(1, 1),
group = "cluster", group.equal = "loadings")
## by definition, Level-1 means must be zero, so fix them
syntax.metric <- update(syntax.metric,
change.syntax = paste0("y", 1:6, " ~ c(0, NA)*1"))
## save as a character string
mod.metric <- as.character(syntax.metric, groups.as.blocks = TRUE)
## convert from multigroup to multilevel
mod.metric <- gsub(pattern = "group:", replacement = "level:",
x = mod.metric, fixed = TRUE)
## fit model to data
fit.metric <- lavaan(mod.metric, data = Demo.twolevel, cluster = "cluster")
summary(fit.metric)
Source

Related

Clustered resampling for inner layer of Caret recursive feature elimination

I have data where IDs are contained within clusters.
I would like to perform recursive feature elimination using Caret's rfe function which performs the following procedure:
Clustered resampling for the outer layer (line 2.1) is straightforward, using the index parameter.
However, within each outer resample, I would like to tune tuning parameters using cluster-based cross-validation (inner resampling) (line 2.9). Model tuning in the inner layer is possible by specifying a tuneGrid in rfe and having an appropriate trControl. It is this trControl that I would like to change to allow clustered resampling.
The outer resampling is specified in the rfeControl parameter of rfe.
The inner resampling is specified by trControl of rfe which is passed to train.
The trouble I am having is that I can't seem to specify any inner indices, because after the outer resampling, those indices are no longer valid or no longer present in the outer-resampled data.
I am looking for a way to tell train to take an outer resample (which will be missing a cluster against which to validate), and to tune the model using inner resampling by based on folds of the remaining clusters.
The MWE is as minimal as possible:
library(caret)
library(tidyverse)
library(parallel)
library(doParallel)
range01 <- function(x){(x-min(x))/(max(x)-min(x))}
### Create some random data, 10 features, with some influence over a binomial outcome
set.seed(42)
id <- 1:1000
cluster <- rep(1:10, each = 100)
dat <- data.frame(id, cluster, replicate(10,rnorm(n = 1000, mean = runif(1, 0,100)+cluster, sd = runif(1, 0,20))))
dat <- dat %>% mutate(temp = rowSums(across(X1:X10)), prob = range01(temp), outcome = rbinom(n = nrow(dat), size = 1, prob = prob))
dat$outcome <- as.factor(dat$outcome)
levels(dat$outcome) <- c("control", "case")
dat$outcome <- factor(dat$outcome, levels=rev(levels(dat$outcome)))
### Manual outer folds-based cluster ###
for(i in 1:10) {
assign(paste0("index", i), which(dat$cluster!=i))
}
unit_indices <- list(index1, index2, index3, index4, index5, index6, index7, index8, index9, index10)
### Inner resampling method (THIS IS WHAT I'D LIKE TO CHANGE) ###
cv5 <- trainControl(classProbs = TRUE, method = "cv", number = 5, allowParallel = F) ## Is there a way to have inner cluster-based resampling WITHIN the outer cluster-based resampling?
caret_rfe_functions <- list(summary = twoClassSummary,
fit = function (x, y, first, last, ...) {
train(x, y, ...)
},
pred = caretFuncs$pred,
rank = function(object, x, y) {
vimp <- varImp(object)$importance
vimp <- vimp[order(vimp$Overall,decreasing = TRUE),,drop = FALSE]
vimp$var <- rownames(vimp)
vimp
},
selectSize = function (x, metric = "ROC", tol = 1, maximize = TRUE)
{
if (!maximize) {
best <- min(x[, metric])
perf <- (x[, metric] - best)/best * 100
flag <- perf <= tol
}
else {
best <- max(x[, metric])
perf <- (best - x[, metric])/best * 100
flag <- perf <= tol
}
min(x[flag, "Variables"])
},
selectVar = caretFuncs$selectVar)
caret_rfe_ctrl <- rfeControl(
functions = caret_rfe_functions,
saveDetails = TRUE,
index = unit_indices,
indexOut = NULL,
returnResamp = "all",
allowParallel = T, ### change this if you don't want to / can't go parallel
verbose = TRUE
)
#### Feature selection ####
set.seed(42)
cl <- makePSOCKcluster(10) ### for parallel processing if available
registerDoParallel(cl)
rfe_profile_nnet <- rfe(
form = outcome ~
X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10,
data = dat,
sizes = seq(2,10,1),
rfeControl = caret_rfe_ctrl,
## pass options to train()
method = "nnet",
preProc = c("center", "scale"),
metric = "ROC",
tuneGrid = expand.grid(size = c(1:5), decay = 5),
trControl = cv5) ### I would like to change this to allow inner cluster-based resampling
stopCluster(cl)
rfe_profile_nnet
plot(rfe_profile_nnet)
Presumably the inner cluster-based resampling would be achieved by specifying a new trainControl containing some dynamic inner index based on the outer resample that is selected at the time:
inner_cluster_tune <- trainControl(classProbs = TRUE,
index = {insert magic here}, ### This is the important bit
returnResamp = "all",
summaryFunction = twoClassSummary,
allowParallel = F) ### especially if the outer resample is parallelised
If you try with the original cluster indices e.g.
inner_cluster_tune <- trainControl(classProbs = TRUE,
index = unit_indices,
returnResamp = "all",
summaryFunction = twoClassSummary,
allowParallel = F)
There are various warnings about missing data in the resamples, and things like 24: In [<-.data.frame(*tmp*, , object$method$center, value = structure(list( ... : provided 81 variables to replace 9 variables.
All help greatly appreciated.
As a postscript question , you can see which parameters were used within your rfe like so:
> rfe_profile_nnet$fit
Neural Network
1000 samples
8 predictor
2 classes: 'case', 'control'
Pre-processing: centered (8), scaled (8)
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 800, 800, 800, 800, 800
Resampling results across tuning parameters:
size Accuracy Kappa
1 0.616 0.1605071
2 0.616 0.1686937
3 0.620 0.1820503
4 0.618 0.1788491
5 0.618 0.1788063
Tuning parameter 'decay' was held constant at a value of 5
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were size = 3 and decay = 5.
But does anyone know if this refers to one, or all of the outer resamples? Presumably the same tuning parameters won't necessarily be chosen across all outer resamples

Saving a gamlss model to an RDS format

I'm fitting an R gamlss model:
set.seed(1)
df <- data.frame(group = c(rep("g1",100),rep("g2",100),rep("g3",100)),
value = c(rgamma(100,rate=5,shape=3),rgamma(100,rate=5,shape=4),rgamma(100,rate=5,shape=5)))
df$group <- factor(df$group, levels=c("g1","g2","g3"))
gamlss.fit <- gamlss::gamlss(formula = value ~ group, sigma.formula = ~group, data = df, family=gamlss.dist::GA(mu.link="log"))
This is what I get:
> gamlss.fit
Family: c("GA", "Gamma")
Fitting method: RS()
Call: gamlss::gamlss(formula = value ~ group, sigma.formula = ~group, family = gamlss.dist::GA(mu.link = "log"), data = df)
Mu Coefficients:
(Intercept) groupg2 groupg3
-0.5392 0.2553 0.5162
Sigma Coefficients:
(Intercept) groupg2 groupg3
-0.66318 0.02355 -0.08610
Degrees of Freedom for the fit: 6 Residual Deg. of Freedom 294
Global Deviance: 217.18
AIC: 229.18
SBC: 251.402
I want to save this gamlss.fit model in RDS format for later use. The saveRDS function works fine.
saveRDS(gamlss.fit, "my.gamlss.fit.RDS")
But then if I terminate the current R session, open a new one and read the RDS saved gamlss.fit model, I get:
Call: gamlss::gamlss(formula = value ~ group, sigma.formula = ~group,
family = gamlss.dist::GA(mu.link = "log"), data = df)
No coefficients
Degrees of Freedom: Total (i.e. Null); 294 Residual
Error in signif(x$null.deviance, digits) :
non-numeric argument to mathematical function
So I cannot really use this object for anything downstream.
I thought that tidypredict's parse_model function might come in handy, but it doesn't seem to support parsing the gamlss model:
> gamlss.parsed.fit <- tidypredict::parse_model(gamlss.fit)
Error: Functions inside the formula are not supported.
- Functions detected: `gamlss`,`gamlss.dist`,`GA`. Use `dplyr` transformations to prepare the data.
This saveRDS is specific to gamlss because if I fit a glm model:
glm.fit <- glm(formula = value ~ group, data = df, family="Gamma"(link='log'))
Which gives:
> glm.fit
Call: glm(formula = value ~ group, family = Gamma(link = "log"), data = df)
Coefficients:
(Intercept) groupg2 groupg3
-0.5392 0.2553 0.5162
Degrees of Freedom: 299 Total (i.e. Null); 297 Residual
Null Deviance: 93.25
Residual Deviance: 79.99 AIC: 226.9
I'll get the same after reading it from the RDS saved file:
Call: glm(formula = value ~ group, family = Gamma(link = "log"), data = df)
Coefficients:
(Intercept) groupg2 groupg3
-0.5392 0.2553 0.5162
Degrees of Freedom: 299 Total (i.e. Null); 297 Residual
Null Deviance: 93.25
Residual Deviance: 79.99 AIC: 226.9
BTW, tidypredict's parse_model neither supports parsing a glm model:
> glm.parsed.fit <- tidypredict::parse_model(glm.fit)
Error: Functions inside the formula are not supported.
- Functions detected: `Gamma`. Use `dplyr` transformations to prepare the data.
Any idea if and how a gamlss model can be saved not using the save function, which its drawbacks are discussed here

Hide p_value and put stars to significant OR gtsummary

I'm using gtsummary package.
I need to merge different univariate logistic regression and in order to have a good presentation, I want to hide the p_value and bold or put a star to the significant OR (p< 0.05).
Anyone can help me?
Maybe it's easier to use another presentation type like kable, huxtable, I don't know?
Thank you for your help.
Have a nice day
There is a function called add_significance_stars() that hides the p-value and adds stars to the estimate indicating various levels of statistical significance. I've also added code to bold the estimate if significant with modify_table_styling().
library(gtsummary)
#> #BlackLivesMatter
packageVersion("gtsummary")
#> [1] '1.4.0'
tbl <-
trial %>%
select(death, age, grade) %>%
tbl_uvregression(
y = death,
method = glm,
method.args = list(family = binomial),
exponentiate = TRUE
) %>%
# add significance stars to sig estimates
add_significance_stars() %>%
# additioanlly bolding significant estimates
modify_table_styling(
columns = estimate,
rows = p.value < 0.05,
text_format = "bold"
)
Created on 2021-04-14 by the reprex package (v2.0.0)
Here's a quick huxtable version:
l1 <- glm(I(cyl==8) ~ gear, data = mtcars, family = binomial)
l2 <- glm(I(cyl==8) ~ carb, data = mtcars, family = binomial)
huxtable::huxreg(l1, l2, statistics = "nobs", bold_signif = 0.05)
────────────────────────────────────────────────────
(1) (2)
───────────────────────────────────
(Intercept) 5.999 * -1.880 *
(2.465) (0.902)
gear -1.736 *
(0.693)
carb 0.579 *
(0.293)
───────────────────────────────────
nobs 32 32
────────────────────────────────────────────────────
*** p < 0.001; ** p < 0.01; * p < 0.05.
Column names: names, model1, model2
It doesn't show it here, but the significant coefficients are bold on screen (and in any other kind of output).

How to feed previous time-stamp prediction as additional input to the next time-stamp?

This question might have been asked, but I got confused.
I am trying to apply one of RNN types, e.g. LSTM for time-series forecasting. I have inputs, y (stock returns). For each timestamp, I'd like to get the predictions. Q1 - Am I correct choosing seq2seq approach?
I also want to use predictions from previous timestamp (initializing initial values with some constant) as additional (still using my existing inputs) input in the form of squared residuals, i.e. using
eps_{t-1} = (y_{t-1} - y^_{t-1})^2 as additional input at t (as well as previous inputs).
So, how can I do this in tensorflow or in pytorch?
I tried to depict what I want on the attached graph. The graph
p.s. Sorry, it the question is poorly formulated
Let say your input if of dimension (32,10,1) with batch_size 32, time steps of length 10 and dimension of 1. Same for your target (stock return). This code make use of the tf.scan function, which is usefull when implementing custom recurrent networks (it will iterate over the timesteps). It remains to use the residual of t-1 in t somewhere, as you would like to.
ps: it is a very basic implementation of lstm from scratch, without any bias or output activation.
import tensorflow as tf
import numpy as np
tf.reset_default_graph()
BS = 32
TS = 10
inputs_dim = 1
target_dim = 1
inputs = tf.placeholder(shape=[BS, TS, inputs_dim], dtype=tf.float32)
stock_returns = tf.placeholder(shape=[BS, TS, target_dim], dtype=tf.float32)
state_size = 16
# initial hidden state
init_state = tf.placeholder(shape=[2, BS, state_size],
dtype=tf.float32, name='initial_state')
# initializer
xav_init = tf.contrib.layers.xavier_initializer
# params
W = tf.get_variable('W', shape=[4, state_size, state_size],
initializer=xav_init())
U = tf.get_variable('U', shape=[4, inputs_dim, state_size],
initializer=xav_init())
W_out = tf.get_variable('W_out', shape=[state_size, target_dim],
initializer=xav_init())
#the function to feed tf.scan with
def step(prev, inputs_):
#unpack all inputs and previous outputs
st_1, ct_1 = prev[0][0], prev[0][1]
x = inputs_[0]
target = inputs_[1]
#get previous squared residual
eps = prev[1]
"""
here do whatever you want with eps_t-1
like x += eps if x if of the same dimension
or include it somewhere in your graph
"""
# lstm gates (add bias if needed)
#
# input gate
i = tf.sigmoid(tf.matmul(x,U[0]) + tf.matmul(st_1,W[0]))
# forget gate
f = tf.sigmoid(tf.matmul(x,U[1]) + tf.matmul(st_1,W[1]))
# output gate
o = tf.sigmoid(tf.matmul(x,U[2]) + tf.matmul(st_1,W[2]))
# gate weights
g = tf.tanh(tf.matmul(x,U[3]) + tf.matmul(st_1,W[3]))
ct = ct_1*f + g*i
st = tf.tanh(ct)*o
"""
make prediction, compute residual in t
and pass it to t+1
Normaly, we would compute prediction outside the scan function,
but as we need it here, we could just keep it and return it back
as an output of the scan function
"""
prediction_t = tf.matmul(st, W_out) # + bias
eps = (target - prediction_t)**2
return [tf.stack((st, ct), axis=0), eps, prediction_t]
states, eps, preds = tf.scan(step, [tf.transpose(inputs, [1,0,2]),
tf.transpose(stock_returns, [1,0,2])], initializer=[init_state,
tf.zeros((32,1), dtype=tf.float32),
tf.zeros((32,1),dtype=tf.float32)])
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
out = sess.run(preds, feed_dict=
{inputs:np.random.rand(BS,TS,inputs_dim),
stock_returns:np.random.rand(BS,TS,target_dim),
init_state:np.zeros((2,BS,state_size))})
out = tf.transpose(out,[1,0,2])
print(out)
And the output :
Tensor("transpose_2:0", shape=(32, 10, 1), dtype=float32)
Base code from here

shape of input to calculate information gain

I want to calculate the information gain on 20_newsgroup data set.
I am using the code here(also I put a copy of the code down of the question).
As you see the input to the algorithm is X,y
My confusion is that, X is going to be a matrix with documents in rows and features as column. (according to 20_newsgroup it is 11314,1000
in case i only considered 1000 features).
but according to the concept of information gain, it should calculate information gain for each feature.
(So I was expecting to see the code in a way loop through each feature, so the input to the function be a matrix where rows are features and columns are class)
But X is not feature here but X stands for documents, and I can not see the part in the code that take care of this part! ( I mean considering each document, and then going through each feature of that document; like looping through rows but at the same time looping through columns as the features are stored in columns).
I have read this and this and many similar questions but they are not clear in terms of input matrix shape.
this is the code for reading 20_newsgroup:
newsgroup_train = fetch_20newsgroups(subset='train')
X,y = newsgroup_train.data,newsgroup_train.target
cv = CountVectorizer(max_df=0.99,min_df=0.001, max_features=1000,stop_words='english',lowercase=True,analyzer='word')
X_vec = cv.fit_transform(X)
(X_vec.shape) is (11314,1000) which is not features in the 20_newsgroup data set. I am thinking am I calculating Information gain in a incorrect way?
This is the code for Information gain:
def information_gain(X, y):
def _calIg():
entropy_x_set = 0
entropy_x_not_set = 0
for c in classCnt:
probs = classCnt[c] / float(featureTot)
entropy_x_set = entropy_x_set - probs * np.log(probs)
probs = (classTotCnt[c] - classCnt[c]) / float(tot - featureTot)
entropy_x_not_set = entropy_x_not_set - probs * np.log(probs)
for c in classTotCnt:
if c not in classCnt:
probs = classTotCnt[c] / float(tot - featureTot)
entropy_x_not_set = entropy_x_not_set - probs * np.log(probs)
return entropy_before - ((featureTot / float(tot)) * entropy_x_set
+ ((tot - featureTot) / float(tot)) * entropy_x_not_set)
tot = X.shape[0]
classTotCnt = {}
entropy_before = 0
for i in y:
if i not in classTotCnt:
classTotCnt[i] = 1
else:
classTotCnt[i] = classTotCnt[i] + 1
for c in classTotCnt:
probs = classTotCnt[c] / float(tot)
entropy_before = entropy_before - probs * np.log(probs)
nz = X.T.nonzero()
pre = 0
classCnt = {}
featureTot = 0
information_gain = []
for i in range(0, len(nz[0])):
if (i != 0 and nz[0][i] != pre):
for notappear in range(pre+1, nz[0][i]):
information_gain.append(0)
ig = _calIg()
information_gain.append(ig)
pre = nz[0][i]
classCnt = {}
featureTot = 0
featureTot = featureTot + 1
yclass = y[nz[1][i]]
if yclass not in classCnt:
classCnt[yclass] = 1
else:
classCnt[yclass] = classCnt[yclass] + 1
ig = _calIg()
information_gain.append(ig)
return np.asarray(information_gain)
Well, after going through the code in detail, I learned more about X.T.nonzero().
Actually it is correct that information gain needs to loop through features.
Also it is correct that the matrix scikit-learn give us here is based on doc-features.
But:
in code it uses X.T.nonzero() which technically transform all the nonzero values into array. and then in the next row loop through the length of that array range(0, len(X.T.nonzero()[0]).
Overall, this part X.T.nonzero()[0] is returning all the none zero features to us :)

Resources