Random forest variable importance - random-forest

Does anyone know how to do variable importance for multiple factors (TILLAGEm, CROPSTAGEm ) in R? Show multiple factor effects on..
1.factor
TILLAGEm <- randomForest(TILLAGE~SWC+DHA+GLU+PHOS+LC, data=datarandom,importance=TRUE,proximity=TRUE,mtry=3,ntree=1000, maxnodes=21 )
2.factor
CROPSTAGEm <- randomForest(CROPSTAGE~SWC+DHA+GLU+PHOS+LC, data=datarandom, importance=TRUE,proximity=TRUE,mtry=2,ntree=1000, maxnodes=21 )
RandomForest::varImpPlot(TILLAGEm, type=1, sort = T, scale = T, bg = "blue", pch=22, main = 'TILLAGE') #PLOT
RandomForest::varImpPlot(TILLAGEm, type=1, sort = T, scale = T, bg = "blue", pch=22, main = 'TILLAGE') #PLOT
type here
enter image description here

Related

Linefitting how to deal with continuous values?

I'm trying to fit a line using quadratic poly, but because the fit results in continuous values, the integer conversion (for CartesianIndex) rounds it off, and I loose data at that pixel.
I tried the method
here. So I get new y values as
using Images, Polynomials, Plots,ImageView
img = load("jTjYb.png")
img = Gray.(img)
img = img[end:-1:1, :]
nodes = findall(img.>0)
xdata = map(p->p[2], nodes)
ydata = map(p->p[1], nodes)
f = fit(xdata, ydata, 2)
ydata_new .= round.(Int, f.(xdata)
new_line_fitted_img=zeros(size(img))
new_line_fitted_img[xdata,ydata_new].=1
imshow(new_line_fitted_img)
which results in chopped line as below
whereas I was expecting it to be continuous line as it was in pre-processing
Do you expect the following:
Raw Image
Fitted Polynomial
Superposition
enter image description here
enter image description here
enter image description here
Code:
using Images, Polynomials
img = load("img.png");
img = Gray.(img)
fx(data, dCoef, cCoef, bCoef, aCoef) = #. data^3 *aCoef + data^2 *bCoef + data*cCoef + dCoef;
function fit_poly(img::Array{<:Gray, 2})
img = img[end:-1:1, :]
nodes = findall(img.>0)
xdata = map(p->p[2], nodes)
ydata = map(p->p[1], nodes)
f = fit(xdata, ydata, 3)
xdt = unique(xdata)
xdt, fx(xdt, f.coeffs...)
end;
function draw_poly!(X, y)
the_min = minimum(y)
if the_min<0
y .-= the_min - 1
end
initialized_img = Gray.(zeros(maximum(X), maximum(y)))
initialized_img[CartesianIndex.(X, y)] .= 1
dif = diff(y)
for i in eachindex(dif)
the_dif = dif[i]
if abs(the_dif) >= 2
segment = the_dif รท 2
initialized_img[i, y[i]:y[i]+segment] .= 1
initialized_img[i+1, y[i]+segment+1:y[i+1]-1] .= 1
end
end
rotl90(initialized_img)
end;
X, y = fit_poly(img);
y = convert(Vector{Int64}, round.(y));
draw_poly!(X, y)

Clustered resampling for inner layer of Caret recursive feature elimination

I have data where IDs are contained within clusters.
I would like to perform recursive feature elimination using Caret's rfe function which performs the following procedure:
Clustered resampling for the outer layer (line 2.1) is straightforward, using the index parameter.
However, within each outer resample, I would like to tune tuning parameters using cluster-based cross-validation (inner resampling) (line 2.9). Model tuning in the inner layer is possible by specifying a tuneGrid in rfe and having an appropriate trControl. It is this trControl that I would like to change to allow clustered resampling.
The outer resampling is specified in the rfeControl parameter of rfe.
The inner resampling is specified by trControl of rfe which is passed to train.
The trouble I am having is that I can't seem to specify any inner indices, because after the outer resampling, those indices are no longer valid or no longer present in the outer-resampled data.
I am looking for a way to tell train to take an outer resample (which will be missing a cluster against which to validate), and to tune the model using inner resampling by based on folds of the remaining clusters.
The MWE is as minimal as possible:
library(caret)
library(tidyverse)
library(parallel)
library(doParallel)
range01 <- function(x){(x-min(x))/(max(x)-min(x))}
### Create some random data, 10 features, with some influence over a binomial outcome
set.seed(42)
id <- 1:1000
cluster <- rep(1:10, each = 100)
dat <- data.frame(id, cluster, replicate(10,rnorm(n = 1000, mean = runif(1, 0,100)+cluster, sd = runif(1, 0,20))))
dat <- dat %>% mutate(temp = rowSums(across(X1:X10)), prob = range01(temp), outcome = rbinom(n = nrow(dat), size = 1, prob = prob))
dat$outcome <- as.factor(dat$outcome)
levels(dat$outcome) <- c("control", "case")
dat$outcome <- factor(dat$outcome, levels=rev(levels(dat$outcome)))
### Manual outer folds-based cluster ###
for(i in 1:10) {
assign(paste0("index", i), which(dat$cluster!=i))
}
unit_indices <- list(index1, index2, index3, index4, index5, index6, index7, index8, index9, index10)
### Inner resampling method (THIS IS WHAT I'D LIKE TO CHANGE) ###
cv5 <- trainControl(classProbs = TRUE, method = "cv", number = 5, allowParallel = F) ## Is there a way to have inner cluster-based resampling WITHIN the outer cluster-based resampling?
caret_rfe_functions <- list(summary = twoClassSummary,
fit = function (x, y, first, last, ...) {
train(x, y, ...)
},
pred = caretFuncs$pred,
rank = function(object, x, y) {
vimp <- varImp(object)$importance
vimp <- vimp[order(vimp$Overall,decreasing = TRUE),,drop = FALSE]
vimp$var <- rownames(vimp)
vimp
},
selectSize = function (x, metric = "ROC", tol = 1, maximize = TRUE)
{
if (!maximize) {
best <- min(x[, metric])
perf <- (x[, metric] - best)/best * 100
flag <- perf <= tol
}
else {
best <- max(x[, metric])
perf <- (best - x[, metric])/best * 100
flag <- perf <= tol
}
min(x[flag, "Variables"])
},
selectVar = caretFuncs$selectVar)
caret_rfe_ctrl <- rfeControl(
functions = caret_rfe_functions,
saveDetails = TRUE,
index = unit_indices,
indexOut = NULL,
returnResamp = "all",
allowParallel = T, ### change this if you don't want to / can't go parallel
verbose = TRUE
)
#### Feature selection ####
set.seed(42)
cl <- makePSOCKcluster(10) ### for parallel processing if available
registerDoParallel(cl)
rfe_profile_nnet <- rfe(
form = outcome ~
X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10,
data = dat,
sizes = seq(2,10,1),
rfeControl = caret_rfe_ctrl,
## pass options to train()
method = "nnet",
preProc = c("center", "scale"),
metric = "ROC",
tuneGrid = expand.grid(size = c(1:5), decay = 5),
trControl = cv5) ### I would like to change this to allow inner cluster-based resampling
stopCluster(cl)
rfe_profile_nnet
plot(rfe_profile_nnet)
Presumably the inner cluster-based resampling would be achieved by specifying a new trainControl containing some dynamic inner index based on the outer resample that is selected at the time:
inner_cluster_tune <- trainControl(classProbs = TRUE,
index = {insert magic here}, ### This is the important bit
returnResamp = "all",
summaryFunction = twoClassSummary,
allowParallel = F) ### especially if the outer resample is parallelised
If you try with the original cluster indices e.g.
inner_cluster_tune <- trainControl(classProbs = TRUE,
index = unit_indices,
returnResamp = "all",
summaryFunction = twoClassSummary,
allowParallel = F)
There are various warnings about missing data in the resamples, and things like 24: In [<-.data.frame(*tmp*, , object$method$center, value = structure(list( ... : provided 81 variables to replace 9 variables.
All help greatly appreciated.
As a postscript question , you can see which parameters were used within your rfe like so:
> rfe_profile_nnet$fit
Neural Network
1000 samples
8 predictor
2 classes: 'case', 'control'
Pre-processing: centered (8), scaled (8)
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 800, 800, 800, 800, 800
Resampling results across tuning parameters:
size Accuracy Kappa
1 0.616 0.1605071
2 0.616 0.1686937
3 0.620 0.1820503
4 0.618 0.1788491
5 0.618 0.1788063
Tuning parameter 'decay' was held constant at a value of 5
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were size = 3 and decay = 5.
But does anyone know if this refers to one, or all of the outer resamples? Presumably the same tuning parameters won't necessarily be chosen across all outer resamples

glmnet: Fit a GLMM with lasso or ridge and add binomial cloglog link

How can one specify link functions in glmnet for lasso / ridge / elastic net regression?
I have found the following post but not sure this helps me when I need to specify a cloglog link.
How to specify log link in glmnet?
I have a survey data set with binary response 0/1 (disease no/yes) and several predictor variables, which are mostly binary categorical (yes/no, male/female), some are counts (herd size), and a few are categorical with several levels.
I previously ran a generalized linear mixed model using glmer() function with binomial family and link = cloglog as doing so created the exact interpretation of the resulting intercept that I wanted (in disease study the intercept from this setup is equivalent to the mean value 'force of infection' - the rate at which susceptibles become infected - among the variation specified in the random effect (in my case the geographic unit (village or subvillage or household).
As there are several survey variables now available to me, I wanted to try a lasso and a ridge regression using glmnet. It is my understanding that I should best do this by putting in the glmm formula into the glmnet. However, I cannot find any documentation about how to add a link. I did so, in the syntax I thought would work, and it did run. But it also ran with nonsense entered in the link function.
Here is a reproducible example:
library(msm)
library(glmnet)
set.seed(1)
N = 1000
X = cbind( rbinom(n=N,size=1,prob=0.5), rnorm(n=N) )
beta = c(-0.1,0.1)
phi.true = exp( X%*%beta )
p = 1 - exp(-phi.true)
y = rbinom(n=N,size=1,prob = p)
dat <- data.frame(x=X,y=y)
x <- model.matrix(y~., dat)
glmnet(x, y, family="binomial", link="logit", alpha = 1, lambda = 2)
I get the same output whether I put in 'logit', 'cloglog' or even a name 'adam'. And cannot use same syntax as GLMM as in glmnet must be a character vector.
OUTPUT:
> glmnet(x, y, family="binomial"(link="logit"), alpha = 1, lambda = 2)
Error in match.arg(family) : 'arg' must be NULL or a character vector
> glmnet(x, y, family="binomial", link="logit", alpha = 1, lambda = 2)
Call: glmnet(x = x, y = y, family = "binomial", alpha = 1, lambda = 2, link = "logit")
Df %Dev Lambda
1 0 -7.12e-15 2
> glmnet(x, y, family="binomial", link="cloglog", alpha = 1, lambda = 2)
Call: glmnet(x = x, y = y, family = "binomial", alpha = 1, lambda = 2, link = "cloglog")
Df %Dev Lambda
1 0 -7.12e-15 2
> glmnet(x, y, family="binomial", link="adam", alpha = 1, lambda = 2)
Call: glmnet(x = x, y = y, family = "binomial", alpha = 1, lambda = 2, link = "adam")
Df %Dev Lambda
1 0 -7.12e-15 2
Is it not possible to change the default link function for binomial family in glmnet?
I think you want to use family = binomial(link = "cloglog")
See the new glmnet vignette: https://cran.r-project.org/web/packages/glmnet/vignettes/glmnetFamily.pdf

How to create stratified folds for repeatedcv in caret?

The way to create stratified folds for cv in caret is like this
library(caret)
library(data.table)
train_dat <- data.table(group = c(rep("group1",10), rep("group2",5)), x1 = rnorm(15), x2 = rnorm(15), label = factor(c(rep("treatment",15), rep("control",15))))
folds <- createFolds(train_dat[, group], k = 5)
fitCtrl <- trainControl(method = "cv", index = folds, classProbs = T, summaryFunction = twoClassSummary)
train(label~., data = train_dat[, !c("group"), with = F], trControl = fitCtrl, method = "xgbTree", metric = "ROC")
To balance group1 and group2, the creation of fold indexes is based on "group" variable.
However, is there any way to createFolds for repeatedcv in caret? So, I can have a balanced split for repeatedcv. Should I combined several createFolds and run trainControl?
trControl = trainControl(method = "cv", index = many_repeated_folds)
Thanks!
createMultiFolds is probably what you are interested in.

tf.IndexedSlicesValue when returned from tf.gradients()

I'm having the following problem, I have four embedding matrices and want to get the gradients of my loss function with respect to those matrices.
When I run the session to return the values for the gradients, two of those returned objects are of type tensorflow.python.framework.ops.IndexedSlicesValue, the other two are numpy arrays. Now for the numpy arrays, their shape corresponds to the shape of their corresponding embedding matrix, but I'm having problems with the IndexedSlicesValue objects.
If I call .values on one of those objects, I get an array whose shape does not match that of the gradient, the shape of the embedding matrix is [22,30], but calling .values on the IndexedSlicesValue object I get an array with shape [4200,30] ( The shape of my input tensor had dimensions of [30,20,7], the product of those dimensions equals 4200, not sure if this is relevant).
The IndexedSlicesValue object has an attribute called dense_shape, which is an array that holds the dimensions the gradient should have, i.e. array([22,30]) is value returned by .dense_shape.
I don't really understand the docs here: https://www.tensorflow.org/versions/r0.7/api_docs/python/state_ops.html#IndexedSlices
It says:
An IndexedSlices is typically used to represent a subset of a
larger tensor dense of shape [LARGE0, D1, .. , DN] where LARGE0 >> D0.
The values in indices are the indices in the first dimension of the
slices that have been extracted from the larger tensor.
So this array of shape (4200,30) is extracted from an array corresponding to an even larger, dense tensor?
What exactly is the gradient in this IndexedSlicesValue object and why does tensorflow automatically use this type for some gradients returned by tf.gradients()?
Here is my code:
input_tensor = tf.placeholder(tf.int32, shape = [None, memory_size, max_sent_length], name = 'Input')
q_tensor = tf.placeholder(tf.int32, shape = [None,max_sent_length], name = 'Question')
a_tensor = tf.placeholder(tf.float32, shape = [None,V+1], name = 'Answer')
# Embedding matrices
A_prior = tf.get_variable(name = 'A', shape = [V+1,d], initializer = tf.random_normal_initializer(stddev = 0.1))
A = tf.concat(0,[tf.zeros(shape = tf.pack([1,tf.shape(A_prior)[1]])),tf.slice(A_prior,[1,0],[-1,-1])])
B = tf.get_variable(name = 'B', shape = [V+1,d], initializer = tf.random_normal_initializer(stddev = 0.1))
C = tf.get_variable(name = 'C', shape = [V+1,d], initializer = tf.random_normal_initializer(stddev = 0.1))
W = tf.get_variable(name = 'W', shape = [V+1,d], initializer= tf.random_normal_initializer(stddev = 0.1))
embeddings = tf.reduce_sum(tf.nn.embedding_lookup(A,input_tensor),2)
u = tf.reshape(tf.reduce_sum(tf.nn.embedding_lookup(B,q_tensor),1),[-1,1,d])
test = tf.transpose(embeddings, perm = [0,2,1])
test_batch_mul = tf.squeeze(tf.batch_matmul(u,test))
cond = tf.not_equal(test_batch_mul,0.0)
tt = tf.fill(tf.shape(test_batch_mul),-1000.0)
softmax_in = tf.select(cond, test_batch_mul, tt)
p_values = tf.nn.softmax(softmax_in)
c_values = tf.reduce_sum(tf.nn.embedding_lookup(C,input_tensor),2)
o = tf.squeeze(tf.batch_matmul(tf.expand_dims(p_values,1),c_values))
a_pred = tf.nn.softmax(tf.matmul(tf.squeeze(u)+o,tf.transpose(W)))
loss = tf.nn.softmax_cross_entropy_with_logits(a_pred, a_tensor, name = 'loss')
cost = tf.reduce_mean(loss)
global_step = tf.Variable(0,name = 'global_step', trainable= False)
#optimizer = tf.train.MomentumOptimizer(0.01,0.9)
vars_list = tf.trainable_variables()
grads = tf.gradients(cost, vars_list)
#train_op = optimizer.minimize( cost, global_step, vars_list, name = 'train_op')
sess = tf.Session()
init = tf.initialize_all_variables()
sess.run(init)
input_feed = {input_tensor : phrases, q_tensor : questions, a_tensor : answers}
grad_results = sess.run(grads, feed_dict = input_feed)
I had the same issue, apparently IndexedSlices objects are automatically created for some Embedding matrices when computing their gradients,
If you want to access the gradients of the trainable variables of the Embedding, you need to convert the IndexedSlices to a tensor, by simply using:
tf.convert_to_tensor(gradients_of_the_embedding_layer)

Resources