different result from varpart and RsquareAdj - rda

I'm getting different results from varpart and RsquareAdj
> (allel_freq.varpar<-varpart(allel_freq.h,env.PCoA,PCNM.red))
Partition of variation in RDA
Call: varpart(Y = allel_freq.h, X = env.PCoA, PCNM.red)
Explanatory tables:
X1: env.PCoA
X2: PCNM.red
No. of explanatory tables: 2
Total variation (SS): 0.0017369
Variance: 0.00017369
No. of observations: 11
Partition table:
Df R.squared Adj.R.squared Testable
[a+b] = X1 2 0.23618 0.04522 TRUE
[b+c] = X2 2 0.54147 0.42683 TRUE
[a+b+c] = X1+X2 4 0.65547 0.42578 TRUE
Individual fractions
[a] = X1|X2 2 -0.00106 TRUE
[b] 0 0.04628 FALSE
[c] = X2|X1 2 0.38056 TRUE
[d] = Residuals 0.57422 FALSE
in "varpart",Unique effect of env.PCoA is -0.00106,but when I use "RsquareAdj", I got different adjusted R-square(-0.2266707). strange.
> rda.envspe<-rda(allel_freq.h,env.PCoA,cbind(PCNM.red))
> RsquareAdj(rda.envspe)
$r.squared
[1] 0.1139976
$adj.r.squared
[1] -0.2266707

Related

glmnet: Fit a GLMM with lasso or ridge and add binomial cloglog link

How can one specify link functions in glmnet for lasso / ridge / elastic net regression?
I have found the following post but not sure this helps me when I need to specify a cloglog link.
How to specify log link in glmnet?
I have a survey data set with binary response 0/1 (disease no/yes) and several predictor variables, which are mostly binary categorical (yes/no, male/female), some are counts (herd size), and a few are categorical with several levels.
I previously ran a generalized linear mixed model using glmer() function with binomial family and link = cloglog as doing so created the exact interpretation of the resulting intercept that I wanted (in disease study the intercept from this setup is equivalent to the mean value 'force of infection' - the rate at which susceptibles become infected - among the variation specified in the random effect (in my case the geographic unit (village or subvillage or household).
As there are several survey variables now available to me, I wanted to try a lasso and a ridge regression using glmnet. It is my understanding that I should best do this by putting in the glmm formula into the glmnet. However, I cannot find any documentation about how to add a link. I did so, in the syntax I thought would work, and it did run. But it also ran with nonsense entered in the link function.
Here is a reproducible example:
library(msm)
library(glmnet)
set.seed(1)
N = 1000
X = cbind( rbinom(n=N,size=1,prob=0.5), rnorm(n=N) )
beta = c(-0.1,0.1)
phi.true = exp( X%*%beta )
p = 1 - exp(-phi.true)
y = rbinom(n=N,size=1,prob = p)
dat <- data.frame(x=X,y=y)
x <- model.matrix(y~., dat)
glmnet(x, y, family="binomial", link="logit", alpha = 1, lambda = 2)
I get the same output whether I put in 'logit', 'cloglog' or even a name 'adam'. And cannot use same syntax as GLMM as in glmnet must be a character vector.
OUTPUT:
> glmnet(x, y, family="binomial"(link="logit"), alpha = 1, lambda = 2)
Error in match.arg(family) : 'arg' must be NULL or a character vector
> glmnet(x, y, family="binomial", link="logit", alpha = 1, lambda = 2)
Call: glmnet(x = x, y = y, family = "binomial", alpha = 1, lambda = 2, link = "logit")
Df %Dev Lambda
1 0 -7.12e-15 2
> glmnet(x, y, family="binomial", link="cloglog", alpha = 1, lambda = 2)
Call: glmnet(x = x, y = y, family = "binomial", alpha = 1, lambda = 2, link = "cloglog")
Df %Dev Lambda
1 0 -7.12e-15 2
> glmnet(x, y, family="binomial", link="adam", alpha = 1, lambda = 2)
Call: glmnet(x = x, y = y, family = "binomial", alpha = 1, lambda = 2, link = "adam")
Df %Dev Lambda
1 0 -7.12e-15 2
Is it not possible to change the default link function for binomial family in glmnet?
I think you want to use family = binomial(link = "cloglog")
See the new glmnet vignette: https://cran.r-project.org/web/packages/glmnet/vignettes/glmnetFamily.pdf

Best way to parallelize computation over dask blocks that do not return np arrays?

I'd like to return a dask dataframe from an overlapping dask array computation, where each block's computation returns a pandas dataframe. The example below shows one way to do this, simplified for demonstration purposes. I've found a combination of da.overlap.overlap and to_delayed().ravel() as able to get the job done, if I pass in the relevant block key and chunk information.
Edit:
Thanks to a #AnnaM who caught bugs in the original post and then made it general! Building off of her comments, I'm including an updated version of the code. Also, in responding to Anna's interest in memory usage, I verified that this does not seem to take up more memory than naively expected.
def extract_features_generalized(chunk, offsets, depth, columns):
shape = np.asarray(chunk.shape)
offsets = np.asarray(offsets)
depth = np.asarray(depth)
coordinates = np.stack(np.nonzero(chunk)).T
keep = ((coordinates >= depth) & (coordinates < (shape - depth))).all(axis=1)
data = coordinates + offsets - depth
df = pd.DataFrame(data=data, columns=columns)
return df[keep]
def my_overlap_generalized(data, chunksize, depth, columns, boundary):
data = data.rechunk(chunksize)
data_overlapping_chunks = da.overlap.overlap(data, depth=depth, boundary=boundary)
dfs = []
for block in data_overlapping_chunks.to_delayed().ravel():
offsets = np.array(block.key[1:]) * np.array(data.chunksize)
df_block = dask.delayed(extract_features_generalized)(block, offsets=offsets,
depth=depth, columns=columns)
dfs.append(df_block)
return dd.from_delayed(dfs)
data = np.zeros((2,4,8,16,16))
data[0,0,4,2,2] = 1
data[0,1,4,6,2] = 1
data[1,2,4,8,2] = 1
data[0,3,4,2,2] = 1
arr = da.from_array(data)
df = my_overlap_generalized(arr,
chunksize=(-1,-1,-1,8,8),
depth=(0,0,0,2,2),
columns=['r', 'c', 'z', 'y', 'x'],
boundary=tuple(['reflect']*5))
df.compute().reset_index()
-- Remainder of original post, including original bugs --
My example only does xy overlaps, but it's easy to generalize. Is there anything below that is suboptimal or could be done better? Is anything likely to break because it's relying on low-level information that could change (e.g. block key)?
def my_overlap(data, chunk_xy, depth_xy):
data = data.rechunk((-1,-1,-1, chunk_xy, chunk_xy))
data_overlapping_chunks = da.overlap.overlap(data,
depth=(0,0,0,depth_xy,depth_xy),
boundary={3: 'reflect', 4: 'reflect'})
dfs = []
for block in data_overlapping_chunks.to_delayed().ravel():
offsets = np.array(block.key[1:]) * np.array(data.chunksize)
df_block = dask.delayed(extract_features)(block, offsets=offsets, depth_xy=depth_xy)
dfs.append(df_block)
# All computation is delayed, so downstream comptutions need to know the format of the data. If the meta
# information is not specified, a single computation will be done (which could be expensive) at this point
# to infer the metadata.
# This empty dataframe has the index, column, and type information we expect in the computation.
columns = ['r', 'c', 'z', 'y', 'x']
# The dtypes are float64, except for a small number of columns
df_meta = pd.DataFrame(columns=columns, dtype=np.float64)
df_meta = df_meta.astype({'c': np.int64, 'r': np.int64})
df_meta.index.name = 'feature'
return dd.from_delayed(dfs, meta=df_meta)
def extract_features(chunk, offsets, depth_xy):
r, c, z, y, x = np.nonzero(chunk)
df = pd.DataFrame({'r': r, 'c': c, 'z': z, 'y': y+offsets[3]-depth_xy,
'x': x+offsets[4]-depth_xy})
df = df[(df.y > depth_xy) & (df.y < (chunk.shape[3] - depth_xy)) &
(df.z > depth_xy) & (df.z < (chunk.shape[4] - depth_xy))]
return df
data = np.zeros((2,4,8,16,16)) # round, channel, z, y, x
data[0,0,4,2,2] = 1
data[0,1,4,6,2] = 1
data[1,2,4,8,2] = 1
data[0,3,4,2,2] = 1
arr = da.from_array(data)
df = my_overlap(arr, chunk_xy=8, depth_xy=2)
df.compute().reset_index()
First of all, thanks for posting your code. I am working on a similar problem and this was really helpful for me.
When testing your code, I discovered a few mistakes in the extract_features function that prevent your code from returning correct indices.
Here is a corrected version:
def extract_features(chunk, offsets, depth_xy):
r, c, z, y, x = np.nonzero(chunk)
df = pd.DataFrame({'r': r, 'c': c, 'z': z, 'y': y, 'x': x})
df = df[(df.y >= depth_xy) & (df.y < (chunk.shape[3] - depth_xy)) &
(df.x >= depth_xy) & (df.x < (chunk.shape[4] - depth_xy))]
df['y'] = df['y'] + offsets[3] - depth_xy
df['x'] = df['x'] + offsets[4] - depth_xy
return df
The updated code now returns the indices that were set to 1:
index r c z y x
0 0 0 0 4 2 2
1 1 0 1 4 6 2
2 2 0 3 4 2 2
3 1 1 2 4 8 2
For comparison, this is the output of the original version:
index r c z y x
0 1 0 1 4 6 2
1 3 1 2 4 8 2
2 0 0 1 4 6 2
3 1 1 2 4 8 2
It returns lines number 2 and 4, two times each.
The reason why this happens is three mistakes in the extract_features function:
You first add the offset and subtract the depth and then filter out the overlapping parts: the order needs to be swapped
df.y > depth_xy should be replaced with df.y >= depth_xy
df.z should be replaced with df.x, since it is the x dimension that has an overlap
To optimize this even further, here is a generalized version of the code that would work for an arbitrary number of dimension:
def extract_features_generalized(chunk, offsets, depth, columns):
coordinates = np.nonzero(chunk)
df = pd.DataFrame()
rows_to_keep = np.ones(len(coordinates[0]), dtype=int)
for i in range(len(columns)):
df[columns[i]] = coordinates[i]
rows_to_keep = rows_to_keep * np.array((df[columns[i]] >= depth[i])) * \
np.array((df[columns[i]] < (chunk.shape[i] - depth[i])))
df[columns[i]] = df[columns[i]] + offsets[i] - depth[i]
del coordinates
return df[rows_to_keep > 0]
def my_overlap_generalized(data, chunksize, depth, columns):
data = data.rechunk(chunksize)
data_overlapping_chunks = da.overlap.overlap(data, depth=depth,
boundary=tuple(['reflect']*len(columns)))
dfs = []
for block in data_overlapping_chunks.to_delayed().ravel():
offsets = np.array(block.key[1:]) * np.array(data.chunksize)
df_block = dask.delayed(extract_features_generalized)(block, offsets=offsets,
depth=depth, columns=columns)
dfs.append(df_block)
return dd.from_delayed(dfs)
data = np.zeros((2,4,8,16,16))
data[0,0,4,2,2] = 1
data[0,1,4,6,2] = 1
data[1,2,4,8,2] = 1
data[0,3,4,2,2] = 1
arr = da.from_array(data)
df = my_overlap_generalized(arr, chunksize=(-1,-1,-1,8,8),
depth=(0,0,0,2,2), columns=['r', 'c', 'z', 'y', 'x'])
df.compute().reset_index()

machine learning using R and randomForestSRC package

I'am trying to use the "surv.randomForestSRC" as the learner of machine learning in R.
My code and results are as below. "newHCC" is the survival data of HCC patients with result of multiple numeric paramaters.
> newHCC$status = (newHCC$status == 1)
> surv.task = makeSurvTask(data = newHCC, target = c("time", "status"))
> surv.task
Supervised task: newHCC
Type: surv
Target: time,status
Events: 61
Observations: 127
Features:
numerics factors ordered
30 0 0
Missings: FALSE
Has weights: FALSE
Has blocking: FALSE
> lrn = makeLearner("surv.randomForestSRC")
> rdesc = makeResampleDesc(method = "RepCV", folds=10, reps=10)
> r = resample(learner = lrn, task = surv.task, resampling = rdesc)
[Resample] repeated cross-validation iter 1: cindex.test.mean=0.485
[Resample] repeated cross-validation iter 2: cindex.test.mean=0.556
[Resample] repeated cross-validation iter 3: cindex.test.mean=0.825
[Resample] repeated cross-validation iter 4: cindex.test.mean=0.81
...
[Resample] repeated cross-validation iter 100: cindex.test.mean=0.683
[Resample] Aggr. Result: cindex.test.mean=0.688
I have several questions.
How can I check the parameters like used ntree, mtry and so on?
Is there any good way to tune up?
How can I watch the predicted individual risk, things like what we can see when we use predicted of randomForestSRC package?
Many thanks in advance.
and 2. You can try as below
surv_param <- makeParamSet(
makeIntegerParam("ntree",lower = 50, upper = 100),
makeIntegerParam("mtry", lower = 1, upper = 6),
makeIntegerParam("nodesize", lower = 10, upper = 50),
makeIntegerParam("nsplit", lower = 3, upper = 50)
)
rancontrol <- makeTuneControlRandom(maxit = 10L)
surv_tune <- tuneParams(learner = lrn, resampling = rdesc, task = surv.task,
par.set = surv_param, control = rancontrol)
surv.tree <- setHyperPars(lrn, par.vals = surv_tune$x)
surv <- mlr::train(surv.tree, surv.task)
getLearnerModel(surva)
model <- predict(surv, surv.task)
for today you can not predict individual risk in mlr surv.randomForestSRC. There is just predict type response

Is there a way to use Machine Learning classify discrete and infinite scale data?

The data like that:
x y
7773 0
9805 4
7145 0
7645 1
2529 1
4814 2
6027 2
7499 2
3367 1
8861 5
9776 2
8009 5
3844 2
1218 2
1120 1
4553 0
3017 1
2582 2
1691 2
5342 0
...
The real function f(x) is: (Return the circle count of a decimal integer)
# 0 1 2 3 4 5 6 7 8 9
_f_map = [1, 0, 0, 0, 0, 0, 1, 0, 2, 1]
def f(x):
x = int(x)
assert x >= 0
if x == 0:
return 1
r = 0
while x:
r += _f_map[x % 10]
x /= 10
return r
The training data and test data can be produced by random:
data = []
target = []
for i in xrange(3000):
x = random.randint(0, 999999) #hardcode a scale
data.append([x])
target.append(f(x))
The real function is discrete and infinite scale.
Is there a way or a model can classify this data?
I tried SVM(Support Vector Machine), and acquired a 20% accuracy rate.
Looks like a typical use case of sequential models. You can easily learn LSTM/ other recurrent neural network to do so by considering your numbers as sequences of integers feeded to the network. At this point it just has to learn sum operation and a simple mapping(your f_map).

Is an admisible heuristic always monotone (consistent)?

For the A* search algorithm, provided an heuristic h, supose h is admisible.
That is:
h(n) ≤ h*(n) for every node n, where h* is the real cost from n to goal.
Does this ensure the heuristic is monotone?
That is:
f(n) ≤ g(n') + h(n') for every sucesor n' of n, where f(n)= h(n) + g(n) and g(n) is the accumulated cost.
No.
Assume you have three successor states s1, s2, s3 and a goal state g so that s1 -> s2 -> s3 -> g.
s1 is the starting node.
Consider also the following values for h(s) and h*(s) (i.e. true cost):
h(s1) = 3 , h*(s1) = 6
h(s2) = 4 , h*(s2) = 5
h(s3) = 3 , h*(s3) = 3
h(g) = 0 , h*(g) = 0
Following the only path to the goal we can have that:
g(s1) = 0, g(s2) = 1, g(s3) = 3, g(g) = 6, coinciding with the true cost above.
Although the heuristic function is admissible (h(s) <= h*(s)), f(n) will not be monotonic. For instance f(s1) = h(s1) + g(s1) = 3 while f(s2) = h(s2) + g(s2) = 5 with f(s1) < f(s2). Same holds between f(s2) and f(s3).
Of course this means you have a quite uninformative heuristic.

Resources