I can't get the coefficients of the taxa that contribute to the groups differences - coefficients

I'm trying to get the coefficients of the taxa that generate the most differences between the groups in my permutest of types of supplies for the animals.
I paste the results of the permutest:
permutest(betadispersionAlimentación, pairwise = TRUE)
enter image description here
When I try to get the coefficients, Studio answer me with this:
coefficients(betadispersionAlimentación)["C",]
NULL
The object to get the coefficients is the type of:
Permanova?
Betadisper?
Permutest?
i don't know how can I get the taxa (genus, Family, Specie...) that contribute to the differences.
can anybody help me, please?
Thanks a lot!

For anybody who have the same problem.
From the day that I posted the question I have worked about that.
To get de genus or the specie that more contribute to the dissimilarity I used the vegan library.
Here I used the Vegan Distance to create de distance object
DIST_ABUNDANCES <- vegdist(t(OTU_ABUNDANCES))
Here I took the anova function to create all the values.
anova(betadisper(DIST_ABUNDANCES, META_ABUNDANDES$Sexo))
Here I made the permutest to compare all the values of the variable Sexo
permutest(betadisper(DIST_ABUNDANCES, META_ABUNDANDES$Sexo), pairwise = TRUE)
Here I took the correlation coefficients of the anova to seek the objetive
COEF_ABUNDANCES <- coefficients(PERMANOVA_ABUNDANCES)["Sexo1",]
I created the top coeficients object
TOP.COEF_ABUNDANCES <- COEF_ABUNDANCES[rev(order(abs(COEF_ABUNDANCES)))[1:20]]
Here I designed the plot of the coefficients
par(mar = c(3, 14, 2, 1))
barplot(sort(TOP.COEF_ABUNDANCES), horiz = T, las = 1, main = "Top taxa")
I followed this tutorial: https://mibwurrepo.github.io/Microbial-bioinformatics-introductory-course-Material-2018/multivariate-comparisons-of-microbial-community-composition.html

Related

Seurat data visualization

Hi I am using public data pbmc to practice single cell analysis
I got stuck at this point with this error message.
Just started with R and having a hard time
Could anyone give a pointer?
Many thanks
Ridge plots - from ggridges. Visualize single cell expression distributions in each cluster
Code: RidgePlot(pbmc3k.final, features = features, ncol = 2)
Error in FetchData.Seurat(object = object, vars = features, slot = slot) :
object 'features' not found
You would need to define what 'features' you want to see.
so :
features = c('Sox2', 'Sox9') #etc, as an example
or, you can add them as a variable which you call features :
features <- c('Sox2', 'Sox9')
RidgePlot(pbmc3k.final, features = features, ncol = 2)
#if they are mouse genes, human genes are all caps.

Predict type="response" to type="terms conversion in R

Can someone please help me with the math? I need to convert the output of my GLM from response to terms to understand the math.
Let's say I am using gender (female(1), male(0)) to predict the college admission rate (0 to 1).
model <- glm(admission_rate ~ gender, data = data,family = quasipoisson(link="log"))
Model coefficients are
intercept 0.24918
genderFemale -0.23229
Now when I run
predict.glm(model, data = data, type = "response")
the values I will get will have the equation y= 0.24918 + (-0.23229) * 1 for female and y= 0.24918 for male. Since it is a link GLM, we take an exponent of each and what we get is our fitted values produced by type=response.
female = 1.017
male = 1.283
I have tried so many things to convert it to fitted values produced by type=terms, but did not get it to match.
The fitted values produced by terms should be
female = 0.152984
male = -0.07
constant = 0.096198
If you can explain the math behind, I would really really appreciate it!

Distance function for DBSCAN

I will like to use a clustering algorithm to find a clustering for a big Digraph, and I will like remove noise from this graph too. So, I was thinking to use the DBSCAN approach, because I saw that we can give to the algorithm a distance function for determining the distance/similarity between two different nodes.
My question is, how can I define a distance function which increases the similarity between two nodes closes in terms of hops and decrease when a node is isolated.
I don't have coordinates or node attributes, so I can not use those. I only have the topology of the graph.
The expected output will be something like this:
I'm really concern about the complexity of the solution. How can approximate a clustering with a linear complexity ...
What is wrong with the obvious?
Distance(a,b) = length of shortest path, or infinity if there is none.
You probably should take directions into account, so a0 to a3 ist 1.
The distance metric suggested by #Anony-Mousse is a good
and natural one, but I question the use of dbscan. Using
the proposed
distance = length of shortest path, or infinity if there is none
Any two nodes that are directly linked would be at distance 1.
If you used dbscan with epsilon < 1, all points would be noise
points. So you will want epsilon > 1. From your example, it looks
like if there is even one point at distance 1, you want them in
the same component so
it looks like you want minNumPts = 2. This will give the
result that it two points are connected by a path of any length
they would be in the same cluster. It looks to me like what
you are after has nothing to do with density and clustering,
rather, I think that what you want is connected components.
If two nodes are connected by a path of any length, they are
in the same component. Finding this via dbscan or some other clustering
method may be possible, but that is probably the
wrong way to think about this. You have a graph and a graph
theoretic problem. You should probably use methods from graph
theory.
I will illustrate using R and igraph. There are other tools
if you don't care for these.
Most of the work is simply setting up your problem.
library(igraph)
to = c("a1", "a2", "a3", "a0", "b1", "b2", "b3", "b0")
from = c("a0", "a1", "a2", "a3", "b0", "b1", "b2", "b3")
EL = data.frame(from, to)
Vert = c("a0", "a1", "a2", "a3", "b0", "b1", "b2", "b3", "c0", "d0")
Vdf = data.frame(Vert)
g = graph_from_data_frame(d = EL, vertices=Vdf)
LO = matrix(c(1.2,1,1,1.2, 2.2,2,2,2.2, 0, 3, 4,3,2,1,4,3,2,1,4,4),
ncol=2)
plot(g, layout=LO)
Now we can use a one-liner to get everything that we need
about the components.
Comp = components(g, mode="weak")
Comp
$membership
a0 a1 a2 a3 b0 b1 b2 b3 c0 d0
1 1 1 1 2 2 2 2 3 4
$csize
[1] 4 4 1 1
$no
[1] 4
This is telling us component membership of the nodes,
the number of nodes per component and the number of
components. Since you wanted to call the single node
components "noise" in the style of dbscan, you can
see that components 3 and 4 have one node each.
They are the noise. The others are "real" components.
To show how to use this and to come to closure with a
pretty picture, I will plot the graph coloring the
components and use light gray for the "noise".
ColorMap = rainbow(Comp$no)
ColorMap[Comp$csize == 1] = "lightgray"
plot(g, layout=LO, vertex.color=ColorMap[Comp$membership])
I encourage you to think about your graph problem as a graph.

Predictors of different size for time series prediction using LSTM with Keras

I would like to predict time series values X using another time series Y and the past value of X.In detail, I would like to predict X at time t (Xt) using (Xt-p,...,Xt-1) and (Yt-p,...,Yt-1,Yt) with p the dimension of the "look back".
So, my problem is that I do not have the same length for my 2 predictors.
Let's use a exemple to be clearer.
If I use a timestep of 2, I would have for one observation :
[(Xt-p,Yt-p),...,(Xt-1,Yt-1),(??,Yt)] as input and Xt as output. I do not know what to use instead of the ??
I understand that mathematically speaking I need to have the same length for my predictors, so I am looking for a value to replace the missing value.
I really do not know if there is a good solution here and if I could to something so any help would be greatly appreciated.
Cheers !
PS : you could see my problem as if I wanted to predict the number of ice cream sell one day in advance in a city using the forcast of weather for the next day. X would be the number of ice cream and Y could be the temperature.
You could e.g. do the following:
input_x = Input(shape=input_shape_x)
input_y = Input(shape=input_shape_y)
lstm_for_x = LSTM(50, return_sequences=False)(input_x)
lstm_for_y = LSTM(50, return_sequences=False)(input_y)
merged = merge([lstm_for_x, lstm_for_y], mode="concat") # for keras < 2.0
merged = Concatenate([lstm_for_x, lstm_for_y])
output = Dense(1)(merged)
model = Model([x_input, y_input], output)
model.compile(..)
model.fit([X, Y], X_next)
Where X is an array of sequences, X_forward is X p-steps ahead and Y is an array of sequences of Ys.

Glm with caret package producing "missing values in resampled performance measures"

I obtained the following code from this Stack Overflow question. caret train() predicts very different then predict.glm()
The following code is producing an error.
I am using caret 6.0-52.
library(car); library(caret); library(e1071)
#data import and preparation
data(Chile)
chile <- na.omit(Chile) #remove "na's"
chile <- chile[chile$vote == "Y" | chile$vote == "N" , ] #only "Y" and "N" required
chile$vote <- factor(chile$vote) #required to remove unwanted levels
chile$income <- factor(chile$income) # treat income as a factor
tc <- trainControl("cv", 2, savePredictions=T, classProbs=TRUE,
summaryFunction=twoClassSummary) #"cv" = cross-validation, 10-fold
fit <- train(chile$vote ~ chile$sex +
chile$education +
chile$statusquo ,
data = chile ,
method = "glm" ,
family = binomial ,
metric = "ROC",
trControl = tc)
Running this code produces the following error.
Something is wrong; all the ROC metric values are missing:
ROC Sens Spec
Min. : NA Min. :0.9354 Min. :0.9187
1st Qu.: NA 1st Qu.:0.9354 1st Qu.:0.9187
Median : NA Median :0.9354 Median :0.9187
Mean :NaN Mean :0.9354 Mean :0.9187
3rd Qu.: NA 3rd Qu.:0.9354 3rd Qu.:0.9187
Max. : NA Max. :0.9354 Max. :0.9187
NA's :1
Error in train.default(x, y, weights = w, ...) : Stopping
In addition: Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
Would anyone know what the issue is or can reproduce/ not reproduce this error. I've seen other answers to this error message that says this has to do with not having representation of classes in each cross validation fold but this isn't the issue as the number of folds is set to 2.
Looks like I needed to install and load the pROC package.
install.packages("pROC")
library(pROC)
You should install using
install.packages("caret", dependencies = c("Imports", "Depends", "Suggests"))
That gets most of the default packages. If there are specific modeling packages that are missing, the code usually prompts you to install them.
I know I'm late to the party, but I think you need to set classProbs = TRUE in train control.
You are using logistic regression when using the parameters method = "glm", family = binomial.
In this case, you must make sure that the target variable (chile$vote) has only 2 factor levels, because logistic regression only performs binary classification.
If the target has more than two labels, then you must set family = "multinomial"

Resources