How to use prepare_analogy_questions and check_analogy_accuracy functions in text2vec package? - text2vec

Following code:
library(text2vec)
text8_file = "text8"
if (!file.exists(text8_file)) {
download.file("http://mattmahoney.net/dc/text8.zip", "text8.zip")
unzip ("text8.zip", files = "text8")
}
wiki = readLines(text8_file, n = 1, warn = FALSE)
# Create iterator over tokens
tokens <- space_tokenizer(wiki)
# Create vocabulary. Terms will be unigrams (simple words).
it = itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it)
vocab <- prune_vocabulary(vocab, term_count_min = 5L)
# Use our filtered vocabulary
vectorizer <- vocab_vectorizer(vocab)
# use window of 5 for context words
tcm <- create_tcm(it, vectorizer, skip_grams_window = 5L)
RcppParallel::setThreadOptions(numThreads = 4)
glove_model = GloVe$new(word_vectors_size = 50, vocabulary = vocab, x_max = 10, learning_rate = .25)
word_vectors_main = glove_model$fit_transform(tcm, n_iter = 20)
word_vectors_context = glove_model$components
word_vectors = word_vectors_main + t(word_vectors_context)
causes error:
qlst <- prepare_analogy_questions("questions-words.txt", rownames(word_vectors))
> Error in (function (fmt, ...) :
invalid format '%d'; use format %s for character objects
File questions-words.txt from word2vec sources https://github.com/nicholas-leonard/word2vec/blob/master/questions-words.txt

This was a small bug in information message formatting (after introduction of futille.logger). Just fixed it and pushed to github.
You can install updated version of the package with devtools::install_github("dselivanov/text2vec"

Related

Why does all my emission mu of HMM in pyro converge to the same number?

I'm trying to create a Gaussian HMM model in pyro to infer the parameters of a very simple Markov sequence. However, my model fails to infer the parameters and something wired happened during the training process. Using the same sequence, hmmlearn has successfully infer the true parameters.
Full code can be accessed in here:
https://colab.research.google.com/drive/1u_4J-dg9Y1CDLwByJ6FL4oMWMFUVnVNd#scrollTo=ZJ4PzdTUBgJi
My model is modified from the example in here:
https://github.com/pyro-ppl/pyro/blob/dev/examples/hmm.py
I manually created a first order Markov sequence where there are 3 states, the true means are [-10, 0, 10], sigmas are [1,2,1].
Here is my model
def model(observations, num_state):
assert not torch._C._get_tracing_state()
with poutine.mask(mask = True):
p_transition = pyro.sample("p_transition",
dist.Dirichlet((1 / num_state) * torch.ones(num_state, num_state)).to_event(1))
p_init = pyro.sample("p_init",
dist.Dirichlet((1 / num_state) * torch.ones(num_state)))
p_mu = pyro.param(name = "p_mu",
init_tensor = torch.randn(num_state),
constraint = constraints.real)
p_tau = pyro.param(name = "p_tau",
init_tensor = torch.ones(num_state),
constraint = constraints.positive)
current_state = pyro.sample("x_0",
dist.Categorical(p_init),
infer = {"enumerate" : "parallel"})
for t in pyro.markov(range(1, len(observations))):
current_state = pyro.sample("x_{}".format(t),
dist.Categorical(Vindex(p_transition)[current_state, :]),
infer = {"enumerate" : "parallel"})
pyro.sample("y_{}".format(t),
dist.Normal(Vindex(p_mu)[current_state], Vindex(p_tau)[current_state]),
obs = observations[t])
My model is compiled as
device = torch.device("cuda:0")
obs = torch.tensor(obs)
obs = obs.to(device)
torch.set_default_tensor_type("torch.cuda.FloatTensor")
guide = AutoDelta(poutine.block(model, expose_fn = lambda msg : msg["name"].startswith("p_")))
Elbo = Trace_ELBO
elbo = Elbo(max_plate_nesting = 1)
optim = Adam({"lr": 0.001})
svi = SVI(model, guide, optim, elbo)
As the training goes, the ELBO has decreased steadily as shown. However, the three means of the states converges.
I have tried to put the for loop of my model into a pyro.plate and switch pyro.param to pyro.sample and vice versa, but nothing worked for my model.
I have not tried this model, but I think it should be possible to solve the problem by modifying the model in the following way:
def model(observations, num_state):
assert not torch._C._get_tracing_state()
with poutine.mask(mask = True):
p_transition = pyro.sample("p_transition",
dist.Dirichlet((1 / num_state) * torch.ones(num_state, num_state)).to_event(1))
p_init = pyro.sample("p_init",
dist.Dirichlet((1 / num_state) * torch.ones(num_state)))
p_mu = pyro.sample("p_mu",
dist.Normal(torch.zeros(num_state), torch.ones(num_state)).to_event(1))
p_tau = pyro.sample("p_tau",
dist.HalfCauchy(torch.zeros(num_state)).to_event(1))
current_state = pyro.sample("x_0",
dist.Categorical(p_init),
infer = {"enumerate" : "parallel"})
for t in pyro.markov(range(1, len(observations))):
current_state = pyro.sample("x_{}".format(t),
dist.Categorical(Vindex(p_transition)[current_state, :]),
infer = {"enumerate" : "parallel"})
pyro.sample("y_{}".format(t),
dist.Normal(Vindex(p_mu)[current_state], Vindex(p_tau)[current_state]),
obs = observations[t])
The model would then be trained using MCMC:
# MCMC
hmc_kernel = NUTS(model, target_accept_prob = 0.9, max_tree_depth = 7)
mcmc = MCMC(hmc_kernel, num_samples = 1000, warmup_steps = 100, num_chains = 1)
mcmc.run(obs)
The results could then be analysed using:
mcmc.get_samples()

Training random forest (ranger) using caret with custom F4 metric in R yields but after running full ,error showing undefined columns selected

library(MLmetrics)
library(caret)
library(doSNOW)
library(ranger)
data is called as the "bank additional" full from this enter link description here and then following code to generate data1
library(VIM)
data1<-hotdeck(data,variable=c('job','marital','education','default','housing','loan'),domain_var = "y",imp_var=FALSE)
#converting the categorical variables to factors as they should be
library(magrittr)
data1%<>%
mutate_at(colnames(data1)[grepl('factor|logical|character',sapply(data1,class))],factor)
Now, splitting
library(caret)
#spliting data into train test 70/30
set.seed(1234)
trainIndex<-createDataPartition(data1$y,p=0.7,times = 1,list = F)
train<-data1[trainIndex,-11]
test<-data1[-trainIndex,-11]
levels(train$y)
train$y = as.factor(train$y)
# train$y = factor(train$y,levels = c("yes","no"))
# train$y = relevel(train$y,ref="yes")
Here, i got an idea of how to create F1 metric in Training Model in Caret Using F1 Metric
and using fbeta score formula i created f1_val; now i can't understand what lev,obs and pred are indicating . in my train dataset only column y showing data$obs , but no data$pred . So, is following error is due to this? and how to rectify this?
f1 <- function (data, lev = NULL, model = NULL) {
precision <- precision(data$obs,data$pred)
recall <- sensitivity(data$obs,data$pred)
f1_val <- (17*precision*recall)/(16*precision+recall)
names(f1_val) <- c("F1")
f1_val
}
tgrid <- expand.grid(
.mtry = 1:5,
.splitrule = "gini",
.min.node.size = seq(1,500,75)
)
model_caret <- train(train$y~., data = train,
method = "ranger",
trControl = trainControl(method="cv",
number = 2,
verboseIter = T,
classProbs = T,
summaryFunction = f1),
tuneGrid = tgrid,
num.trees = 500,
importance = "impurity",
metric = "F1")
After running for 3/4 minutes we get following :
Aggregating results
Selecting tuning parameters
Fitting mtry = 5, splitrule = gini, min.node.size = 1 on full training set
but error:
Error in `[.data.frame`(data, , all.vars(Terms), drop = FALSE) :
undefined columns selected
Also when running model_caret we get,
Error: object 'model_caret' not found
Kindly help. Thanks in advance

How to load Seurat Object into WGCNA Tutorial Format

As far as I can find, there is only one tutorial about loading Seurat objects into WGCNA (https://ucdavis-bioinformatics-training.github.io/2019-single-cell-RNA-sequencing-Workshop-UCD_UCSF/scrnaseq_analysis/scRNA_Workshop-PART6.html). I am really new to programming so it's probably just my inexperience, but I am not sure how to load my Seurat object into a format that works with WGCNA's tutorials (https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/).
Here is what I have tried thus far:
This tries to replicate datExpr and datTraits from part I.1:
library(WGCNA)
library(Seurat)
#example Seurat object -----------------------------------------------
ERlist <- list(c("CPB1", "RP11-53O19.1", "TFF1", "MB", "ANKRD30B",
"LINC00173", "DSCAM-AS1", "IGHG1", "SERPINA5", "ESR1",
"ILRP2", "IGLC3", "CA12", "RP11-64B16.2", "SLC7A2",
"AFF3", "IGFBP4", "GSTM3", "ANKRD30A", "GSTT1", "GSTM1",
"AC026806.2", "C19ORF33", "STC2", "HSPB8", "RPL29P11",
"FBP1", "AGR3", "TCEAL1", "CYP4B1", "SYT1", "COX6C",
"MT1E", "SYTL2", "THSD4", "IFI6", "K1AA1467", "SLC39A6",
"ABCD3", "SERPINA3", "DEGS2", "ERLIN2", "HEBP1", "BCL2",
"TCEAL3", "PPT1", "SLC7A8", "RP11-96D1.10", "H4C8",
"PI15", "PLPP5", "PLAAT4", "GALNT6", "IL6ST", "MYC",
"BST2", "RP11-658F2.8", "MRPS30", "MAPT", "AMFR", "TCEAL4",
"MED13L", "ISG15", "NDUFC2", "TIMP3", "RP13-39P12.3", "PARD68"))
tnbclist <- list(c("FABP7", "TSPAN8", "CYP4Z1", "HOXA10", "CLDN1",
"TMSB15A", "C10ORF10", "TRPV6", "HOXA9", "ATP13A4",
"GLYATL2", "RP11-48O20.4", "DYRK3", "MUCL1", "ID4", "FGFR2",
"SHOX2", "Z83851.1", "CD82", "COL6A1", "KRT23", "GCHFR",
"PRICKLE1", "GCNT2", "KHDRBS3", "SIPA1L2", "LMO4", "TFAP2B",
"SLC43A3", "FURIN", "ELF5", "C1ORF116", "ADD3", "EFNA3",
"EFCAB4A", "LTF", "LRRC31", "ARL4C", "GPNMB", "VIM",
"SDR16C5", "RHOV", "PXDC1", "MALL", "YAP1", "A2ML1",
"RP1-257A7.5", "RP11-353N4.6", "ZBTB18", "CTD-2314B22.3", "GALNT3",
"BCL11A", "CXADR", "SSFA2", "ADM", "GUCY1A3", "GSTP1",
"ADCK3", "SLC25A37", "SFRP1", "PRNP", "DEGS1", "RP11-110G21.2",
"AL589743.1", "ATF3", "SIVA1", "TACSTD2", "HEBP2"))
genes = c(unlist(c(ERlist,tnbclist)))
mat = matrix(rnbinom(500*length(genes),mu=500,size=1),ncol=500)
rownames(mat) = genes
colnames(mat) = paste0("cell",1:500)
sobj = CreateSeuratObject(mat)
sobj = NormalizeData(sobj)
sobj$ClusterName = factor(sample(0:1,ncol(sobj),replace=TRUE))
sobj$Patient = paste0("Patient", 1:500)
sobj = AddModuleScore(object = sobj, features = tnbclist,
name = "TNBC_List",ctrl=5)
sobj = AddModuleScore(object = sobj, features = ERlist,
name = "ER_List",ctrl=5)
#WGCNA -----------------------------------------------------------------
sobjwgcna <- sobj
sobjwgcna <- FindVariableFeatures(sobjwgcna, selection.method = "vst", nfeatures = 2000,
verbose = FALSE, assay = "RNA")
options(stringsAsFactors = F)
sobjwgcnamat <- GetAssayData(sobjwgcna)
datExpr <- t(sobjwgcnamat)[,VariableFeatures(sobjwgcna)]
datTraits <- sobjwgcna#meta.data
datTraits = subset(datTraits, select = -c(nCount_RNA, nFeature_RNA))
I then copy-paste the code as written in the WGCNA I.2a tutorial (https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/FemaleLiver-02-networkConstr-auto.pdf), and that all works until I get to this line in the I.3 tutorial (https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/FemaleLiver-03-relateModsToExt.pdf):
MEList = moduleEigengenes(datExpr, colors = moduleColors)
Error in t.default(expr[, restrict1]) : argument is not a matrix
I tried converting both moduleColors and datExpr into a matrix with as.matrix(), but the error still persists.
Hopefully this makes sense, and thanks for reading!
So doing as.matrix(datExpr) right after datExpr <- t(sobjwgcnamat)[,VariableFeatures(sobjwgcna)] worked. I had been trying it right before MEList = moduleEigengenes(datExpr, colors = moduleColors)
and that didn't work. Seems simple but order matters I guess.

gap characters corresponding to the missing residues in a protein sequence using biopython

I have extracted sequence from pdb without missing residues but i need to have a sequence that has gaps (dash character) replacing missing residues. how can i do it?
thank you
I did it manually by parsing ATOM records of pdb to get existing residues and by parsing REMARK 465 to get missing residues:
pname = '4g5j.pdb' #downloaded file from PDB
fin = open(pname,'r')
content = fin.readlines()
fin.close()
res = []
mis_res = []
print('CHAIN A will be used for ALIGNMENT')
het_chain = 'A'
for i,line in enumerate(content):
if line[0:4] == 'ATOM':
split = [line[:6], line[6:11], line[12:16], line[17:20], line[21], line[22:26], line[30:38], line[38:46], line[46:54]]
if split[4] != het_chain:
continue
res.append(int(split[5]))
for i,line in enumerate(content):
if line[0:10] == 'REMARK 465':
split = [line[:10], line[19], line[21:26]]
if split[1] == het_chain:
mis_res.append(int(split[2]))
resindexes = sorted(list(set(sorted(res))))
missed_resindexes = sorted(list(set(mis_res)))
missed_resindexes = [el for el in missed_resindexes if el not in resindexes]
all_indexes = sorted(resindexes+missed_resindexes)
print(len(all_indexes))
#here you should have your real sequence!
real_seq = 'GSMGEAPNQALLRILKETEFKKIKVLGSGAFGTVYKGLWIPEGEKVKIPVAIKELREATSPKANKEILDEAYVMASVDNPHVCRLLGICLTSTVQLITQLMPFGCLLDYVREHKDNIGSQYLLNWCVQIAKGMNYLEDRRLVHRDLAARNVLVKTPQHVKITDFGLAKLLGAEEKEYHAEGGKVPIKWMALESILHRIYTHQSDVWSYGVTVWELMTFGSKPYDGIPASEISSILEKGERLPQPPICTIDVYMIMVKCWMIDADSRPKFRELIIEFSKMARDPQRYLVIQGDERMHLPSPTDSNFYRALMDEEDMDDVVDADEYLIPQQG'
missed_seq = real_seq[:]
for i,el in enumerate(real_seq):
if all_indexes[i] in missed_resindexes:
print(i)
missed_seq = missed_seq[:i]+'-'+missed_seq[i+1:]
print(missed_seq)
OUTPUT:
---GEAPNQALLRILKETEFKKIKVLGS----TVYKGLWIPEGEKVKIPVAIKE----------KEILDEAYVMASVDNPHVCRLLGICLTSTVQLITQLMPFGCLLDYVREHKDNIGSQYLLNWCVQIAKGMNYLEDRRLVHRDLAARNVLVKTPQHVKITDFGLAKLLGAEEKEYHAEGGKVPIKWMALESILHRIYTHQSDVWSYGVTVWELMTFGSKPYDGIPASEISSILEKGERLPQPPICTIDVYMIMVKCWMIDADSRPKFRELIIEFSKMARDPQRYLVIQGDERMHLPSPTDSNFYRALMDEEDMDDVVDADEY------

Caret - Setting the seeds inside the gafsControl()

I am trying to set the seeds inside the caret's gafsControl(), but I am getting this error:
Error in { : task 1 failed - "supplied seed is not a valid integer"
I understand that seeds for trainControl() is a vector equal to the number of resamples plus one, with the number of combinations of models's tuning parameters (in my case 36, SVM with 6 Sigma and 6 Cost values) in each (resamples) entries. However, I couldn't figure out what I should use for gafsControl(). I've tried iters*popSize (100*10), iters (100), popSize (10), but none has worked.
Thanks in advance.
here is my code (with simulated data):
library(caret)
library(doMC)
library(kernlab)
registerDoMC(cores=32)
set.seed(1234)
train.set <- twoClassSim(300, noiseVars = 100, corrVar = 100, corrValue = 0.75)
mylogGA <- caretGA
mylogGA$fitness_extern <- mnLogLoss
#Index for gafsControl
set.seed(1045481)
ga_index <- createFolds(train.set$Class, k=3)
#Seed for the gafsControl()
set.seed(1056)
ga_seeds <- vector(mode = "list", length = 4)
for(i in 1:3) ga_seeds[[i]] <- sample.int(1500, 1000)
## For the last model:
ga_seeds[[4]] <- sample.int(1000, 1)
#Index for the trainControl()
set.seed(1045481)
tr_index <- createFolds(train.set$Class, k=5)
#Seeds for the trainControl()
set.seed(1056)
tr_seeds <- vector(mode = "list", length = 6)
for(i in 1:5) tr_seeds[[i]] <- sample.int(1000, 36)#
## For the last model:
tr_seeds[[6]] <- sample.int(1000, 1)
gaCtrl <- gafsControl(functions = mylogGA,
method = "cv",
number = 3,
metric = c(internal = "logLoss",
external = "logLoss"),
verbose = TRUE,
maximize = c(internal = FALSE,
external = FALSE),
index = ga_index,
seeds = ga_seeds,
allowParallel = TRUE)
tCtrl = trainControl(method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = mnLogLoss,
index = tr_index,
seeds = tr_seeds,
allowParallel = FALSE)
svmGrid <- expand.grid(sigma= 2^c(-25, -20, -15,-10, -5, 0), C= 2^c(0:5))
t1 <- Sys.time()
set.seed(1234235)
svmFuser.gafs <- gafs(x = train.set[, names(train.set) != "Class"],
y = train.set$Class,
gafsControl = gaCtrl,
trControl = tCtrl,
popSize = 10,
iters = 100,
method = "svmRadial",
preProc = c("center", "scale"),
tuneGrid = svmGrid,
metric="logLoss",
maximize = FALSE)
t2<- Sys.time()
svmFuser.gafs.time<-difftime(t2,t1)
save(svmFuser.gafs, file ="svmFuser.gafs.rda")
save(svmFuser.gafs.time, file ="svmFuser.gafs.time.rda")
Session Info:
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS
locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C LC_TIME=en_CA.UTF-8
[4] LC_COLLATE=en_CA.UTF-8 LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=en_CA.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] kernlab_0.9-22 doMC_1.3.3 iterators_1.0.7 foreach_1.4.2 caret_6.0-52 ggplot2_1.0.1 lattice_0.20-33
loaded via a namespace (and not attached):
[1] Rcpp_0.12.0 magrittr_1.5 splines_3.2.2 MASS_7.3-43 munsell_0.4.2
[6] colorspace_1.2-6 foreach_1.4.2 minqa_1.2.4 car_2.0-26 stringr_1.0.0
[11] plyr_1.8.3 tools_3.2.2 parallel_3.2.2 pbkrtest_0.4-2 nnet_7.3-10
[16] grid_3.2.2 gtable_0.1.2 nlme_3.1-122 mgcv_1.8-7 quantreg_5.18
[21] MatrixModels_0.4-1 iterators_1.0.7 gtools_3.5.0 lme4_1.1-9 digest_0.6.8
[26] Matrix_1.2-2 nloptr_1.0.4 reshape2_1.4.1 codetools_0.2-11 stringi_0.5-5
[31] compiler_3.2.2 BradleyTerry2_1.0-6 scales_0.3.0 stats4_3.2.2 SparseM_1.7
[36] brglm_0.5-9 proto_0.3-10
>
I am not so familiar with the gafsControl() function that you mention, but I encountered a very similar issue when setting parallel seeds using trainControl(). In the instructions, it describes how to create a list (length = number of resamples + 1), where each item is a list (length = number of parameter combinations to test). I find that doing that does not work (see topepo/caret issue #248 for info). However, if you then turn each item into a vector, e.g.
seeds <- lapply(seeds, as.vector)
then the seeds seem to work (i.e. models and predictions are entirely reproducible). I should clarify that this is using doMC as the backend. It may be different for other parallel backends.
Hope this helps
I was able to figure out my mistake by inspecting gafs.default. The seeds inside gafsControl() takes a vector with length (n_repeats*nresampling)+1 and not a list (as in trainControl$seeds). It is actually stated in the documentation of ?gafsControl that seeds is a vector or integers that can be used to set the seed during each search. The number of seeds must be equal to the number of resamples plus one. I figured it out the hard way, this is a reminder to carefully read the documentation :D.
if (!is.null(gafsControl$seeds)) {
if (length(gafsControl$seeds) < length(gafsControl$index) +
1)
stop(paste("There must be at least", length(gafsControl$index) +
1, "random number seeds passed to gafsControl"))
}
else {
gafsControl$seeds <- sample.int(1e+05, length(gafsControl$index) +
1)
}
So, the proper way to set my ga_seeds is:
#Index for gafsControl
set.seed(1045481)
ga_index <- createFolds(train.set$Class, k=3)
#Seed for the gafsControl()
set.seed(1056)
ga_seeds <- sample.int(1500, 4)
If that way settings seeds you can ensure each run the same feature subset is selected ? I ams asking due randominess of GA

Resources