Rewriting ParamSet ids from mlr3::paradox() - mlr3

Let's say I have the following ParamSet object:
my_ps = paradox::ps(
minsplit = p_int(1, 64, logscale = TRUE),
cp = p_dbl(1e-04, 1, logscale = TRUE))
Is it possible to rename minsplit to survTree.minsplit without changing anything else?
The reason for this is that I use some learners as part of a GraphLearner and so their parameters names changed and I would like to have some code that adds the learner$id in front the parameters to use later for tuning (rather than rewriting them from scratch with the new names)

I think I have a partial solution here. It is only partial, because it does not support the transformation.
Where it works:
library(paradox)
my_ps = paradox::ps(
minsplit = p_int(1, 64),
cp = p_dbl(1e-04, 1)
)
my_ps$set_id = "john"
my_psc = ParamSetCollection$new(list(my_ps))
print(my_psc)
#> <ParamSetCollection>
#> id class lower upper nlevels default value
#> 1: john.minsplit ParamInt 1e+00 64 64 <NoDefault[3]>
#> 2: john.cp ParamDbl 1e-04 1 Inf <NoDefault[3]>
Created on 2022-12-07 by the reprex package (v2.0.1)
Where it does not:
library(paradox)
my_ps = paradox::ps(
minsplit = p_int(1, 64, logscale = TRUE),
cp = p_dbl(1e-04, 1)
)
my_ps$set_id = "john"
my_psc = ParamSetCollection$new(list(my_ps))
#> Error in .__ParamSetCollection__initialize(self = self, private = private, : Building a collection out sets, where a ParamSet has a trafo is currently unsupported!
Created on 2022-12-07 by the reprex package (v2.0.1)
The underlying problem is that we did not solve the problem of how to reconcile the parameter transformations of individual ParamSets and a possible parameter transformation of the ParamSetCollection
I fear that there is currently no neat solution for your problem.

Sorry I can not comment yet, this is not exactly the solution you are looking for but I hope this will fix the problem you are having.
You can set the param_space in the learner, before putting it in the graph, i.e. sticking with your search space. After you create the GraphLearner regularly it will have the desired search space.
A concrete example:
library(mlr3verse)
learner = lrn("regr.rpart", cp = to_tune(0.1, 0.2))
glrn = as_learner(po("pca") %>>% po("learner", learner))
at = auto_tuner(
"random_search",
glrn,
rsmp("holdout"),
term_evals = 10
)
task = tsk("mtcars")
at$train(task)

Related

How to load Seurat Object into WGCNA Tutorial Format

As far as I can find, there is only one tutorial about loading Seurat objects into WGCNA (https://ucdavis-bioinformatics-training.github.io/2019-single-cell-RNA-sequencing-Workshop-UCD_UCSF/scrnaseq_analysis/scRNA_Workshop-PART6.html). I am really new to programming so it's probably just my inexperience, but I am not sure how to load my Seurat object into a format that works with WGCNA's tutorials (https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/).
Here is what I have tried thus far:
This tries to replicate datExpr and datTraits from part I.1:
library(WGCNA)
library(Seurat)
#example Seurat object -----------------------------------------------
ERlist <- list(c("CPB1", "RP11-53O19.1", "TFF1", "MB", "ANKRD30B",
"LINC00173", "DSCAM-AS1", "IGHG1", "SERPINA5", "ESR1",
"ILRP2", "IGLC3", "CA12", "RP11-64B16.2", "SLC7A2",
"AFF3", "IGFBP4", "GSTM3", "ANKRD30A", "GSTT1", "GSTM1",
"AC026806.2", "C19ORF33", "STC2", "HSPB8", "RPL29P11",
"FBP1", "AGR3", "TCEAL1", "CYP4B1", "SYT1", "COX6C",
"MT1E", "SYTL2", "THSD4", "IFI6", "K1AA1467", "SLC39A6",
"ABCD3", "SERPINA3", "DEGS2", "ERLIN2", "HEBP1", "BCL2",
"TCEAL3", "PPT1", "SLC7A8", "RP11-96D1.10", "H4C8",
"PI15", "PLPP5", "PLAAT4", "GALNT6", "IL6ST", "MYC",
"BST2", "RP11-658F2.8", "MRPS30", "MAPT", "AMFR", "TCEAL4",
"MED13L", "ISG15", "NDUFC2", "TIMP3", "RP13-39P12.3", "PARD68"))
tnbclist <- list(c("FABP7", "TSPAN8", "CYP4Z1", "HOXA10", "CLDN1",
"TMSB15A", "C10ORF10", "TRPV6", "HOXA9", "ATP13A4",
"GLYATL2", "RP11-48O20.4", "DYRK3", "MUCL1", "ID4", "FGFR2",
"SHOX2", "Z83851.1", "CD82", "COL6A1", "KRT23", "GCHFR",
"PRICKLE1", "GCNT2", "KHDRBS3", "SIPA1L2", "LMO4", "TFAP2B",
"SLC43A3", "FURIN", "ELF5", "C1ORF116", "ADD3", "EFNA3",
"EFCAB4A", "LTF", "LRRC31", "ARL4C", "GPNMB", "VIM",
"SDR16C5", "RHOV", "PXDC1", "MALL", "YAP1", "A2ML1",
"RP1-257A7.5", "RP11-353N4.6", "ZBTB18", "CTD-2314B22.3", "GALNT3",
"BCL11A", "CXADR", "SSFA2", "ADM", "GUCY1A3", "GSTP1",
"ADCK3", "SLC25A37", "SFRP1", "PRNP", "DEGS1", "RP11-110G21.2",
"AL589743.1", "ATF3", "SIVA1", "TACSTD2", "HEBP2"))
genes = c(unlist(c(ERlist,tnbclist)))
mat = matrix(rnbinom(500*length(genes),mu=500,size=1),ncol=500)
rownames(mat) = genes
colnames(mat) = paste0("cell",1:500)
sobj = CreateSeuratObject(mat)
sobj = NormalizeData(sobj)
sobj$ClusterName = factor(sample(0:1,ncol(sobj),replace=TRUE))
sobj$Patient = paste0("Patient", 1:500)
sobj = AddModuleScore(object = sobj, features = tnbclist,
name = "TNBC_List",ctrl=5)
sobj = AddModuleScore(object = sobj, features = ERlist,
name = "ER_List",ctrl=5)
#WGCNA -----------------------------------------------------------------
sobjwgcna <- sobj
sobjwgcna <- FindVariableFeatures(sobjwgcna, selection.method = "vst", nfeatures = 2000,
verbose = FALSE, assay = "RNA")
options(stringsAsFactors = F)
sobjwgcnamat <- GetAssayData(sobjwgcna)
datExpr <- t(sobjwgcnamat)[,VariableFeatures(sobjwgcna)]
datTraits <- sobjwgcna#meta.data
datTraits = subset(datTraits, select = -c(nCount_RNA, nFeature_RNA))
I then copy-paste the code as written in the WGCNA I.2a tutorial (https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/FemaleLiver-02-networkConstr-auto.pdf), and that all works until I get to this line in the I.3 tutorial (https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/FemaleLiver-03-relateModsToExt.pdf):
MEList = moduleEigengenes(datExpr, colors = moduleColors)
Error in t.default(expr[, restrict1]) : argument is not a matrix
I tried converting both moduleColors and datExpr into a matrix with as.matrix(), but the error still persists.
Hopefully this makes sense, and thanks for reading!
So doing as.matrix(datExpr) right after datExpr <- t(sobjwgcnamat)[,VariableFeatures(sobjwgcna)] worked. I had been trying it right before MEList = moduleEigengenes(datExpr, colors = moduleColors)
and that didn't work. Seems simple but order matters I guess.

How to perform up-sampling using sample() function(py-spark)

I am working on a Binary Classification Machine Learning Problem and I am trying to balance the training set as I have an imbalanced target class variable. I am using Py-Spark for building the model.
Below is the code which is working to balance the data
train_initial, test = new_data.randomSplit([0.7, 0.3], seed = 2018)
train_initial.groupby('label').count().toPandas()
label count
0 0.0 712980
1 1.0 2926
train_new = train_initial.sampleBy('label', fractions={0: 2926./712980, 1: 1.0}).cache()
The above code performs under-sampling, but I think this might lead to loss of information. However, I am not sure how to perform upsampling. I also tried to use sample function as below:
train_up = train_initial.sample(True, 10.0, seed = 2018)
Although, it increases the count of 1 in my data set, it also increases the count of 0 and gives the below result.
label count
0 0.0 7128722
1 1.0 29024
Can someone please help me to achieve up-sampling in py-spark.
Thanks a lot in Advance!!
The problem is that you are oversampling the whole data frame. You should filter the data from the two classes
df_class_0 = df_train[df_train['label'] == 0]
df_class_1 = df_train[df_train['label'] == 1]
df_class_1_over = df_class_1.sample(count_class_0, replace=True)
df_test_over = pd.concat([df_class_0, df_class_1_over], axis=0)
the example comes from : https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets
Please note that there are better way to perform oversampling (e.g. SMOTE)
For anyone trying to do random oversampling on a imbalanced dataset in pyspark. The following code will get you started (in this snippet 0 is the mayority class , and 1 is the class to be oversampled):
df_a = df.filter(df['label'] == 0)
df_b = df.filter(df['label'] == 1)
a_count = df_a.count()
b_count = df_b.count()
ratio = a_count / b_count
df_b_overampled = df_b.sample(withReplacement=True, fraction=ratio, seed=1)
df = df_a.unionAll(df_b_oversampled)
I might be quite late to the rescue here. But this is what I would recommend:
Step 1. Sample only for label = 1
train_1= train_initial.where(col('label')==1).sample(True, 10.0, seed = 2018)
step 2. Merge this data with label = 0 data
train_0=train_initial.where(col('label')==0)
train_final = train_0.union(train_1)
PS: please import the col with
from pyspark.sql.functions import col

Learning2Search (vowpal-wabbit) for NER gives weird results

We are trying to use Learning2Search from vowpal-wabbit for NER
We are using ATIS dataset.
In ATIS there are 127 Entities (including Others category)
Training set has 4978 and test has 893 sentences.
How ever when we run it on test set it is mapping everything either class 1(Airline name) or class 2(Airport code)
Which is wired.
We tried another dataset (https://github.com/glample/tagger/tree/master/dataset), same behavior.
Looks like I am not using it the right way. Any pointers will be of great help.
Code snippet :
with open("/tweetsdb/ner/datasets/atis.pkl") as f:
train, test, dicts = cPickle.load(f)
idx2words = {v: k for k, v in dicts['words2idx'].iteritems()}
idx2labels = {v: k for k, v in dicts['labels2idx'].iteritems()}
idx2tables = {v: k for k, v in dicts['tables2idx'].iteritems()}
#Convert the dataset into a format compatible with Vowpal Wabbit
training_set = []
for i in xrange(len(train[0])):
zip_label_ent_idx = zip(train[2][i], train[0][i])
label_ent_actual = [(int(i[0]), idx2words[i[1]]) for i in zip_label_ent_idx]
training_set.append(label_ent_actual)
# Do like wise to get test chunk
class SequenceLabeler(pyvw.SearchTask):
def __init__(self, vw, sch, num_actions):
pyvw.SearchTask.__init__(self, vw, sch, num_actions)
sch.set_options( sch.AUTO_HAMMING_LOSS | sch.AUTO_CONDITION_FEATURES )
def _run(self, sentence):
output = []
for n in range(len(sentence)):
pos,word = sentence[n]
with self.vw.example({'w': [word]}) as ex:
pred = self.sch.predict(examples=ex, my_tag=n+1, oracle=pos, condition=[(n,'p'), (n-1, 'q')])
output.append(pred)
return output
vw = pyvw.vw("--search 3 --search_task hook --ring_size 1024")
Code for training the model:
#Training
sequenceLabeler = vw.init_search_task(SequenceLabeler)
for i in xrange(3):
sequenceLabeler.learn(training_set[:10])
Code for Prediction:
pred = []
for i in random.sample(xrange(len(test_set)), 10):
test_example = [ (999, word[1]) for word in test_set[i] ]
test_labels = [ label[0] for label in test_set[i] ]
print 'input sentence:', ' '.join([word[1] for word in test_set[i]])
print 'actual labels:', ' '.join([str(label) for label in test_labels])
print 'predicted labels:', ' '.join([str(pred) for pred in sequenceLabeler.predict(test_example)])
To see the full code, pls refer to this notebook:
https://github.com/nsanthanam/ner/blob/master/vowpal_wabbit_atis.ipynb
I am also new to this algorithm, but did some pilot studies recently.
To your problem, the answer is that you set a wrong parameter in
vw = pyvw.vw("--search 3 --search_task hook --ring_size 1024")
Here, the search should be set as '127', and in this way, vw will use your 127 tags.
vw = pyvw.vw("--search 127 --search_task hook --ring_size 1024")
Also, my feeling is that vw doesn't work really well with so many tags. I might be wrong, please let me know your result :)

Caret - Setting the seeds inside the gafsControl()

I am trying to set the seeds inside the caret's gafsControl(), but I am getting this error:
Error in { : task 1 failed - "supplied seed is not a valid integer"
I understand that seeds for trainControl() is a vector equal to the number of resamples plus one, with the number of combinations of models's tuning parameters (in my case 36, SVM with 6 Sigma and 6 Cost values) in each (resamples) entries. However, I couldn't figure out what I should use for gafsControl(). I've tried iters*popSize (100*10), iters (100), popSize (10), but none has worked.
Thanks in advance.
here is my code (with simulated data):
library(caret)
library(doMC)
library(kernlab)
registerDoMC(cores=32)
set.seed(1234)
train.set <- twoClassSim(300, noiseVars = 100, corrVar = 100, corrValue = 0.75)
mylogGA <- caretGA
mylogGA$fitness_extern <- mnLogLoss
#Index for gafsControl
set.seed(1045481)
ga_index <- createFolds(train.set$Class, k=3)
#Seed for the gafsControl()
set.seed(1056)
ga_seeds <- vector(mode = "list", length = 4)
for(i in 1:3) ga_seeds[[i]] <- sample.int(1500, 1000)
## For the last model:
ga_seeds[[4]] <- sample.int(1000, 1)
#Index for the trainControl()
set.seed(1045481)
tr_index <- createFolds(train.set$Class, k=5)
#Seeds for the trainControl()
set.seed(1056)
tr_seeds <- vector(mode = "list", length = 6)
for(i in 1:5) tr_seeds[[i]] <- sample.int(1000, 36)#
## For the last model:
tr_seeds[[6]] <- sample.int(1000, 1)
gaCtrl <- gafsControl(functions = mylogGA,
method = "cv",
number = 3,
metric = c(internal = "logLoss",
external = "logLoss"),
verbose = TRUE,
maximize = c(internal = FALSE,
external = FALSE),
index = ga_index,
seeds = ga_seeds,
allowParallel = TRUE)
tCtrl = trainControl(method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = mnLogLoss,
index = tr_index,
seeds = tr_seeds,
allowParallel = FALSE)
svmGrid <- expand.grid(sigma= 2^c(-25, -20, -15,-10, -5, 0), C= 2^c(0:5))
t1 <- Sys.time()
set.seed(1234235)
svmFuser.gafs <- gafs(x = train.set[, names(train.set) != "Class"],
y = train.set$Class,
gafsControl = gaCtrl,
trControl = tCtrl,
popSize = 10,
iters = 100,
method = "svmRadial",
preProc = c("center", "scale"),
tuneGrid = svmGrid,
metric="logLoss",
maximize = FALSE)
t2<- Sys.time()
svmFuser.gafs.time<-difftime(t2,t1)
save(svmFuser.gafs, file ="svmFuser.gafs.rda")
save(svmFuser.gafs.time, file ="svmFuser.gafs.time.rda")
Session Info:
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS
locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C LC_TIME=en_CA.UTF-8
[4] LC_COLLATE=en_CA.UTF-8 LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=en_CA.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] kernlab_0.9-22 doMC_1.3.3 iterators_1.0.7 foreach_1.4.2 caret_6.0-52 ggplot2_1.0.1 lattice_0.20-33
loaded via a namespace (and not attached):
[1] Rcpp_0.12.0 magrittr_1.5 splines_3.2.2 MASS_7.3-43 munsell_0.4.2
[6] colorspace_1.2-6 foreach_1.4.2 minqa_1.2.4 car_2.0-26 stringr_1.0.0
[11] plyr_1.8.3 tools_3.2.2 parallel_3.2.2 pbkrtest_0.4-2 nnet_7.3-10
[16] grid_3.2.2 gtable_0.1.2 nlme_3.1-122 mgcv_1.8-7 quantreg_5.18
[21] MatrixModels_0.4-1 iterators_1.0.7 gtools_3.5.0 lme4_1.1-9 digest_0.6.8
[26] Matrix_1.2-2 nloptr_1.0.4 reshape2_1.4.1 codetools_0.2-11 stringi_0.5-5
[31] compiler_3.2.2 BradleyTerry2_1.0-6 scales_0.3.0 stats4_3.2.2 SparseM_1.7
[36] brglm_0.5-9 proto_0.3-10
>
I am not so familiar with the gafsControl() function that you mention, but I encountered a very similar issue when setting parallel seeds using trainControl(). In the instructions, it describes how to create a list (length = number of resamples + 1), where each item is a list (length = number of parameter combinations to test). I find that doing that does not work (see topepo/caret issue #248 for info). However, if you then turn each item into a vector, e.g.
seeds <- lapply(seeds, as.vector)
then the seeds seem to work (i.e. models and predictions are entirely reproducible). I should clarify that this is using doMC as the backend. It may be different for other parallel backends.
Hope this helps
I was able to figure out my mistake by inspecting gafs.default. The seeds inside gafsControl() takes a vector with length (n_repeats*nresampling)+1 and not a list (as in trainControl$seeds). It is actually stated in the documentation of ?gafsControl that seeds is a vector or integers that can be used to set the seed during each search. The number of seeds must be equal to the number of resamples plus one. I figured it out the hard way, this is a reminder to carefully read the documentation :D.
if (!is.null(gafsControl$seeds)) {
if (length(gafsControl$seeds) < length(gafsControl$index) +
1)
stop(paste("There must be at least", length(gafsControl$index) +
1, "random number seeds passed to gafsControl"))
}
else {
gafsControl$seeds <- sample.int(1e+05, length(gafsControl$index) +
1)
}
So, the proper way to set my ga_seeds is:
#Index for gafsControl
set.seed(1045481)
ga_index <- createFolds(train.set$Class, k=3)
#Seed for the gafsControl()
set.seed(1056)
ga_seeds <- sample.int(1500, 4)
If that way settings seeds you can ensure each run the same feature subset is selected ? I ams asking due randominess of GA

Pc-Stable from pcalg

I am using the pc-stable from the package ‘pcalg’ version 2.0-10 to learn the structure . what I understand this algorithm does not effect the the order of the input data because it is order_independent. when I run it with different order ,I got different graph. can any one help me with this issue and this is my code.
library(pracma)
randindexMatriax <- matrix(0,10,ncol(TrainData))
numberUnique_val_col = vector()
pdf("Graph for Test PC Stable with random order.pdf")
par(mfrow=c(2,1))
for (i in 1:10)
{
randindex<-randperm(1:ncol(TrainData))
randindexMatriax[i,]<-randindex
TrainDataRandOrder<-data[,randindex]
V <- colnames( TrainDataRandOrder)
UD <-data.frame(TrainDataRandOrder)
numberUnique_val_col= sapply(UD,function(x)length(unique(x)))
suffStat <- list(dm = TrainDataRandOrder,nlev = c(numberUnique_val_col[1],numberUnique_val_col[2], numberUnique_val_col[3],numberUnique_val_col[4],
numberUnique_val_col[5],numberUnique_val_col[6], numberUnique_val_col[7],
numberUnique_val_col[8],numberUnique_val_col[9],
numberUnique_val_col[10],numberUnique_val_col[11],
numberUnique_val_col[12],numberUnique_val_col[13],
numberUnique_val_col[14],numberUnique_val_col[15],
numberUnique_val_col[16],numberUnique_val_col[17],
numberUnique_val_col[18],numberUnique_val_col[19], numberUnique_val_col[20]), adaptDF = FALSE)
pc.fit <- pc(suffStat, indepTest= disCItest, alpha=0.01, labels=V, fixedGaps = NULL, fixedEdges = NULL,NAdelete = TRUE, m.max = Inf,skel.method = "stable", conservative = TRUE,solve.confl = TRUE, verbose = TRUE)
The "Stable" part of PC-Stable only affects the Skeleton phase of the algorithm. The Orientation phase is still order-dependent. Do the two graphs have identical "skeletons"? That is, if you convert all directed edges into undirected edges, are the two graphs identical?
If not, you may have uncovered a bug in pcalg! Please post a sample dataset and two orderings of the columns that produce graphs with different skeletons.

Resources