How can I import from multiple files in r-drake? - drake-r-package

I want to import data of a similar category from multiple source files.
Every source has a short label.
How can I incorporate this into drake, without writing out every file as its own target?
I thought the following would work, but it does not. Ideally, I would like to have the targets raw_a and raw_b.
input_files <- list(
'a' = 'file_1.csv',
'b' = 'file_2.csv'
)
plan <-
drake::drake_plan(
raw = drake::target(
import_file(file),
transform = map(
file = file_in(!! input_files)
)
)
)
with
import_file <- function(file) {
readr::read_csv(file, skip = 2)
}

You are so close. file_in() needs to go literally in the command, not the transformation.
library(drake)
input_files <- c("file_1.csv", "file_2.csv")
plan <- drake_plan(
raw = target(
import_file(file_in(file)),
transform = map(file = !!input_files)
)
)
config <- drake_config(plan)
vis_drake_graph(config)
Created on 2019-10-19 by the reprex package (v0.3.0)

This is probably the idiomatic solution.
plan <-
drake::drake_plan(
raw = drake::target(
import_file(file),
transform = map(
file = file_in('file_1.csv', 'file_2.csv'),
label = c('a', 'b'),
.id = label
)
)
)

file_in needs to around the whole string
plan <-
drake::drake_plan(
raw = drake::target(
import_file(file),
transform = map(
file = list(
file_in('file_1.csv'),
file_in('file_2.csv')
)
)
)
)

Related

Rewriting ParamSet ids from mlr3::paradox()

Let's say I have the following ParamSet object:
my_ps = paradox::ps(
minsplit = p_int(1, 64, logscale = TRUE),
cp = p_dbl(1e-04, 1, logscale = TRUE))
Is it possible to rename minsplit to survTree.minsplit without changing anything else?
The reason for this is that I use some learners as part of a GraphLearner and so their parameters names changed and I would like to have some code that adds the learner$id in front the parameters to use later for tuning (rather than rewriting them from scratch with the new names)
I think I have a partial solution here. It is only partial, because it does not support the transformation.
Where it works:
library(paradox)
my_ps = paradox::ps(
minsplit = p_int(1, 64),
cp = p_dbl(1e-04, 1)
)
my_ps$set_id = "john"
my_psc = ParamSetCollection$new(list(my_ps))
print(my_psc)
#> <ParamSetCollection>
#> id class lower upper nlevels default value
#> 1: john.minsplit ParamInt 1e+00 64 64 <NoDefault[3]>
#> 2: john.cp ParamDbl 1e-04 1 Inf <NoDefault[3]>
Created on 2022-12-07 by the reprex package (v2.0.1)
Where it does not:
library(paradox)
my_ps = paradox::ps(
minsplit = p_int(1, 64, logscale = TRUE),
cp = p_dbl(1e-04, 1)
)
my_ps$set_id = "john"
my_psc = ParamSetCollection$new(list(my_ps))
#> Error in .__ParamSetCollection__initialize(self = self, private = private, : Building a collection out sets, where a ParamSet has a trafo is currently unsupported!
Created on 2022-12-07 by the reprex package (v2.0.1)
The underlying problem is that we did not solve the problem of how to reconcile the parameter transformations of individual ParamSets and a possible parameter transformation of the ParamSetCollection
I fear that there is currently no neat solution for your problem.
Sorry I can not comment yet, this is not exactly the solution you are looking for but I hope this will fix the problem you are having.
You can set the param_space in the learner, before putting it in the graph, i.e. sticking with your search space. After you create the GraphLearner regularly it will have the desired search space.
A concrete example:
library(mlr3verse)
learner = lrn("regr.rpart", cp = to_tune(0.1, 0.2))
glrn = as_learner(po("pca") %>>% po("learner", learner))
at = auto_tuner(
"random_search",
glrn,
rsmp("holdout"),
term_evals = 10
)
task = tsk("mtcars")
at$train(task)

How to load Seurat Object into WGCNA Tutorial Format

As far as I can find, there is only one tutorial about loading Seurat objects into WGCNA (https://ucdavis-bioinformatics-training.github.io/2019-single-cell-RNA-sequencing-Workshop-UCD_UCSF/scrnaseq_analysis/scRNA_Workshop-PART6.html). I am really new to programming so it's probably just my inexperience, but I am not sure how to load my Seurat object into a format that works with WGCNA's tutorials (https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/).
Here is what I have tried thus far:
This tries to replicate datExpr and datTraits from part I.1:
library(WGCNA)
library(Seurat)
#example Seurat object -----------------------------------------------
ERlist <- list(c("CPB1", "RP11-53O19.1", "TFF1", "MB", "ANKRD30B",
"LINC00173", "DSCAM-AS1", "IGHG1", "SERPINA5", "ESR1",
"ILRP2", "IGLC3", "CA12", "RP11-64B16.2", "SLC7A2",
"AFF3", "IGFBP4", "GSTM3", "ANKRD30A", "GSTT1", "GSTM1",
"AC026806.2", "C19ORF33", "STC2", "HSPB8", "RPL29P11",
"FBP1", "AGR3", "TCEAL1", "CYP4B1", "SYT1", "COX6C",
"MT1E", "SYTL2", "THSD4", "IFI6", "K1AA1467", "SLC39A6",
"ABCD3", "SERPINA3", "DEGS2", "ERLIN2", "HEBP1", "BCL2",
"TCEAL3", "PPT1", "SLC7A8", "RP11-96D1.10", "H4C8",
"PI15", "PLPP5", "PLAAT4", "GALNT6", "IL6ST", "MYC",
"BST2", "RP11-658F2.8", "MRPS30", "MAPT", "AMFR", "TCEAL4",
"MED13L", "ISG15", "NDUFC2", "TIMP3", "RP13-39P12.3", "PARD68"))
tnbclist <- list(c("FABP7", "TSPAN8", "CYP4Z1", "HOXA10", "CLDN1",
"TMSB15A", "C10ORF10", "TRPV6", "HOXA9", "ATP13A4",
"GLYATL2", "RP11-48O20.4", "DYRK3", "MUCL1", "ID4", "FGFR2",
"SHOX2", "Z83851.1", "CD82", "COL6A1", "KRT23", "GCHFR",
"PRICKLE1", "GCNT2", "KHDRBS3", "SIPA1L2", "LMO4", "TFAP2B",
"SLC43A3", "FURIN", "ELF5", "C1ORF116", "ADD3", "EFNA3",
"EFCAB4A", "LTF", "LRRC31", "ARL4C", "GPNMB", "VIM",
"SDR16C5", "RHOV", "PXDC1", "MALL", "YAP1", "A2ML1",
"RP1-257A7.5", "RP11-353N4.6", "ZBTB18", "CTD-2314B22.3", "GALNT3",
"BCL11A", "CXADR", "SSFA2", "ADM", "GUCY1A3", "GSTP1",
"ADCK3", "SLC25A37", "SFRP1", "PRNP", "DEGS1", "RP11-110G21.2",
"AL589743.1", "ATF3", "SIVA1", "TACSTD2", "HEBP2"))
genes = c(unlist(c(ERlist,tnbclist)))
mat = matrix(rnbinom(500*length(genes),mu=500,size=1),ncol=500)
rownames(mat) = genes
colnames(mat) = paste0("cell",1:500)
sobj = CreateSeuratObject(mat)
sobj = NormalizeData(sobj)
sobj$ClusterName = factor(sample(0:1,ncol(sobj),replace=TRUE))
sobj$Patient = paste0("Patient", 1:500)
sobj = AddModuleScore(object = sobj, features = tnbclist,
name = "TNBC_List",ctrl=5)
sobj = AddModuleScore(object = sobj, features = ERlist,
name = "ER_List",ctrl=5)
#WGCNA -----------------------------------------------------------------
sobjwgcna <- sobj
sobjwgcna <- FindVariableFeatures(sobjwgcna, selection.method = "vst", nfeatures = 2000,
verbose = FALSE, assay = "RNA")
options(stringsAsFactors = F)
sobjwgcnamat <- GetAssayData(sobjwgcna)
datExpr <- t(sobjwgcnamat)[,VariableFeatures(sobjwgcna)]
datTraits <- sobjwgcna#meta.data
datTraits = subset(datTraits, select = -c(nCount_RNA, nFeature_RNA))
I then copy-paste the code as written in the WGCNA I.2a tutorial (https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/FemaleLiver-02-networkConstr-auto.pdf), and that all works until I get to this line in the I.3 tutorial (https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/FemaleLiver-03-relateModsToExt.pdf):
MEList = moduleEigengenes(datExpr, colors = moduleColors)
Error in t.default(expr[, restrict1]) : argument is not a matrix
I tried converting both moduleColors and datExpr into a matrix with as.matrix(), but the error still persists.
Hopefully this makes sense, and thanks for reading!
So doing as.matrix(datExpr) right after datExpr <- t(sobjwgcnamat)[,VariableFeatures(sobjwgcna)] worked. I had been trying it right before MEList = moduleEigengenes(datExpr, colors = moduleColors)
and that didn't work. Seems simple but order matters I guess.

Filter source files for custom rule

I used a bazel macro to run a python test on a subset of source files. Similar to this:
def report(name, srcs):
source_labels = [file for file in srcs if file.startswith("a")]
if len(source_labels) == 0:
return;
source_filenames = ["$(location %s)" % x for x in source_labels]
native.py_test(
name = name + "_report",
srcs = ["report_tool"],
data = source_labels,
main = "report_tool.py",
args = source_filenames,
)
report("foo", ["foo.hpp", "afoo.hpp"])
This worked fine until one of my source files started using a select and now I get the error:
File "/home/david/foo/report.bzl", line 47, in report
[file for file in srcs if file.startswith("a")]
type 'select' is not iterable
I tried to move the code to a bazel rule, but then I get a different error that py_test can not be used in the analysis phase.
The reason that the select is causing the error is that macros are evaluated during the loading phase, whereas selectss are not evaluated until the analysis phase (see Extension Overview).
Similarly, py_test can't be used in a rule implementation because the rule implementation is evaluated in the analysis phase, whereas the py_test would need to have been loaded in the loading phase.
One way past this is to create a separate Starlark rule that takes a list of labels and just creates a file with each filename from the label. Then the py_test takes that file as data and loads the other files from there. Something like this:
def report(name, srcs):
file_locations_label = "_" + name + "_file_locations"
_generate_file_locations(
name = file_locations_label,
labels = srcs
)
native.py_test(
name = name + "_report",
srcs = ["report_tool.py"],
data = srcs + [file_locations_label],
main = "report_tool.py",
args = ["$(location %s)" % file_locations_label]
)
def _generate_file_locations_impl(ctx):
paths = []
for l in ctx.attr.labels:
f = l.files.to_list()[0]
if f.basename.startswith("a"):
paths.append(f.short_path)
ctx.actions.write(ctx.outputs.file_paths, "\n".join(paths))
return DefaultInfo(runfiles = ctx.runfiles(files = [ctx.outputs.file_paths]))
_generate_file_locations = rule(
implementation = _generate_file_locations_impl,
attrs = { "labels": attr.label_list(allow_files = True) },
outputs = { "file_paths": "%{name}_files" },
)
This has one disadvantage: Because the py_test has to depend on all the sources, the py_test will get rerun even if the only files that have changed are the ignored files. (If this is a significant drawback, then there is at least one way around this, which is to have _generate_file_locations filter the files too, and have the py_test depend on only _generate_file_locations. This could maybe be accomplished through runfiles symlinks)
Update:
Since the test report tool comes from an external repository and can't be easily modified, here's another approach that might work better. Rather than create a rule that creates a params file (a file containing the paths to process) as above, the Starlark rule can itself be a test rule that uses the report tool as the test executable:
def _report_test_impl(ctx):
filtered_srcs = []
for f in ctx.attr.srcs:
f = f.files.to_list()[0]
if f.basename.startswith("a"):
filtered_srcs.append(f)
report_tool = ctx.attr._report_test_tool
ctx.actions.write(
output = ctx.outputs.executable,
content = "{report_tool} {paths}".format(
report_tool = report_tool.files_to_run.executable.short_path,
paths = " ".join([f.short_path for f in filtered_srcs]))
)
runfiles = ctx.runfiles(files = filtered_srcs).merge(
report_tool.default_runfiles)
return DefaultInfo(runfiles = runfiles)
report_test = rule(
implementation = _report_test_impl,
attrs = {
"srcs": attr.label_list(allow_files = True),
"_report_test_tool": attr.label(default="//:report_test_tool"),
},
test = True,
)
This requires that the test report tool be a py_binary somewhere so that the test rule above can depend on it:
py_binary(
name = "report_test_tool",
srcs = ["report_tool.py"],
main = "report_tool.py",
)

Citing within an RMarkdown table

I am attempting to create a table which has citations built into the table. Here is a visual of what I am trying to achieve.
As far as I know you can only add footnotes in rowvars or colvars in kableExtra (love that package).
# Create a dataframe called df
Component <- c('N2','P3')
Latency <- c('150 to 200ms', '625 to 800ms')
Location <- c('FCz, Fz, Cz', 'Pz, Oz')
df <- data.frame(Component, Latency, Location)
Below is my attempt after reading through kableExtra's Git page
# Trying some code taken from the kableExtra guide
row.names(df) <- df$Component
df[1] <- NULL
dt_footnote <- df
names(dt_footnote)[1] <- paste0(names(dt_footnote)[2],
footnote_marker_symbol(1))
row.names(dt_footnote)[2] <- paste0(row.names(dt_footnote)[2],
footnote_marker_alphabet(1))
kable(dt_footnote, align = "c",
# Remember this escape = F
escape = F, "latex", longtable = T, booktabs = T, caption = "My Table Name") %>%
kable_styling(full_width = F) %>%
footnote(alphabet = "Jones, 2013",
symbol = "Footnote Symbol 1; ",
footnote_as_chunk = T)
But this code only works on the headers. The ultimate goal would be if I could use a BibTex reference such as #JonesFunctionalMixedEffectModels2013 such that the final part of the code would look like
footnote(alphabet = #davidsonFunctionalMixedEffectModels2009,
symbol = "Footnote Symbol 1; ", footnote_as_chunk = T)
Anyone have any ideas?
Thanks
What I did at the end was to generate a temporary table with pander, then copy the references' number manually to my kable
pander(
df,
caption = "Temporal",
style = "simple",
justify = "left")

How to use prepare_analogy_questions and check_analogy_accuracy functions in text2vec package?

Following code:
library(text2vec)
text8_file = "text8"
if (!file.exists(text8_file)) {
download.file("http://mattmahoney.net/dc/text8.zip", "text8.zip")
unzip ("text8.zip", files = "text8")
}
wiki = readLines(text8_file, n = 1, warn = FALSE)
# Create iterator over tokens
tokens <- space_tokenizer(wiki)
# Create vocabulary. Terms will be unigrams (simple words).
it = itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it)
vocab <- prune_vocabulary(vocab, term_count_min = 5L)
# Use our filtered vocabulary
vectorizer <- vocab_vectorizer(vocab)
# use window of 5 for context words
tcm <- create_tcm(it, vectorizer, skip_grams_window = 5L)
RcppParallel::setThreadOptions(numThreads = 4)
glove_model = GloVe$new(word_vectors_size = 50, vocabulary = vocab, x_max = 10, learning_rate = .25)
word_vectors_main = glove_model$fit_transform(tcm, n_iter = 20)
word_vectors_context = glove_model$components
word_vectors = word_vectors_main + t(word_vectors_context)
causes error:
qlst <- prepare_analogy_questions("questions-words.txt", rownames(word_vectors))
> Error in (function (fmt, ...) :
invalid format '%d'; use format %s for character objects
File questions-words.txt from word2vec sources https://github.com/nicholas-leonard/word2vec/blob/master/questions-words.txt
This was a small bug in information message formatting (after introduction of futille.logger). Just fixed it and pushed to github.
You can install updated version of the package with devtools::install_github("dselivanov/text2vec"

Resources