Use a dedicated environment for drake, with r_make() - drake-r-package

I'm trying to adapt the recommendation in Section 12.7.6.5 of the manual of using a dedicated environment (rather than the global environment) to interactive usage with r_make().
What I did is to modify the _drake.R configuration script as follows:
envir <- new.env(parent = globalenv())
source("R/packages.R", local = envir) # Load your packages, e.g. library(drake).
source("R/functions.R", local = envir) # Define your custom code as a bunch of functions.
source("R/plan.R", local = envir) # Create your drake plan.
drake_config(plan, envir = envir)
for any packages, functions and plan.
When I run
library(drake)
r_make()
I get:
Error in if (nrow(plan) < 1L) { : argument is of length zero
Error: <callr_status_error: callr subprocess failed: argument is of length zero>
-->
<callr_remote_error in if (nrow(plan) < 1L) { ...:
argument is of length zero>
in process 1598
See `.Last.error.trace` for a stack trace.
Am I missing something?

If you are already using r_make(), you most likely do not need to bother with envir. Because r_make() begins and ends in its own isolated callr::r() process, the global environment of the master session is already protected. In fact, r_make() is much better than envir when it comes to environment reproducibility, so you are already on the right track.
But if you do still want to use envir, please make sure the plan is defined in the environment that calls drake_config(): that is, the global environment of the session that runs _drake.R. So you can either call drake_config(envir$plan, envir = envir) or write source("plan.R") instead of source("plan.R", local = envir).
Examples:
writeLines(
c(
"library(drake)",
"plan <- drake_plan(x = 1)"
),
"plan.R"
)
writeLines(
c(
"envir <- new.env(parent = globalenv())",
"source(\"plan.R\", local = envir)",
"ls() # does not contain the plan",
"ls(envir) # contains the plan",
"drake_config(envir$plan, envir = envir)"
),
"_drake.R"
)
cat(readLines("plan.R"), sep = "\n")
#> library(drake)
#> plan <- drake_plan(x = 1)
cat(readLines("_drake.R"), sep = "\n")
#> envir <- new.env(parent = globalenv())
#> source("plan.R", local = envir)
#> ls() # does not contain the plan
#> ls(envir) # contains the plan
#> drake_config(envir$plan, envir = envir)
library(drake)
r_make()
#> [32mtarget[39m x
Created on 2020-01-13 by the reprex package (v0.3.0)
writeLines(
c(
"library(drake)",
"plan <- drake_plan(x = 1)"
),
"plan.R"
)
writeLines(
c(
"envir <- new.env(parent = globalenv())",
"source(\"plan.R\") # source into global envir",
"ls()",
"ls(envir)",
"drake_config(plan, envir = envir)"
),
"_drake.R"
)
cat(readLines("plan.R"), sep = "\n")
#> library(drake)
#> plan <- drake_plan(x = 1)
cat(readLines("_drake.R"), sep = "\n")
#> envir <- new.env(parent = globalenv())
#> source("plan.R") # source into global envir
#> ls()
#> ls(envir)
#> drake_config(plan, envir = envir)
library(drake)
r_make()
#> [32mtarget[39m x
Created on 2020-01-13 by the reprex package (v0.3.0)

Related

(Lua) How do I break a string into entries on a table?

So I want to take an input like R U R' U' and turn it into a table that contains
R
U
R'
U'
I haven't found an example of code that worked. I have tried this solution from codegrepper, and it didn't work. I have not come up with anything else in my head but my general program, which is supposed to take an input like R and find its place in a table. If R is 1, then it will take the value 1 from another table, which will have the r^ as value 1. Then it will do this with the rest and print it when it is done. So if there is an optimization with this that could make it all quicker than I would like to see it. Thanks and goodbye.
function split(str, pat)
local t = {}
local fpat = "(.-)" .. pat
local last_end = 1
local s, e, cap = str:find(fpat, 1)
while s do
if s ~= 1 or cap ~= "" then table.insert(t, cap) end
last_end = e + 1
s, e, cap = str:find(fpat, last_end)
end
if last_end <= #str then
cap = str:sub(last_end)
table.insert(t, cap)
end
return t
end
then split it with split(var," ")
local myString = "R U R' U'"
local myTable = {}
for e in string.gmatch(myString, "%S+") do
table.insert(myTable, e)
end
Lua Users Wiki
s:gmatch(pattern)
This returns a pattern finding iterator. The iterator will search
through the string passed looking for instances of the pattern you
passed.
First you need to match all the space-separated parts. You can do this using gmatch. Then you can insert these parts as keys in a hash table, the value being the one-indexed index of the occurrence:
local str = "R U R' U'"
local index = 1
local last_occurrence = {}
for match in str:gmatch"%S+" do
last_occurrence[match] = index
index = index + 1
end
Now you can use your "other table" to obtain the value in constant time:
local other_table = {"r^"}
local value = other_table[last_occurrence.R] -- "r^"

Simstudy package duplicate keys error and variables referenced not previously defined error

I tried running the following code and encountered several errors with the simstudy package.
library(simstudy)
clusterDef <- defData(varname = "u_3", dist = "normal", formula = 0,
variance = 25.77, id="clus") #cluster-level random effect
clusterDef <- defData(clusterDef, varname = "error", dist = "normal", formula = 0,
variance = 38.35) #error termeriod
clusterDef <- defData(clusterDef, varname = "ind", dist = "nonrandom",
formula = 25) #individuals per cluster
#Generate individual-level random effect and treatment variable
indDef <- defDataAdd(varname = "u_2", dist = "normal", formula = 0,
variance = 120.62)
#Generate clusters of data
set.seed(12345)
cohortsw <- genData(3, clusterDef)
cohortswTm <- addPeriods(cohortsw, nPeriods = 6, idvars = "clus", perName = "period")
cohortswTm <- trtStepWedge(cohortswTm, "clus", nWaves = 3, lenWaves = 1, startPer = 1, grpName = "trt")
cohortswTm <- genCluster(cohortswTm, cLevelVar = "clus", numIndsVar = "ind", level1ID = "id")
Error in vecseq(f__, len__, if (allow.cartesian || notjoin ||
!anyDuplicated(f__, : Join results in 2700 rows; more than 468 =
nrow(x)+nrow(i). Check for duplicate key values in i each of which
join to the same group in x over and over again. If that's ok, try
by=.EACHI to run j for each group to avoid the large allocation. If
you are sure you wish to proceed, rerun with allow.cartesian=TRUE.
Otherwise, please search for this error message in the FAQ, Wiki,
Stack Overflow and data.table issue tracker for advice.
cohortswTm <- addColumns(indDef, cohortswTm)
#Define coefficients for time as a categorical variable
timecoeff1 <- -5.42
timecoeff2 <- -5.72
timecoeff3 <- -7.03
timecoeff4 <- -6.13
timecoeff5 <- -9.13
#Generate outcome y
y <- defDataAdd(varname = "Y", formula = "17.87 + 5.0*trt + timecoeff1*I(period == 1) + timecoeff2*I(period == 2) + timecoeff3*I(period == 3) + timecoeff4*I(period == 4) + timecoeff5*I(period == 5) + u_3 + u_2 + error", dist = "normal")
#Add outcome to dataset
cohortswTm <- addColumns(y, cohortswTm)
Error: Variable(s) referenced not previously defined: timecoeff1,
timecoeff2, timecoeff3, timecoeff4, timecoeff5
Does anybody know why I am getting the errors that were highlighted above? How would I fix the code to prevent them from occuring?
Any help is much appreciated.
The first error is generated because you are trying to create individual level data within each cluster, but each cluster appears repeatedly (over 6 periods). genCluster is expecting that cLevelVar is a unique id. In this case, you can generate 6 individuals per cluster per time period by modifying the genCluster command to be
cohortswTm <- genCluster(cohortswTm, cLevelVar = "timeID",
numIndsVar = "ind", level1ID = "id")
This code creates a "closed" cohort, individuals are observed only in a single period. Generating an open cohort, where individuals might be observed over time as well, is a bit more involved, and is described here.
The second error is generated because simstudy data definitions can only include variables that have been defined in the context of the data definition. So, any constants need to be in the formula. (The formula itself can be updated using updateDef and updateDefAdd if you want to explore the effects of different covariate levels.)
This is how y should be defined:
y <- defDataAdd(varname = "Y", formula = "17.87 + 5.0*trt -
5.42*I(period == 1) - 5.72*I(period == 2) - 7.03*I(period == 3) -
6.13*I(period == 4) - 9.13*I(period == 5) + u_3 + u_2 + error",
dist = "normal")

Create groups of targets

Let's say that I have the following plan:
test_plan = drake_plan(
foo = target(x + 1, transform = map(x = c(5, 10))),
bar = 42
)
Now I want to create a new target that contains the two subtargets foo_5, foo_10 and the target bar. How can I do this? I feel it must be super simple but I don't manage to get a solution.
Thanks!
Yes, it is both possible and simple. The built-in solution is to use tags: https://books.ropensci.org/drake/static.html#tags. Example:
library(drake)
drake_plan(
foo = target(
x + 1,
transform = map(x = c(5, 10), .tag_out = group)
),
bar = target(
42,
# You need a transform to use a tag, even for 1 target.
transform = map(tmp = 1, .tag_out = group)
),
baz_map = target(group, transform = map(group)),
baz_combine = target(c(group), transform = combine(group))
)
#> # A tibble: 7 x 2
#> target command
#> <chr> <expr>
#> 1 foo_5 5 + 1
#> 2 foo_10 10 + 1
#> 3 bar_1 42
#> 4 baz_map_foo_5 foo_5
#> 5 baz_map_foo_10 foo_10
#> 6 baz_map_bar_1 bar_1
#> 7 baz_combine c(foo_5, foo_10, bar_1)
Created on 2019-11-16 by the reprex package (v0.3.0)

How can I get the length of a protein chain from a PDB file with Biopython?

I have tried it this way first:
for model in structure:
for residue in model.get_residues():
if PDB.is_aa(residue):
x += 1
and then that way:
len(structure[0][chain])
But none of them seem to work...
Your code should work and give you the correct results.
from Bio import PDB
parser = PDB.PDBParser()
pdb1 ='./1bfg.pdb'
structure = parser.get_structure("1bfg", pdb1)
model = structure[0]
res_no = 0
non_resi = 0
for model in structure:
for chain in model:
for r in chain.get_residues():
if r.id[0] == ' ':
res_no +=1
else:
non_resi +=1
print ("Residues: %i" % (res_no))
print ("Other: %i" % (non_resi))
res_no2 = 0
non_resi2 = 0
for model in structure:
for residue in model.get_residues():
if PDB.is_aa(residue):
res_no2 += 1
else:
non_resi2 += 1
print ("Residues2: %i" % (res_no2))
print ("Other2: %i" % (non_resi2))
Output:
Residues: 126
Other: 99
Residues2: 126
Other2: 99
Your statement
print (len(structure[0]['A']))
gives you the sum (225) of all residues, in this case all amino acids and water atoms.
The numbers seem to be correct when compared to manual inspection using PyMol.
What is the specific error message you are getting or the output you are expecting? Any specific PDB file?
Since the PDB file is mostly used to store the coordinates of the resolved atoms, it is not always possible to get the full structure. Another approach would be use to the cif files.
from Bio import PDB
parser = PDB.PDBParser()
pdb1 ='./1bfg.cif'
m = PDB.MMCIF2Dict.MMCIF2Dict(pdb1)
if '_entity_poly.pdbx_seq_one_letter_code' in m.keys():
print ('Full structure:')
full_structure = (m['_entity_poly.pdbx_seq_one_letter_code'])
print (full_structure)
print (len(full_structure))
Output:
Full structure:
PALPEDGGSGAFPPGHFKDPKRLYCKNGGFFLRIHPDGRVDGVREKSDPHIKLQLQAEERGVVSIKGVSANRYLAMKEDGRLLASKSVTDECFFFERLESNNYNTYRSRKYTSWYVALKRTGQYKLGSKTGPGQKAILFLPMSAKS
146
For multiple chains:
from Bio import PDB
parser = PDB.PDBParser()
pdb1 ='./4hlu.cif'
m = PDB.MMCIF2Dict.MMCIF2Dict(pdb1)
if '_entity_poly.pdbx_seq_one_letter_code' in m.keys():
full_structure = m['_entity_poly.pdbx_seq_one_letter_code']
chains = m['_entity_poly.pdbx_strand_id']
for c in chains:
print('Chain %s' % (c))
print('Sequence: %s' % (full_structure[chains.index(c)]))
It's just:
from Bio.PDB import PDBParser
from Bio import PDB
pdb = PDBParser().get_structure("1bfg", "1bfg.pdb")
for chain in pdb.get_chains():
print(len([_ for _ in chain.get_residues() if PDB.is_aa(_)]))
I appreciated Peters' answer, but I also realized the res.id[0] == " " is more robust (i.e. HIE). PDB.is_aa() cannot detect HIE is an amino acid while HIE is ε-nitrogen protonated histidine. So I recommend:
from Bio import PDB
parser = PDB.PDBParser()
pdb1 ='./1bfg.pdb'
structure = parser.get_structure("1bfg", pdb)
model = structure[0]
res_no = 0
non_resi = 0
for model in structure:
for chain in model:
for r in chain.get_residues():
if r.id[0] == ' ':
res_no +=1
else:
non_resi +=1
print ("Residues: %i" % (res_no))
print ("Other: %i" % (non_resi))
I think you would actually want to do something like
m = Bio.PDB.MMCIF2Dict.MMCIF2Dict(pdb_cif_file)
if '_entity_poly.pdbx_seq_one_letter_code' in m.keys():
full_structure = m['_entity_poly.pdbx_seq_one_letter_code']
chains = m['_entity_poly.pdbx_strand_id']
for c in chains:
for ci in c.split(','):
print('Chain %s' % (ci))
print('Sequence: %s' % (full_structure[chains.index(c)]))

Get the number of times a test has failed in Jenkins

We have some tests that fail periodically for no reason, mainly, JUnit times out. I want to know if I can get the number of times each test has failed. With that, I can see if it is certain tests that keep have issues, or it is not tied to tricky tests and more an issue with the stability of Jenkins on that server.
I encountered the same problems and we made a python that can grab the failings tests in the last N builds :
# -*- coding: utf-8 -*-
#! /usr/bin/python
import urllib
import re
import sys
project = "HERE_THE_PROJECT_NAME"
jenkin_host = "http://path.to.your.jenkins/jenkins/job/%s" % project
last_build = int(re.search("%s #(\d+)" % project, urllib.urlopen(jenkin_host + "/rssAll").read()).group(1))
start_build = last_build
nb_build = 200
REG_EXP = """All Failed Tests(.*)All Tests"""
FAILURE_REG_EXP = """javascript:hideStackTrace\(([^<]*)\)"""
all_failures = {}
last_seen = {}
print "Loading %s builds starting from build number %s" % (nb_build, start_build)
build_ok = 0
for build_id in range(start_build - nb_build, start_build + 1):
test_page = jenkin_host + "/%s/testReport/" % build_id
sys.stdout.write(".")
sys.stdout.flush()
failures = set()
for line in urllib.urlopen(test_page).readlines():
line_piece = re.search(REG_EXP, line)
if line_piece:
piece = line_piece.group(1)
match = re.search(FAILURE_REG_EXP, piece)
while (match):
failures.add(match.group(1))
match = re.search(FAILURE_REG_EXP, piece[match.start():match.end()])
if not failures:
build_ok += 1
for failure in failures:
all_failures[failure] = all_failures.get(failure, 0) + 1
last_seen[failure] = build_id
print
print "Done (found %s build OK)" % build_ok
nbs = [ x for x in list(set(all_failures.values())) if x > 1]
nbs.sort(reverse=True)
for i in nbs:
for test, nb in all_failures.iteritems():
if nb == i :
print "%d : %s (last seen : %s)" % (nb, test, last_seen[test])
And I obtain :
Loading 200 builds starting from build number 11032
.........................................................................................................................................................................................................
Done (found 148 build OK)
8 : 'one failing test' (last seen : 10906)
7 : 'another-failing-test' (last seen : 11019)

Resources