`knitr_out, `file_out` and `vis_drake_graph` usage in R:drake - drake-r-package

I'm trying to understand how to use knitr_out, file_out and vis_drake_graph properly in drake.
I have two questions.
Q1: Usage of knitr_out and file_out to create markdown reports
While a code like this works correctly for one of my smaller projects:
make_hyp_data_aggregated_report <- function() {
render(
input = knitr_in("rmd/hyptest-is-data-being-aggregated.Rmd"),
output_file = file_out("~/projectname/reports/01-hyp-test.html"),
quiet = TRUE
)
}
plan <- drake_plan(
...
...
hyp_data_aggregated_report = make_hyp_data_aggregated_report()
...
...
)
Exactly similar code in my large project (with ~10+ reports) doesn't work exactly right. i.e., while the reports get built, the knitr_in objects don't get displayed as the blue squares in the graph using drake::vis_drake_graph() in my large project.
Both projects use the drake::loadd(....) within the markdown to get the objects from cache.
Is there some code in vis_drake_graph that removes these squares once the graph gets busy?
Q2: file_out objects in vis_drake_graph
Is there a way to display the file_out objects themselves as circles/squares in vis_drake_graph?
Q3: packages showing up in vis_drake_graph
Is there a way to avoid vis_drake_graph from printing the packages explicitly? (Basically anything with the ::)

Q1
Every literal file path needs its own knitr_in() or file_out(). If you have one function with one knitr_in(), even if you use the function multiple times, that still only counts as one file path. I recommend writing these keywords at the plan level, e.g.
plan <- drake_plan(
r1 = render(knitr_in("report1.Rmd"), output_file = file_out("report1.html")),
r2 = render(knitr_in("report2.Rmd"), output_file = file_out("report2.html")),
r3 = render(knitr_in("report3.Rmd"), output_file = file_out("report3.html"))
)
Q2
They should appear unless you set show_output_files = FALSE in vis_drake_graph().
Q3
No, but if it's any consolation, I do regret the decision to track namespaced functions and objects at all in drake. drake's approach is fundamentally suboptimal for tracking packages, and I plan to get rid of it if there ever comes time for a round of breaking changes. Otherwise, there is no way to get rid of it except vis_drake_graph(targets_only = TRUE), which also gets rid of all the imports in the graph.

Related

Clean way to loop over a masked list in Julia

In Julia, I have a list of neighbors of a location stored in all_neighbors[loc]. This allows me to quickly loop over these neighbors conveniently with the syntax for neighbor in all_neighbors[loc]. This leads to readable code such as the following:
active_neighbors = 0
for neighbor in all_neighbors[loc]
if cube[neighbor] == ACTIVE
active_neighbors += 1
end
end
Astute readers will see that this is nothing more than a reduction. Because I'm just counting active neighbors, I figured I could do this in a one-liner using the count function. However,
# This does not work
active_neighbors = count(x->x==ACTIVE, cube[all_neighbors[loc]])
does not work because the all_neighbors mask doesn't get interpreted correctly as simply a mask over the cube array. Does anyone know the cleanest way to write this reduction? An alternative solution I came up with is:
active_neighbors = count(x->x==ACTIVE, [cube[all_neighbors[loc][k]] for k = 1:length(all_neighbors[loc])])
but I really don't like this because it's even less readable than what I started with. Thanks for any advice!
This should work:
count(x -> cube[x] == ACTIVE, all_neighbors[loc])

Issue returning desired data with Lua

Wondering if I could get some help with this:
function setupRound()
local gameModes = {'mode 1','mode 2','mode 3'} -- Game modes
local maps = {'map1','map2','map3'}
--local newMap = maps[math.random(1,#maps)]
local mapData = {maps[math.random(#maps)],gameModes[math.random(#gameModes)]}
local mapData = mapData
return mapData
end
a = setupRound()
print(a[1],a[2]) --Fix from Egor
What the problem is:
`
When trying to get the info from setupRound() I get table: 0x18b7b20
How I am trying to get mapData:
a = setupRound()
print(a)
Edit:
Output Issues
With the current script I will always the the following output: map3 mode 2.
What is the cause of this?
Efficiency; is this the best way to do it?
While this really isn't a question, I just wanted to know if this method that I am using is truly the most efficient way of doing this.
First of all
this line does nothing useful and can be removed (it does something, just not something you'd want)
local mapData = mapData
Output Issues
The problem is math.random. Write a script that's just print(math.random(1,100)) and run it 100 times. It will print the same number each time. This is because Lua, by default, does not set its random seed on startup. The easiest way is to call math.randomseed(os.time()) at the beginning of your program.
Efficiency; is this the best way to do it?
Depends. For what you seem to want, yes, it's definitely efficient enough. If anything, I'd change it to the following to avoid magic numbers which will make it harder to understand the code in the future.
--- etc.
local mapData = {
map = maps[math.random(#maps)],
mode = gameModes[math.random(#gameModes)]
}
-- etc.
print(a.map, a.mode)
And remember:
Premature optimization is the root of all evil.
— Donald Knuth
You did very good by creating a separate function for generating your modes and maps. This separates code and is modular and neat.
Now, you have your game modes in a table modes = {} (=which is basically a list of strings).
And you have your maps in another table maps = {}.
Each of the table items has a key, that, when omitted, becomes a number counted upwards. In your case, there are 3 items in modes and 3 items in maps, so keys would be 1, 2, 3. The key is used to grab a certain item in that table (=list). E.g. maps[2] would grab the second item in the maps table, whose value is map 2. Same applies to the modes table. Hence your output you asked about.
To get a random game mode, you just call math.random(#mode). math.random can accept up to two parameters. With these you define your range, to pick the random number from. You can also pass a single parameter, then Lua assumes to you want to start at 1. So math.random(3) becomes actually math.random(1, 3). #mode in this case stand for "count all game modes in that table and give me that count" which is 3.
To return your chosen map and game mode from that function we could use another table, just to hold both values. This time however the table would have different keys to access the values inside it; namely "map" and "mode".
Complete example would be:
local function setupRound()
local modes = {"mode 1", "mode 2", "mode 3"} -- different game modes
local maps = {"map 1", "map 2", "map 3"} -- different maps
return {map = maps[math.random(#maps)], mode = modes[math.random(#modes)]}
end
for i = 1, 10 do
local freshRound = setupRound()
print(freshRound.map, freshRound.mode)
end

How to share the same index among multiple dask arrays

I'm trying to build a dask-based ipython application, that holds a meta-class which consists of some sub-dask-arrays (which are all shaped (n_samples, dim_1, dim_2 ...)) and should be able to sector the sub-dask-arrays by its getitem operator.
In the getitem method, I call the da.Array.compute method (the code is still in it's very early state), so I would be able to iterate batches of the sub-arrays.
def MetaClass(object):
...
def __getitem__(self, inds):
new_m = MetaClass()
inds = inds.compute()
for name,var in vars(self).items():
if isinstance(var,da.Array):
try:
setattr(new_m, name, var[inds])
except Exception as e:
print(e)
else:
setattr(new_m, name, var)
return new_m
# Here I construct the meta-class to work with some directory.
m = MetaClass('/my/data/...')
# m.type is one of the sub-dask-arrays
m2 = m[m.type==2]
It works as expected, and I get the sliced arrays, but as a result I get a huge memory consumption, and I assume that in the background the mechanism of dask is copying the index for each sub-dask-array.
My question is, how do I achieve the same results, without using so much memory?
(I tried not to "compute" the "inds" in getitem, but then I get nan shaped arrays, which can not be iterated, which is a must for the application)
I have been thinking about three possible solutions that I'd be happy to be advised which of them is the "right" one for me. (or to get another solution which I haven't thought of):
To use a Dask DataFrame, which I'm not sure how to fit multidimensional-dask-arrays in (would really appreciate some help or even a link that explains how to deal with multidimensional arrays in dd).
To forget about the entire MetaClass, and to use one dask-array with a nasty dtype (something like [("type",int,(1,)),("images",np.uint8,(1000,1000))]), again, I'm not familiar with this and would really appreciate some help with that (tried to google it.. it's a bit complicated..)
To share the index as a global inside the calling function (getitem) with property and its get-function-mechanism (https://docs.python.org/2/library/functions.html#property). But the big downside here is that I lose the types of the arrays (big down for representation and everything that needs anything but the data itself).
Thanks in advance!!!
One can use the sub-arrays.map_blocks with a shared function that holds the indices in its memory.
Here is an example:
def bool_mask(arr, block_info=None):
from_ind,to_ind = block_info[0]["array-location"][0]
return arr[inds[from_ind:to_ind]]
def getitem(var):
original_chunks = var.chunks[0]
tmp_inds = np.cumsum([0]+list(original_chunks))
from_inds = tmp_inds[:-1]
to_inds = tmp_inds[1:]
new_chunks_0 = np.array(list(map(lambda f,t:inds[f:t].sum(),from_inds,to_inds)))
new_chunks = tuple([tuple(new_chunks_0.tolist())] + list(var.chunks[1:]))
return var.map_blocks(bool_mask,dtype=var.dtype,chunks=new_chunks)

Is there any way to resize a Surface by any natural number value in ROBLOX?

I've been working with the built-in Resize function in Roblox Studio and have been using it to expand the Top Surface of multiple Parts in order to form a wall-like structure.
The only problem that has arisen when using this method is that the surface of the wall created is not even: Some Parts are higher than others.
I later discovered that this problem is due to the fact that the built-in Resize function only takes integers as it's second parameter (or "expand-by" value). Ideally I need the Parts to have the ability expand by any Real Number.
Are there any alternatives to the built-in Resize function that allow one to resize a Surface by any Real Number?
Yes, this is possible, but it actually requires a custom function to do so. With some fairly basic math we can write a simple function to accomplish such a task:
local Resize
do
local directions = {
[Enum.NormalId.Top] = {Scale=Vector3.new(0,1,0),Position=Vector3.new(0,1,0)},
[Enum.NormalId.Bottom] = {Scale=Vector3.new(0,1,0),Position=Vector3.new(0,-1,0)},
[Enum.NormalId.Right] = {Scale=Vector3.new(1,0,0),Position=Vector3.new(1,0,0)},
[Enum.NormalId.Left] = {Scale=Vector3.new(1,0,0),Position=Vector3.new(-1,0,0)},
[Enum.NormalId.Front] = {Scale=Vector3.new(0,0,1),Position=Vector3.new(0,0,1)},
[Enum.NormalId.Back] = {Scale=Vector3.new(0,0,1),Position=Vector3.new(0,0,-1)},
}
function Resize(p, d, n, c)
local prop = c and 'Position' or 'CFrame'
p.Size = p.Size + directions[d].Scale*n
p[prop] = p[prop] + directions[d].Position*(n/2)
return p.Size, p[prop]
end
end
Resize(workspace.Part, Enum.NormalId.Bottom, 10, false) --Resize workspace.Part downards by 10 studs, ignoring collisions
If you're interested more on how and why this code works the way it does, here's a link to a pastebin that's loaded with comments, which I felt would be rather ugly for the answer here: http://pastebin.com/LYKDWZnt

splitting space delimited entries into new columns in R

I am coding a survey that outputs a .csv file. Within this csv I have some entries that are space delimited, which represent multi-select questions (e.g. questions with more than one response). In the end I want to parse these space delimited entries into their own columns and create headers for them so i know where they came from.
For example I may start with this (note that the multiselect columns have an _M after them):
Q1, Q2_M, Q3, Q4_M
6, 1 2 88, 3, 3 5 99
6, , 3, 1 2
and I want to go to this:
Q1, Q2_M_1, Q2_M_2, Q2_M_88, Q3, Q4_M_1, Q4_M_2, Q4_M_3, Q4_M_5, Q4_M_99
6, 1, 1, 1, 3, 0, 0, 1, 1, 1
6,,,,3,1,1,0,0,0
I imagine this is a relatively common issue to deal with but I have not been able to find it in the R section. Any ideas how to do this in R after importing the .csv ? My general thoughts (which often lead to inefficient programs) are that I can:
(1) pull column numbers that have the special suffix with grep()
(2) loop through (or use an apply) each of the entries in these columns and determine the levels of responses and then create columns accordingly
(3) loop through (or use an apply) and place indicators in appropriate columns to indicate presence of selection
I appreciate any help and please let me know if this is not clear.
I agree with ran2 and aL3Xa that you probably want to change the format of your data to have a different column for each possible reponse. However, if you munging your dataset to a better format proves problematic, it is possible to do what you asked.
process_multichoice <- function(x) lapply(strsplit(x, " "), as.numeric)
q2 <- c("1 2 3 NA 4", "2 5")
processed_q2 <- process_multichoice(q2)
[[1]]
[1] 1 2 3 NA 4
[[2]]
[1] 2 5
The reason different columns for different responses are suggested is because it is still quite unpleasant trying to retrieve any statistics from the data in this form. Although you can do things like
# Number of reponses given
sapply(processed_q2, length)
#Frequency of each response
table(unlist(processed_q2), useNA = "ifany")
EDIT: One more piece of advice. Keep the code that processes your data separate from the code that analyses it. If you create any graphs, keep the code for creating them separate again. I've been down the road of mixing things together, and it isn't pretty. (Especially when you come back to the code six months later.)
I am not entirely sure what you trying to do respectively what your reasons are for coding like this. Thus my advice is more general – so just feel to clarify and I will try to give a more concrete response.
1) I say that you are coding the survey on your own, which is great because it means you have influence on your .csv file. I would NEVER use different kinds of separation in the same .csv file. Just do the naming from the very beginning, just like you suggested in the second block.
Otherwise you might geht into trouble with checkboxes for example. Let's say someone checks 3 out of 5 possible answers, the next only checks 1 (i.e. "don't know") . Now it will be much harder to create a spreadsheet (data.frame) type of results view as opposed to having an empty field (which turns out to be an NA in R) that only needs to be recoded.
2) Another important question is whether you intend to do a panel survey(i.e longitudinal study asking the same participants over and over again) . That (among many others) would be a good reason to think about saving your data to a MySQL database instead of .csv . RMySQL can connect directly to the database and access its tables and more important its VIEWS.
Views really help with survey data since you can rearrange the data in different views, conditional on many different needs.
3) Besides all the personal / opinion and experience, here's some (less biased) literature to get started:
Complex Surveys: A Guide to Analysis Using R (Wiley Series in Survey Methodology
The book is comparatively simple and leaves out panel surveys but gives a lot of R Code and examples which should be a practical start.
To prevent re-inventing the wheel you might want to check LimeSurvey, a pretty decent (not speaking of the templates :) ) tool for survey conductors. Besides I TYPO3 CMS extensions pbsurvey and ke_questionnaire (should) work well too (only tested pbsurvey).
Multiple choice items should always be coded as separate variables. That is, if you have 5 alternatives and multiple choice, you should code them as i1, i2, i3, i4, i5, i.e. each one is a binary variable (0-1). I see that you have values 3 5 99 for Q4_M variable in the first example. Does that mean that you have 99 alternatives in an item? Ouch...
First you should go on and create separate variables for each alternative in a multiple choice item. That is, do:
# note that I follow your example with Q4_M variable
dtf_ins <- as.data.frame(matrix(0, nrow = nrow(<initial dataframe>), ncol = 99))
# name vars appropriately
names(dtf_ins) <- paste("Q4_M_", 1:99, sep = "")
now you have a data.frame with 0s, so what you need to do is to get 1s in an appropriate position (this is a bit cumbersome), a function will do the job...
# first you gotta change spaces to commas and convert character variable to a numeric one
y <- paste("c(", gsub(" ", ", ", x), ")", sep = "")
z <- eval(parse(text = y))
# now you assing 1 according to indexes in z variable
dtf_ins[1, z] <- 1
And that's pretty much it... basically, you would like to reconsider creating a data.frame with _M variables, so you can write a function that does this insertion automatically. Avoid for loops!
Or, even better, create a matrix with logicals, and just do dtf[m] <- 1, where dtf is your multiple-choice data.frame, and m is matrix with logicals.
I would like to help you more on this one, but I'm recuperating after a looong night! =) Hope that I've helped a bit! =)
Thanks for all the responses. I agree with most of you that this format is kind of silly but it is what I have to work with (survey is coded and going into use next week). This is what I came up with from all the responses. I am sure this is not the most elegant or efficient way to do it but I think it should work.
colnums <- grep("_M",colnames(dat))
responses <- nrow(dat)
for (i in colnums) {
vec <- as.vector(dat[,i]) #turn into vector
b <- lapply(strsplit(vec," "),as.numeric) #split up and turn into numeric
c <- sort(unique(unlist(b))) #which values were used
newcolnames <- paste(colnames(dat[i]),"_",c,sep="") #column names
e <- matrix(nrow=responses,ncol=length(c)) #create new matrix for indicators
colnames(e) <- newcolnames
#next loop looks for responses and puts indicators in the correct places
for (i in 1:responses) {
e[i,] <- ifelse(c %in% b[[i]],1,0)
}
dat <- cbind(dat,e)
}
Suggestions for improvement are welcome.

Resources