Manually add dependency in drake workflow? - drake-r-package

Let's say I have a drake plan where I create a SQL table in an external database, and after that job, I download from some table that depends on the initial job. My plan might look like this
drake_plan(up_job = create_sql_file('some_input.csv'),
down_job = download_from_sql('my_code.sql')
is there a way of manually forcing down_job to be downstream of up_job? There's nothing inherent in create_sql_file or download_from_sql that drake would be able to parse to infer the relationship but I'd still like to manually apply it.
Thanks!

To have down_job depend on up_job, either up_job or a file_out() created by up_job should be mentioned in the command of down_job.
Example using the up_job return value
library(drake)
plan <- drake_plan(
db_path = create_sql_db_from(file_in("some_input.csv")),
down_job = download_from_sql(db = db_path, file_in("my_code.sql"))
)
plan
#> # A tibble: 2 x 2
#> target command
#> <chr> <chr>
#> 1 db_path "create_sql_db_from(file_in(\"some_input.csv\"))"
#> 2 down_job "download_from_sql(db = db_path, file_in(\"my_code.sql\"))"
config <- drake_config(plan)
vis_drake_graph(config)
Example with file paths
library(drake)
plan <- drake_plan(
up_job = create_sql_db_from(file_in("some_input.csv"), file_out("db_path")),
down_job = download_from_sql(file_in("db_path"), file_in("my_code.sql"))
)
plan
#> # A tibble: 2 x 2
#> target command
#> <chr> <chr>
#> 1 up_job "create_sql_db_from(file_in(\"some_input.csv\"), file_out(\"db_…
#> 2 down_job "download_from_sql(file_in(\"db_path\"), file_in(\"my_code.sql\…
config <- drake_config(plan)
vis_drake_graph(config)
Created on 2019-01-25 by the reprex package (v0.2.1)

Related

How do you convert a GPX file directly into a SpatVector of lines while preserving attributes?

I'm trying to teach myself coding skills for spatial data analysis. I've been using Robert Hijmans' document, "Spatial Data in R," and so far, it's been great. To test my skills, I'm messing around with a GPX file I got from my smartwatch during a run, but I'm having issues getting my data into a SpatVector of lines (or a line, more specifically). I haven't been able to find anything online on this topic.
As you can see below with a data sample, the SpatVector "run" has point geometries even though "lines" was specified. From Hijman's example of SpatVectors with lines, I gathered that adding columns with "id" and "part" both equal to 1 does something that enables the data to be converted to a SpatVector with line geometries. Accordingly, in the SpatVector "run2," the geometry is lines.
My questions are 1) is adding the "id" and "part" columns necessary? 2) and what do they actually do? I.e. why are these columns necessary? 3) Is there a way to go directly from the original data to a SpatVector of lines? In the process I used to get "run2," I lost all the attributes from the original data, and I don't want to lose them.
Thanks!
library(plotKML)
library(terra)
library(sf)
library(lubridate)
library(XML)
library(raster)
#reproducible example
GPX <- structure(list(lon = c(-83.9626053348184, -83.9625438954681,
-83.962496034801, -83.9624336734414, -83.9623791072518, -83.9622404705733,
-83.9621777739376, -83.9620685577393, -83.9620059449226, -83.9619112294167,
-83.9618398994207, -83.9617654681206, -83.9617583435029, -83.9617464412004,
-83.9617786277086, -83.9617909491062, -83.9618581719697), lat = c(42.4169608857483,
42.416949570179, 42.4169420264661, 42.4169377516955, 42.4169291183352,
42.4169017933309, 42.4168863706291, 42.4168564472347, 42.4168310500681,
42.4167814292014, 42.4167292937636, 42.4166279565543, 42.4166054092348,
42.4164886493236, 42.4163396190852, 42.4162954464555, 42.4161833804101
), ele = c("267.600006103515625", "268.20001220703125", "268.79998779296875",
"268.600006103515625", "268.600006103515625", "268.399993896484375",
"268.600006103515625", "268.79998779296875", "268.79998779296875",
"269", "269", "269.20001220703125", "269.20001220703125", "269.20001220703125",
"268.79998779296875", "268.79998779296875", "269"), time = c("2020-10-25T11:30:32.000Z",
"2020-10-25T11:30:34.000Z", "2020-10-25T11:30:36.000Z", "2020-10-25T11:30:38.000Z",
"2020-10-25T11:30:40.000Z", "2020-10-25T11:30:45.000Z", "2020-10-25T11:30:47.000Z",
"2020-10-25T11:30:51.000Z", "2020-10-25T11:30:53.000Z", "2020-10-25T11:30:57.000Z",
"2020-10-25T11:31:00.000Z", "2020-10-25T11:31:05.000Z", "2020-10-25T11:31:06.000Z",
"2020-10-25T11:31:12.000Z", "2020-10-25T11:31:19.000Z", "2020-10-25T11:31:21.000Z",
"2020-10-25T11:31:27.000Z"), extensions = c("18.011677", "18.011977",
"18.012176", "18.012678", "18.013078", "18.013277", "18.013578",
"18.013877", "17.013977", "17.014278", "17.014478", "17.014677",
"17.014676", "17.014677", "16.014477", "16.014477", "16.014576"
)), row.names = c(NA, 17L), class = "data.frame")
crdref <- "+proj=longlat +datum=WGS84"
run <- vect(GPX, type="lines", crs=crdref)
run
data <- cbind(id=1, part=1, GPX$lon, GPX$lat)
run2 <- vect(data, type="lines", crs=crdref)
run2
There is a vect method for a matrix and one for a data.frame. The data.frame method can only make points (and has no type argument, so that is ignored). I will change that into an informative error and clarify this in the manual.
So to make a line, you could do
library(terra)
g <- as.matrix(GPX[,1:2])
v <- vect(g, "lines")
To add attributes you would first need to determine what they are. You have one line but 17 rows in GPX that need to be reduced to one row. You could just take the first row
att <- GPX[1, -c(1:2)]
But you may prefer to take the average instead
GPX$ele <- as.numeric(GPX$ele)
GPX$extensions <- as.numeric(GPX$extensions)
GPX$time <- as.POSIXct(GPX$time)
att <- as.data.frame(lapply(GPX[, -c(1:2)], mean))
# ele time extensions
#1 268.7412 2020-10-25 17.3078
values(v) <- att
Or in one step
v <- vect(g, "lines", atts=att)
v
#class : SpatVector
#geometry : lines
#dimensions : 1, 3 (geometries, attributes)
#extent : -83.96261, -83.96175, 42.41618, 42.41696 (xmin, xmax, ymin, ymax)
#coord. ref. :
#names : ele time extensions
#type : <num> <chr> <num>
#values : 268.7 2020-10-25 17.31
The id and part columns are not necessary if you make a single line. But you need them when you wish to create multiple lines and or line parts (in a "multi-line").
gg <- cbind(id=rep(1:3, each=6)[-1], part=1, g)
vv <- vect(gg, "lines")
plot(vv, col=rainbow(5), lwd=8)
lines(v)
points(v, cex=2, pch=1)
And with multiple lines you would use id in aggregate to compute attributes for each line.

stack variables from netCDF file

I'm working with several netCDF files. Each nc file has 33 variables. I need to create a stack for each nc file with these 33 variables to do some calculations, but the only way I know is to convert each variable in a rater and stack them, one by one... Like this:
library(raster)
library(ncdf4)
library(rgdal)
nc_data <- nc_open('./data/GCAM/RAW/93d4aa096b15491b1ba136b46d8063cdca59d253c75d59791b4d4cb6f8a1ae91/Project ID 68344/GCAM-Demeter/GCAM-Harmonized/Mean_Std/GCAM_Demeter_LU_H_ssp1_rcp26_modelmean_2030.nc')
PTF0 <- nc_data$var[[1]]
data1 <- ncvar_get( nc_data, PTF0 )
data1 <- raster(data1)
plot(data1)
Can anyone help to automatize this?? I'm thankful in advance
This is a structure of the NetCDF file, I highlight the files that I need to stack, actually, I need stack just PT1 to PTF8
What you are doing now could be done like tis
f <- 'GCAM_Demeter_LU_H_ssp1_rcp26_modelmean_2030.nc'
library(raster)
r <- raster(f, "PTF0")
(assuming that "PTF0" is a variable name)
But if you want to create a single object for multiple variables in one step, use terra instead
library(terra)
r <- rast(f)
You can specify the variables you want
rr <- rast(f, c("PFT1", "PTF2"))
You can also create a SpatRasterDataSet and then extract the variables you want like this
s <-sds(f)
x <- rast(s[2:8])
Thanks Mr Robert! With these few lines, I solve the issue and I can do all I need.
library(raster)
library(ncdf4)
library(rgdal)
library(terra)
f <- nc_open('./data/GCAM/RAW/93d4aa096b15491b1ba136b46d8063cdca59d253c75d59791b4d4cb6f8a1ae91/Project ID 68344/GCAM-Demeter/GCAM-Harmonized/Mean_Std/GCAM_Demeter_LU_H_ssp1_rcp26_modelmean_2030.nc')
rr <- rast(f$filename)
rrr <- rr[[2:8]]
plot(rrr)
soma <- sum(rrr)
plot(soma)
Now I'll spend more time thinking about how to automate this... because I need to do this operation for each one of NetCDF files...

Can rmarkdown return a value to a target

I find myself using rmarkdown/rnotebooks quite a bit to do exploratory analysis since I can combine code, prose and graphs. Many a times, I'll write my entire predictive modeling approach and the model itself within markdown.
However, then I end up with forecast models embedded within rmarkdown, unlinked to a target within my drake_plan. Today, I save these to disk first, then read them back in to my plan using file_in or other similar approach.
My question is - can I have a markdown document return an object directly to a drake target?
Conceptually:
plan = drake_plan(
dat = read_data(),
model = analyze_data(dat)
)
analyse_data = function(dat){
result = render(....)
return(result)
}
This way - I can get my model directly into my drake target, but if I need to investigate the model, I can open up my markdown/HTML.
I recommend you include those models as targets in the plan, but what you describe is possible. R Markdown and knitr automatically run code chunks in the calling environment, so the variable assignments you make in the report are available.
library(drake)
library(tibble)
simulate <- function(n){
tibble(x = rnorm(n), y = rnorm(n))
}
render_and_return <- function(input, output) {
rmarkdown::render(input, output_file = output, quiet = TRUE)
return_value # Assigned in the report.
}
lines <- c(
"---",
"output: html_document",
"---",
"",
"```{r show_data}",
"return_value <- head(readd(large))", # return_value gets assigned here.
"```"
)
writeLines(lines, "report.Rmd")
plan <- drake_plan(
large = simulate(1000),
subset = render_and_return(knitr_in("report.Rmd"), file_out("report.html")),
)
make(plan)
#> target large
#> target subset
readd(subset)
#> # A tibble: 6 x 2
#> x y
#> <dbl> <dbl>
#> 1 1.30 -0.912
#> 2 -0.327 0.0622
#> 3 1.29 1.18
#> 4 -1.52 1.06
#> 5 -1.18 0.0295
#> 6 -0.985 -0.0475
Created on 2019-10-10 by the reprex package (v0.3.0)

How do I perform a "diff" on two Sources given a key using Apache Beam Python SDK?

I posed the question generically, because maybe it is a generic answer. But a specific example is comparing 2 BigQuery tables with the same schema, but potentially different data. I want a diff, i.e. what was added, deleted, modified, with respect to a composite key, e.g. the first 2 columns.
Table A
C1 C2 C3
-----------
a a 1
a b 1
a c 1
Table B
C1 C2 C3 # Notes if comparing B to A
-------------------------------------
a a 1 # No Change to the key a + a
a b 2 # Key a + b Changed from 1 to 2
# Deleted key a + c with value 1
a d 1 # Added key a + d
I basically want to be able to make/report the comparison notes.
Or from a Beam perspective I may want to Just output up to 4 labeled PCollections: Unchanged, Changed, Added, Deleted. How do I do this and what would the PCollections look like?
What you want to do here, basically, is join two tables and compare the result of that, right? You can look at my answer to this question, to see the two ways in which you can join two tables (Side inputs, or CoGroupByKey).
I'll also code a solution for your problem using CoGroupByKey. I'm writing the code in Python because I'm more familiar with the Python SDK, but you'd implement similar logic in Java:
def make_kv_pair(x):
""" Output the record with the x[0]+x[1] key added."""
return ((x[0], x[1]), x)
table_a = (p | 'ReadTableA' >> beam.Read(beam.io.BigQuerySource(....))
| 'SetKeysA' >> beam.Map(make_kv_pair)
table_b = (p | 'ReadTableB' >> beam.Read(beam.io.BigQuerySource(....))
| 'SetKeysB' >> beam.Map(make_kv_pair))
joined_tables = ({'table_a': table_a, 'table_b': table_b}
| beam.CoGroupByKey())
output_types = ['changed', 'added', 'deleted', 'unchanged']
class FilterDoFn(beam.DoFn):
def process((key, values)):
table_a_value = list(values['table_a'])
table_b_value = list(values['table_b'])
if table_a_value == table_b_value:
yield pvalue.TaggedOutput('unchanged', key)
elif len(table_a_value) < len(table_b_value):
yield pvalue.TaggedOutput('added', key)
elif len(table_a_value) > len(table_b_value):
yield pvalue.TaggedOutput('removed', key)
elif table_a_value != table_b_value:
yield pvalue.TaggedOutput('changed', key)
key_collections = (joined_tables
| beam.ParDo(FilterDoFn()).with_outputs(*output_types))
# Now you can handle each output
key_collections.unchanged | WriteToText(...)
key_collections.changed | WriteToText(...)
key_collections.added | WriteToText(...)
key_collections.removed | WriteToText(...)

Parallel query of SQLite database in R

I have a large database (~100Gb) from which I need to pull every entry,
perform some comparisons on it, and then store the results of those comparisons. I have attempted to run parallel queries within a single R sessions without any success. I can just run multiple R sessions all at once but I am looking for a better approach. Here is what I attempted:
library(RSQLite)
library(data.table)
library(foreach)
library(doMC)
#---------
# SETUP
#---------
#connect to db
db <- dbConnect(SQLite(), dbname="genes_drug_combos.sqlite")
#---------
# QUERY
#---------
# 856086 combos = 1309 * 109 * 6
registerDoMC(8)
#I would run 6 seperate R sessions (one for each i)
res_list <- foreach(i=1:6) %dopar% {
a <- i*109-108
b <- i*109
pb <- txtProgressBar(min=a, max=b, style=3)
res <- list()
for (j in a:b) {
#get preds for drug combos
statement <- paste("SELECT * from combo_tstats WHERE rowid BETWEEN", (j*1309)-1308, "AND", j*1309)
combo_preds <- dbGetQuery(db, statement)
#here I do some stuff to the result returned from the query
combo_names <- combo_preds$drug_combo
combo_preds <- as.data.frame(t(combo_preds[,-1]))
colnames(combo_preds) <- combo_names
#get top drug combos
top_combos <- get_top_drugs(query_genes, drug_info=combo_preds, es=T)
#update progress and store result
setTxtProgressBar(pb, j)
res[[ length(res)+1 ]] <- top_combos
}
#bind results together
res <- rbindlist(res)
}
I dont get any errors but only one core spins up. In contrast, if I run multiple R sessions, all my cores go at it. What am I doing wrong?
Some things I have learned while accessing concurrently with RSQLite the same file SQLite database:
1. Make sure each worker has its own DB connection.
parallel::clusterEvalQ(cl = cl, {
db.conn <- RSQLite::dbConnect(RSQLite::SQLite(), "./export/models.sqlite");
RSQLite::dbClearResult(RSQLite::dbSendQuery(db.conn, "PRAGMA busy_timeout=5000;"));
})
2. Use PRAGMA busy_timeout=5000;
By default this is set to 0, and chances are that you will end up with a "database is locked" error each time your worker tries to write to the DB while it is locked. Previous code sets this PRAGMA in each worker connection. Note that SELECT operations are never locked, only INSERT/DELETE/UPDATE.
3. Use PRAGMA journal_mode=WAL;
This only has to be set once and stays on by default forever. It will add two (more or less permanent) files to the DB. It will improve concurrent read/write performance. Read more here.
With the above settings I have not experienced this issue.

Resources