Can rmarkdown return a value to a target - drake-r-package

I find myself using rmarkdown/rnotebooks quite a bit to do exploratory analysis since I can combine code, prose and graphs. Many a times, I'll write my entire predictive modeling approach and the model itself within markdown.
However, then I end up with forecast models embedded within rmarkdown, unlinked to a target within my drake_plan. Today, I save these to disk first, then read them back in to my plan using file_in or other similar approach.
My question is - can I have a markdown document return an object directly to a drake target?
Conceptually:
plan = drake_plan(
dat = read_data(),
model = analyze_data(dat)
)
analyse_data = function(dat){
result = render(....)
return(result)
}
This way - I can get my model directly into my drake target, but if I need to investigate the model, I can open up my markdown/HTML.

I recommend you include those models as targets in the plan, but what you describe is possible. R Markdown and knitr automatically run code chunks in the calling environment, so the variable assignments you make in the report are available.
library(drake)
library(tibble)
simulate <- function(n){
tibble(x = rnorm(n), y = rnorm(n))
}
render_and_return <- function(input, output) {
rmarkdown::render(input, output_file = output, quiet = TRUE)
return_value # Assigned in the report.
}
lines <- c(
"---",
"output: html_document",
"---",
"",
"```{r show_data}",
"return_value <- head(readd(large))", # return_value gets assigned here.
"```"
)
writeLines(lines, "report.Rmd")
plan <- drake_plan(
large = simulate(1000),
subset = render_and_return(knitr_in("report.Rmd"), file_out("report.html")),
)
make(plan)
#> target large
#> target subset
readd(subset)
#> # A tibble: 6 x 2
#> x y
#> <dbl> <dbl>
#> 1 1.30 -0.912
#> 2 -0.327 0.0622
#> 3 1.29 1.18
#> 4 -1.52 1.06
#> 5 -1.18 0.0295
#> 6 -0.985 -0.0475
Created on 2019-10-10 by the reprex package (v0.3.0)

Related

How do you convert a GPX file directly into a SpatVector of lines while preserving attributes?

I'm trying to teach myself coding skills for spatial data analysis. I've been using Robert Hijmans' document, "Spatial Data in R," and so far, it's been great. To test my skills, I'm messing around with a GPX file I got from my smartwatch during a run, but I'm having issues getting my data into a SpatVector of lines (or a line, more specifically). I haven't been able to find anything online on this topic.
As you can see below with a data sample, the SpatVector "run" has point geometries even though "lines" was specified. From Hijman's example of SpatVectors with lines, I gathered that adding columns with "id" and "part" both equal to 1 does something that enables the data to be converted to a SpatVector with line geometries. Accordingly, in the SpatVector "run2," the geometry is lines.
My questions are 1) is adding the "id" and "part" columns necessary? 2) and what do they actually do? I.e. why are these columns necessary? 3) Is there a way to go directly from the original data to a SpatVector of lines? In the process I used to get "run2," I lost all the attributes from the original data, and I don't want to lose them.
Thanks!
library(plotKML)
library(terra)
library(sf)
library(lubridate)
library(XML)
library(raster)
#reproducible example
GPX <- structure(list(lon = c(-83.9626053348184, -83.9625438954681,
-83.962496034801, -83.9624336734414, -83.9623791072518, -83.9622404705733,
-83.9621777739376, -83.9620685577393, -83.9620059449226, -83.9619112294167,
-83.9618398994207, -83.9617654681206, -83.9617583435029, -83.9617464412004,
-83.9617786277086, -83.9617909491062, -83.9618581719697), lat = c(42.4169608857483,
42.416949570179, 42.4169420264661, 42.4169377516955, 42.4169291183352,
42.4169017933309, 42.4168863706291, 42.4168564472347, 42.4168310500681,
42.4167814292014, 42.4167292937636, 42.4166279565543, 42.4166054092348,
42.4164886493236, 42.4163396190852, 42.4162954464555, 42.4161833804101
), ele = c("267.600006103515625", "268.20001220703125", "268.79998779296875",
"268.600006103515625", "268.600006103515625", "268.399993896484375",
"268.600006103515625", "268.79998779296875", "268.79998779296875",
"269", "269", "269.20001220703125", "269.20001220703125", "269.20001220703125",
"268.79998779296875", "268.79998779296875", "269"), time = c("2020-10-25T11:30:32.000Z",
"2020-10-25T11:30:34.000Z", "2020-10-25T11:30:36.000Z", "2020-10-25T11:30:38.000Z",
"2020-10-25T11:30:40.000Z", "2020-10-25T11:30:45.000Z", "2020-10-25T11:30:47.000Z",
"2020-10-25T11:30:51.000Z", "2020-10-25T11:30:53.000Z", "2020-10-25T11:30:57.000Z",
"2020-10-25T11:31:00.000Z", "2020-10-25T11:31:05.000Z", "2020-10-25T11:31:06.000Z",
"2020-10-25T11:31:12.000Z", "2020-10-25T11:31:19.000Z", "2020-10-25T11:31:21.000Z",
"2020-10-25T11:31:27.000Z"), extensions = c("18.011677", "18.011977",
"18.012176", "18.012678", "18.013078", "18.013277", "18.013578",
"18.013877", "17.013977", "17.014278", "17.014478", "17.014677",
"17.014676", "17.014677", "16.014477", "16.014477", "16.014576"
)), row.names = c(NA, 17L), class = "data.frame")
crdref <- "+proj=longlat +datum=WGS84"
run <- vect(GPX, type="lines", crs=crdref)
run
data <- cbind(id=1, part=1, GPX$lon, GPX$lat)
run2 <- vect(data, type="lines", crs=crdref)
run2
There is a vect method for a matrix and one for a data.frame. The data.frame method can only make points (and has no type argument, so that is ignored). I will change that into an informative error and clarify this in the manual.
So to make a line, you could do
library(terra)
g <- as.matrix(GPX[,1:2])
v <- vect(g, "lines")
To add attributes you would first need to determine what they are. You have one line but 17 rows in GPX that need to be reduced to one row. You could just take the first row
att <- GPX[1, -c(1:2)]
But you may prefer to take the average instead
GPX$ele <- as.numeric(GPX$ele)
GPX$extensions <- as.numeric(GPX$extensions)
GPX$time <- as.POSIXct(GPX$time)
att <- as.data.frame(lapply(GPX[, -c(1:2)], mean))
# ele time extensions
#1 268.7412 2020-10-25 17.3078
values(v) <- att
Or in one step
v <- vect(g, "lines", atts=att)
v
#class : SpatVector
#geometry : lines
#dimensions : 1, 3 (geometries, attributes)
#extent : -83.96261, -83.96175, 42.41618, 42.41696 (xmin, xmax, ymin, ymax)
#coord. ref. :
#names : ele time extensions
#type : <num> <chr> <num>
#values : 268.7 2020-10-25 17.31
The id and part columns are not necessary if you make a single line. But you need them when you wish to create multiple lines and or line parts (in a "multi-line").
gg <- cbind(id=rep(1:3, each=6)[-1], part=1, g)
vv <- vect(gg, "lines")
plot(vv, col=rainbow(5), lwd=8)
lines(v)
points(v, cex=2, pch=1)
And with multiple lines you would use id in aggregate to compute attributes for each line.

Manually add dependency in drake workflow?

Let's say I have a drake plan where I create a SQL table in an external database, and after that job, I download from some table that depends on the initial job. My plan might look like this
drake_plan(up_job = create_sql_file('some_input.csv'),
down_job = download_from_sql('my_code.sql')
is there a way of manually forcing down_job to be downstream of up_job? There's nothing inherent in create_sql_file or download_from_sql that drake would be able to parse to infer the relationship but I'd still like to manually apply it.
Thanks!
To have down_job depend on up_job, either up_job or a file_out() created by up_job should be mentioned in the command of down_job.
Example using the up_job return value
library(drake)
plan <- drake_plan(
db_path = create_sql_db_from(file_in("some_input.csv")),
down_job = download_from_sql(db = db_path, file_in("my_code.sql"))
)
plan
#> # A tibble: 2 x 2
#> target command
#> <chr> <chr>
#> 1 db_path "create_sql_db_from(file_in(\"some_input.csv\"))"
#> 2 down_job "download_from_sql(db = db_path, file_in(\"my_code.sql\"))"
config <- drake_config(plan)
vis_drake_graph(config)
Example with file paths
library(drake)
plan <- drake_plan(
up_job = create_sql_db_from(file_in("some_input.csv"), file_out("db_path")),
down_job = download_from_sql(file_in("db_path"), file_in("my_code.sql"))
)
plan
#> # A tibble: 2 x 2
#> target command
#> <chr> <chr>
#> 1 up_job "create_sql_db_from(file_in(\"some_input.csv\"), file_out(\"db_…
#> 2 down_job "download_from_sql(file_in(\"db_path\"), file_in(\"my_code.sql\…
config <- drake_config(plan)
vis_drake_graph(config)
Created on 2019-01-25 by the reprex package (v0.2.1)

Spark streaming join wierd results

I'm trying to observe how do spark streaming uses the RDDs inside DStream to join two DStreams, but seeing strange results which is confusing.
In my code, I am collecting data from a socket stream, split them into 2 PairedDStreams by some logic. In order to have some batches collected for join, I have created a window to collect last three batches. However, the results of join is clueless. Please help me understand.
object Join extends App {
val conf = new SparkConf().setMaster("local[4]").setAppName("KBN Streaming")
val sc = new SparkContext(conf)
sc.setLogLevel("ERROR")
val BATCH_INTERVAL_SEC = 10
val ssc = new StreamingContext(sc, Seconds(BATCH_INTERVAL_SEC))
val lines = ssc.socketTextStream("localhost", 8091)
//println(s"lines.slideDuration : ${lines.slideDuration}")
//lines.print()
val ds = lines.map(x => x)
import scala.util.Random
val randNums = List(1, 2, 3, 4, 5, 6)
val less = ds.filter(x => x.length <= 2)
val lessPairs = less.map(x => (Random.nextInt(randNums.size), x))
lessPairs.print
val greater = ds.filter(x => x.length > 2)
val greaterPairs = greater.map(x => (Random.nextInt(randNums.size), x))
greaterPairs.print
val join = lessPairs.join(greaterPairs).window(Seconds(30), Seconds(30))
join.print
ssc.start
ssc.awaitTermination
}
Test Results:
------------------------------------------- Time: 1473344240000 ms
------------------------------------------- (1,b) (4,s)
------------------------------------------- Time: 1473344240000 ms
------------------------------------------- (5,333)
------------------------------------------- Time: 1473344250000 ms
------------------------------------------- (2,x)
------------------------------------------- Time: 1473344250000 ms
------------------------------------------- (4,the)
------------------------------------------- Time: 1473344260000 ms
------------------------------------------- (2,a) (0,b)
------------------------------------------- Time: 1473344260000 ms
------------------------------------------- (2,ten) (1,one) (3,two)
------------------------------------------- Time: 1473344260000 ms
------------------------------------------- (4,(b,two))
When join is called, the two RDDs are recomputed again and thus they will contain different values than those shown when printed. So, we need to cache when the both RDDs are computed for the first time and thus same values will be used when join is called later (instead of recomputing both RDDs once again). I tried this on multiple examples and it works fine. I was missing the basic core concept of Spark.
Excerpt from "Learning Spark" book:
Persistence (Caching)
As discussed earlier, Spark RDDs are lazily evaluated, and sometimes we may wish to use the same RDD multiple times. If we do this naively, Spark will recompute the RDD and all of its dependencies each time we call an action on the RDD.

Yahoo Finance - How to get companies Key Statistics

I have used codeproject to get share data from yahoo
( http://www.codeproject.com/Articles/37550/Stock-quote-and-chart-from-Yahoo-in-C ).
In yahoo finance, there are 'Key Statistics' which i would like to use, but are not available by this means (e.g. data at http://uk.finance.yahoo.com/q/ks?s=BNZL.L ). Is there any way to get this information directly? I would really rather not screen scrape if possible.
I am using C#/.NET4.
You can use my lib for .NET Yahoo! Managed. There you have the MaasOne.Finance.YahooFinance.CompanyStatisticsDownload class to do exactly what you want.
p/s: You need to use the latest version (0.10.1). v0.10.0.2 is obsolete with Key Statistics Download.
I landed on this question while searching for an answer couple of days ago, thought of providing an answer I created in R (and shared it on R-Bloggers). I know that the answer I am providing is not in C# but XPath and XML are supported in every language so you can use this approach there. The URL to the blog is - http://www.r-bloggers.com/pull-yahoo-finance-key-statistics-instantaneously-using-xml-and-xpath-in-r/
#######################################################################
##Alternate method to download all key stats using XML and x_path - PREFERRED WAY
#######################################################################
setwd("C:/Users/i827456/Pictures/Blog/Oct-25")
require(XML)
require(plyr)
getKeyStats_xpath <- function(symbol) {
yahoo.URL <- "http://finance.yahoo.com/q/ks?s="
html_text <- htmlParse(paste(yahoo.URL, symbol, sep = ""), encoding="UTF-8")
#search for <td> nodes anywhere that have class 'yfnc_tablehead1'
nodes <- getNodeSet(html_text, "/*//td[#class='yfnc_tablehead1']")
if(length(nodes) > 0 ) {
measures <- sapply(nodes, xmlValue)
#Clean up the column name
measures <- gsub(" *[0-9]*:", "", gsub(" \\(.*?\\)[0-9]*:","", measures))
#Remove dups
dups <- which(duplicated(measures))
#print(dups)
for(i in 1:length(dups))
measures[dups[i]] = paste(measures[dups[i]], i, sep=" ")
#use siblings function to get value
values <- sapply(nodes, function(x) xmlValue(getSibling(x)))
df <- data.frame(t(values))
colnames(df) <- measures
return(df)
} else {
break
}
}
tickers <- c("AAPL")
stats <- ldply(tickers, getKeyStats_xpath)
rownames(stats) <- tickers
write.csv(t(stats), "FinancialStats_updated.csv",row.names=TRUE)
#######################################################################
If you don't mind using the key statistics from BarChart.com, here is a simple function script:
library(XML)
getKeyStats <- function(symbol) {
barchart.URL <- "http://www.barchart.com/profile.php?sym="
barchart.URL.Suffix <- "&view=key_statistics"
html_table <- readHTMLTable(paste(barchart.URL, symbol, barchart.URL.Suffix, sep = ""))
df_keystats = html_table[[5]]
print(df_keystats)
}

Determining memory usage of objects?

I'd like to work out how much RAM is being used by each of my objects inside my current workspace. Is there an easy way to do this?
some time ago I stole this little nugget from here:
sort( sapply(ls(),function(x){object.size(get(x))}))
it has served me well
1. by object size
to get memory allocation on an object-by-object basis, call object.size() and pass in the object of interest:
object.size(My_Data_Frame)
(unless the argument passed in is a variable, it must be quoted, or else wrapped in a get call.)variable name, then omit the quotes,
you can loop through your namespace and get the size of all of the objects in it, like so:
for (itm in ls()) {
print(formatC(c(itm, object.size(get(itm))),
format="d",
big.mark=",",
width=30),
quote=F)
}
2. by object type
to get memory usage for your namespace, by object type, use memory.profile()
memory.profile()
NULL symbol pairlist closure environment promise language
1 9434 183964 4125 1359 6963 49425
special builtin char logical integer double complex
173 1562 20652 7383 13212 4137 1
(There's another function, memory.size() but i have heard and read that it only seems to work on Windows. It just returns a value in MB; so to get max memory used at any time in the session, use memory.size(max=T)).
You could try the lsos() function from this question:
R> a <- rnorm(100)
R> b <- LETTERS
R> lsos()
Type Size Rows Columns
b character 1496 26 NA
a numeric 840 100 NA
R>
This question was posted and got legitimate answers so much ago, but I want to let you know another useful tips to get the size of an object using a library called gdata and its ll() function.
library(gdata)
ll() # return a dataframe that consists of a variable name as rownames, and class and size (in KB) as columns
subset(ll(), KB > 1000) # list of object that have over 1000 KB
ll()[order(ll()$KB),] # sort by the size (ascending)
another (slightly prettier) option using dplyr
data.frame('object' = ls()) %>%
dplyr::mutate(size_unit = object %>%sapply(. %>% get() %>% object.size %>% format(., unit = 'auto')),
size = as.numeric(sapply(strsplit(size_unit, split = ' '), FUN = function(x) x[1])),
unit = factor(sapply(strsplit(size_unit, split = ' '), FUN = function(x) x[2]), levels = c('Gb', 'Mb', 'Kb', 'bytes'))) %>%
dplyr::arrange(unit, dplyr::desc(size)) %>%
dplyr::select(-size_unit)
A data.table function that separates memory and unit for easier sorting:
ls.obj <- {as.data.table(sapply(ls(),
function(x){format(object.size(get(x)),
nsmall=3,digits=3,unit="Mb")}),keep.rownames=TRUE)[,
c("mem","unit") := tstrsplit(V2, " ", fixed=TRUE)][,
setnames(.SD,"V1","obj")][,.(obj,mem=as.numeric(mem),unit)][order(-mem)]}
ls.obj
obj mem unit
1: obj1 848.283 Mb
2: obj2 37.705 Mb
...
Here's a tidyverse-based function to calculate the size of all objects in your environment:
weigh_environment <- function(env){
purrr::map_dfr(env, ~ tibble::tibble("object" = .) %>%
dplyr::mutate(size = object.size(get(.x)),
size = as.numeric(size),
megabytes = size / 1000000))
}
I've used the solution from this link
for (thing in ls()) { message(thing); print(object.size(get(thing)), units='auto') }
Works fine!

Resources