Related
I have a folder with hundreds rasters with different names. Some names are partially similar, except for the last 5 letters which characterize the year. for example I have
"raster_a_2010.tif"
"raster_a_1990.tif"
"raster_f_2010.tif"
"raster_f_1990.tif"
I need to stack the rasters that share the first part of name, so (raster_a_2010 with raster_a_1990) but I need to do it automatically, without indicating one by one the pattern.
So far I did this but I'm still far from the solution: basically I'm trying to create a list of vectors that recognise each pattern and then I'd like to use this list to create the stack
raster_year <- base::list.files(file.path (dir_years,"raster"))
#list of raster
files_base <- basename(list.files(file.path (dir_years,"raster")))
files_group <- substring(files_base, 1, char(files_base) - 4)
## Group the files by the extracted portion of the base name
files_grouped <- group_by(data.frame(file = raster_year , group = files_group)) files_grouped
V <-as.list(as.data.frame(files_grouped))
pattern <- unique(V$group)
file_vector <- list.files(path = dir_years_raster, pattern =files_grouped$group, full.names = TRUE)
I think you are looking for the pattern argument to list.files.
For example:
ff <- list.files(file.path(dir_years,"raster"), pattern="_a_")
Or if you first read all files you can subset them with grep
ff <- c("raster_a_2010.tif" , "raster_a_1990.tif", "raster_f_2010.tif", "raster_f_1990.tif")
grep("_a_", ff, value=TRUE)
#[1] "raster_a_2010.tif" "raster_a_1990.tif"
I'm trying to teach myself coding skills for spatial data analysis. I've been using Robert Hijmans' document, "Spatial Data in R," and so far, it's been great. To test my skills, I'm messing around with a GPX file I got from my smartwatch during a run, but I'm having issues getting my data into a SpatVector of lines (or a line, more specifically). I haven't been able to find anything online on this topic.
As you can see below with a data sample, the SpatVector "run" has point geometries even though "lines" was specified. From Hijman's example of SpatVectors with lines, I gathered that adding columns with "id" and "part" both equal to 1 does something that enables the data to be converted to a SpatVector with line geometries. Accordingly, in the SpatVector "run2," the geometry is lines.
My questions are 1) is adding the "id" and "part" columns necessary? 2) and what do they actually do? I.e. why are these columns necessary? 3) Is there a way to go directly from the original data to a SpatVector of lines? In the process I used to get "run2," I lost all the attributes from the original data, and I don't want to lose them.
Thanks!
library(plotKML)
library(terra)
library(sf)
library(lubridate)
library(XML)
library(raster)
#reproducible example
GPX <- structure(list(lon = c(-83.9626053348184, -83.9625438954681,
-83.962496034801, -83.9624336734414, -83.9623791072518, -83.9622404705733,
-83.9621777739376, -83.9620685577393, -83.9620059449226, -83.9619112294167,
-83.9618398994207, -83.9617654681206, -83.9617583435029, -83.9617464412004,
-83.9617786277086, -83.9617909491062, -83.9618581719697), lat = c(42.4169608857483,
42.416949570179, 42.4169420264661, 42.4169377516955, 42.4169291183352,
42.4169017933309, 42.4168863706291, 42.4168564472347, 42.4168310500681,
42.4167814292014, 42.4167292937636, 42.4166279565543, 42.4166054092348,
42.4164886493236, 42.4163396190852, 42.4162954464555, 42.4161833804101
), ele = c("267.600006103515625", "268.20001220703125", "268.79998779296875",
"268.600006103515625", "268.600006103515625", "268.399993896484375",
"268.600006103515625", "268.79998779296875", "268.79998779296875",
"269", "269", "269.20001220703125", "269.20001220703125", "269.20001220703125",
"268.79998779296875", "268.79998779296875", "269"), time = c("2020-10-25T11:30:32.000Z",
"2020-10-25T11:30:34.000Z", "2020-10-25T11:30:36.000Z", "2020-10-25T11:30:38.000Z",
"2020-10-25T11:30:40.000Z", "2020-10-25T11:30:45.000Z", "2020-10-25T11:30:47.000Z",
"2020-10-25T11:30:51.000Z", "2020-10-25T11:30:53.000Z", "2020-10-25T11:30:57.000Z",
"2020-10-25T11:31:00.000Z", "2020-10-25T11:31:05.000Z", "2020-10-25T11:31:06.000Z",
"2020-10-25T11:31:12.000Z", "2020-10-25T11:31:19.000Z", "2020-10-25T11:31:21.000Z",
"2020-10-25T11:31:27.000Z"), extensions = c("18.011677", "18.011977",
"18.012176", "18.012678", "18.013078", "18.013277", "18.013578",
"18.013877", "17.013977", "17.014278", "17.014478", "17.014677",
"17.014676", "17.014677", "16.014477", "16.014477", "16.014576"
)), row.names = c(NA, 17L), class = "data.frame")
crdref <- "+proj=longlat +datum=WGS84"
run <- vect(GPX, type="lines", crs=crdref)
run
data <- cbind(id=1, part=1, GPX$lon, GPX$lat)
run2 <- vect(data, type="lines", crs=crdref)
run2
There is a vect method for a matrix and one for a data.frame. The data.frame method can only make points (and has no type argument, so that is ignored). I will change that into an informative error and clarify this in the manual.
So to make a line, you could do
library(terra)
g <- as.matrix(GPX[,1:2])
v <- vect(g, "lines")
To add attributes you would first need to determine what they are. You have one line but 17 rows in GPX that need to be reduced to one row. You could just take the first row
att <- GPX[1, -c(1:2)]
But you may prefer to take the average instead
GPX$ele <- as.numeric(GPX$ele)
GPX$extensions <- as.numeric(GPX$extensions)
GPX$time <- as.POSIXct(GPX$time)
att <- as.data.frame(lapply(GPX[, -c(1:2)], mean))
# ele time extensions
#1 268.7412 2020-10-25 17.3078
values(v) <- att
Or in one step
v <- vect(g, "lines", atts=att)
v
#class : SpatVector
#geometry : lines
#dimensions : 1, 3 (geometries, attributes)
#extent : -83.96261, -83.96175, 42.41618, 42.41696 (xmin, xmax, ymin, ymax)
#coord. ref. :
#names : ele time extensions
#type : <num> <chr> <num>
#values : 268.7 2020-10-25 17.31
The id and part columns are not necessary if you make a single line. But you need them when you wish to create multiple lines and or line parts (in a "multi-line").
gg <- cbind(id=rep(1:3, each=6)[-1], part=1, g)
vv <- vect(gg, "lines")
plot(vv, col=rainbow(5), lwd=8)
lines(v)
points(v, cex=2, pch=1)
And with multiple lines you would use id in aggregate to compute attributes for each line.
I am coding a survey that outputs a .csv file. Within this csv I have some entries that are space delimited, which represent multi-select questions (e.g. questions with more than one response). In the end I want to parse these space delimited entries into their own columns and create headers for them so i know where they came from.
For example I may start with this (note that the multiselect columns have an _M after them):
Q1, Q2_M, Q3, Q4_M
6, 1 2 88, 3, 3 5 99
6, , 3, 1 2
and I want to go to this:
Q1, Q2_M_1, Q2_M_2, Q2_M_88, Q3, Q4_M_1, Q4_M_2, Q4_M_3, Q4_M_5, Q4_M_99
6, 1, 1, 1, 3, 0, 0, 1, 1, 1
6,,,,3,1,1,0,0,0
I imagine this is a relatively common issue to deal with but I have not been able to find it in the R section. Any ideas how to do this in R after importing the .csv ? My general thoughts (which often lead to inefficient programs) are that I can:
(1) pull column numbers that have the special suffix with grep()
(2) loop through (or use an apply) each of the entries in these columns and determine the levels of responses and then create columns accordingly
(3) loop through (or use an apply) and place indicators in appropriate columns to indicate presence of selection
I appreciate any help and please let me know if this is not clear.
I agree with ran2 and aL3Xa that you probably want to change the format of your data to have a different column for each possible reponse. However, if you munging your dataset to a better format proves problematic, it is possible to do what you asked.
process_multichoice <- function(x) lapply(strsplit(x, " "), as.numeric)
q2 <- c("1 2 3 NA 4", "2 5")
processed_q2 <- process_multichoice(q2)
[[1]]
[1] 1 2 3 NA 4
[[2]]
[1] 2 5
The reason different columns for different responses are suggested is because it is still quite unpleasant trying to retrieve any statistics from the data in this form. Although you can do things like
# Number of reponses given
sapply(processed_q2, length)
#Frequency of each response
table(unlist(processed_q2), useNA = "ifany")
EDIT: One more piece of advice. Keep the code that processes your data separate from the code that analyses it. If you create any graphs, keep the code for creating them separate again. I've been down the road of mixing things together, and it isn't pretty. (Especially when you come back to the code six months later.)
I am not entirely sure what you trying to do respectively what your reasons are for coding like this. Thus my advice is more general – so just feel to clarify and I will try to give a more concrete response.
1) I say that you are coding the survey on your own, which is great because it means you have influence on your .csv file. I would NEVER use different kinds of separation in the same .csv file. Just do the naming from the very beginning, just like you suggested in the second block.
Otherwise you might geht into trouble with checkboxes for example. Let's say someone checks 3 out of 5 possible answers, the next only checks 1 (i.e. "don't know") . Now it will be much harder to create a spreadsheet (data.frame) type of results view as opposed to having an empty field (which turns out to be an NA in R) that only needs to be recoded.
2) Another important question is whether you intend to do a panel survey(i.e longitudinal study asking the same participants over and over again) . That (among many others) would be a good reason to think about saving your data to a MySQL database instead of .csv . RMySQL can connect directly to the database and access its tables and more important its VIEWS.
Views really help with survey data since you can rearrange the data in different views, conditional on many different needs.
3) Besides all the personal / opinion and experience, here's some (less biased) literature to get started:
Complex Surveys: A Guide to Analysis Using R (Wiley Series in Survey Methodology
The book is comparatively simple and leaves out panel surveys but gives a lot of R Code and examples which should be a practical start.
To prevent re-inventing the wheel you might want to check LimeSurvey, a pretty decent (not speaking of the templates :) ) tool for survey conductors. Besides I TYPO3 CMS extensions pbsurvey and ke_questionnaire (should) work well too (only tested pbsurvey).
Multiple choice items should always be coded as separate variables. That is, if you have 5 alternatives and multiple choice, you should code them as i1, i2, i3, i4, i5, i.e. each one is a binary variable (0-1). I see that you have values 3 5 99 for Q4_M variable in the first example. Does that mean that you have 99 alternatives in an item? Ouch...
First you should go on and create separate variables for each alternative in a multiple choice item. That is, do:
# note that I follow your example with Q4_M variable
dtf_ins <- as.data.frame(matrix(0, nrow = nrow(<initial dataframe>), ncol = 99))
# name vars appropriately
names(dtf_ins) <- paste("Q4_M_", 1:99, sep = "")
now you have a data.frame with 0s, so what you need to do is to get 1s in an appropriate position (this is a bit cumbersome), a function will do the job...
# first you gotta change spaces to commas and convert character variable to a numeric one
y <- paste("c(", gsub(" ", ", ", x), ")", sep = "")
z <- eval(parse(text = y))
# now you assing 1 according to indexes in z variable
dtf_ins[1, z] <- 1
And that's pretty much it... basically, you would like to reconsider creating a data.frame with _M variables, so you can write a function that does this insertion automatically. Avoid for loops!
Or, even better, create a matrix with logicals, and just do dtf[m] <- 1, where dtf is your multiple-choice data.frame, and m is matrix with logicals.
I would like to help you more on this one, but I'm recuperating after a looong night! =) Hope that I've helped a bit! =)
Thanks for all the responses. I agree with most of you that this format is kind of silly but it is what I have to work with (survey is coded and going into use next week). This is what I came up with from all the responses. I am sure this is not the most elegant or efficient way to do it but I think it should work.
colnums <- grep("_M",colnames(dat))
responses <- nrow(dat)
for (i in colnums) {
vec <- as.vector(dat[,i]) #turn into vector
b <- lapply(strsplit(vec," "),as.numeric) #split up and turn into numeric
c <- sort(unique(unlist(b))) #which values were used
newcolnames <- paste(colnames(dat[i]),"_",c,sep="") #column names
e <- matrix(nrow=responses,ncol=length(c)) #create new matrix for indicators
colnames(e) <- newcolnames
#next loop looks for responses and puts indicators in the correct places
for (i in 1:responses) {
e[i,] <- ifelse(c %in% b[[i]],1,0)
}
dat <- cbind(dat,e)
}
Suggestions for improvement are welcome.
I have the following list
["txtvers=1","userid=3A6524D4-E31C-491D-94DD-555883B1600A","name=Jarrod Roberson","version=2"]
I want to create a Dict where the left side of the = is the key and the right side is the value.
Preferably where the key is an atom.
Using the following list comprehension I get this.
KVL = [string:tokens(T,"=") || T <- TXT].
[["txtvers","1"], ["userid","3A6524D4-E31C-491D-94DD-555883B1600A"], ["name","Jarrod Roberson"], ["version","2"]]
what I am struggling with now is how to convert the nested lists into tuples so I can send them into a list of tuples
where I can send them into dict:from_list
what I want is something like this
[{txtvers,"1"}, {userid,"3A6524D4-E31C-491D-94DD-555883B1600A"}, {name,"Jarrod Roberson"}, {version,"2"}]
I know there has to be a concise way to do this but I just can't get my head around it.
KVL = [begin [K,V]=string:tokens(T,"="), {list_to_atom(K), V} end || T <- L].
;)
A little disclaimer on anyone else taking hints from this question. It is always a good idea to turn lists into atoms using list_to_existing_atom.
split_keyvalue(Str) ->
try
{K, [$=|V]} = lists:splitwith(fun(X) -> X =/= $= end, Str),
{erlang:list_to_existing_atom(K), V}
catch
error:badarg ->
fail
end.
split_keyvalues(List) ->
[KV || {_,_}=KV <- lists:map(fun split_keyvalue/1, List)].
The reason is that it is a possible DoS attack if (malicious) user supplied data can create million and millions of unique atoms. The table of unique atoms is max 16 million atoms big or so.
Also, tokens splits every equal sign in the string. Isnt it better to split on the first one only?
Even shorter:
KVL = [{list_to_atom(K), V} || [K,V] <- [string:tokens(T,"=") || T <- L]].
I actually got it to work finally!
A = [ string:tokens(KV,"=") || KV <- TXT].
[["txtvers","1"],
["userid","3A6524D4-E31C-491D-94DD-555883B1600A"],
["name","Jarrod Roberson"],
["version","2"]]
B = [{list_to_atom(K),V} || [K|[V|_]] <- A].
[{txtvers,"1"},
{userid,"3A6524D4-E31C-491D-94DD-555883B1600A"},
{name,"Jarrod Roberson"},
{version,"2"}]
I'd like to work out how much RAM is being used by each of my objects inside my current workspace. Is there an easy way to do this?
some time ago I stole this little nugget from here:
sort( sapply(ls(),function(x){object.size(get(x))}))
it has served me well
1. by object size
to get memory allocation on an object-by-object basis, call object.size() and pass in the object of interest:
object.size(My_Data_Frame)
(unless the argument passed in is a variable, it must be quoted, or else wrapped in a get call.)variable name, then omit the quotes,
you can loop through your namespace and get the size of all of the objects in it, like so:
for (itm in ls()) {
print(formatC(c(itm, object.size(get(itm))),
format="d",
big.mark=",",
width=30),
quote=F)
}
2. by object type
to get memory usage for your namespace, by object type, use memory.profile()
memory.profile()
NULL symbol pairlist closure environment promise language
1 9434 183964 4125 1359 6963 49425
special builtin char logical integer double complex
173 1562 20652 7383 13212 4137 1
(There's another function, memory.size() but i have heard and read that it only seems to work on Windows. It just returns a value in MB; so to get max memory used at any time in the session, use memory.size(max=T)).
You could try the lsos() function from this question:
R> a <- rnorm(100)
R> b <- LETTERS
R> lsos()
Type Size Rows Columns
b character 1496 26 NA
a numeric 840 100 NA
R>
This question was posted and got legitimate answers so much ago, but I want to let you know another useful tips to get the size of an object using a library called gdata and its ll() function.
library(gdata)
ll() # return a dataframe that consists of a variable name as rownames, and class and size (in KB) as columns
subset(ll(), KB > 1000) # list of object that have over 1000 KB
ll()[order(ll()$KB),] # sort by the size (ascending)
another (slightly prettier) option using dplyr
data.frame('object' = ls()) %>%
dplyr::mutate(size_unit = object %>%sapply(. %>% get() %>% object.size %>% format(., unit = 'auto')),
size = as.numeric(sapply(strsplit(size_unit, split = ' '), FUN = function(x) x[1])),
unit = factor(sapply(strsplit(size_unit, split = ' '), FUN = function(x) x[2]), levels = c('Gb', 'Mb', 'Kb', 'bytes'))) %>%
dplyr::arrange(unit, dplyr::desc(size)) %>%
dplyr::select(-size_unit)
A data.table function that separates memory and unit for easier sorting:
ls.obj <- {as.data.table(sapply(ls(),
function(x){format(object.size(get(x)),
nsmall=3,digits=3,unit="Mb")}),keep.rownames=TRUE)[,
c("mem","unit") := tstrsplit(V2, " ", fixed=TRUE)][,
setnames(.SD,"V1","obj")][,.(obj,mem=as.numeric(mem),unit)][order(-mem)]}
ls.obj
obj mem unit
1: obj1 848.283 Mb
2: obj2 37.705 Mb
...
Here's a tidyverse-based function to calculate the size of all objects in your environment:
weigh_environment <- function(env){
purrr::map_dfr(env, ~ tibble::tibble("object" = .) %>%
dplyr::mutate(size = object.size(get(.x)),
size = as.numeric(size),
megabytes = size / 1000000))
}
I've used the solution from this link
for (thing in ls()) { message(thing); print(object.size(get(thing)), units='auto') }
Works fine!