Create dummy variable for all neighborhood region depending on another dummy variable (sf object) - spatial

I have a spatial dataframe (sf) of all European NUTS2 regions. Within this sf object certain regions have a dummy = 1, and others a dummy = 0.
How can I create a new dummy (lets say "dummy_neighbor") where all the neighboring (st_touches?) regions of the regions with dummy==1 get a dummy_neighbor==1, and all those not touching a dummy==1 region get a dummy_neighbor==0?

For the time being I have this work-around. But I guess there musst be an easier solution?
# load packages
library(sf)
library(here)
library(tidyverse)
library(spdep)
library(expp)
# import nuts 2 sf object
nuts2_sf <- st_read(here("Data", "nuts2_data", "final_nuts.shp"))
# take row numbers as extra column for later
nuts2_sf$rownumber = 1:nrow(nuts2_sf)
#neighbouring list
neighbour <- poly2nb(nuts2_sf, row.names="NUTS_ID", queen=TRUE)
# transform nb into data frame
nb_df <- plyr::ldply(neighbour, rbind)
nb_df$rownumber = 1:nrow(nb_df) # get rownbumer as column
# merge neighbour-list-df with sf by rownumber
df <- merge(nuts2_sf, nb_df, by = "rownumber")
# extract all neighbours of tp100_d=1
# (tp100_d is the name of the original dummy variable)
df_dummy <- df %>% filter(tp100_d == 1)
df_dummy$geometry <- NULL
all_neighbours <- as.vector(as.matrix(df_dummy[,c(66:76)])) %>% unique(.)
# create new neighbourhood-dummy for all neighbours (but not if its a neighbour that has a 1 in its original dummy)
nuts2_sf <- nuts2_sf %>% mutate(nb_dummy = ifelse(rownumber %in% all_neighbours & tp100_d == 0, 1, 0))

Related

how to optimize my MIP model to find the best combination of columns and rows with minimum of NAN values?

I have a dataframe with lots of NAN values. My objective is to find the best conbination of columns and rows to maximize my data and minimize the NAN values. One solution I found is to use the ompr package and create a MIP model to solve the problem.
Here's the model :
m <- +!is.na(M) # gets logical matrix; 0 if NA else 1
nr <- nrow(m)
nc <- ncol(m)
n_years <- 17
model <- MIPModel() %>%
# keep[i,j] is 1 if matrix cell [i,j] is to be kept else 0
add_variable(keep[i,j], i = 1:nr, j = 1:nc, typ = "binary") %>%
# rm_row[i] is 1 if row i is selected for removal else 0
add_variable(rm_row[i], i = 1:nr, type = "binary") %>%
# rm_col[j] is 1 if column j is selected for removal else 0
add_variable(rm_col[j], j = 1:nc, type = "binary") %>%
# maximize good cells kept
set_objective(sum_expr(keep[i,j], i = 1:nr, j = 1:nc), "max") %>%
# cell can be kept only when row is not selected for removal
add_constraint(sum_expr(keep[i,j], j = 1:nc) <= 1 - rm_row[i], i = 1:nr) %>%
# cell can be kept only when column is not selected for removal
add_constraint(sum_expr(keep[i,j], i = 1:nr) <= 1 - rm_col[j], j = 1:nc) %>%
# only non-NA values can be kept
add_constraint(m[i,j] + rm_row[i] + rm_col[j] >= 1, i = 1:nr, j = 1:nc) %>%
# keep at most n columns i.e. remove at least (nc - n_years) columns
add_constraint(sum_expr(rm_col[j], j = 1:nc) >= nc - n_years)
model
This model works just fine as long as my data is not huge which is not the case for my data (3500 rows x 180 columns)
this model has a lot of contsraints which is why it takes too long to solve. Is there another way to recreate this model so it has less constraints and calculates faster?

Constrain axis limits in chordDiagram (circlize) when making gifs

I hope somebody will be able to help me with this chordDiagram visualisation I am trying to create. I am well aware that maybe this visualization type was not suitable for this particular data, but somehow it was something I had in my head (or how I wanted to visualize this data) and what I wanted to create, and now I think it is too late to give it up :) too curious how one can fix it. It is my first real post here, though I am an active user of stackoverflow and I genuinely admire the audience here.
So I have this data on the change in the size of area in km2 over time (d0) and I am trying to create a GIF out of it using example here: https://guyabel.com/post/animated-directional-chord-diagrams/
The data "d0":
Time <- as.numeric(c(10,10,10,100,100,100,200,200,200,5,5,5,50,50,50,0,0,0))
Year <- as.character(c(2050,2100,2200,2050,2100,2200,2050,2100,2200,2050,2100,2200,2050,2100,2200,2050,2100,2200))
Area_km2 <- as.numeric(c(4.3075211,7.1672926,17.2780622,5.9099250,8.2909189,16.9748961,6.5400554,8.9036313,16.5627228,3.0765610,6.3929883,18.0708108,5.3520782,8.4503856,16.7938196,0.5565978,1.8415855,12.5089476))
(d0 <- as.data.frame(cbind(Time,Year,Area_km2)))
I also have the color codes stored in a separate dataframe (d1) following the above mentioned example.
The data "d1":
year <- as.numeric(c(2050,2100,2200))
order1 <- as.character(c(1,2,3))
col1 <- c("#40A4D8","#33BEB7","#0C5BCE")
(d1 <- as.data.frame(cbind(year,order1,col1)))
So the idea was to have self-linking flows within each sector increasing in size over time, which will look like just growing segments in a final animated GIF (or like growing pie segments), but I noticed that regardless how hard I try I can't seem to manage to constrain the axis of each segment to limits of that particular year in an every single frame. It seems that the axis is being added on and keeps on adding over time, which is not what I want.
Like for example in the first figure (figure0) or "starting frame" the size of the links matches well the dataframe:
figure0
So it is
orig_year
Area_km2
.frame
2050
0.557
0
2100
1.84
0
2200
12.5
0
But when one plots next figure (figure1), the axis seems to have taken the values from the starting frame and added on the current values (4, 7.4 and 19 respectively) instead of (3.08, 6.39 and 18.1) or what should have been the values according the data frame:
figure1
orig_year
Area_km2
.frame
2050
3.08
1
2100
6.39
1
2200
18.1
1
And it keep on doing so as one loops through the data and creates new plots for the next frames. I wonder whether it is possible to constrain the axis and create the visualization in a way that the links just gradually increase over time and the axis is, so to say, following the increase or does also increase gradually following the data???
Any help is highly appreciated!
Thanks.
My code:
Sort decreasing
(d0 <- arrange(d0,Time))
Copy columns
(d0$Dest_year <- d0$Year)
Re-arrange data
library(tweenr)
(d2 <- d0 %>%
mutate(corridor=paste(Year,Dest_year,sep="->")) %>%
dplyr::select(Time,corridor,Area_km2) %>%
mutate(ease="linear") %>%
tweenr::tween_elements('Time','corridor','ease',nframes=30) %>%
tibble::as_tibble())
(d2 <- d2 %>%
separate(col=.group,into=c("orig_year","dest_year"),sep="->") %>%
dplyr::select(orig_year,dest_year,Area_km2,everything()))
d2$Time <- NULL
Create a directory to store the individual plots
dir.create("./plot-gif/")
Fixing scales
scale_gap <- function(Area_km2_m,Area_km2_max,gap_at_max=1,gaps=NULL) {
p <- Area_km2_m/Area_km2_max
if(length(gap_at_max)==1 & !is.null(gaps)) {
gap_at_max <- rep(gap_at_max,gaps)
}
gap_degree <- (360-sum(gap_at_max))*(1-p)
gap_m <- (gap_degree + sum(gap_at_max))/gaps
return(gap_m)
}
Function to derive the size of gaps in each frame for an animated GIF
(d3 <- d2 %>% group_by(orig_year) %>% mutate(gaps=scale_gap(Area_km2_m=Area_km2,Area_km2_max=max(.$Area_km2),gap_at_max=4,gaps=9)))
library(magrittr)
Get the values for axis limits
(axmax <- d2 %>% group_by(orig_year,.frame) %>% mutate(max=mean(Area_km2)))
Creating unique chordDiagrams for each frame
library(circlize)
for(f in unique(d2$.frame)){
png(file=paste0("./plot-gif/figure",f,".png"),height=7,width=7,units="in",res=500)
circos.clear()
par(mar=rep(0,4),cex=1)
circos.par(start.degree=90,track.margin=c(-0.1,0.1),
gap.degree=filter(d3,.frame==f)$gaps,
points.overflow.warning=FALSE)
chordDiagram(x=filter(d2,.frame==f),directional=2,order=d1$year,
grid.col=d1$col1,annotationTrack=c("grid","name","axis"),
transparency=0.25,annotationTrackHeight=c(0.05,0.1),
direction.type=c("diffHeight"),
diffHeight=-0.04,link.sort=TRUE,
xmax=axmax$max)
dev.off()
}
Now make a GIF
library(magick)
img <- image_read(path="./plot-gif/figure0.png")
for(f in unique(d2$.frame)[-1]){
img0 <- image_read(path=paste0("./plot-gif/figure",f,".png"))
img <- c(img,img0)
message(f)
}
img1 <- image_scale(image=img,geometry="720x720")
ani0 <- image_animate(image=img1,fps=10)
image_write(image=ani0,path="./plot-gif/figure.gif")
I will start with your d0 object. I first construct the d0 object but I do not convert everything to characters, just put them as the original numeric format. Also I reorder d0 by Time and Year:
Time = c(10,10,10,100,100,100,200,200,200,5,5,5,50,50,50,0,0,0)
Year = c(2050,2100,2200,2050,2100,2200,2050,2100,2200,2050,2100,2200,2050,2100,2200,2050,2100,2200)
Area_km2 = c(4.3075211,7.1672926,17.2780622,5.9099250,8.2909189,16.9748961,6.5400554,8.9036313,16.5627228,3.0765610,6.3929883,18.0708108,5.3520782,8.4503856,16.7938196,0.5565978,1.8415855,12.5089476)
d0 = data.frame(Time = Time,
Year = Year,
Area_km2 = Area_km2,
Dest_year = Year)
d0 = d0[order(d0$Time, d0$Year), ]
The key thing is to calculate proper values for "gaps" between sectors so that the same unit from data corresponds to the same degree in different plots.
We first calculate the maximal total width of the circular plot:
width = tapply(d0$Area_km2, d0$Time, sum)
max_width = max(width)
We assume there are n sectors (where n = 3 in d0). We let the first n-1 gaps to be 2 degrees and we dynamically adjust the last gap according to the total amount of values in each plot. For the plot with the largest total value, the last gap is also set to 2 degrees.
n = 3
degree_per_unit = (360 - n*2)/max_width
Now degree_per_unit can be shared between multiple plots. Every time we calculate the value for last_gap:
for(t in sort(unique(Time))) {
l = d0$Time == t
d0_current = d0[l, c("Year", "Dest_year", "Area_km2")]
last_gap = 360 - (n-1)*2 - sum(d0_current$Area_km2)*degree_per_unit
circos.par(gap.after = c(rep(2, n-1), last_gap))
chordDiagram(d0_current, grid.col = c("2050" = "red", "2100" = "blue", "2200" = "green"))
circos.clear()
title(paste0("Time = ", t, ", Sum = ", sum(d0_current$Area_km2)))
Sys.sleep(1)
}

Hide p_value and put stars to significant OR gtsummary

I'm using gtsummary package.
I need to merge different univariate logistic regression and in order to have a good presentation, I want to hide the p_value and bold or put a star to the significant OR (p< 0.05).
Anyone can help me?
Maybe it's easier to use another presentation type like kable, huxtable, I don't know?
Thank you for your help.
Have a nice day
There is a function called add_significance_stars() that hides the p-value and adds stars to the estimate indicating various levels of statistical significance. I've also added code to bold the estimate if significant with modify_table_styling().
library(gtsummary)
#> #BlackLivesMatter
packageVersion("gtsummary")
#> [1] '1.4.0'
tbl <-
trial %>%
select(death, age, grade) %>%
tbl_uvregression(
y = death,
method = glm,
method.args = list(family = binomial),
exponentiate = TRUE
) %>%
# add significance stars to sig estimates
add_significance_stars() %>%
# additioanlly bolding significant estimates
modify_table_styling(
columns = estimate,
rows = p.value < 0.05,
text_format = "bold"
)
Created on 2021-04-14 by the reprex package (v2.0.0)
Here's a quick huxtable version:
l1 <- glm(I(cyl==8) ~ gear, data = mtcars, family = binomial)
l2 <- glm(I(cyl==8) ~ carb, data = mtcars, family = binomial)
huxtable::huxreg(l1, l2, statistics = "nobs", bold_signif = 0.05)
────────────────────────────────────────────────────
(1) (2)
───────────────────────────────────
(Intercept) 5.999 * -1.880 *
(2.465) (0.902)
gear -1.736 *
(0.693)
carb 0.579 *
(0.293)
───────────────────────────────────
nobs 32 32
────────────────────────────────────────────────────
*** p < 0.001; ** p < 0.01; * p < 0.05.
Column names: names, model1, model2
It doesn't show it here, but the significant coefficients are bold on screen (and in any other kind of output).

CFA in data with 3 levels - estimating factor scores at level 2?

I am working on a dataset with 3 levels:
Teacher
School
Country
Using survey responses from the teachers, the aim is to use Confirmatory Factor Analysis (CFA) with the ultimate goal of having the factor scores at the school level.
A further objective is to test for measurement invariance across countries.
I want to use the lavaan package in R, because it is able to deal with the complex survey design of my data trough the lavaan.survey-extension (sampling design, weights etc.)
I have done some preliminary analysis, where i use country-ID as the group argument in the cfa-function. This gives me the possibility to perform the measurement invariance analysis across countries. The issue is, that my factor scores are given at individual teacher level, and i am interested in the school-level.
Any ideas about how to get these factor scores at the school level?
Here are some examples of the functions i use. I do not think that i data sample is needed, but i will create some if it is requested.
library(lavaan)
library(SEMtools)
#define model
reduced_mod <-'
leadership_sup =~ TC3G22D + TC3G22E + TC3G22K
continous_develop_collab =~ TT3G32A + TT3G32B + TT3G32C + TT3G32D '
#Fit model with different restraints:
fit_no_restraint <- cfa(model = reduced_mod, data = cfa_data, group="countryID")
fit_metric <- cfa(model = reduced_mod, data = cfa_data, group="countryID", group.equal = c("loadings"))
fit_scalar <- cfa(model = reduced_mod, data = cfa_data, group="countryID", group.equal = c("loadings", "intercepts"))
#Compare fit statistics
compareFit(scalar = fit_scalar , metric = fit_metric , config = fit_no_restraint)
It seems that you want multilevel measurement invariance. You should use the measEq.syntax() from the semTools package:
## ---------------------
## Multilevel Invariance
## ---------------------
## To test invariance across levels in a MLSEM, specify syntax as though
## you are fitting to 2 groups instead of 2 levels.
mlsem <- ' f1 =~ y1 + y2 + y3
f2 =~ y4 + y5 + y6 '
## metric invariance
syntax.metric <- measEq.syntax(configural.model = mlsem, meanstructure = TRUE,
ID.fac = "std.lv", sample.nobs = c(1, 1),
group = "cluster", group.equal = "loadings")
## by definition, Level-1 means must be zero, so fix them
syntax.metric <- update(syntax.metric,
change.syntax = paste0("y", 1:6, " ~ c(0, NA)*1"))
## save as a character string
mod.metric <- as.character(syntax.metric, groups.as.blocks = TRUE)
## convert from multigroup to multilevel
mod.metric <- gsub(pattern = "group:", replacement = "level:",
x = mod.metric, fixed = TRUE)
## fit model to data
fit.metric <- lavaan(mod.metric, data = Demo.twolevel, cluster = "cluster")
summary(fit.metric)
Source

How to normalize a bigger-than-memory dask array?

I'm trying to normalize a dask array with a reduction of itself (e.g., b = a / a.sum() with a and b being dask arrays). Computation of this normalized array will first compute everything necessary to know the original array and only then calculate the divisions and hence spill to disk if memory is not sufficient.
Code example:
from dask.distributed import Client
from dask import arry as da
# Create 1000 MB array full of 1's of with chunks of 50MB
a = da.ones(shape=(1/8 * 1000e6, 1), chunks=(1/8 * 50e6, 1))
# Create normalized array with sum = 1
b = a / a.sum()
# Create cluster to small for all of a or b at once
client = Client(n_workers=1, threads_per_worker=1, memory_limit=500e6)
# Compute sum of b (Spills to disk)
print(b.sum().compute())
Is there something like the following?
b = a / same_as_a_but_different_tasks.sum()
I solved this by copying the array and renaming all tasks in the top layer:
from copy import deepcopy
def copy_with_renamed_top_layer(a, prepend_name="copy-of-"):
# copy array and dask
b = a.copy()
b.dask = deepcopy(b.dask)
# get new name
orig_name = a.name
new_name = prepend_name + orig_name
# rename dependencies
b.dask.dependencies[new_name] = b.dask.dependencies.pop(orig_name)
# rename tasks of uppermost layer
b.dask.layers[new_name] = b.dask.layers.pop(orig_name)
b.dask.layers[new_name] = {
(new_name, ) + k[1:]: v
for k, v in b.dask.layers[new_name].items()
}
# rename array
b.name = new_name
return b
# Create 1000 MB array full of 1's of with chunks of 50MB
a = da.ones(shape=(1/8 * 1000e6, 1), chunks=(1/8 * 50e6, 1))
# copy and rename uppermost layer
a_copy = copy_with_renamed_top_layer(a)
# Create normalized array with sum = 1
b = a / a_copy.sum()
This is, however, a highly fragile solution as it relies on the current internal API.

Resources