Is there method to deal with missing observation through adding weight in GLMM? - glm

Is there method to deal with missing observation through adding weight in GLMM?
I wonder how to construct a GLMM model with count data where some observations lost. For me, I used yellow sticky board to trap ladybeetles, whose data type is count. And another variable is ratio of area of grass within 1km radius. IDs are ID of yellow sticky board. Each of cotton field put 3 pierce of boards. SITES are different cotton fields. Regions are that several sites belong to a certain region. I want to build GLMM model:
As you can see , I can create GLMM model as below:
library(lme4)
library(glmmTMB)
library(mgcv)
m1 = glmer(ladybeetles ~ Grass.perc + (1|Region\SITE) , data=df, family="poisson" ) #nbinomial tried
m2 = glmmTMB(ladybeetles ~ Grass.perc + (1|Region\SITE) , data=df, family="poisson" ) #nbinomial tried
But, I don't want to create it like that due to yellow stick boards trapping huge variance of count. I want to group factor variables and sum yellow boards' integer values. Therefore, I could do like this:
df =
df %>% group_by(Region,SITE) %>%
summarise(Ladybeetles = sum(Ladybeetles),
Grass.perc = mean(Grass.perc),
SampleNum = n()
) # sum ladybeetles of all yellow boards at each field
m1 = glmer(ladybeetles ~ Grass.perc + (1|Region) , data=df, family="poisson" ) #nbinomial tried
m2 = glmmTMB(ladybeetles ~ Grass.perc + (1|Region) , data=df, family="poisson" ) #nbinomial tried
However, you might see there're some problems in my data.
Most of boards are full in many fields. But site3,site4,site5 have problems: they lost a board due to windy or farmer's behavior(Mark red fill). Then I cannot sum ladybeetles like that. Someone could advice you can group and obtain mean value and construct GLMM models with family of gaussian or gamma error distribution. But from personal perspective, count number type linked poisson or negative binomial distribution supports an alternative for me because I have lots of other data like this which often violate assumptions.
I wonder if there's any model structure with weight to sample numbers to build GLMM model in lme4 or glmmTMB package so that it allows me still using poisson or negative binomial distribution function when special condition occurring like observations lost. So I add weight=SampleNum, but I have no idea whether the weightmethod is right. Hoping someone can help me.
glmer(ladybeetles ~ Grass.perc + (1|Region) ,
weight = SampleNum , # SampleNum is sample number I can group and create
data=df, family="poisson" )
glmmTMB(ladybeetles ~ Grass.perc + (1|Region) ,
weight = SampleNum, # SampleNum is sample number I can group and create
data=df, family="poisson" )
add weight = SampleNum
Here is my data.
df =
structure(list(Region = c("ITI", "ITI", "ITI", "ITI", "ITI",
"ITI", "ITI", "ITI", "ITI", "ITI", "ITI", "ITI", "ITI", "ITI",
"ITI", "KZ", "KZ", "KZ", "KZ", "KZ", "KZ", "KZ", "KZ", "KZ",
"KZ", "KZ", "KZ", "BST", "BST", "BST", "BST", "BST", "BST", "BST",
"BST", "BST", "BST", "BST", "BST"), SITE = c("site1", "site1",
"site1", "site2", "site2", "site2", "site3", "site3", "site4",
"site4", "site5", "site5", "site6", "site6", "site6", "site7",
"site7", "site7", "site8", "site8", "site8", "site9", "site9",
"site9", "site10", "site10", "site10", "site11", "site11", "site11",
"site12", "site12", "site12", "site13", "site13", "site13", "site14",
"site14", "site14"), ID = c("ID1", "ID2", "ID3", "ID1", "ID2",
"ID3", "ID1", "ID2", "ID1", "ID3", "ID2", "ID3", "ID1", "ID2",
"ID3", "ID1", "ID2", "ID3", "ID1", "ID2", "ID3", "ID1", "ID2",
"ID3", "ID1", "ID2", "ID3", "ID1", "ID2", "ID3", "ID1", "ID2",
"ID3", "ID1", "ID2", "ID3", "ID1", "ID2", "ID3"), Ladybeetles = c(0L,
1L, 7L, 7L, 2L, 8L, 4L, 0L, 3L, 3L, 9L, 10L, 22L, 10L, 15L, 0L,
7L, 1L, 6L, 0L, 0L, 5L, 2L, 2L, 12L, 5L, 5L, 7L, 9L, 6L, 4L,
5L, 4L, 11L, 13L, 27L, 8L, 1L, 7L), Grass.perc = c(5L, 5L, 5L,
13L, 13L, 13L, 4L, 4L, 21L, 21L, 13L, 13L, 5L, 5L, 5L, 2L, 2L,
2L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 10L, 10L, 10L, 11L, 11L,
11L, 15L, 15L, 15L, 19L, 19L, 19L)), class = "data.frame", row.names = c(NA,
-39L))

I think this is more of a CrossValidated question, but I'll give it a shot (it may be migrated).
tl;dr I don't think you should be trying to aggregate like this: this kind of unbalanced design is exactly where GLMMs are useful.
I think you should be using a negative binomial or other overdispersed model: the ratio of deviance()/df.residual() (a rough guide to overdispersion is greater than 2 (a Poisson model would be appropriate if the dispersion were close to 1).
The random effect of Region is estimated as zero — not surprising because your data set is fairly small, fairly noisy, and you have only 3 regions. You could simplify the model and replace (1|Region/SITE) with (1|SITE), if the differences among regions are not very interesting to you.
I fitted the model with (1|SITE) and family = "nbinom2", then I used the DHARMa package to explore residuals (plot(ss <- simulateResiduals(model)); plotResiduals(ss, df$Grass.perc)). I found there was some nonlinearity in the pattern. I also tried with nbinom1 rather than nbinom2 and found it improved the fit, so I ended up with:
glmmTMB(Ladybeetles ~ poly(Grass.perc,2) + (1|SITE) ,
data=df, family="nbinom1" )

Related

How to merge based on a subset of string in a column?

This is an extension of my previous post.
I have the following dataframes (df1 and df2) that I'm trying to merge:
year <- c("2002", "2002", "1999", "1999", "1997", "2002")
state <- c("TN", "TN", "AL", "AL", "CA", "TN")
name <- c("Molly Homes, Jane Doe", "Sally", "David", "Laura", "John", "Kate")
df1 <- data.frame(year, state, name)
year <- c("2002", "1999")
state <- c("TN", "AL")
versus <- c("Homes (v. Vista)", "#laura v. dAvid")
df2 <- data.frame(year, state, versus)
And I df4 is my ideal output:
year <- c("2002", "2002", "1999", "1999", "1997", "2002")
state <- c("TN", "TN", "AL", "AL", "CA", "TN")
name <- c("Molly Homes, Jane Doe", "Sally", "David", "Laura", "John", "Kate")
versus <- c("Homes (v. Vista)", "# george v. SALLY", "#laura v. dAvid", "#laura v. dAvid", NA, NA)
df4 <- data.frame(year, state, name, versus)
The kind responders on the last post suggested this (and a variation):
library(dplyr)
df3 <- left_join(df1,df2, by=c("year","state")) %>%
rowwise() %>%
mutate(versus:=if_else(grepl(name,versus,ignore.case=T), versus,as.character(NA)))
The problem with the above code is that it doesn't match subsets. Ideally, I'd like grepl(x, y) to match each other, vice versa. If x is in y and/or y is in x, then it's TRUE and results in the value in the "versus" column.
fuzzyjoin is meant for regex searches like this :-)
library(dplyr)
# library(tidyr) # unnest
# library(fuzzyjoin) # fuzzy_*_join
df1 %>%
mutate(
rn = row_number(),
ptn = strsplit(name, "[ ,]+")
) %>%
tidyr::unnest(ptn) %>%
fuzzyjoin::fuzzy_left_join(df2,
by = c("year" = "year", "state" = "state", "ptn" = "versus"),
match_fun = list(`==`, `==`, function(...) Vectorize(grepl)(..., ignore.case = TRUE))
) %>%
group_by(rn, year = year.x, state = state.x, name) %>%
summarize(versus = na.omit(versus)[1], .groups = "drop") %>%
select(-rn)
# # A tibble: 6 x 4
# year state name versus
# <chr> <chr> <chr> <chr>
# 1 2002 TN Molly Homes, Jane Doe Homes (v. Vista)
# 2 2002 TN Sally NA
# 3 1999 AL David #laura v. dAvid
# 4 1999 AL Laura #laura v. dAvid
# 5 1997 CA John NA
# 6 2002 TN Kate NA
We need a way to retrieve the series of whole words, and check if any of them appear (case-insensitive) within the versus column. Here is one simple way to do this:
Create function (f(n,v)), which takes strings n and v, extracts the whole words (wrds) from n, and then counts how many of them are found in v. Returns TRUE if this count exceeds 0
f <- function(n,v) {
wrds = stringr::str_extract_all(n, "\\b\\w*\\b")[[1]]
sum(sapply(wrds[which(nchar(wrds)>1)], grepl,x=v,ignore.case=T))>0
}
Left join the original frames, and apply f() by row, retaining versus if one or more whole words from name are found in veruss, else set to NA
left_join(df1,df2, by=c("year","state")) %>%
rowwise() %>%
mutate(versus:=if_else(f(name, versus), versus,NA_character_))
Output:
1 2002 TN Molly Homes, Jane Doe Homes (v. Vista)
2 2002 TN Sally NA
3 1999 AL David #laura v. dAvid
4 1999 AL Laura #laura v. dAvid
5 1997 CA John NA
6 2002 TN Kate NA
Input:
df1 = structure(list(year = c("2002", "2002", "1999", "1999", "1997",
"2002"), state = c("TN", "TN", "AL", "AL", "CA", "TN"), name = c("Molly Homes, Jane Doe",
"Sally", "David", "Laura", "John", "Kate")), class = "data.frame", row.names = c(NA,
-6L))
df2 = structure(list(year = c("2002", "1999"), state = c("TN", "AL"
), versus = c("Homes (v. Vista)", "#laura v. dAvid")), class = "data.frame", row.names = c(NA,
-2L))

Apache Beam Streaming Lag Operator

I'm currently thinking to build pipeline that have LAG operator like in SQL. But i'm not sure if it's possible.
To be clearer, let's say I have stream of data like this:
# sensor_name, temperature
("station 1", 30.0)
("station 1", 31.0)
("station 1", 32.0)
("station 1", 33.0)
("station 2", 30.0)
("station 2", 31.0)
("station 2", 32.0)
and do PTransform and the output become
("station 1", {"now":30.0, "before":None})
("station 1", {"now":31.0, "before":30.0})
("station 1", {"now":32.0, "before":31.0})
("station 1", {"now":33.0, "before":32.0})
("station 2", {"now":30.0, "before":None})
("station 2", {"now":31.0, "before":30.0})
("station 2", {"now":32.0, "before":31.0})
Is it possible to do so? thanks!
Here you have a working sample using the public topic for taxis
This is the StatefulDoFn
class UpdateLast(beam.DoFn):
RIDE_TRACK = BagStateSpec('rides', TupleCoder((FloatCoder(), FloatCoder())))
def process(self,
element,
timestamp_param=beam.DoFn.TimestampParam,
ride_state=beam.DoFn.StateParam(RIDE_TRACK)):
key = element[0]
meter_reading = element[1]
timestamp = float(timestamp_param)
bag_content = [x for x in ride_state.read()]
if not bag_content:
logging.info("Generating entry %s for key %s", (meter_reading, timestamp), key)
ride_state.add((meter_reading, timestamp))
output = {"now": meter_reading, "before": None}
yield (key, output)
else:
# There should only be one element in the bag
bag_ride = bag_content[0]
old_meter = bag_ride[0]
old_timestamp = bag_ride[1]
# We only need to check if the element is more recent
if timestamp > old_timestamp:
# Update bag
ride_state.clear()
ride_state.add((meter_reading, timestamp))
output = {"now": meter_reading, "before": old_meter}
logging.info("KEY %s: updating from %s to %s", key, old_meter, meter_reading)
yield (key, output)
else:
# Invert old and new if element is old
output = {"now": old_meter, "before": meter_reading}
yield (key, output)
And a pipeline for you to test it"
options = PipelineOptions(
temp_location=f"{bucket}/tmp/",
project=project,
region=region,
streaming=True,
job_name="statedofn",
num_workers=4,
max_num_workers=20,
)
p = beam.Pipeline(DataflowRunner(), options=options)
topic = "projects/pubsub-public-data/topics/taxirides-realtime"
pubsub = (p | "Read Topic" >> ReadFromPubSub(topic=topic)
| "Json Loads" >> Map(json.loads)
| beam.Filter(lambda x: x["ride_status"] == "enroute")
| "KV" >> Map(lambda x: (x["ride_id"], x["meter_reading"]))
)
state_df = (pubsub | "Stateful Do Fn" >> ParDo(UpdateLast())
| Map(logging.info)
)
p.run()
output:
('052b8a40-1c57-4a3c-a012-73ffeddb1f02', {'now': 9.875244, 'before': 9.857124})
('835a9a99-c2fc-4f3d-9284-59098827fe05', {'now': 26.973698, 'before': 26.940273})
('952c0fa5-2bb8-4c9a-b38c-72d66dedfddc', {'now': 17.828278, 'before': 17.808857})
('952c0fa5-2bb8-4c9a-b38c-72d66dedfddc', {'now': 17.847698, 'before': 17.828278})
('d5641df2-2fd8-4416-bde7-4def6d477a29', {'now': 2.3575556, 'before': 2.3346667})
('d5641df2-2fd8-4416-bde7-4def6d477a29', {'now': 2.3804445, 'before': 2.3575556})

How to create multiline string sf object to create a time slider in Leaflet map

I'm trying to create a leaflet map which shows different movement paths for different months of the year. I.e. I've got a dataset showing multiple journeys per month and I want to display the movement paths separately for each month using the addTimeslider feature of the leaflet.extras2 package.
To do so I have been trying to adapt the code posted by SymbolixAU I found here: leaflet add multiple polylines
This code uses sf functions including st_linestring to create an object that can be supplied to a addPolylines leaflet function to show all movement paths at once.
I'm pretty sure for my purposes (showing data separately for each month) I have to use st_multilinestring, which takes a list of matrices containing the coordinates for multiple polylines per row (with one row per month) rather than a single polyline per row.
Once I have that I think I could supply that object to the addTimeslider function of leaflet.extras2 to achieve what I need. I'm quite sure of this because when I used the sf object created using sf_linestring inthe AddTimeslider feature I was able to use the time slider on the map to individual movement paths at a time.
However, I have been trying for hours and haven't been successful. Would be hugely grateful for any pointers, please and thank you.
Some example data:
#load packages
library(dplyr)
library(leaflet)
library(leaflet.extras2)
library(sf)
library(data.table)
# create the example dataset
data <- structure(list(arrival_month = structure(c(3L, 3L, 4L, 4L, 4L,
5L, 5L, 6L, 6L, 6L), .Label = c("January", "February", "March",
"April", "May", "June", "July", "August", "September", "October",
"November", "December"), class = c("ordered", "factor")), start_lat = c(33.40693,
33.64672, 33.57127, 33.42848, 33.54936, 33.53418, 33.60399, 33.49554,
33.5056, 33.61696), start_long = c(-112.0298, -111.9255, -112.049,
-112.0998, -112.0912, -112.0911, -111.9273, -111.9687, -112.0563,
-111.9866), finish_lat...4 = c(33.40687, 33.64776, 33.57125,
33.42853, 33.54893, 33.53488, 33.60401, 33.49647, 33.5056, 33.61654
), finish_lat...5 = c(-112.0343, -111.9303, -112.0481, -112.0993,
-112.0912, -112.0911, -111.931, -111.9711, -112.0541, -111.986
)), row.names = c(NA, -10L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x0000026e5df41ef0>)
My attempt at the code:
# Convert the data into a list of matrices for each month
mnths <- c("May","March","April","June")
mat_list <- list()
for (i in mnths) {
month <- as.matrix(data %>% filter(arrival_month == i) %>% select(-1))
mat_list[[i]] <- month
}
# convert to an sf object
data_DT <- setDT(data)
sf <- data_DT[
, {
geometry <- sf::st_multilinestring(x = mat_list)
geometry <- sf::st_sfc(geometry)
geometry <- sf::st_sf(geometry = geometry)
}
, by = arrival_month
]
sf <- sf::st_as_sf(sf)
This yields the following result:
It's not correct because each row contains the coordinates for all the months, rather than just for the month in the respective row. I'm at a loss as to where to go from here - any help would be hugely appreciated.
Thanks
I would do it slightly differently today, making use of {sfheaders} to build the linestrings.
library(sfheaders)
library(sf)
library(data.table)
setDT( data )
data[, line_id := .I ] ## Assuming each row is a line
## create a long-form of the data
dt_line <- rbindlist(
list(
data[, .(arrival_month, line_id, lon = start_long, lat = start_lat, sequence = 1)]
, data[, .(arrival_month, line_id, lon = finish_lat...5, lat = finish_lat...4, sequence = 2)] ## I think 'finish_lat...5' is actually the 'long'
)
)
setorder(dt_line, line_id, sequence)
sf <- sfheaders::sf_multilinestring(
obj = dt_line
, x = "lon"
, y = "lat"
, multilinestring_id = "arrival_month"
, linestring_id = "line_id"
, keep = T
)
sf::st_crs( sf ) <- 4326 ## Assuming it's in Web Mercator
# Simple feature collection with 4 features and 2 fields
# Geometry type: MULTILINESTRING
# Dimension: XY
# Bounding box: xmin: -112.0998 ymin: 33.40687 xmax: -111.9255 ymax: 33.64776
# Geodetic CRS: WGS 84
#. arrival_month sequence geometry
# 1 3 1 MULTILINESTRING ((-112.0298...
# 2 4 1 MULTILINESTRING ((-112.049 ...
# 3 5 1 MULTILINESTRING ((-112.0911...
# 4 6 1 MULTILINESTRING ((-111.9687...
Note the arrival_month has been re-coded to it's factor levels.
I managed to adapt the code provided by SymbolixAU to suit my purposes, by using sf_multilinestring instead of sf_line string:
## Generate data
data <- structure(list(arrival_month = structure(c(3L, 3L, 4L, 4L, 4L,
5L, 5L, 6L, 6L, 6L), .Label = c("January", "February", "March",
"April", "May", "June", "July", "August", "September", "October",
"November", "December"), class = c("ordered", "factor")), start_lat = c(33.40693,
33.64672, 33.57127, 33.42848, 33.54936, 33.53418, 33.60399, 33.49554,
33.5056, 33.61696), start_long = c(-112.0298, -111.9255, -112.049,
-112.0998, -112.0912, -112.0911, -111.9273, -111.9687, -112.0563,
-111.9866), finish_lat...4 = c(33.40687, 33.64776, 33.57125,
33.42853, 33.54893, 33.53488, 33.60401, 33.49647, 33.5056, 33.61654
), finish_lat...5 = c(-112.0343, -111.9303, -112.0481, -112.0993,
-112.0912, -112.0911, -111.931, -111.9711, -112.0541, -111.986
)), row.names = c(NA, -10L), class = c("data.table", "data.frame"
))
## Add id column and convert to data table
data[, line_id := .I ]
setDT(data)
## create a long-form of the data
dt_line <- rbindlist(
list(
data[, .(arrival_month, line_id, lon = start_long, lat = start_lat, sequence = 1)]
, data[, .(arrival_month, line_id, lon = finish_lat...5, lat = finish_lat...4, sequence = 2)] ## I think 'finish_lat...5' is actually the 'long'
)
)
setorder(dt_line, line_id, sequence)
## Create multistring sf object
sf <- sfheaders::sf_multilinestring(
obj = dt_line
, x = "lon"
, y = "lat"
, linestring_id = "line_id"
, multilinestring_id = "arrival_month"
, keep = TRUE
)
sf::st_as_sf(sf)
## Convert arrival_month back to factor
sf$arrival_month <- as.factor(sf$arrival_month)
Might not be the most elegant but this does the job.
The key was to use sf_multilinestring and include/specify both linestring_id and multilinestring_id to distinguish between separate sets of polylines.
This is the result:
Now when this sf object is used in AddTimeslider it behaves as desired.
Credit to SymbolixAU for most of the code and bringing my attention to the sfheaders package

Clipping rasters based on a ShapeFile using R

I am using the WorldPop Population Count for DR Congo as a GeoTIFF (.tif) file (See here). This single-band raster file has about 5.17 million grids cells each at the 30-arc second resolution for the entire country.
I would like to clip out just the two northeastern provinces ([Ituri and North-Kivu] in R and save the clipped file as a .tif file. I am able to do this using QGIS (See here) however this is way too complicated. I am looking to using GADM inside the R workspace and overlay it on top of the population count raster and clip out the two provinces.
Here is what I have for my preliminary attempt at importing the raster (.tif) file and GADM (.rds) in R.
###########################################################################
# #
# Spatial tracking of the 2018-2020 Kivu Ebola outbreak in DRC #
# #
# This source code is issued under the GNU General Public License, v3.0. #
# #
# This script is free software; you can redistribute it and/or modify #
# it under the terms of the GNU General Public License as published by #
# the Free Software Foundation; either version 3.0 of the License, or #
# (at your option) any later version. #
# #
# See the GNU General Public License for more details. #
# #
# https://www.gnu.org/licenses/gpl-3.0.en.html #
###########################################################################
rm(list = ls())
#install.packages("raster", dependencies = T)
#install.packages("rgdal", dependencies = T)
#install.packages("ncdf4", dependencies = T)
#install.packages("rstudioapi", dependencies = T)
library(sp)
library(raster)
library(rgdal)
library(ncdf4)
library(rstudioapi)
# library(colorRamps)
# library(ggmap)
# library(ggplot2)
#----------------------------#
# Set your working directory #
#----------------------------#
setwd(dirname(rstudioapi::getActiveDocumentContext()$path)) # RStudio IDE preferred
getwd() # Path to your working directory
#----------------------------------------------------------------#
# Source 1: WorldPop UN-Adjusted Population Count GeoTIFF raster #
#----------------------------------------------------------------#
# Downloaded from https://www.worldpop.org/geodata/summary?id=35845
DRCWorldPop <- raster("cod_ppp_2020_1km_Aggregated_UNadj.tif")
DRCWorldPop # this original raster layer has 2,261 rows and 2,289 columns = 5,175,429 grid cells
dim(DRCWorldPop); length(DRCWorldPop); extent(DRCWorldPop)
names(DRCWorldPop) <- "Susceptible"
names(DRCWorldPop); res(DRCWorldPop); projection(DRCWorldPop)
summary(getValues(DRCWorldPop))
minValue(DRCWorldPop); maxValue(DRCWorldPop)
DRCWorldPop <- replace(DRCWorldPop, is.na(DRCWorldPop), 0)
summary(getValues(DRCWorldPop))
sum(getValues(DRCWorldPop))
nlayers(DRCWorldPop)
#---------------------#
# Source 2: From GADM #
#---------------------#
#?getData
drc <- getData("GADM", level=1, country="COD")
#drc$NAME_1 # List of all provinces
drc <- drc[drc$NAME_1 %in% c("Ituri", "Nord-Kivu"), ]
r <- raster(drc, resolution = res(DRCWorldPop)[1])
values(r) <- 0
r # this raster layer has 689 rows and 487 columns
extent(r)
plot(log(DRCWorldPop), main = "2020 UN-Adjusted Population Count \n for DR Congo (log-scale)", col=topo.colors(100), legend.width=2, legend.shrink=1, legend.args=list(text='Persons', side=4, font=2, line=2.5, cex=0.8), axes=T)
lines(drc, col="black", lwd=1)
# plot(drc, col = "light yellow")
# lines(drc, col="black", lwd=2)
#--------------------------------------------#
# Merging and Cropping Source 1 and Source 2 #
#--------------------------------------------#
# ?merge
merged <- merge(DRCWorldPop, r, tolerance = 0.07)
# Task: What is the significance of tolerance?
# ?crop
cropped <- crop(merged, drc)
cropped # this raster layer has 689 rows and 487 columns
extent(cropped)
plot(log(cropped), main = "2020 UN-Adjusted Population Count \n for DR Congo (log-scale)", legend.width=2, legend.shrink=1, legend.args=list(text='Persons', side=4, font=2, line=2.5, cex=0.8), axes=T)
lines(drc, col="red", lwd=2)
# Task: How to change the plot Legend to make it more colorful?
# Task: The plot is in log-scale, how to construct a similar colorful plot in raw scale?
writeRaster(cropped, "cropped.tif", format = "GTiff", overwrite = TRUE)
croppedDRCWorldPop <- raster("cropped.tif")
croppedDRCWorldPop # this cropped raster layer has 689 rows and 487 columns
#--------------------------------#
# Aggregating the cropped raster #
#--------------------------------#
aggregationFactor <- 10 # in km
DRC_aggr_count <- aggregate(croppedDRCWorldPop, fact = c(aggregationFactor, aggregationFactor), fun = sum, na.rm = TRUE)
DRC_aggr_count # this raster layer has 69 rows and 49 columns
names(DRC_aggr_count); res(DRC_aggr_count); projection(DRC_aggr_count)
dim(DRC_aggr_count); length(DRC_aggr_count); extent(DRC_aggr_count); isLonLat(DRC_aggr_count)
summary(getValues(DRC_aggr_count))
sum(getValues(DRC_aggr_count))
nrow(DRC_aggr_count)
ncol(DRC_aggr_count)
# ?xyFromCell
#xyFromCell(DRC_aggr_count, 1:ncell(DRC_aggr_count), spatial=FALSE)
# #xy for corners of a raster:
# xyFromCell(DRC_aggr_count, c(1, ncol(DRC_aggr_count), ncell(DRC_aggr_count)-ncol(DRC_aggr_count)+1, ncell(DRC_aggr_count)))
#
# xmin(DRC_aggr_count)
# xmax(DRC_aggr_count)
# ymin(DRC_aggr_count)
# ymax(DRC_aggr_count)
# origin(DRC_aggr_count)
# https://www.nationalgeographic.org/encyclopedia/latitude/
# https://www.nationalgeographic.org/encyclopedia/longitude/
plot(log(DRC_aggr_count), main = "2020 UN-Adjusted Population Count \n for DR Congo (log-scale)", legend.width=2, legend.shrink=1, legend.args=list(text='Persons', side=4, font=2, line=2.5, cex=0.8))
lines(drc, col="red", lwd=2)
text(32,3,"aggregated", xpd = TRUE)
# Task: How to change the plot Legend to make it more colorful?
# Task: The plot is in log-scale, how to construct a similar colorful plot in raw scale?
# library(RColorBrewer)
# warna <- brewer.pal(n = 11, name = "RdYlGn")
# #warna <- rev(warna)
# plot(log(DRC_aggr_count), col=palette(warna), main = "2020 UN-Adjusted Population Count \n for DR Congo (log-scale)", legend.width=2, legend.shrink=1, legend.args=list(text='Persons', side=4, font=2, line=2.5, cex=0.8))
# lines(drc, col="red", lwd=2)
# #plot.window(xlim=extent(DRC_aggr_count)[1:2], ylim=extent(DRC_aggr_count)[3:4])
#spplot(log(DRC_aggr_count))
#-----------------------------------#
# Export cropped raster to a netCDF #
#-----------------------------------#
if (require(ncdf4)) {
#rnc <- writeRaster(DRCWorldPop, filename ='Congo_full_0000.nc', format = "CDF", varname = "Susceptible", varunit = "Persons", longname = "Susceptible", overwrite = TRUE)
rnc_aggr_tif <- writeRaster(DRC_aggr_count, filename ='cod_ppp_2020_10km_Aggregated_UNadj.tif', format = "GTiff", varname = "Susceptible", varunit = "Persons", longname = "Susceptible", overwrite = TRUE)
rnc_aggr <- writeRaster(DRC_aggr_count, filename ='Congo_0000.nc', format = "CDF", varname = "Susceptible", varunit = "Persons", longname = "Susceptible", overwrite = TRUE)
}
I would like the cell values for the neighbouring countries (Uganda, Rwanda, South Sudan, Burundi) set equal to zero. The resulting tif has 689 rows and 487 columns which is equivalent to 335,543 nonoverlapping grid cells each containing the population count. This is too big for my simulation purposes so I am aggregating the cropped raster by a factor of 10 to get a raster with 69 rows by 49 columns. Aggregation is done using the above code and finally saved as a NetCDF file and viewed using the Panoply tool. The base NetCDF file has one layer (Population Count data which we call "Susceptible.). To this NetCDF file we add additional epidemic compartments as follows.
#---------------------------------------------------------#
# Adding more epidemic compartments to an existing netCDF #
#---------------------------------------------------------#
episim <- nc_open("Congo_0000.nc", write = TRUE)
episim$dim$latitude$vals; length(episim$dim$latitude$vals) # lat is vertical axis (or)rows in our case
episim$dim$longitude$vals; length(episim$dim$longitude$vals) # lon is horizontal axis (or) columns in our case
nrows <- length(episim$dim$latitude$vals)
ncols <- length(episim$dim$longitude$vals)
# Longitude is "East - West" means columns
# Latitude is "North - South" means rows
#ncatt_get(episim, 0, attname=NA, verbose=FALSE)
ncatt_put(episim, 0, "created_by", attval = c("Ashok Krishnamurthy"), verbose=FALSE)
ncatt_put(episim, 0, "contact", attval = c("Ashok Krishnamurthy <akrishnamurthy#mtroyal.ca>"), verbose=FALSE)
ncatt_put(episim, 0, "nRows", attval = nrows, verbose=FALSE)
ncatt_put(episim, 0, "nCols", attval = ncols, verbose=FALSE)
#Upper Left Corner pair is (first row, first column) = (27.22375, 3.674584)
ncatt_put(episim, 0, "ULCornerLongitude", attval = episim$dim$longitude$vals[1], verbose=FALSE)
ncatt_put(episim, 0, "ULCornerLatitude", attval = episim$dim$latitude$vals[1], verbose=FALSE)
#Lower Left Corner pair is (last row, first column) = (27.22375, -2.075416)
ncatt_put(episim, 0, "LLCornerLongitude", attval = episim$dim$longitude$vals[1], verbose=FALSE) #
ncatt_put(episim, 0, "LLCornerLatitude", attval = episim$dim$latitude$vals[nrows], verbose=FALSE) #
#ncatt_put(episim, 0, "cellSize", attval = abs(episim$dim$longitude$vals[1] - episim$dim$longitude$vals[2]), verbose=FALSE)
ncatt_put(episim, 0, "hcellSize", attval = res(DRC_aggr_count)[1], verbose=FALSE)
ncatt_put(episim, 0, "vcellSize", attval = res(DRC_aggr_count)[2], verbose=FALSE)
# episim$var[[1]] # str(episim$var[[1]]$dim) # str(episim$var[[1]])
#----------------------------------------------------------------#
# Adding new Epidemic State Variables to an existing netCDF file #
#----------------------------------------------------------------#
x <- ncdim_def(name = "longitude", units = "degrees_east", vals = episim$dim$longitude$vals, unlim = FALSE, create_dimvar = TRUE, calendar = NA, longname = "longitude")
y <- ncdim_def(name = "latitude", units = "degrees_north", vals = episim$dim$latitude$vals, unlim = FALSE, create_dimvar = TRUE, calendar = NA, longname = "latitude")
#?ncvar_def
Vaccinated <- ncvar_def(name = "Vaccinated", units = "Persons", dim = list(x,y), missval = NULL, prec = "float", shuffle=FALSE, compression=NA, chunksizes=NA, verbose=FALSE, longname = "Vaccinated")
Exposed <- ncvar_def(name = "Exposed", units = "Persons", dim = list(x,y), missval = NULL, prec = "float", shuffle=FALSE, compression=NA, chunksizes=NA, verbose=FALSE, longname = "Exposed")
Infected <- ncvar_def(name = "Infected", units = "Persons", dim = list(x,y), missval = NULL, prec = "float", shuffle=FALSE, compression=NA, chunksizes=NA, verbose=FALSE, longname = "Infected")
Recovered <- ncvar_def(name = "Recovered", units = "Persons", dim = list(x,y), missval = NULL, prec = "float", shuffle=FALSE, compression=NA, chunksizes=NA, verbose=FALSE, longname = "Recovered")
Dead <- ncvar_def(name = "Dead", units = "Persons", dim = list(x,y), missval = NULL, prec = "float", shuffle=FALSE, compression=NA, chunksizes=NA, verbose=FALSE, longname = "Dead")
Inhabitable <- ncvar_def(name = "Inhabitable", units = "Binary", dim = list(x,y), missval = NULL, prec = "integer", shuffle=FALSE, compression=NA, chunksizes=NA, verbose=FALSE, longname = "Inhabitable")
#ProvinceIdentifier <- ncvar_def(name = "ProvinceIdentifier", units = "Province", dim = list(x,y), missval = NULL, prec = "integer", shuffle=FALSE, compression=NA, chunksizes=NA, verbose=FALSE, longname = "ProvinceIdentifier")
# Epidemic State Variables (or Epidemic Compartments) are added in the following order
ncvar_add(episim, Vaccinated)
ncvar_add(episim, Exposed)
ncvar_add(episim, Infected)
ncvar_add(episim, Recovered)
ncvar_add(episim, Dead)
ncvar_add(episim, Inhabitable)
#ncvar_add(episim, ProvinceIdentifier)
currSusceptible <- ncvar_get(episim, episim$var[[2]]) # WorldPop
#ProvinceIdentifier <- ncvar_get(epiProvince, epiProvince$var[[2]]) # Sitansu
dim(currSusceptible) # 49 rows and 69 columns
dim(t(currSusceptible)) # 69 rows and 49 columns
# sum(currSusceptible)
# sum(currSusceptible>0)
# sum(currSusceptible == 0)
# max(currSusceptible)
# currSusceptible[ncols, nrows]
# currSusceptible[nrows, ncols] # Subscript out of bounds error! AS EXPECTED
# dim(ProvinceIdentifier)
# dim(t(ProvinceIdentifier))
# table(ProvinceIdentifier)
# I could use a combination of transpose and flip from raster package
currVaccinated <- currExposed <- currInfected <- currRecovered <- currDead <- currInhabitable <- matrix(0, length(episim$dim$longitude$vals),length(episim$dim$latitude$vals))
for(i in 1:ncols)
{ # ncols
for(j in 1:nrows)
{ # nrows
if (currSusceptible[i,j] > 0)
{
currInhabitable[i,j] <- 1
}
if (currSusceptible[i,j] == 0)
{
currInhabitable[i,j] <- 0
}
}
}
table(currInhabitable)
# Some cells in DRC have a population count equal to zero. Possibly forests, deserts or uninhabited areas
# currInhabitable
# 0 1
# 903 2478
nc_close(episim)
################################################################################################
episim <- nc_open("Congo_0000.nc", write = TRUE) # Fill values to an existing ncdf file
ncvar_put(episim, episim$var[[2]], currSusceptible)
ncvar_put(episim, episim$var[[3]], currVaccinated)
ncvar_put(episim, episim$var[[4]], currExposed)
ncvar_put(episim, episim$var[[5]], currInfected)
ncvar_put(episim, episim$var[[6]], currRecovered)
ncvar_put(episim, episim$var[[7]], currDead)
ncvar_put(episim, episim$var[[8]], currInhabitable)
# ncvar_put(episim, episim$var[[9]], ProvinceIdentifier)
cat(paste("The file", episim$filename, "has", episim$nvars, "variables"), fill=TRUE)
for (v in 1:episim$nvars) cat(paste("Variable ", v, " is ", episim$var[[v]]$name,".",sep=""), fill=TRUE)
#episim # class(episim) # str(episim)
nc_close(episim)
I able to overlay these two layers and clip just the two provinces (Ituri and North-Kivu) and save it back as a .tif file.
Here is a minimal, reproducible, self-contained example:
library(raster)
# example polygons
f <- system.file("external/lux.shp", package="raster")
p <- shapefile(f)
# example raster
r <- raster(p, res=0.01)
values(r) <- 1:ncell(r)
# select two areas
pp <- p[1:2, ]
# have a look
plot(r)
lines(p)
lines(pp, col="red", lwd=3)
Solution
x <- crop(r, pp)
m <- mask(x, pp)
plot(m)
lines(pp)
Also see this question from only yesterday...

Deep CNN doesn't learn and accuracy just stay in same value

I have a Deep CNN based on ResNet, and a dataset(10000, 50,50,1) to classify digits . when I run it to start leanrning , accuracy just stops in some value and gently occilating(around 0.2). I am wondering if it has overfitting or there is another issue involved ?
here is the identity block :
def identity_block(X, f, filters, stage, block):
# defining name basics
conv_name_base = 'res' + str(stage) + block + '_branch'
bn_name_base = 'bn' + str(stage) + block + '_branch'
# retrieve filters
F1, F2, F3 = filters
# save the shortcut
X_shortcut = X
# first component
X = Conv2D(filters=F1, kernel_size=(1, 1), strides=(1, 1), padding='valid', name=conv_name_base + '2a',
kernel_initializer=initializers.glorot_uniform(seed=0))(X)
X = BatchNormalization(axis=3, name=bn_name_base + '2a')(X)
X = Activation('relu')(X)
# second component
X = Conv2D(filters=F2, kernel_size=(f, f), strides=(1, 1), padding='same', name=conv_name_base + '2b',
kernel_initializer=initializers.glorot_uniform(seed=0))(X)
X = BatchNormalization(axis=3, name=bn_name_base + '2b')(X)
X = Activation('relu')(X)
# third component
X = Conv2D(filters=F3, kernel_size=(1, 1), strides=(1, 1), padding='valid', name=conv_name_base + '2c',
kernel_initializer=initializers.glorot_uniform(seed=0))(X)
X = BatchNormalization(axis=3, name=bn_name_base + '2c')(X)
# final component
X = Add()([X, X_shortcut])
X = Activation('relu')(X)
return X
and convolutional block :
def conv_block(X, f, filters, stage, block, s=2):
conv_name_base = 'res' + str(stage) + block + '_branch'
bn_name_base = 'bn' + str(stage) + block + '_branch'
# Retivr filters
F1, F2, F3 = filters
# Save shortcut
X_shortcut = X
# First component
X = Conv2D(F1, kernel_size=(1, 1), strides=(s, s), name=conv_name_base + '2a',
kernel_initializer=initializers.glorot_uniform(seed=0))(X)
X = BatchNormalization(axis=3, name=bn_name_base + '2a')(X)
X = Activation('relu')(X)
# Second component
X = Conv2D(F2, kernel_size=(f, f), strides=(1, 1), padding='same', name=conv_name_base + '2b',
kernel_initializer=initializers.glorot_uniform(seed=0))(X)
X = BatchNormalization(axis=3, name=bn_name_base + '2b')(X)
X = Activation('relu')(X)
# third component
X = Conv2D(F3, kernel_size=(1, 1), strides=(1, 1), name=conv_name_base + '2c', padding='valid',
kernel_initializer=initializers.glorot_uniform(seed=0))(X)
X = BatchNormalization(axis=3, name=bn_name_base + '2c')(X)
# short cut
X_shortcut = Conv2D(F3, kernel_size=(1, 1), strides=(s, s), name=conv_name_base + '1',
kernel_initializer=initializers.glorot_uniform(seed=0))(X_shortcut)
X_shortcut = BatchNormalization(axis=3, name=bn_name_base + '1')(X_shortcut)
# finaly
X = Add()([X, X_shortcut])
X = Activation('relu')(X)
return X
and finaly the ResNet:
def ResNet( input_shape=(50, 50, 1), classes=10):
inp = Input(shape=(50,50,1))
# zero padding
X = ZeroPadding2D((3, 3), name='pad0')(inp)
# stage1
X = Conv2D(32, (5,5), name='conv1', input_shape=input_shape,
kernel_initializer=initializers.glorot_uniform(seed=0))(X)
X = BatchNormalization(axis=3, name='bn1')(X)
X = Activation('relu')(X)
X = MaxPooling2D((2,2), name='pool1')(X)
# Stage 2
stage2_filtersize = 32
X = conv_block(X, 3, filters=[stage2_filtersize, stage2_filtersize, stage2_filtersize], stage=2, block='a', s=1)
X = identity_block(X, 3, [stage2_filtersize,stage2_filtersize, stage2_filtersize], stage=2, block='b')
X = identity_block(X, 3, [stage2_filtersize, stage2_filtersize, stage2_filtersize], stage=2, block='c')
# Stage 3
stage3_filtersize = 64
X = conv_block(X, 3, filters=[stage3_filtersize, stage3_filtersize, stage3_filtersize], stage=3, block='a', s=1)
X = identity_block(X, 3, [stage3_filtersize, stage3_filtersize, stage3_filtersize], stage=3, block='b')
X = identity_block(X, 3, [stage3_filtersize, stage3_filtersize, stage3_filtersize], stage=3, block='c')
# Stage 4
stage4_filtersize = 128
X = conv_block(X, 3, filters=[stage4_filtersize, stage4_filtersize, stage4_filtersize], stage=4, block='a', s=1)
X = identity_block(X, 3, [stage4_filtersize, stage4_filtersize, stage4_filtersize], stage=4, block='b')
X = identity_block(X, 3, [stage4_filtersize, stage4_filtersize, stage4_filtersize], stage=4, block='c')
# final
X = AveragePooling2D((2, 2), padding='same', name='Pool0')(X)
# FC
X = Flatten(name='D0')(X)
X = Dense(classes, activation='softmax', kernel_initializer=initializers.glorot_uniform(seed=0), name='D2')(X)
# creat model
model = Model(inputs=inp, outputs=X)
return model
update 1 : here are the fitting and compile methods :
model.compile(optimizer='adam',
loss=tensorflow.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
model.compile(optimizer='adam',
loss=tensorflow.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
print("model compiled settings imported successfully")
early_stopping = EarlyStopping(monitor='val_loss', patience=2)
model.fit(X_train, Y_train, validation_split=0.2, callbacks=[early_stopping], epochs=10)
test_loss, test_acc = model.evaluate(X_test, Y_test, verbose=2)
First try normalizing the values of the digit image (50x50).
Then also consider how a neural network learns its weights. Convolutional Neural Networks learns by continually adding gradient error vectors that are multiplied by a learning rate computed from backpropagation to various weight matrices throughout the network as training examples are passed through.
The most important thing to consider is the multiplication of the learning rate, because once we didn't scale the training inputs the range of distributions of the feature values will be likely different from each feature, thus the learning rate would cause corrections in each dimension that would differ from one another. This is random, so the machine could be overcompensating a correction in one weight dimension and under compensating in another. Which is very non-ideal as this might result in an oscillation state or a very slow training state.
Oscillating means that the model is unable to locate the center for the better maxima in weights.
Slow training means moving too slow to achieve a better maxima.
This is why it is a common practice to normalize images before using it as an input for Neural Network or any Models that is Gradient-Based.
TF_Support's answer:
Provide few samples of the dataset, loss curve, accuracy plot so we can clearly understand what you're trying to learn, it's more important than the code you provided.
I would guess, you are trying to learn very hard samples, 50by50 grayscale is not much. Is your network overfitting? (We would only figure that out after looking into some plots of the validation metrics) (0.2 is it your training accuracy?)
First do a sanity check on the dataset, by training a very simple CNN. I see you have 10 classes (not sure, just guessing from the function's default value), the randomized accuracy is 10%, so set a baseline first with a simple CNN and then try to improve with ResNet.
Increase the learning rate and see how the accuracy fluctuates. After a few epochs, reduce the learning rate when the accuracy better than the baseline.

Resources