Identifying outlying datapoints from residuals (GeoLight package)

Identifying outlying datapoints from residuals (GeoLight package) - geolocation

I am analysing some data collected from a geolocator placed on a migratory bird. In a nutshell, my data are sunrise and sunset times, which are then used to determine position on the globe.
I am using a package GeoLight (http://cran.r-project.org/web/packages/GeoLight/GeoLight.pdf) to identify outlying data - specifically, I am using the LoessFilter function which applies a polynomial regression and identify residuals that are greater than 3 interquantile ranges (specified by k in the code when applying the function)
My problem is: the function returns graphs in which outlying datapoints are identified in red. There seems to be an issue with the code itself regarding returned TRUE or FALSE statements stating which points are outliers - all are stated as TRUE, even if outliers are identified.
I have therefore modified the function code to state which residuals are outliers.
However, when I then remove those rows from the original dataset and re-run the function, the points have not been removed. Therefore, there is some discrepancy between which residuals are relating to values in the original data: i.e. if the output states that residual 78 is an outlying point, removing row 78 from the original data does not remove the outlying datapoint.
I would very much appreciate some help with removing the outlying datapoints identified using the function. It seems like a very easy fix but I can't seem to figure it out.
Code for full function and data below
Thanks
Emma
log2$tFirst<-as.POSIXlt(log2$tFirst)
log2$tSecond<-as.POSIXlt(log2$tSecond)
CODE TO GET OUTLYING RESIDUALS
i.get.outliers<-function(residuals, k=3) {
x <- residuals
# x is a vector of residuals
# k is a measure of how many interquartile ranges to take before saying that point is an outlier
# it looks like 3 is a good preset for k
QR<-quantile(x, probs = c(0.25, 0.75))
IQR<-QR[2]-QR[1]
Lower.band<-QR[1]-(k*IQR)
Upper.Band<-QR[2]+(k*IQR)
delete<-which(x<Lower.band | x>Upper.Band)
return(as.vector(delete))
}
LOESS FILTER FUNCTION CODE
loessFilter <- function(tFirst, tSecond, type, k=3, plot=TRUE){
tw <- data.frame(datetime=as.POSIXct(c(tFirst,tSecond),"UTC"),type=c(type,ifelse(type==1,2,1)))
tw <- tw[!duplicated(tw$datetime),]
tw <- tw[order(tw[,1]),]
hours <- as.numeric(format(tw[,1],"%H"))+as.numeric(format(tw[,1],"%M"))/60
for(t in 1:2){
cor <- rep(NA, 24)
for(i in 0:23){
cor[i+1] <- max(abs((c(hours[tw$type==t][1],hours[tw$type==t])+i)%%24 -
(c(hours[tw$type==t],hours[tw$type==t][length(hours)])+i)%%24),na.rm=T)
}
hours[tw$type==t] <- (hours[tw$type==t] + (which.min(round(cor,2)))-1)%%24
}
dawn <- data.frame(id=1:sum(tw$type==1),
datetime=tw$datetime[tw$type==1],
type=tw$type[tw$type==1],
hours = hours[tw$type==1], filter=FALSE)
dusk <- data.frame(id=1:sum(tw$type==2),
datetime=tw$datetime[tw$type==2],
type=tw$type[tw$type==2],
hours = hours[tw$type==2], filter=FALSE)
for(d in seq(30,k,length=5)){
predict.dawn <- predict(loess(dawn$hours[!dawn$filter]~as.numeric(dawn$datetime[!dawn$filter]),span=0.1))
predict.dusk <- predict(loess(dusk$hours[!dusk$filter]~as.numeric(dusk$datetime[!dusk$filter]),span=0.1))
del.dawn <- i.get.outliers(as.vector(residuals(loess(dawn$hours[!dawn$filter]~
as.numeric(dawn$datetime[!dawn$filter]),span=0.1))),k=d)
del.dusk <- i.get.outliers(as.vector(residuals(loess(dusk$hours[!dusk$filter]~
as.numeric(dusk$datetime[!dusk$filter]),span=0.1))),k=d)
if(length(del.dawn)>0) dawn$filter[!dawn$filter][del.dawn] <- TRUE
if(length(del.dusk)>0) dusk$filter[!dusk$filter][del.dusk] <- TRUE
}
if(plot){
par(mfrow=c(2,1),mar=c(3,3,0.5,3),oma=c(2,2,0,0))
plot(dawn$datetime[dawn$type==1],dawn$hours[dawn$type==1],pch="+",cex=0.6,xlab="",ylab="",yaxt="n")
lines(dawn$datetime[!dawn$filter], predict(loess(dawn$hours[!dawn$filter]~as.numeric(dawn$datetime[!dawn$filter]),span=0.1)) , type="l")
points(dawn$datetime[dawn$filter],dawn$hours[dawn$filter],col="red",pch="+",cex=1)
axis(2,labels=F)
mtext("Sunrise",4,line=1.2)
plot(dusk$datetime[dusk$type==2],dusk$hours[dusk$type==2],pch="+",cex=0.6,xlab="",ylab="",yaxt="n")
lines(dusk$datetime[!dusk$filter], predict(loess(dusk$hours[!dusk$filter]~as.numeric(dusk$datetime[!dusk$filter]),span=0.1)), type="l")
points(dusk$datetime[dusk$filter],dusk$hours[dusk$filter],col="red",pch="+",cex=1)
axis(2,labels=F)
legend("bottomleft",c("Outside filter","Inside filter"),pch=c("+","+"),col=c("black","red"),
bty="n",cex=0.8)
mtext("Sunset",4,line=1.2)
mtext("Time",1,outer=T)
mtext("Sunrise/Sunset hours (rescaled)",2,outer=T)
}
all <- rbind(subset(dusk,filter),subset(dawn,filter))
filter <- rep(FALSE,length(tFirst))
filter[tFirst%in%all$datetime | tSecond%in%all$datetime] <- TRUE
# original code:
#return(!filter)
# altered code to return outliersreturn(del.dusk)
# replace with code below to print outlying points
return(c("delete dawn",del.dawn,"delete dusk",del.dusk))
}
APPLY FUNCTION
loessFilter(log2$tFirst, log2$tSecond, type=1, k=4, plot=TRUE)
remove the values - need to remove both sunrise and sunset curves
log2b<-log2[-c(77,78,124,125),]
length(log2$tFirst)
length(log2b$tFirst)
repeat function to see if the values have gone
loessFilter(log2b$tFirst, log2b$tSecond, type=1, k=4, plot=TRUE)
outliers still there!!
HERE ARE THE DATA:
http://www.4shared.com/file/jxVuTsVHce/002_geolight.html
A bit too long to post the full data here and the example won't work with a dummy dataset :)

Related

Use the Survey package to weight observations in stacked imputations

I am exploring model variable selection within imputed data.
One technique is to stack imputations in long format (where n observations in M imputed datasets creates a dataset n x M long), and use weighted regression to reduce the contribution of each observation proportionally to the number of imputations. If we treated the stacked dataset as one large dataset, the standard errors would be too small.
I am trying to use the weights argument in svyglm to account for the stacked data, resulting in SEs that you would expect with n obervations, rather than n x M observations.
To illustrate:
library(mice)
### create data
set.seed(42)
n <- 50
id <- 1:n
var1 <- rbinom(n,1,0.4)
var2 <- runif(n,30,80)
var3 <- rnorm(n, mean = 12, sd = 5)
var4 <- rnorm(n, mean = 100, sd = 20)
prob <- (((var1*var2)+var3)-min((var1*var2)+var3)) / (max((var1*var2)+var3)-min((var1*var2)+var3))
outcome <- rbinom(n, 1, prob = prob)
data <- data.frame(id, var1, var2, var3, var4, outcome)
### Add missingness
data_miss <- ampute(data)
patt <- data_miss$patterns
patt <- patt[2:5,]
data_miss <- ampute(data, patterns = patt)
data_miss <- data_miss$amp
## create 5 imputed datasets
nimp <- 5
imp <- mice(data_miss, m = nimp)
## Stack data
data_long <- complete(imp, action = "long")
## Generate model in stacked data (SEs will be too small)
modlong <- glm(outcome ~ var1 + var2 + var3 + var4, family = "binomial", data = data_long)
summary(modlong)
the long data gives overly small SEs, as we've increased the size of our dataset by 5x
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.906417 0.965090 -3.012 0.0026 **
var1 2.221053 0.311167 7.138 9.48e-13 ***
var2 -0.002543 0.010468 -0.243 0.8081
var3 0.076955 0.032265 2.385 0.0171 *
var4 0.006595 0.008031 0.821 0.4115
Add weights
data_long$weight <- 1/nimp
library(survey)
des <- svydesign(ids = ~1, data = data_long, weights = ~weight)
mod_svy <- svyglm(formula = outcome ~ var1 + var2 + var3 + var4, family = quasibinomial(), design = des)
summary(mod_svy)
The weighted regression gives similar SEs to the unweighted model
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.906417 1.036691 -2.804 0.00546 **
var1 2.221053 0.310906 7.144 1.03e-11 ***
var2 -0.002543 0.010547 -0.241 0.80967
var3 0.076955 0.030955 2.486 0.01358 *
var4 0.006595 0.008581 0.769 0.44288
Adding rescale = F (to apparently stop weights being rescaled to the sum of the sample size) doesn't change anything
mod_svy <- svyglm(formula = outcome ~ var1 + var2 + var3 + var4, family = quasibinomial(), design = des, rescale = F)
summary(mod_svy)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.906417 1.036688 -2.804 0.00546 **
var1 2.221053 0.310905 7.144 1.03e-11 ***
var2 -0.002543 0.010547 -0.241 0.80967
var3 0.076955 0.030955 2.486 0.01358 *
var4 0.006595 0.008581 0.769 0.44288
I would have expected SEs similar to those obtained when running a model in a single imputed dataset
## Assess SEs in single imputation
mod_singleimp <- glm(outcome ~ var1 + var2 + var3 + var4, family = "binomial", data = complete(imp,1))
summary(mod_singleimp)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.679589 2.116806 -1.266 0.20556
var1 2.476193 0.761195 3.253 0.00114 **
var2 0.014823 0.025350 0.585 0.55874
var3 0.048940 0.072752 0.673 0.50114
var4 -0.004551 0.017986 -0.253 0.80026
All assistance greatly appreciated. Or if anybody knows other ways of achieving the same goal.
Alternative options
the psfmi package allows for stepwise selection in multiply imputed datasets and pooling of models. However, it is computationally intensive and slow with large datasets, particularly if the process needs to be bootstrapped (e.g. during internal validation) - hence the requirement for a less intensive stacking approach.

Sorry, no, this isn't going to work.
To handle stacked imputation data with weights you need frequency weights, so that a weight of 1/10 means you have 1/10 of an observation. With svydesign you specify sampling weights, so that a weight of 1/10 means your observation represents 10 observations in the population. These will (and should) give different standard errors. Pretending you have frequency weights when you actually have imputations is a clever hack to avoid having software that understands what it's doing, which is fine but isn't compatible with survey, which understands what it's doing and is doing something different.
Currently,if you want to use svyglm with multiple imputations you need to compute the standard errors separately -- most conveniently with Rubin's rules using mitools::MIcombine, which is set up to work with the survey package (see the help for with.svyimputationList and withPV).
It might be worth putting in a feature request to the mitools or survey developers (with citations to examples) to allow for stacked analysis of imputations, but this isn't just a matter of adjusting the weights.

Select prior probability of inclusion in CausalImpact or bsts?

In the CausalImpact package, the supplied covariates are independently selected with some prior probability M/J where M is the expected model size and J is the number of covariates. However, on page 11 of the paper, they say get the values by "asking about the expected model size M." I checked the documentation for CausalImpact but was unable to find any more information. Where is this done in the package? Is there a parameter I can set in a function call to decide why my desired M?

You are right, this is not directly possible with CausalImpact, but it is possible. CausalImpact uses bsts behind the scenes and this package allows to set the parameter. So you have to define you model using bsts first, set the parameter and then provide it to your CausalImpact call like this (modified example from the CausalImpact manual):
post.period <- c(71, 100)
post.period.response <- y[post.period[1] : post.period[2]]
y[post.period[1] : post.period[2]] <- NA
ss <- AddLocalLevel(list(), y)
bsts.model <- bsts(y ~ x1, ss, niter = 1000, expected.model.size = 4)
impact <- CausalImpact(bsts.model = bsts.model,
post.period.response = post.period.response)

if (freq) x$counts else x$density length > 1 and only the first element will be used

for my thesis I have to calculate the number of workers at risk of substitution by machines. I have calculated the probability of substitution (X) and the number of employee at risk (Y) for each occupation category. I have a dataset like this:
X Y
1 0.1300 0
2 0.1000 0
3 0.0841 1513
4 0.0221 287
5 0.1175 3641
....
700 0.9875 4000
I tried to plot a histogram with this command:
hist(dataset1$X,dataset1$Y,xlim=c(0,1),ylim=c(0,30000),breaks=100,main="Distribution",xlab="Probability",ylab="Number of employee")
But I get this error:
In if (freq) x$counts else x$density
length > 1 and only the first element will be used
Can someone tell me what is the problem and write me the right command?
Thank you!

It is worth pointing out that the message displayed is a Warning message, and should not prevent the results being plotted. However, it does indicate there are some issues with the data.
Without the full dataset, it is not 100% obvious what may be the problem. I believe it is caused by the data not being in the correct format, with two potential issues. Firstly, some values have a value of 0, and these won't be plotted on the histogram. Secondly, the observations appear to be inconsistently spaced.
Histograms are best built from one of two datasets:
A dataframe which has been aggregated grouped into consistently sized bins.
A list of values X which in the data
I prefer the second technique. As originally shown here The expandRows() function in the package splitstackshape can be used to repeat the number of rows in the dataframe by the number of observations:
set.seed(123)
dataset1 <- data.frame(X = runif(900, 0, 1), Y = runif(900, 0, 1000))
library(splitstackshape)
dataset2 <- expandRows(dataset1, "Y")
hist(dataset2$X, xlim=c(0,1))
dataset1$bins <- cut(dataset1$X, breaks = seq(0,1,0.01), labels = FALSE)

Find the area of a region using monte carlo methods

I am interested in finding the area of the dark gray region, in r I have used the area function of spatstat package but is very time consuming since I need to run a mh algorithm, I have read that I can calculate this area using monte carlo approach but I do not know how to proceed, please someone can help me?
some code example from spatstat book for the plot and for the area:
source("figurelayout.R")
source("startup.R")
llibrary(spatstat)
requireversion(spatstat, "1.41-1.073")
W <- as.owin(swedishpines)
x <- c(28,29,55,60,66)
y <- c(70,38,32,72,59)
X <- ppp(x=x,y=y, window = W)
u <- list(x=48,y=50)
u <- as.ppp(u, W)
rad <- 14
Xplusr <- dilation(X, rad) #
uplusr <- disc(rad, u) #bola de radio rad con centro en u
ovlap <- intersect.owin(uplusr, Xplusr)
B<-setminus.owin( disc( rad, c(u$x, u$y) ), Xplusr)
AIdemo <- layered(W,
ovlap,
Xplusr,
uplusr,
X,
u,B)
layerplotargs(AIdemo) <- list(list(),
list(col="darkgrey", border=NA),
list(lwd=2),
list(lwd=2, lty=2),
list(pch=16),
list(pch=3),
list(col="red", border=NA))
###################################################
### code chunk number 98: Unit.Rnw:3-5
###################################################
newplot(6, 0.7)
setmargins(0)
###################################################
### code chunk number 99: 13gibbs.Rnw:3214-3215
###################################################
plot(AIdemo, main="")
plot(setminus.owin( disc( rad, c(u$x, u$y) ), Xplusr))
area<-area(B)

What you are asking is a simple Monte Carlo application. You will need to draw uniform samples across the whole region, and check the proportion of points selected within your region.
To check if a point lies within the region, you can describe your region with the center-radius form of the circle equation (your region is just a circle).
Follow the link:
http://geekyisawesome.blogspot.com.au/2013/03/montecarlo-method-of-finding-area-of.html

Your question title suggests that you want to use Monte Carlo, but the question itself just says "I am interested in finding the area of the dark gray region ...". Another way to calculate this area in spatstat is to use the function areaLoss.
Continuing your example (which I guess is really copyrighted example code from the companion website of the spatstat book):
Y <- superimpose(u, X) # Centres of all circles
areaLoss(Y, rad, exact = TRUE) # Remaining area in each circle
areaLoss(Y, rad, exact = TRUE, subset = 1) # Only circle of interest
Depending on how many overlapping circles you have this may be faster than what you currently are doing. Also depending on the number of circles setting exact = FALSE may be faster.
If you really wanted to estimate the area of B by simple Monte Carlo it can be done in a few lines of code:
Bbox <- as.rectangle(B)
points <- runifpoint(1000, Bbox)
insideB <- inside.owin(points, w = B)
fraction <- mean(insideB)
area(Bbox) * fraction

Thanks you for your comments, here is the code for the solution :)
x.min <- min(W$xrange[1])
x.max <- max(W$xrange[2])
y.min <- min(W$yrange[1])
y.max <- max(W$yrange[2])
n.sims <- 1000
random.points <- data.frame(Q=runif(n=n.sims,x.min,x.max),
DOC=runif(n=n.sims,y.min,y.max))
random.points<-data.frame(Xinterval=runif(n=n.sims,x.min,x.max),Yinterval=runif(n=n.sims,y.min,y.max))
contador<-0
for ( ii in 1:n.sims ) {
if(inside.owin(random.points$Xinterval[ii], random.points$Yinterval[ii], w=B)
){
contador<-contador+1
}
}
estimatedArea<-area(W)*contador/n.sims

Glm with caret package producing "missing values in resampled performance measures"

I obtained the following code from this Stack Overflow question. caret train() predicts very different then predict.glm()
The following code is producing an error.
I am using caret 6.0-52.
library(car); library(caret); library(e1071)
#data import and preparation
data(Chile)
chile <- na.omit(Chile) #remove "na's"
chile <- chile[chile$vote == "Y" | chile$vote == "N" , ] #only "Y" and "N" required
chile$vote <- factor(chile$vote) #required to remove unwanted levels
chile$income <- factor(chile$income) # treat income as a factor
tc <- trainControl("cv", 2, savePredictions=T, classProbs=TRUE,
summaryFunction=twoClassSummary) #"cv" = cross-validation, 10-fold
fit <- train(chile$vote ~ chile$sex +
chile$education +
chile$statusquo ,
data = chile ,
method = "glm" ,
family = binomial ,
metric = "ROC",
trControl = tc)
Running this code produces the following error.
Something is wrong; all the ROC metric values are missing:
ROC Sens Spec
Min. : NA Min. :0.9354 Min. :0.9187
1st Qu.: NA 1st Qu.:0.9354 1st Qu.:0.9187
Median : NA Median :0.9354 Median :0.9187
Mean :NaN Mean :0.9354 Mean :0.9187
3rd Qu.: NA 3rd Qu.:0.9354 3rd Qu.:0.9187
Max. : NA Max. :0.9354 Max. :0.9187
NA's :1
Error in train.default(x, y, weights = w, ...) : Stopping
In addition: Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
Would anyone know what the issue is or can reproduce/ not reproduce this error. I've seen other answers to this error message that says this has to do with not having representation of classes in each cross validation fold but this isn't the issue as the number of folds is set to 2.

Looks like I needed to install and load the pROC package.
install.packages("pROC")
library(pROC)

You should install using
install.packages("caret", dependencies = c("Imports", "Depends", "Suggests"))
That gets most of the default packages. If there are specific modeling packages that are missing, the code usually prompts you to install them.

I know I'm late to the party, but I think you need to set classProbs = TRUE in train control.

You are using logistic regression when using the parameters method = "glm", family = binomial.
In this case, you must make sure that the target variable (chile$vote) has only 2 factor levels, because logistic regression only performs binary classification.
If the target has more than two labels, then you must set family = "multinomial"

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Identifying outlying datapoints from residuals (GeoLight package) - geolocation

Related

Use the Survey package to weight observations in stacked imputations

Select prior probability of inclusion in CausalImpact or bsts?

if (freq) x$counts else x$density length > 1 and only the first element will be used

Find the area of a region using monte carlo methods

Glm with caret package producing "missing values in resampled performance measures"

Categories

Resources