missing value where TRUE/FALSE needed with Caret - r-caret

I have a data frame, which contains the "date variable".
(the test data and code is available here)
However, I use "function = caretFunc". It shows error message.
Error in { : task 1 failed - "missing value where TRUE/FALSE needed"
In addition: Warning messages:
1: In FUN(newX[, i], ...) : NAs introduced by coercion
2: In FUN(newX[, i], ...) : NAs introduced by coercion
3: In FUN(newX[, i], ...) : NAs introduced by coercion
4: In FUN(newX[, i], ...) : NAs introduced by coercion
5: In FUN(newX[, i], ...) : NAs introduced by coercion
6: In FUN(newX[, i], ...) : NAs introduced by coercion
7: In FUN(newX[, i], ...) : NAs introduced by coercion
8: In FUN(newX[, i], ...) : NAs introduced by coercion
9: In FUN(newX[, i], ...) : NAs introduced by coercion
10: In FUN(newX[, i], ...) : NAs introduced by coercion
What can I do?
Code to reproduce the error:
library(mlbench)
library(caret)
library(maps)
library(rgdal)
library(raster)
library(sp)
library(spdep)
library(GWmodel)
library(e1071)
library(plyr)
library(kernlab)
library(zoo)
mydata <- read.csv("Realestatedata_all_delete_date.csv", header=TRUE)
mydata$estate_TransDate <- as.Date(paste(mydata$estate_TransDate,1,sep="-"),format="%Y-%m-%d")
mydata$estate_HouseDate <- as.Date(mydata$estate_HouseDate,format="%Y-%m-%d")
rfectrl <- rfeControl(functions=caretFuncs, method="cv",number=10,verbose=TRUE,returnResamp = "final")
results <- rfe(mydata[,1:48],mydata[,49],sizes = c(1:48),rfeControl=rfectrl,method = "svmRadial")
print(results)
predictors(results)
plot(results, type=c("g", "o"))

You have NAs (missing values) in mydata in the following input variables (which you feed to the classifier):
colnames(mydata)[unique(which(is.na(mydata[,1:48]), arr.ind = TRUE)[,2])]
gives:
[1] "Aport_Distance" "Univ_Distance" "ParkR_Distance"
[4] "TRA_StationDistance" "THSR_StationDistance" "Schools_Distance"
[7] "Lib_Distance" "Sport_Distance" "ParkS_Distance"
[10] "Hyper_Distance" "Shop_Distance" "Post_Distance"
[13] "Hosp_Distance" "Gas_Distance" "Incin_Distance"
[16] "Mort_Distance"
In addition, it looks like your date variables (transaction date and house date) seem to be converted to NAs inside rfe(..) .
The SVM regressor seems not to be able to deal with NAs as is.
I would convert the dates to something like 'years since a given reference':
mydata$estate_TransAge <- as.numeric(as.Date("2015-11-01") - mydata$estate_TransDate) / 365.25
mydata$estate_HouseAge <- as.numeric(as.Date("2015-11-01") - mydata$estate_HouseDate) / 365.25
# define the set of input variables
inputVars = setdiff(colnames(mydata),
# exclude these
c("estate_TransDate", "estate_HouseDate", "estate_TotalPrice")
)
And also remove those entries with any NA in any of the columns you use as input to the regressor:
traindata <- mydata[complete.cases(mydata[,inputVars]),]
then run rfe with:
rfectrl <- rfeControl(functions=caretFuncs, method="cv",number=10,verbose=TRUE,returnResamp = "final")
results <- rfe(
traindata[,inputVars],
traindata[,"estate_TotalPrice"],
rfeControl=rfectrl,
method = "svmRadial"
)
In my case, this would have taken a long time to complete, so I tested it only on one percent of the data using:
traindata <- sample_frac(traindata, 0.01)
The question remains what to do if your are given data to predict the price where some of input variables as NA.

Related

Leave one out cross validation RStudio randomForest package error

Creating an LOOCV loop using the randomForest package. I have adapted the following code from this link (https://stats.stackexchange.com/questions/459293/loocv-in-caret-package-randomforest-example-not-unique-results) however I am unable to reproduce a successful code.
Here is the code that I am running but on the iris dataset.
irisdata <- iris[1:150,]
predictionsiris <- 1:150
for (k in 1:150){
set.seed(123)
predictioniris[k] <- predict(randomForest(Petal.Width ~ Sepal.Length, data = irisdata[-k], ntree = 10), newdata = irisdata[k,,drop=F])[2]
}
What I would expect to happen is for it to run the random forest model on all but one row and then use that one row to test the model.
However, when I run this code, I get the following error:
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'object' in selecting a method for function 'predict': object 'Sepal.Length' not found
Any suggestions? I have been messing around with LOOCV code for the past two days including messing with code in this page (Compute Random Forest with a leave one ID out cross validation in R) and running the following:
iris %>%
mutate(ID = 1:516)
loocv <- NULL
for(i in iris$ID){
test[[i]] <- slice(iris, i)
train[[i]] <- slice(iris, i+1:516)
rf <- randomForest(Sepal.Length ~., data = train, ntree = 10, importance = TRUE)
loocv[[i]] <- predict(rf, newdata = test)
}
but I have had no success. Any help would be appreciated.

Type error when use apex.amp O1 in PyTorch

When I try to use NVIDIA apex.amp O1 to accelerate my training, it report an error in my code logits = einsum('b x y d, r d -> b x y r', q, rel_k):
RuntimeError: RuntimeErrorRuntimeErrorexpected scalar type Half but found Float: :
expected scalar type Half but found Float
It means that rel_kshould be torch.HalfTensor.
rel_k is defined as follow: self.rel_height = nn.Parameter(torch.randn(height * 2 - 1, dim_head) * scale)
But when I specify the type of rel_kto be torch.HalfTensor, it report an error that I should not specify dtype manually
RuntimeErrorRuntimeError: : Found param encoder.layers.0.blocks.0.attn.rel_pos_emb.rel_height with type torch.cuda.HalfTensor, expected torch.cuda.FloatTensor.
When using amp.initialize, you do not need to call .half() on your model
before passing it, no matter what optimization level you choose.Found param encoder.layers.0.blocks.0.attn.rel_pos_emb.rel_height with type torch.cuda.HalfTensor, expected torch.cuda.FloatTensor.
When using amp.initialize, you do not need to call .half() on your model
before passing it, no matter what optimization level you choose.
How should I do to use amp O1 correctly in my code?

Use the Survey package to weight observations in stacked imputations

I am exploring model variable selection within imputed data.
One technique is to stack imputations in long format (where n observations in M imputed datasets creates a dataset n x M long), and use weighted regression to reduce the contribution of each observation proportionally to the number of imputations. If we treated the stacked dataset as one large dataset, the standard errors would be too small.
I am trying to use the weights argument in svyglm to account for the stacked data, resulting in SEs that you would expect with n obervations, rather than n x M observations.
To illustrate:
library(mice)
### create data
set.seed(42)
n <- 50
id <- 1:n
var1 <- rbinom(n,1,0.4)
var2 <- runif(n,30,80)
var3 <- rnorm(n, mean = 12, sd = 5)
var4 <- rnorm(n, mean = 100, sd = 20)
prob <- (((var1*var2)+var3)-min((var1*var2)+var3)) / (max((var1*var2)+var3)-min((var1*var2)+var3))
outcome <- rbinom(n, 1, prob = prob)
data <- data.frame(id, var1, var2, var3, var4, outcome)
### Add missingness
data_miss <- ampute(data)
patt <- data_miss$patterns
patt <- patt[2:5,]
data_miss <- ampute(data, patterns = patt)
data_miss <- data_miss$amp
## create 5 imputed datasets
nimp <- 5
imp <- mice(data_miss, m = nimp)
## Stack data
data_long <- complete(imp, action = "long")
## Generate model in stacked data (SEs will be too small)
modlong <- glm(outcome ~ var1 + var2 + var3 + var4, family = "binomial", data = data_long)
summary(modlong)
the long data gives overly small SEs, as we've increased the size of our dataset by 5x
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.906417 0.965090 -3.012 0.0026 **
var1 2.221053 0.311167 7.138 9.48e-13 ***
var2 -0.002543 0.010468 -0.243 0.8081
var3 0.076955 0.032265 2.385 0.0171 *
var4 0.006595 0.008031 0.821 0.4115
Add weights
data_long$weight <- 1/nimp
library(survey)
des <- svydesign(ids = ~1, data = data_long, weights = ~weight)
mod_svy <- svyglm(formula = outcome ~ var1 + var2 + var3 + var4, family = quasibinomial(), design = des)
summary(mod_svy)
The weighted regression gives similar SEs to the unweighted model
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.906417 1.036691 -2.804 0.00546 **
var1 2.221053 0.310906 7.144 1.03e-11 ***
var2 -0.002543 0.010547 -0.241 0.80967
var3 0.076955 0.030955 2.486 0.01358 *
var4 0.006595 0.008581 0.769 0.44288
Adding rescale = F (to apparently stop weights being rescaled to the sum of the sample size) doesn't change anything
mod_svy <- svyglm(formula = outcome ~ var1 + var2 + var3 + var4, family = quasibinomial(), design = des, rescale = F)
summary(mod_svy)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.906417 1.036688 -2.804 0.00546 **
var1 2.221053 0.310905 7.144 1.03e-11 ***
var2 -0.002543 0.010547 -0.241 0.80967
var3 0.076955 0.030955 2.486 0.01358 *
var4 0.006595 0.008581 0.769 0.44288
I would have expected SEs similar to those obtained when running a model in a single imputed dataset
## Assess SEs in single imputation
mod_singleimp <- glm(outcome ~ var1 + var2 + var3 + var4, family = "binomial", data = complete(imp,1))
summary(mod_singleimp)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.679589 2.116806 -1.266 0.20556
var1 2.476193 0.761195 3.253 0.00114 **
var2 0.014823 0.025350 0.585 0.55874
var3 0.048940 0.072752 0.673 0.50114
var4 -0.004551 0.017986 -0.253 0.80026
All assistance greatly appreciated. Or if anybody knows other ways of achieving the same goal.
Alternative options
the psfmi package allows for stepwise selection in multiply imputed datasets and pooling of models. However, it is computationally intensive and slow with large datasets, particularly if the process needs to be bootstrapped (e.g. during internal validation) - hence the requirement for a less intensive stacking approach.
Sorry, no, this isn't going to work.
To handle stacked imputation data with weights you need frequency weights, so that a weight of 1/10 means you have 1/10 of an observation. With svydesign you specify sampling weights, so that a weight of 1/10 means your observation represents 10 observations in the population. These will (and should) give different standard errors. Pretending you have frequency weights when you actually have imputations is a clever hack to avoid having software that understands what it's doing, which is fine but isn't compatible with survey, which understands what it's doing and is doing something different.
Currently,if you want to use svyglm with multiple imputations you need to compute the standard errors separately -- most conveniently with Rubin's rules using mitools::MIcombine, which is set up to work with the survey package (see the help for with.svyimputationList and withPV).
It might be worth putting in a feature request to the mitools or survey developers (with citations to examples) to allow for stacked analysis of imputations, but this isn't just a matter of adjusting the weights.

Getting an error while converting Tibble to h2o hex file

I am running the h2o package in Rstudio, I am getting an error while converting Tibble into h2o.
Below is my code
#Augment Time Series Signature
PO_Data_aug = PO_Data %>%
tk_augment_timeseries_signature()
PO_Data_aug
# Split into training, validation and test sets
train_tbl = PO_Data_aug %>% filter(Date <= '2017-12-29')
valid_tbl = PO_Data_aug %>% filter(Date>'2017-12-29'& Date <='2018-03-31')
test_tbl = PO_Data_aug %>% filter(Date > '2018-03-31')
str(train_tbl)
train_tbl$month.lbl<-as.character(train_tbl$month.lbl)
h2o.init() # Fire up h2o
##hex
train_h2o = as.h2o(train_tbl)
valid_h2o = as.h2o(valid_tbl)
test_h2o = as.h2o(test_tbl)
ERROR: Unexpected HTTP Status code: 412 Precondition Failed (url = http://localhost:54321/3/Parse)
ERROR MESSAGE:
Provided column type ordered is unknown. Cannot proceed with parse due to invalid argument.
Kindly Suggest
This is actually a bug in H2O -- it has nothing to do with tibbles. There is no support for the "ordered" column type in data.frames or tibbles. We will fix this (ticket here).
The work-around right now is to manually convert your "ordered" columns into un-ordered "factor" columns.
tb <- tibble(x = ordered(c(1,2,3)), y = 1:3)
tb$x <- factor(tb$x, ordered = FALSE)
hf <- as.h2o(tb)
as.h2o() expects an R dataframe. You could use an R dataframe instead of your tibble dataframe or as Tom mentioned in the comments you could use one of the supported file formats for H2O.
train_h2o = as.h2o(as_data_frame(train_tbl))
valid_h2o = as.h2o(as_data_frame(valid_tbl))
test_h2o = as.h2o(as_data_frame(test_tbl))

Classification with torch model exported from digits - lua 5.1

I'm very new to deep learning and i'm trying to obtain a classification with lua.
I've installed digits with torch and lua 5.1 and i've train the following model :
After that, i've made a classification with the digits server to test the exemple and here is the result :
I've exported the model and now i'm trying to do a classification with the following lua code :
local image_url = '/home/delpech/mnist/test/5/04131.png'
local network_url = '/home/delpech/models/snapshot_30_Model.t7'
local network_name = paths.basename(network_url)
print '==> Loading network'
local net = torch.load(network_name)
--local net = torch.load(network_name):unpack():float()
net:evaluate()
print(net)
print '==> Loading synsets'
print 'Loads mapping from net outputs to human readable labels'
local synset_words = {}
--for line in io.lines'/home/delpech/models/labels.txt' do table.insert(synset_words, line:sub(11)) end
for line in io.lines'/home/delpech/models/labels.txt' do table.insert(synset_words, line) end
print 'synset words'
for line in io.lines'/home/delpech/models/labels.txt' do print(line) end
print '==> Loading image and imagenet mean'
local im = image.load(image_url)
print '==> Preprocessing'
local I = image.scale(im,28,28,'bilinear'):float()
print 'Propagate through the network, sort outputs in decreasing order and show 10 best classes'
local _,classes = net:forward(I):view(-1):sort(true)
for i=1,10 do
print('predicted class '..tostring(i)..': ', synset_words[classes[i]])
end
But here is the output :
delpech#delpech-K55VD:~/models$ lua classify.lua
==> Downloading image and network
==> Loading network
nn.Sequential {
[input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> output]
(1): nn.MulConstant
(2): nn.SpatialConvolution(1 -> 20, 5x5)
(3): nn.SpatialMaxPooling(2x2, 2,2)
(4): nn.SpatialConvolution(20 -> 50, 5x5)
(5): nn.SpatialMaxPooling(2x2, 2,2)
(6): nn.View(-1)
(7): nn.Linear(800 -> 500)
(8): nn.ReLU
(9): nn.Linear(500 -> 10)
(10): nn.LogSoftMax
}
==> Loading synsets
Loads mapping from net outputs to human readable labels
synset words
0
1
2
3
4
5
6
7
8
9
==> Loading image and imagenet mean
==> Preprocessing
Propagate through the network, sort outputs in decreasing order and show 5 best classes
predicted class 1: 4
predicted class 2: 8
predicted class 3: 0
predicted class 4: 1
predicted class 5: 9
predicted class 6: 6
predicted class 7: 7
predicted class 8: 2
predicted class 9: 5
predicted class 10: 3
And this is actually not the classification provided by digits...
OK, after searching in the digits code source, it looked like i've missed two things :
you have to get the mean image in the job folder and make the following pre-process :
print '==> Preprocessing'
for i=1,im_mean:size(1) do
im[i]:csub(im_mean[i])
end
and the fact that i had to load my images in this way and multiply every pixel to 255.
local im = image.load(image_url):type('torch.FloatTensor'):contiguous();
im:mul(255)
Here is the total anwser :
require 'image'
require 'nn'
require 'torch'
require 'paths'
local function main()
print '==> Downloading image and network'
local image_url = '/home/delpech/mnist/test/7/03079.png'
local network_url = '/home/delpech/models/snapshot_30_Model.t7'
local mean_url = '/home/delpech/models/mean.jpg'
print '==> Loading network'
local net = torch.load(network_url)
net:evaluate();
print '==> Loading synsets'
print 'Loads mapping from net outputs to human readable labels'
local synset_words = {}
for line in io.lines'/home/delpech/models/labels.txt' do table.insert(synset_words, line) end
print '==> Loading image and imagenet mean'
local im = image.load(image_url):type('torch.FloatTensor'):contiguous();--:contiguous()
im:mul(255)
local I = image.scale(im,28,28,'bilinear'):float()
local im_mean = image.load(mean_url):type('torch.FloatTensor'):contiguous();
im_mean:mul(255)
local Imean = image.scale(im,28,28,'bilinear'):float()
print '==> Preprocessing'
for i=1,im_mean:size(1) do
im[i]:csub(im_mean[i])
end
local _,classes = net:forward(im):sort(true);
for i=1,10 do
print('predicted class '..tostring(i)..': ', synset_words[classes[i]])
end
end
main()

Resources