Clustered resampling for inner layer of Caret recursive feature elimination - r-caret

I have data where IDs are contained within clusters.
I would like to perform recursive feature elimination using Caret's rfe function which performs the following procedure:
Clustered resampling for the outer layer (line 2.1) is straightforward, using the index parameter.
However, within each outer resample, I would like to tune tuning parameters using cluster-based cross-validation (inner resampling) (line 2.9). Model tuning in the inner layer is possible by specifying a tuneGrid in rfe and having an appropriate trControl. It is this trControl that I would like to change to allow clustered resampling.
The outer resampling is specified in the rfeControl parameter of rfe.
The inner resampling is specified by trControl of rfe which is passed to train.
The trouble I am having is that I can't seem to specify any inner indices, because after the outer resampling, those indices are no longer valid or no longer present in the outer-resampled data.
I am looking for a way to tell train to take an outer resample (which will be missing a cluster against which to validate), and to tune the model using inner resampling by based on folds of the remaining clusters.
The MWE is as minimal as possible:
library(caret)
library(tidyverse)
library(parallel)
library(doParallel)
range01 <- function(x){(x-min(x))/(max(x)-min(x))}
### Create some random data, 10 features, with some influence over a binomial outcome
set.seed(42)
id <- 1:1000
cluster <- rep(1:10, each = 100)
dat <- data.frame(id, cluster, replicate(10,rnorm(n = 1000, mean = runif(1, 0,100)+cluster, sd = runif(1, 0,20))))
dat <- dat %>% mutate(temp = rowSums(across(X1:X10)), prob = range01(temp), outcome = rbinom(n = nrow(dat), size = 1, prob = prob))
dat$outcome <- as.factor(dat$outcome)
levels(dat$outcome) <- c("control", "case")
dat$outcome <- factor(dat$outcome, levels=rev(levels(dat$outcome)))
### Manual outer folds-based cluster ###
for(i in 1:10) {
assign(paste0("index", i), which(dat$cluster!=i))
}
unit_indices <- list(index1, index2, index3, index4, index5, index6, index7, index8, index9, index10)
### Inner resampling method (THIS IS WHAT I'D LIKE TO CHANGE) ###
cv5 <- trainControl(classProbs = TRUE, method = "cv", number = 5, allowParallel = F) ## Is there a way to have inner cluster-based resampling WITHIN the outer cluster-based resampling?
caret_rfe_functions <- list(summary = twoClassSummary,
fit = function (x, y, first, last, ...) {
train(x, y, ...)
},
pred = caretFuncs$pred,
rank = function(object, x, y) {
vimp <- varImp(object)$importance
vimp <- vimp[order(vimp$Overall,decreasing = TRUE),,drop = FALSE]
vimp$var <- rownames(vimp)
vimp
},
selectSize = function (x, metric = "ROC", tol = 1, maximize = TRUE)
{
if (!maximize) {
best <- min(x[, metric])
perf <- (x[, metric] - best)/best * 100
flag <- perf <= tol
}
else {
best <- max(x[, metric])
perf <- (best - x[, metric])/best * 100
flag <- perf <= tol
}
min(x[flag, "Variables"])
},
selectVar = caretFuncs$selectVar)
caret_rfe_ctrl <- rfeControl(
functions = caret_rfe_functions,
saveDetails = TRUE,
index = unit_indices,
indexOut = NULL,
returnResamp = "all",
allowParallel = T, ### change this if you don't want to / can't go parallel
verbose = TRUE
)
#### Feature selection ####
set.seed(42)
cl <- makePSOCKcluster(10) ### for parallel processing if available
registerDoParallel(cl)
rfe_profile_nnet <- rfe(
form = outcome ~
X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10,
data = dat,
sizes = seq(2,10,1),
rfeControl = caret_rfe_ctrl,
## pass options to train()
method = "nnet",
preProc = c("center", "scale"),
metric = "ROC",
tuneGrid = expand.grid(size = c(1:5), decay = 5),
trControl = cv5) ### I would like to change this to allow inner cluster-based resampling
stopCluster(cl)
rfe_profile_nnet
plot(rfe_profile_nnet)
Presumably the inner cluster-based resampling would be achieved by specifying a new trainControl containing some dynamic inner index based on the outer resample that is selected at the time:
inner_cluster_tune <- trainControl(classProbs = TRUE,
index = {insert magic here}, ### This is the important bit
returnResamp = "all",
summaryFunction = twoClassSummary,
allowParallel = F) ### especially if the outer resample is parallelised
If you try with the original cluster indices e.g.
inner_cluster_tune <- trainControl(classProbs = TRUE,
index = unit_indices,
returnResamp = "all",
summaryFunction = twoClassSummary,
allowParallel = F)
There are various warnings about missing data in the resamples, and things like 24: In [<-.data.frame(*tmp*, , object$method$center, value = structure(list( ... : provided 81 variables to replace 9 variables.
All help greatly appreciated.
As a postscript question , you can see which parameters were used within your rfe like so:
> rfe_profile_nnet$fit
Neural Network
1000 samples
8 predictor
2 classes: 'case', 'control'
Pre-processing: centered (8), scaled (8)
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 800, 800, 800, 800, 800
Resampling results across tuning parameters:
size Accuracy Kappa
1 0.616 0.1605071
2 0.616 0.1686937
3 0.620 0.1820503
4 0.618 0.1788491
5 0.618 0.1788063
Tuning parameter 'decay' was held constant at a value of 5
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were size = 3 and decay = 5.
But does anyone know if this refers to one, or all of the outer resamples? Presumably the same tuning parameters won't necessarily be chosen across all outer resamples

Related

Flux.jl : Customizing optimizer

I'm trying to implement a gradient-free optimizer function to train convolutional neural networks with Julia using Flux.jl. The reference paper is this: https://arxiv.org/abs/2005.05955. This paper proposes RSO, a gradient-free optimization algorithm updates single weight at a time on a sampling bases. The pseudocode of this algorithm is depicted in the picture below.
optimizer_pseudocode
I'm using MNIST dataset.
function train(; kws...)
args = Args(; kws...) # collect options in a stuct for convinience
if CUDA.functional() && args.use_cuda
#info "Training on CUDA GPU"
CUDA.allwoscalar(false)
device = gpu
else
#info "Training on CPU"
device = cpu
end
# Prepare datasets
x_train, x_test, y_train, y_test = getdata(args, device)
# Create DataLoaders (mini-batch iterators)
train_loader = DataLoader((x_train, y_train), batchsize=args.batchsize, shuffle=true)
test_loader = DataLoader((x_test, y_test), batchsize=args.batchsize)
# Construct model
model = build_model() |> device
ps = Flux.params(model) # model's trainable parameters
best_param = ps
if args.optimiser == "SGD"
# Regular training step with SGD
elseif args.optimiser == "RSO"
# Run RSO function and update ps
best_param .= RSO(x_train, y_train, args.RSOupdate, model, args.batchsize, device)
end
And the corresponding RSO function:
function RSO(X,L,C,model, batch_size, device)
"""
model = convolutional model structure
X = Input data
L = labels
C = Number of rounds to update parameters
W = Weight set of layers
Wd = Weight tensors of layer d that generates an activation
wid = weight tensor that generates an activation aᵢ
wj = a weight in wid
"""
# Normalize input data to have zero mean and unit standard deviation
X .= (X .- sum(X))./std(X)
train_loader = DataLoader((X, L), batchsize=batch_size, shuffle=true)
#println("model = $(typeof(model))")
std_prep = []
σ_d = Float64[]
D = 1
for layer in model
D += 1
Wd = Flux.params(layer)
# Initialize the weights of the network with Gaussian distribution
for id in Wd
wj = convert(Array{Float32, 4}, rand(Normal(0, sqrt(2/length(id))), (3,3,4,4)))
id = wj
append!(std_prep, vec(wj))
end
# Compute std of all elements in the weight tensor Wd
push!(σ_d, std(std_prep))
end
W = Flux.params(model)
# Weight update
for _ in 1:C
d = D
while d > 0
for id in 1:length(W[d])
# Randomly sample change in weights from Gaussian distribution
for j in 1:length(w[d][id])
# Randomly sample mini-batch
(x, l) = train_loader[rand(1:length(train_loader))]
# Sample a weight from normal distribution
ΔWj[d][id][j] = rand(Normal(0, σ_d[d]), 1)
loss, acc = loss_and_accuracy(data_loader, model, device)
W = argmin(F(x,l, W+ΔWj), F(x,l,W), F(x,l, W-ΔWj))
end
end
d -= 1
end
end
return W
end
The problem here is the second block of the RSO function. I'm trying to evaluate the loss with the change of single weight in three scenarios, which are F(w, l, W+gW), F(w, l, W), F(w, l, W-gW), and choose the weight-set with minimum loss. But how do I do that using Flux.jl? The loss function I'm trying to use is logitcrossentropy(ŷ, y, agg=sum). In order to generate y_hat, we should use model(W), but changing single weight parameter in Zygote.Params() form was already challenging....
Based on the paper you shared, it looks like you need to change the weight arrays per each output neuron per each layer. Unfortunately, this means that the implementation of your optimization routine is going to depend on the layer type, since an "output neuron" for a convolution layer is quite different than a fully-connected layer. In other words, just looping over Flux.params(model) is not going to be sufficient, since this is just a set of all the weight arrays in the model and each weight array is treated differently depending on which layer it comes from.
Fortunately, Julia's multiple dispatch does make this easier to write if you use separate functions instead of a giant loop. I'll summarize the algorithm using the pseudo-code below:
for layer in model
for output_neuron in layer
for weight_element in parameters(output_neuron)
weight_element = sample(N(0, sqrt(2 / num_outputs(layer))))
end
end
sigmas[layer] = stddev(parameters(layer))
end
for c in 1 to C
for layer in reverse(model)
for output_neuron in layer
for weight_element in parameters(output_neuron)
x, y = sample(batches)
dw = N(0, sigmas[layer])
# optimize weights
end
end
end
end
It's the for output_neuron ... portions that we need to isolate into separate functions.
In the first block, we don't actually do anything different to every weight_element, they are all sampled from the same normal distribution. So, we don't actually need to iterate the output neurons, but we do need to know how many there are.
using Statistics: std
# this function will set the weights according to the
# normal distribution and the number of output neurons
# it also returns the standard deviation of the weights
function sample_weight!(layer::Dense)
sample = randn(eltype(layer.weight), size(layer.weight))
num_outputs = size(layer.weight, 1)
# notice the "." notation which is used to mutate the array
layer.weight .= sample .* num_outputs
return std(layer.weight)
end
function sample_weight!(layer::Conv)
sample = randn(eltype(layer.weight), size(layer.weight))
num_outputs = size(layer.weight, 4)
# notice the "." notation which is used to mutate the array
layer.weight .= sample .* num_outputs
return std(layer.weight)
end
sigmas = map(sample_weights!, model)
Now, for the second block, we will do a similar trick by defining different functions for each layer.
function optimize_layer!(loss, layer::Dense, data, sigma)
for i in 1:size(layer.weight, 1)
for j in 1:size(layer.weight, 2)
wj = layer.weight[i, j]
x, y = data[rand(1:length(data))]
dw = randn() * sigma
ws = [wj + dw, wj, wj - dw]
losses = Float32[]
for (k, w) in enumerate(ws)
layer.weight[i, j] = w
losses[k] = loss(x, y)
end
layer.weight[i, j] = ws[argmin(losses)]
end
end
end
function optimize_layer!(loss, layer::Conv, data, sigma)
for i in 1:size(layer.weight, 4)
# we use a view to reference the full kernel
# for this output channel
wid = view(layer.weight, :, :, :, i)
# each index let's us treat wid like a vector
for j in eachindex(wid)
wj = wid[j]
x, y = data[rand(1:length(data))]
dw = randn() * sigma
ws = [wj + dw, wj, wj - dw]
losses = Float32[]
for (k, w) in enumerate(ws)
wid[j] = w
losses[k] = loss(x, y)
end
wid[j] = ws[argmin(losses)]
end
end
end
for c in 1:C
for (layer, sigma) in reverse(zip(model, sigmas))
optimize_layer!(layer, data, sigma) do x, y
logitcrossentropy(model(x), y; agg = sum)
end
end
end
Notice that nowhere did I use Flux.params which does not help us here. Also, Flux.params would include both the weight and bias, and the paper doesn't look like it bothers with the bias at all. If you had an optimization method that generically optimized any parameter regardless of layer type the same (i.e. like gradient descent), then you could use for p in Flux.params(model) ....
Thanks #darsnack :)
I found your answer a bit late, so in the meantime I could figure out my own script that works. Mine is indeed a bit hardcoded but could you also give feedback on this?
function RSO(train_loader, test_loader, C,model, batch_size, device, args)
"""
model = convolutional model structure
C = Number of rounds to update parameters (epochs)
batch_size = size of the mini batch that will be used to calculate loss
device = CPU or GPU
"""
# Evaluate initial weight
test_loss, test_acc = loss_and_accuracy(test_loader, model, device)
println("Initial Weight:")
println(" test_loss = $test_loss, test_accuracy = $test_acc")
random_batch = []
for (x, l) in train_loader
push!(random_batch, (x,l))
end
# Initialize weights
std_prep = []
σ_d = Float64[]
D = 0
for layer in model
D += 1
Wd = Flux.params(layer)
# Initialize the weights of the network with Gaussian distribution
for id in Wd
if typeof(id) == Array{Float32, 4}
wj = convert(Array{Float32, 4}, rand(Normal(0, sqrt(2/length(id))), size(id)))
elseif typeof(id) == Vector{Float32}
wj = convert(Vector{Float32}, rand(Normal(0, sqrt(2/length(id))), length(id)))
elseif typeof(id) == Matrix{Float32}
wj = convert(Matrix{Float32}, rand(Normal(0, sqrt(2/length(id))), size(id)))
end
id = wj
append!(std_prep, vec(wj))
end
# Compute std of all elements in the weight tensor Wd
push!(σ_d, std(std_prep))
end
# Weight update
for c in 1:C
d = D
# First update the weights of the layer closest to the labels
# and then sequentially move closer to the input
while d > 0
Wd = Flux.params(model[d])
for id in Wd
# Randomly sample change in weights from Gaussian distribution
for j in 1:length(id)
# Randomly sample mini-batch
(x, y) = rand(random_batch, 1)[1]
x, y = device(x), device(y)
# Sample a weight from normal distribution
ΔWj = rand(Normal(0, σ_d[d]), 1)[1]
# Weight update with three scenario
## F(x,l, W+ΔWj)
id[j] = id[j]+ΔWj
ŷ = model(x)
ls_pos = logitcrossentropy(ŷ, y, agg=sum) / size(x)[end]
## F(x,l,W)
id[j] = id[j]-ΔWj
ŷ = model(x)
ls_org = logitcrossentropy(ŷ, y, agg=sum) / size(x)[end]
## F(x,l, W-ΔWj)
id[j] = id[j]-ΔWj
ŷ = model(x)
ls_neg = logitcrossentropy(ŷ, y, agg=sum) / size(x)[end]
# Check weight update that gives minimum loss
min_loss = argmin([ls_org, ls_pos, ls_neg])
# Save weight update with minimum loss
if min_loss == 1
id[j] = id[j] + ΔWj
elseif min_loss == 2
id[j] = id[j] + 2*ΔWj
elseif min_loss == 3
id[j] = id[j]
end
end
end
d -= 1
end
train_loss, train_acc = loss_and_accuracy(train_loader, model, device)
test_loss, test_acc = loss_and_accuracy(test_loader, model, device)
track!(args.tracker, test_acc)
println("RSO Round=$c")
println(" train_loss = $train_loss, train_accuracy = $train_acc")
println(" test_loss = $test_loss, test_accuracy = $test_acc")
end
return Flux.params(model)
end

Large, exploding loss in Pytorch transformer model

I am trying to solve a sequence to sequence problem with a transformer model. The data is derived from a set of crossword puzzles.
The positional encoding and transformer classes are as follows:
class PositionalEncoding(nn.Module):
def __init__(self, d_model: int, dropout: float = 0.1, max_len: int = 3000):
super().__init__()
self.dropout = nn.Dropout(p=dropout)
position = torch.arange(max_len).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
pe = torch.zeros(1, max_len, d_model)
pe[0, :, 0::2] = torch.sin(position * div_term)
pe[0, :, 1::2] = torch.cos(position * div_term)
self.register_buffer('pe', pe)
def debug(self, x):
return x.shape, x.size()
def forward(self, x: Tensor) -> Tensor:
x = x + self.pe[:, :x.size(1), :]
return self.dropout(x)
class Transformer(nn.Module):
def __init__(
self,
num_tokens,
dim_model,
num_heads,
num_encoder_layers,
num_decoder_layers,
batch_first,
dropout_p,
):
super().__init__()
self.model_type = "Transformer"
self.dim_model = dim_model
self.positional_encoder = PositionalEncoding(
d_model=dim_model, dropout=dropout_p, max_len=3000
)
self.embedding = nn.Embedding.from_pretrained(vec_weights, freeze=False)#nn.Embedding(num_tokens, dim_model)
self.transformer = nn.Transformer(
d_model=dim_model,
nhead=num_heads,
num_encoder_layers=num_encoder_layers,
num_decoder_layers=num_decoder_layers,
dropout=dropout_p,
batch_first = batch_first
)
self.out = nn.Linear(dim_model, num_tokens)
def forward(self, src, tgt, tgt_mask=None, src_pad_mask=None, tgt_pad_mask=None):
src = self.embedding(src)*math.sqrt(self.dim_model)
tgt = self.embedding(tgt)*math.sqrt(self.dim_model)
src = self.positional_encoder(src)
tgt = self.positional_encoder(tgt)
transformer_out = self.transformer(src, tgt, tgt_mask=tgt_mask, src_key_padding_mask=src_pad_mask, tgt_key_padding_mask=tgt_pad_mask)
out = self.out(transformer_out)
return out
def get_tgt_mask(self, size) -> torch.tensor:
mask = torch.tril(torch.ones(size, size) == 1)
mask = mask.float()
mask = mask.masked_fill(mask == 0, float('-inf'))
mask = mask.masked_fill(mask == 1, float(0.0))
return mask
def create_pad_mask(self, matrix: torch.tensor, pad_token: int) -> torch.tensor:
return (matrix == pad_token)
The input tensors are a source tensor of size N by S, where N is the batch size and S is the source sequence length, and a target tensor of size N by T, where T is the target sequence length. S is about 10 and T is about 5, while the total number of items is about 160,000-200,000, divided into batch sizes of 512. They are torch.IntTensors, with elements in the range from 0 to V, where V is the vocabulary length.
The first layer is an embedding layer that takes the input from N by S to N by S by E, where E is the embedding dimension (300), or to N by T by E in the case of the target. The second layer adds position encoding without changing the shape. Then both tensors are passed through the transformer layer, which outputs an N by T by E tensor. Finally, we pass this output through a linear layer, which produces an N by T by V output, where V is the size of the vocabulary used in the problem. Here V is about 56,697. The most frequent tokens (words) appear about 50-60 times in the target tensor.
The transformer class also contains the functions for implementing the masking matrices.
Then we create the model and run it (this process is wrapped in a function).
device = "cuda"
src_train, src_test = torch.utils.data.random_split(src_t, [int(0.9*len(src_t)), len(src_t)-int(0.9*len(src_t))])
src_train, src_test = src_train[:512], src_test[:512]
tgt_train, tgt_test = torch.utils.data.random_split(tgt_t, [int(0.9*len(tgt_t)), len(tgt_t)-int(0.9*len(tgt_t))])
tgt_train, tgt_test = tgt_train[:512], tgt_test[:512]
train_data, test_data = list(zip(src_train, tgt_train)), list(zip(src_test, tgt_test))
train, test = torch.utils.data.DataLoader(dataset=train_data), torch.utils.data.DataLoader(dataset=test_data)
model = Transformer(num_tokens=ntokens, dim_model=300, num_heads=2, num_encoder_layers=3, num_decoder_layers=3, batch_first = True, dropout_p=0.1).to(device)
loss_function = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.0000001)
n_epochs = 50
def train_model(model, optimizer, loss_function, n_epochs):
loss_value=0
for epoch in range(n_epochs):
print(f"Starting epoch {epoch}")
for batch, data in enumerate(train):
x, y = data
if batch%100 == 0:
print(f"Batch is {batch}")
batch += 1
optimizer.zero_grad()
x, y = torch.tensor(x).to(device), torch.tensor(y).to(device)
y_input, y_base = y[:, :-1], y[:, 1:]
y_input, y_base = y_input.to(device), y_base.to(device)
tgt_mask = model.get_tgt_mask(y_input.shape[1]).to(device)
pad_token = vocabulary_table[embeddings.key_to_index["/"]]
src_pad_mask = model.create_pad_mask(x, pad_token).to(device)
tgt_pad_mask = model.create_pad_mask(y_input, pad_token).to(device)
z = model(x, y_input, tgt_mask, src_pad_mask, tgt_pad_mask)
z = z.permute(0, 2, 1).to(device)
y_base = y_base.long().to(device)
loss = loss_function(z, y_base).to(device)
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), max_norm=2.0, norm_type=2)
optimizer.step()
loss_value += float(loss)
if batch%100 == 0:
print(f"For epoch {epoch}, batch {batch} the cross-entropy loss is {loss_value}")
#Free GPU memory.
del z
del x
del y
del y_input
del y_base
del loss
torch.cuda.empty_cache()
return model.parameters(), loss_value
Basically, we split the data into test and training sets and use an SGD optimizer and cross-entropy loss. We create a masking matrix for the padding for both the target and source tensors, and a masking matrix for unseen elements for the target tensor. We then do the usual gradient update steps. Right now, there is no validation loop, because I cannot even get the training loss to decrease.
The loss is very high, reaching more than 1000 after 100 batches. More concerningly, the loss also increases rapidly during training, rather than decreasing. In the code that I included, I tried clipping the gradients, lowering the learning rate, and using a much smaller sample to debug the code.
What could be causing this behavior?
You are only adding things to your loss, so naturally it can only increase.
loss_value += float(loss)
You're supposed to set it to zero after every epoch. Now you set it to zero once, in the beginning of the training process. There is a training loop template here if you're interested (https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html). This explains the increasing loss. To further troubleshoot (if needed) I'd throw in a validation loop.

ReLU activation function outputs HUGE numbers

I have FINALLY been able to implement backpropagation, but there are still some bugs I need to fix. The main is issue the following: My ReLU activation function produces really big dJdW values (derivative of error function wrt weights). When this gets subtracted from the weights, my output becomes a matrix of -int or inf. How do I stop this? As of now, the only solution I have is to make my learning rate scalar variable REALLY small.
import numpy as np
class Neural_Network(object):
def __init__(self, input_, hidden_, output_, numHiddenLayer_, numExamples_):
# Define Hyperparameters
self.inputLayerSize = input_
self.outputLayerSize = output_
self.hiddenLayerSize = hidden_
self.numHiddenLayer = numHiddenLayer_
self.numExamples = numExamples_
self.learningRate = 0.000000001 # LEARNING RATE: Why does ReLU produce such large dJdW values?
self.weightDecay = 0.5
# in -> out
self.weights = [] # stores matrices of each layer of weights
self.z = [] # stores matrices of each layer of weighted sums
self.a = [] # stores matrices of each layer of activity
self.biases = [] # stores all biases
# Biases are matrices that are added to activity matrix
# Dimensions -> numExamples_*hiddenLayerSize or numExamples_*outputLayerSize
for i in range(self.numHiddenLayer):
# Biases for hidden layer
b = [np.random.random() for x in range(self.hiddenLayerSize)];
B = [b for x in range(self.numExamples)];
self.biases.append(np.mat(B))
# Biases for output layer
b = [np.random.random() for x in range(self.outputLayerSize)]
B = [b for x in range(self.numExamples)];
self.biases.append(np.mat(B))
# Weights (Parameters)
# Weight matrix between input and first layer
W = np.random.rand(self.inputLayerSize, self.hiddenLayerSize)
self.weights.append(W)
for i in range(self.numHiddenLayer-1):
# Weight matrices between hidden layers
W = np.random.rand(self.hiddenLayerSize, self.hiddenLayerSize)
self.weights.append(W)
# Weight matric between hiddenlayer and outputlayer
self.weights.append(np.random.rand(self.hiddenLayerSize, self.outputLayerSize))
def setBatchSize(self, numExamples):
# Changes the number of rows (examples) for biases
if (self.numExamples > numExamples):
self.biases = [b[:numExamples] for b in self.biases]
def sigmoid(self, z):
# Apply sigmoid activation function
return 1/(1+np.exp(-z))
def sigmoidPrime(self, z):
# Derivative of sigmoid function
return self.sigmoid(x)*(1-self.sigmoid(z))
def ReLU(self, z):
# Apply activation function
'''
for (i, j), item in np.ndenumerate(z):
if (item < 0):
item *= 0.01
else:
item = item
return z'''
return np.multiply((z < 0), z * 0.01) + np.multiply((z >= 0), z)
def ReLUPrime(self, z):
# Derivative of ReLU activation function\
'''
for (i, j), item in np.ndenumerate(z):
if (item < 0):
item = 0.01
else:
item = 1
return z'''
return (z < 0) * 0.01 + (z >= 0) * 1
def forward(self, X):
# Propagate outputs through network
self.z = []
self.a = []
self.z.append(np.dot(X, self.weights[0]) + self.biases[0])
self.a.append(self.ReLU(self.z[0]))
#viewZ = self.z
#viewA = self.a
for i in range(1, self.numHiddenLayer):
self.z.append(np.dot(self.a[-1], self.weights[i]) + self.biases[i])
self.a.append(self.ReLU(self.z[-1]))
self.z.append(np.dot(self.z[-1], self.weights[-1]) + self.biases[-1])
self.a.append(self.ReLU(self.z[-1]))
yHat = self.ReLU(self.z[-1])
return yHat
def backProp(self, X, y):
# Compute derivative wrt W
# out -> in
dJdW = [] # stores matrices of each dJdW (equal in size to self.weights[])
delta = [] # stores matrices of each backpropagating error
self.yHat = self.forward(X)
# Quantifying Error
J = np.multiply((y-self.yHat),(y-self.yHat)) * 0.5
Javrg = np.dot(J.T, np.mat([1 for x in range(self.numExamples)]).reshape(self.numExamples, 1))
print(Javrg.item(0))
delta.insert(0,np.multiply(-(y-self.yHat), self.ReLUPrime(self.z[-1]))) # delta = (y-yHat)(sigmoidPrime(final layer unactivated))
dJdW.insert(0, np.dot(self.a[-2].T, delta[0]) + (self.weightDecay*self.weights[-1])) # dJdW
for i in range(len(self.weights)-1, 1, -1):
# Iterate from self.weights[-1] -> self.weights[1]
delta.insert(0, np.multiply(np.dot(delta[0], self.weights[i].T), self.ReLUPrime(self.z[i-1])))
dJdW.insert(0, np.dot(self.a[i-2].T, delta[0]) + (self.weightDecay*self.weights[i-1]))
delta.insert(0, np.multiply(np.dot(delta[0], self.weights[1].T), self.ReLUPrime(self.z[0])))
dJdW.insert(0, np.dot(X.T, delta[0]) + (self.weightDecay*self.weights[0]))
return dJdW
def train(self, X, y):
for t in range(60000):
dJdW = self.backProp(X, y)
for i in range(len(dJdW)):
self.weights[i] -= self.learningRate*dJdW[i]
# Instantiating Neural Network
inputs = [int(np.random.randint(0,1000)) for x in range(1000)]
x = np.mat([x for x in inputs]).reshape(1000,1)
y = np.mat([x+1 for x in inputs]).reshape(1000,1)
NN = Neural_Network(1,3,1,1,1000)
# Training
print("INPUT: ", end = '\n')
print(x, end = '\n\n')
print("BEFORE TRAINING", NN.forward(x), sep = '\n', end = '\n\n')
print("ERROR: ")
NN.train(x,y)
print("\nAFTER TRAINING", NN.forward(x), sep = '\n', end = '\n\n')
# Testing
test = np.mat([int(np.random.randint(0,10080)) for x in range(1000)]).reshape(1000,1)
print("TEST INPUT:", test, sep = '\n', end = '\n\n')
print(NN.forward(test), end = '\n\n')
NN.setBatchSize(1) # changing settings to receive one input at a time
while True:
# Give numbers between 0-100 (I need to fix overfitting) and it will get next value
inputs = input()
x = np.mat([int(i) for i in inputs.split(" ")])
print(NN.forward(x))
I first made the ANN using sigmoid but Leaky ReLU is faster.
The code is a bit much so here is a summary:
Neural Network Class
define hyperparameter and stuff (include really small learning rate scalar)
activation functions and their derivatives (ReLU and sigmoid)
Member functions: forward propagation, backpropagation, setBatchSize etc.
Instantiating ANN
setting hyperparameters (topology of ANN)
creating data (one array has values x and the output array has values x+1)
Training
using inputs generated in step 2 to train ANN
Testing
Testing using randomly generated inputs
User can give inputs
Hope that helps you help me. Thanks!
Your ReLU and ReLUPrime are wrong. When you iterate over a collection and mutate items it doesn't change the collection. Also: try to not explicitly iterate over arrays in numpy, but use vectorized operations, because they are way faster. It should be a good exercise to rewrite ReLU and its derivative in vectorized form. If you aren't sure what I mean, check out this answer.
Apart from that sigmoidPrime is wrong, it should be
self.sigmoid(z) * (1-self.sigmoid(z))
PS
This problem isn't really well suited for neural network, at least not for this encoding - I've tried it with exact hyperparameters with scikit-learn MLPRegressor and its output doesn't make much sense.

Neural Network MNIST: Backpropagation is correct, but training/test accuracy very low

I am building a neural network to learn to recognize handwritten digits from MNIST. I have confirmed that backpropagation calculates the gradients perfectly (gradient checking gives error < 10 ^ -10).
It appears that no matter how I train the weights, the cost function always tends towards around 3.24-3.25 (never below that, just approaching from above) and the training/test set accuracy is very low (around 11% for the test set). It appears that the h values in the end are all very close to 0.1 and to each other.
I cannot find why my program cannot produce better results. I was wondering if anyone could maybe take a look at my code and please tell me any reasons for this occurring. Thank you so much for all your help, I really appreciate it!
Here is my Python code:
import numpy as np
import math
from tensorflow.examples.tutorials.mnist import input_data
# Neural network has four layers
# The input layer has 784 nodes
# The two hidden layers each have 5 nodes
# The output layer has 10 nodes
num_layer = 4
num_node = [784,5,5,10]
num_output_node = 10
# 30000 training sets are used
# 10000 test sets are used
# Can be adjusted
Ntrain = 30000
Ntest = 10000
# Sigmoid Function
def g(X):
return 1/(1 + np.exp(-X))
# Forwardpropagation
def h(W,X):
a = X
for l in range(num_layer - 1):
a = np.insert(a,0,1)
z = np.dot(a,W[l])
a = g(z)
return a
# Cost Function
def J(y, W, X, Lambda):
cost = 0
for i in range(Ntrain):
H = h(W,X[i])
for k in range(num_output_node):
cost = cost + y[i][k] * math.log(H[k]) + (1-y[i][k]) * math.log(1-H[k])
regularization = 0
for l in range(num_layer - 1):
for i in range(num_node[l]):
for j in range(num_node[l+1]):
regularization = regularization + W[l][i+1][j] ** 2
return (-1/Ntrain * cost + Lambda / (2*Ntrain) * regularization)
# Backpropagation - confirmed to be correct
# Algorithm based on https://www.coursera.org/learn/machine-learning/lecture/1z9WW/backpropagation-algorithm
# Returns D, the value of the gradient
def BackPropagation(y, W, X, Lambda):
delta = np.empty(num_layer-1, dtype = object)
for l in range(num_layer - 1):
delta[l] = np.zeros((num_node[l]+1,num_node[l+1]))
for i in range(Ntrain):
A = np.empty(num_layer-1, dtype = object)
a = X[i]
for l in range(num_layer - 1):
A[l] = a
a = np.insert(a,0,1)
z = np.dot(a,W[l])
a = g(z)
diff = a - y[i]
delta[num_layer-2] = delta[num_layer-2] + np.outer(np.insert(A[num_layer-2],0,1),diff)
for l in range(num_layer-2):
index = num_layer-2-l
diff = np.multiply(np.dot(np.array([W[index][k+1] for k in range(num_node[index])]), diff), np.multiply(A[index], 1-A[index]))
delta[index-1] = delta[index-1] + np.outer(np.insert(A[index-1],0,1),diff)
D = np.empty(num_layer-1, dtype = object)
for l in range(num_layer - 1):
D[l] = np.zeros((num_node[l]+1,num_node[l+1]))
for l in range(num_layer-1):
for i in range(num_node[l]+1):
if i == 0:
for j in range(num_node[l+1]):
D[l][i][j] = 1/Ntrain * delta[l][i][j]
else:
for j in range(num_node[l+1]):
D[l][i][j] = 1/Ntrain * (delta[l][i][j] + Lambda * W[l][i][j])
return D
# Neural network - this is where the learning/adjusting of weights occur
# W is the weights
# learn is the learning rate
# iterations is the number of iterations we pass over the training set
# Lambda is the regularization parameter
def NeuralNetwork(y, X, learn, iterations, Lambda):
W = np.empty(num_layer-1, dtype = object)
for l in range(num_layer - 1):
W[l] = np.random.rand(num_node[l]+1,num_node[l+1])/100
for k in range(iterations):
print(J(y, W, X, Lambda))
D = BackPropagation(y, W, X, Lambda)
for l in range(num_layer-1):
W[l] = W[l] - learn * D[l]
print(J(y, W, X, Lambda))
return W
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
# Training data, read from MNIST
inputpix = []
output = []
for i in range(Ntrain):
inputpix.append(2 * np.array(mnist.train.images[i]) - 1)
output.append(np.array(mnist.train.labels[i]))
np.savetxt('input.txt', inputpix, delimiter=' ')
np.savetxt('output.txt', output, delimiter=' ')
# Train the weights
finalweights = NeuralNetwork(output, inputpix, 2, 5, 1)
# Test data
inputtestpix = []
outputtest = []
for i in range(Ntest):
inputtestpix.append(2 * np.array(mnist.test.images[i]) - 1)
outputtest.append(np.array(mnist.test.labels[i]))
np.savetxt('inputtest.txt', inputtestpix, delimiter=' ')
np.savetxt('outputtest.txt', outputtest, delimiter=' ')
# Determine the accuracy of the training data
count = 0
for i in range(Ntrain):
H = h(finalweights,inputpix[i])
print(H)
for j in range(num_output_node):
if H[j] == np.amax(H) and output[i][j] == 1:
count = count + 1
print(count/Ntrain)
# Determine the accuracy of the test data
count = 0
for i in range(Ntest):
H = h(finalweights,inputtestpix[i])
print(H)
for j in range(num_output_node):
if H[j] == np.amax(H) and outputtest[i][j] == 1:
count = count + 1
print(count/Ntest)
Your network is tiny, 5 neurons make it basically a linear model. Increase it to 256 per layer.
Notice, that trivial linear model has 768 * 10 + 10 (biases) parameters, adding up to 7690 floats. Your neural network on the other hand has 768 * 5 + 5 + 5 * 5 + 5 + 5 * 10 + 10 = 3845 + 30 + 60 = 3935. In other words despite being nonlinear neural network, it is actualy a simpler model than a trivial logistic regression applied to this problem. And logistic regression obtains around 11% error on its own, thus you cannot really expect to beat it. Of course this is not a strict argument, but should give you some intuition for why it should not work.
Second issue is related to other hyperparameters, you seem to be using:
huge learning rate (is it 2?) it should be more of order 0.0001
very little training iterations (are you just executing 5 epochs?)
your regularization parameter is huge (it is set to 1), so your network is heavily penalised for learning anything, again - change it to something order of magnitude smaller
The NN architecture is most likely under-fitting. Maybe, the learning rate is high/low. Or there are most issues with the regularization parameter.

Tuning parameters in caret error despite assigning grids and as.factor

Any help appreciated. Been at this for weeks. :(
install.packages("klaR", dependencies=TRUE)
library(klaR)
install.packages("caret", dependencies=TRUE)
library(caret)
install.packages("e1071", dependencies=TRUE)
library(e1071)
install.packages("gmodels", dependencies=TRUE)
library(gmodels)
install.packages("gbm", dependencies=TRUE)
library(gbm)
install.packages("foreach", dependencies=TRUE)
library(foreach)
Load Grading Data
grading <- read.csv("~/PA_DataFinal/GradingData160315.csv")
create stratified sample # 1%
dfstrat <- stratified(grading, "FailPass", .01)
save(dfstrat, file = "c:/Users/gillisn/Documents/PA_DataFinal/RResults/GradingRResults/iteration 1/dfstrat.rda")
split data into train and test #75:25. FailPass is the responseVble
set.seed(1)
inTrainingSet <- createDataPartition(dfstrat$FailPass, p = .75, list = FALSE)
trainSet <- dfstrat[inTrainingSet,]
testSet <- dfstrat[-inTrainingSet, ]
set predictors and labels
There are 48 labels and its the last one that want to train on.
Take all the predictors 1-47
x,y is training data
x <- trainSet[,-48]
y <- as.factor(trainSet$FailPass)
i,j is test data
i <- testSet[,-48,]
j <- as.factor(testSet$FailPass)
Set Training control parameters
Bootstrapping itself around in 25 times.
bootControl <- trainControl(number = 25)
The grid is for the decision tree
gbmGrid <- expand.grid(.interaction.depth = (1:5) * 2, .n.trees = (1:10)*25, .shrinkage = .1)
nbGrid <- expand.grid(.fL=0, .usekernel=FALSE)
svmGrid >- expandGrid(.sigma=, .c=)
set.seed(2)
Train the models
naive bayes
nbFit <- train(x,y,method='nb',tuneGrid="nbGrid")
svm
svmFit <- train(x, y,method = "svmRadial", tuneLength = 10,trControl = bootControl, scaled = FALSE)
gbm
gbmFit <- train(x, y,method = "gbm", trControl = bootControl, verbose = FALSE, bag.fraction = 0.5, tuneGrid = gbmGrid)
predict the models on training data
models <- list(svm = svmFit, nb = nbFit, gbm = gbmFit)
predict(models)

Resources