How does the createFolds function in the caret package divide the dataset into K folds? Is it random sampling or stratified uniform sampling? - r-caret

How does the createFolds function in the caret package divide the dataset into K folds? Is it random sampling or stratified uniform sampling?

Related

Variational autoencoder (VAE) predicts 1 constant value

I'm currently traininig a VAE model.
The images in question are microstructure rocks images (like these).
I defind a compount loss function having the sum 2 folds:
MSE as my images are grayscale but non binary.
KLL divergence.
I was having nan values for loss function, but figured out that a way around this is to use the weighted sum of the 2 losses. I've chosen the weight the MSE by the images size (256x256), so it becomes:
MSE = MSEx256x256
and the KLL divergence by 0.1 factor.
The nan problem was solved then, but my model when predicting just predicts one value for the whole image, so if I predict an output it will be an array of 256*256 values all the same at e.g. 0.502.
Model specs:
10 layers encoder / decoder
Latent vector space of dimension 5
SGD optimizer at lr=0.001
Loss values upon training goes from a billion number to 3000 from 2nd epoch and fluctuates around it
Accuracy upon training or valiudating is below 0.001, I've read this metric is irrelavnt anyway when it comes to VAE
Here is how I sample from the latent vector specs:
sample = Lambda(get_sample_from_dist, output_shape=(latent_dim, ), name='sample')([mu, log_sigma])
def get_sample_from_dist(args):
mean_vec, std_dev_vec = args
eta_vec = K.random_normal(shape=(K.shape(mean_vec)[0], K.int_shape(mean_vec)[1]), mean=0, stddev=1)
return mean_vec + K.exp(std_dev_vec) * eta_vec
and here is how the encoder generate mu and log_sigma:
x is the output of the last encoder layer
mu = Dense(latent_dim, name='latent_mu')(x)
log_sigma = Dense(latent_dim, name='latent_sigma')(x)
and here is my loss
def vae_loss_func(inputs, outputs, mu, log_sigma):
x1 = K.flatten(inputs)
x2 = K.flatten(outputs)
reconstruction_loss = losses.mse(x1, x2)*256**2
kl_loss = -0.5* 0.1*K.sum(1 + log_sigma - K.square(mu) - K.square(K.exp(log_sigma)), axis=-1)
vae_loss = K.mean(reconstruction_loss + kl_loss)
return vae_loss
Any thoughts where things are going wrong?
I tried different weighing factors in the loss function and using strides and dropouts layers, none of these worked. I'm expecting the generated image to be varying in pixel value and evenatually capturing the rock structure.

How to obtain Accuracy, Cohen's Kappa, and AUC values from k fold cross validation?

I would like to obtain not only Accuracy and Cohen's kappa values from a k-fold cross validation, but AUC as well. I know how to obtain the avereage Accuracy, Cohen's Kappa, and AUC, as well as the Accuracy and Cohen's kappa for each fold, but I don't know how to obtain an AUC value for each fold.
Here is an example using different data
# load data
data(Sonar)
#rename data
my_data <- Sonar
#apply train control to get accuracy and cohens kappa
fitControl <-
trainControl(
method = "cv",
number = 10,
classProbs = T,
savePredictions = T
)
#run through k fold cross validation
model <- train(
Class ~ .,
data = my_data,
method = "glm",
trControl = fitControl
)
getTrainPerf(model)
#get every accuracy and kappa value
model$resample
I also know that I can use ROC as the metric in the train function and can fit the model to optimize ROC and can then obtain ROC values. But, I would like to optimize cohen's kappa and still see AUC scores for each fold. How might I accomplish this?

Why do I get some negative values (predictors) as output of regressor estimators (Lasso, Ridge, ElasticNet)

For my regression problem, I am using GridSearchCV of scikit-learn to get the best alpha value and using this alpha value in my estimator (Lasso, Ridge, ElasticNet).
My target values in the training dataset do not contain any negative values. But some of the predicted values are negative (around 5-10%).
I am using the following code.
My training data contains some Null values and I am replacing them by mean of that feature.
return Lasso(alpha=best_parameters['alpha']).fit(X,y).predict(X_test)
Any idea why am I getting some as Negative values ?
Shape of X,y and X_test are (20L, 400L) (20L,) (10L, 400L)
Lasso is just regularized linear regression so in fact for each trained model there are some values for which the predictor will be negative.
consider a linar function
f(x) = w'x + b
Where w and x are vectors and ' is transposition operator
No matter what are the values of w and b, as long as w is not a zero vector - there are always values of x for which f(x)<0. And it does not matter that your training set used to compute w and b did not contain any negative values, as the linear model will always (possibly in some really big values) cross the 0 value.

The cost function and gradient of softmax classifier

When training a softmax classifier, I used minFunc function in Matlab, but it didn't work, the step size would reach TolX quickly and the accuracy is not even 5%. There must be something wrong but I just couldn't find it.
Here is my Matlab code about the cost function and gradient:
z=x*W; %x is the input data, it's an m*n matrix, m is the number of samples, n is the number of units in the input layer. W is an n*o matrix, o is the number of units in the output layer.
a=sigmoid(z)./repmat(sum(sigmoid(z),2),1,o); %a is the output of the classifier.
J=-mean(sum(target.*log(a),2))+l/2*sum(sum(W.^2)); %This is the cost function, target is the desired output, it's an m*n matrix. l is the weight decay parameter.
Wgrad=-x'*(target-a)/m+l*W;
the formula can be found here. Can anyone point out where my error is?
I found the error, I should not use the sigmoid function, it should simply be exp.

Neural Networks: Why does the perceptron rule only work for linearly separable data?

I previously asked for an explanation of linearly separable data. Still reading Mitchell's Machine Learning book, I have some trouble understanding why exactly the perceptron rule only works for linearly separable data?
Mitchell defines a perceptron as follows:
That is, it is y is 1 or -1 if the sum of the weighted inputs exceeds some threshold.
Now, the problem is to determine a weight vector that causes the perceptron to produce the correct output (1 or -1) for each of the given training examples. One way of achieving this is through the perceptron rule:
One way to learn an acceptable weight vector is to begin with random
weights, then iteratively apply the perceptron to each training
example, modify- ing the perceptron weights whenever it misclassifies
an example. This process is repeated, iterating through the training
examples as many times as needed until the perceptron classifies all
training examples correctly. Weights are modified at each step
according to the perceptron training rule, which revises the weight wi
associated with input xi according to the rule:
So, my question is: Why does this only work with linearly separable data? Thanks.
Because the dot product of w and x is a linear combination of xs, and you, in fact, split your data into 2 classes using a hyperplane a_1 x_1 + … + a_n x_n > 0
Consider a 2D example: X = (x, y) and W = (a, b) then X * W = a*x + b*y. sgn returns 1 if its argument is greater than 0, that is, for class #1 you have a*x + b*y > 0, which is equivalent to y > -a/b x (assuming b != 0). And this equation is linear and divides a 2D plane into 2 parts.

Resources