Bayesian vs OLS - machine-learning

I found this question online. Can someone explain in details please, why using OLS is better? Is it only because the number of samples is not enough? Also, why not use all the 1000 samples to estimate the prior distribution?
We have 1000 randomly sampled data points. The goal is to try to build
a regression model with one response variable from k regressor
variables. Which is better? 1. (Bayesian Regression) Using the first
500 samples to estimate the parameters of an assumed prior
distribution and then use the last 500 samples to update the prior to
a posterior distribution with posterior estimates to be used in the
final regression model. 2. (OLS Regression) Use a simple ordinary
least squares regression model with all 1000 regressor variables

"Better" is always a matter of opinion, and it greatly depends on context.
Advantages to a frequentist OLS approach: Simpler, faster, more accessible to a wider audience (and therefore less to explain). A wise professor of mine used to say "You don't need to build an atom smasher when a flyswatter will do the trick."
Advantages to an equivalent Bayesian approach: More flexible to further model development, can directly model posteriors of derived/calculated quantities (there are more, but these have been my motivations for going Bayesian with a given analysis). Note the word "equivalent" - there are things you can do in a Bayesian framework that you can't do within a frequentist approach.
And hey, here's a exploration in R, first simulating data, then using a typical OLS approach.
N <- 1000
x <- 1:N
epsilon <- rnorm(N, 0, 1)
y <- x + epsilon
summary(lm(y ~ x))
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9053 -0.6723 0.0116 0.6937 3.7880
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0573955 0.0641910 0.894 0.371
## x 0.9999997 0.0001111 9000.996 <2e-16 ***
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Residual standard error: 1.014 on 998 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 8.102e+07 on 1 and 998 DF, p-value: < 2.2e-16
...and here's an equivalent Bayesian regression, using non-informative priors on the regression parameters and all 1000 data points.
library(R2jags)
cat('model {
for (i in 1:N){
y[i] ~ dnorm(y.hat[i], tau)
y.hat[i] <- a + b * x[i]
}
a ~ dnorm(0, .0001)
b ~ dnorm(0, .0001)
tau <- pow(sigma, -2)
sigma ~ dunif(0, 100)
}', file="test.jags")
test.data <- list(x=x,y=y,N=1000)
test.jags.out <- jags(model.file="test.jags", data=test.data,
parameters.to.save=c("a","b","tau","sigma"), n.chains=3, n.iter=10000)
test.jags.out$BUGSoutput$mean$a
## [1] 0.05842661
test.jags.out$BUGSoutput$sd$a
## [1] 0.06606705
test.jags.out$BUGSoutput$mean$b
## [1] 0.9999976
test.jags.out$BUGSoutput$sd$b
## [1] 0.0001122533
Note that the parameter estimates and standard errors/standard deviations are essentially equivalent!
Now here's another Bayesian regression, using the first 500 data points to estimate the priors and then the last 500 to estimate posteriors.
test.data <- list(x=x[1:500],y=y[1:500],N=500)
test.jags.out <- jags(model.file="test.jags", data=test.data,
parameters.to.save=c("a","b","tau","sigma"), n.chains=3, n.iter=10000)
cat('model {
for (i in 1:N){
y[i] ~ dnorm(y.hat[i], tau)
y.hat[i] <- a + b * x[i]
}
a ~ dnorm(a_mn, a_prec)
b ~ dnorm(b_mn, b_prec)
a_prec <- pow(a_sd, -2)
b_prec <- pow(b_sd, -2)
tau <- pow(sigma, -2)
sigma ~ dunif(0, 100)
}', file="test.jags1")
test.data1 <- list(x=x[501:1000],y=y[501:1000],N=500,
a_mn=test.jags.out$BUGSoutput$mean$a,a_sd=test.jags.out$BUGSoutput$sd$a,
b_mn=test.jags.out$BUGSoutput$mean$b,b_sd=test.jags.out$BUGSoutput$sd$b)
test.jags.out1 <- jags(model.file="test.jags1", data=test.data1,
parameters.to.save=c("a","b","tau","sigma"), n.chains=3, n.iter=10000)
test.jags.out1$BUGSoutput$mean$a
## [1] 0.01491162
test.jags.out1$BUGSoutput$sd$a
## [1] 0.08513474
test.jags.out1$BUGSoutput$mean$b
## [1] 1.000054
test.jags.out1$BUGSoutput$sd$b
## [1] 0.0001201778
Interestingly, the inferences are similar to the OLS results, but not nearly as much so. This leads me to suspect that the 500 data points used to train the prior are not carrying as much weight in the analysis as the last 500, and the prior is effectively getting washed out, though I'm not sure on this point.
Regardless, I can't think of a reason not to use all 1000 data points (and non-informative priors) either, particularly since I suspect the 500+500 is using the first 500 and last 500 differently.
So perhaps, the answer to all of this is: I trust the OLS and 1000-point Bayesian results more than the 500+500, and OLS is simpler.

In my opinion is not a matter of better but a matter of which inference approach you're comfortable with.
You must remember that OLS comes from the frequentist school of inference and estimation is donde ML process which for this particular problem coincides with a geometric argument of distance minimization (in my personal opinion it is very odd, as supposedly we aare dealing with a rondom phenomena).
On the other hand, in the bayesian approach, inference is done through posterior distribution which is the multiplication of the prior (that represents the decision maker's previous information about the phenom) and the likelihood.
Again, the question is a matter of what inference approach you're comfortable with.

Related

Importance weighted autoencoder doing worse than VAE

I've been implementing VAE and IWAE models on the caltech silhouettes dataset and am having an issue where the VAE outperforms IWAE by a modest margin (test LL ~120 for VAE, ~133 for IWAE!). I don't believe this should be the case, according to both theory and experiments produced here.
I'm hoping someone can find some issue in how I'm implementing that's causing this to be the case.
The network I'm using to approximate q and p is the same as that detailed in the appendix of the paper above. The calculation part of the model is below:
data_k_vec = data.repeat_interleave(K,0) # Generate K samples (in my case K=50 is producing this behavior)
mu, log_std = model.encode(data_k_vec)
z = model.reparameterize(mu, log_std) # z = mu + torch.exp(log_std)*epsilon (epsilon ~ N(0,1))
decoded = model.decode(z) # this is the sigmoid output of the model
log_prior_z = torch.sum(-0.5 * z ** 2, 1)-.5*z.shape[1]*T.log(torch.tensor(2*np.pi))
log_q_z = compute_log_probability_gaussian(z, mu, log_std) # Definitions below
log_p_x = compute_log_probability_bernoulli(decoded,data_k_vec)
if model_type == 'iwae':
log_w_matrix = (log_prior_z + log_p_x - log_q_z).view(-1, K)
elif model_type =='vae':
log_w_matrix = (log_prior_z + log_p_x - log_q_z).view(-1, 1)*1/K
log_w_minus_max = log_w_matrix - torch.max(log_w_matrix, 1, keepdim=True)[0]
ws_matrix = torch.exp(log_w_minus_max)
ws_norm = ws_matrix / torch.sum(ws_matrix, 1, keepdim=True)
ws_sum_per_datapoint = torch.sum(log_w_matrix * ws_norm, 1)
loss = -torch.sum(ws_sum_per_datapoint) # value of loss that gets returned to training function. loss.backward() will get called on this value
Here are the likelihood functions. I had to fuss with the bernoulli LL in order to not get nan during training
def compute_log_probability_gaussian(obs, mu, logstd, axis=1):
return torch.sum(-0.5 * ((obs-mu) / torch.exp(logstd)) ** 2 - logstd, axis)-.5*obs.shape[1]*T.log(torch.tensor(2*np.pi))
def compute_log_probability_bernoulli(theta, obs, axis=1): # Add 1e-18 to avoid nan appearances in training
return torch.sum(obs*torch.log(theta+1e-18) + (1-obs)*torch.log(1-theta+1e-18), axis)
In this code there's a "shortcut" being used in that the row-wise importance weights are being calculated in the model_type=='iwae' case for the K=50 samples in each row, while in the model_type=='vae' case the importance weights are being calculated for the single value left in each row, so that it just ends up calculating a weight of 1. Maybe this is the issue?
Any and all help is huge - I thought that addressing the nan issue would permanently get me out of the weeds but now I have this new problem.
EDIT:
Should add that the training scheme is the same as that in the paper linked above. That is, for each of i=0....7 rounds train for 2**i epochs with a learning rate of 1e-4 * 10**(-i/7)
The K-sample importance weighted ELBO is
$$ \textrm{IW-ELBO}(x,K) = \log \sum_{k=1}^K \frac{p(x \vert z_k) p(z_k)}{q(z_k;x)}$$
For the IWAE there are K samples originating from each datapoint x, so you want to have the same latent statistics mu_z, Sigma_z obtained through the amortized inference network, but sample multiple z K times for each x.
So its computationally wasteful to compute the forward pass for data_k_vec = data.repeat_interleave(K,0), you should compute the forward pass once for each original datapoint, then repeat the statistics output by the inference network for sampling:
mu = torch.repeat_interleave(mu,K,0)
log_std = torch.repeat_interleave(log_std,K,0)
Then sample z_k. And now repeat your datapoints data_k_vec = data.repeat_interleave(K,0), and use the resulting tensor to efficiently evaluate the conditional p(x |z_k) for each importance sample z_k.
Note you may also want to use the logsumexp operation when calculating the IW-ELBO for numerical stability. I can't quite figure out what's going on with the log_w_matrix calculation in your post, but this is what I would do:
log_pz = ...
log_qzCx = ....
log_pxCz = ...
log_iw = log_pxCz + log_pz - log_qzCx
log_iw = log_iw.reshape(-1, K)
iwelbo = torch.logsumexp(log_iw, dim=1) - np.log(K)
EDIT: Actually after thinking about it a bit and using the score function identity, you can interpret the IWAE gradient as an importance weighted estimate of the standard single-sample gradient, so the method in the OP for calculation of the importance weights is equivalent (if a bit wasteful), provided you place a stop_gradient operator around the normalized importance weights, which you call w_norm. So I the main problem is the absence of this stop_gradient operator.

Couting or integrating multivariate Gaussian probabilities on opposite side of non-linear decision line

So I have something that looks like the following:
However, I am having real trouble integrating the data on the other side of this decision line to get my errors.
In general, given analytic form of the decision boundary you could compute the integrals exactly. However, why not use monte carlo which is fast, simple and generic (will work for any distributions and decision boundaries). All you have to do is repeatedly sample from your gaussians, check if the sampled point is on the correct side (N_c) or incorrect (N_i) and in the limit you will get your integrals from
INTEGRAL_of_distributions_being_on_correct_side ~ N_c / (N_c + N_i)
INTEGRAL_of_distributions_being_on_incorrect_side ~ N_i / (N_c + N_i)
thus in pseudo code:
N_c = 0
N_i = 0
for i=1 to N do
y ~ P({-, +}) # sample distribution
x ~ P(X|y) # sample point from given class
if side_of_decision(x) == y then
N_c += 1
else
N_i += 1
end
end
return N_c, N_i
In your case P({-, +}) is probably just 50-50 chance and P(X|-) and P(X|+) are your two Gaussians.

What are some specific examples of Ensemble Learning?

What are some concrete real life examples which can be solved using Boosting/Bagging algorithms? Code snippets would be greatly appreciated.
Ensembles are used to fight overfitting / improve generalization or to fight specific weaknesses / use strength of different classifiers. They can be applied in any classification task.
I used ensembles in my masters thesis. The code is on Github.
Example 1
For example, think of a binary problem where you have to tell if a data point is of class A or B. This could be an image and you have to decide if there is a (A) a dog or (B) a cat on it. Now you have two classifiers (1) and (2) (e.g. two neural networks, but trained in different ways; or one SVM and a decision tree, or ...). They make the following errors:
(1): Predicted
T | A B
R ------------
U A | 90% 10%
E B | 50% 50%
(2): Predicted
T | A B
R ------------
U A | 60% 40%
E B | 40% 60%
You could, for example, combine them to an ensemble by first using (1). If it predicts B, then you can use (2). Otherwise you stick with it.
Now, what would be the expected error, (falsely) assuming both are independent)?
If the true class is A, then we predict with 90% the true result. In 10% of the cases we predict B and use the second classifier. This one gets it right in 60% of the cases. This means if we have A, we predict A in 0.9 + 0.1*0.6 = 0.96 = 96% of the cases.
If the true class is B, we predict in 50% of the cases B. But we also need to get it right the second time, so only in 0.5*0.6 = 0.3 = 30% of the cases we get it right there.
So in this simple example we made the situation better for one class, but worse for the other.
Example 2
Now, lets say we have 3 classifiers with
Predicted
T | A B
R ------------
U A | 60% 40%
E B | 40% 60%
each, but the classifications are independent. What do you get when you make a majority vote?
If you have class A, the probability that at least two say it is class A is
0.6 * 0.6 * 0.6 + 0.6 * 0.6 * 0.4 + 0.6 * 0.4 * 0.6 + 0.4 * 0.6 * 0.6
= 1*0.6^3 + 3*(0.6^2 * 0.4^1)
= (3 nCr 3) * 0.6 + (3 nCr 2) * (0.6^2 * 0.4^1)
= 0.648
The same goes for the other class. So we improved the classifier to
Predicted
T | A B
R ------------
U A | 65% 35%
E B | 35% 65%
Code
See sklearns page on Ensembles for code.
The most specific example of ensemble learning are random forests.
Ensemble is the art of combining diverse set of learners (individual models) together to improvise on the stability and predictive power of the model.
Ensemble Learning Techniques:
Bagging : Bagging tries to implement similar learners on small sample populations and then takes a mean of all the predictions. In generalized bagging, you can use different learners on different population.
Boosting : Boosting is an iterative technique which adjust the weight of an observation based on the last classification. If an observation was classified incorrectly, it tries to increase the weight of this observation and vice versa. Boosting in general decreases the bias error and builds strong predictive models.
Stacking : This is a very interesting way of combining models. Here we use a learner to combine output from different learners. This can lead to decrease in either bias or variance error depending on the combining learner we use.
for more reference:
Basics of Ensemble Learning Explained
Here's Python based pseudo code for basic Ensemble Learning:
# 3 ML/DL models -> first_model, second_model, third_model
all_models = [first_model, second_model, third_model]
first_model.load_weights(first_weight_file)
second_model.load_weights(second_weight_file)
third_model.load_weights(third_weight_file)
def ensemble_average(models: List [Model]): # averaging
outputs = [model.outputs[0] for model in all_models]
y = Average()(outputs)
model = Model(model_input, y, name='ensemble_average')
pred = model.predict(x_test, batch_size = 32)
pred = numpy.argmax(pred, axis=1)
E = numpy.sum(numpy.not_equal(pred, y_test))/ y_test.shape[0]
return E
def ensemble_vote(models: List [Model]): # max-voting
pred = []
yhats = [model.predict(x_test) for model in all_models]
yhats = numpy.argmax(yhats, axis=2)
yhats = numpy.array(yhats)
for i in range(0,len(x_test)):
m = mode([yhats[0][i], yhats[1][i], yhats[2][i]])
pred = numpy.append(pred, m[0])
E = numpy.sum(numpy.not_equal(pred, y_test))/ y_test.shape[0]
return E
# Errors calculation
E1 = ensemble_average(all_models);
E2 = ensemble_vote(all_models);

SVM with RBF: Decision values tend to be equal to the negative of the bias term for faraway test samples

Using RBF kernel in SVM, why the decision value of test samples faraway from the training ones tend to be equal to the negative of the bias term b?
A consequence is that, once the SVM model is generated, if I set the bias term to 0, the decision value of test samples faraway from the training ones tend to 0. Why it happens?
Using the LibSVM, the bias term b is the rho. The decision value is the distance from the hyperplane.
I need to understand what defines this behavior. Does anyone understand that?
Running the following R script, you can see this behavior:
library(e1071)
library(mlbench)
data(Glass)
set.seed(2)
writeLines('separating training and testing samples')
testindex <- sort(sample(1:nrow(Glass), trunc(nrow(Glass)/3)))
training.samples <- Glass[-testindex, ]
testing.samples <- Glass[testindex, ]
writeLines('normalizing samples according to training samples between 0 and 1')
fnorm <- function(ran, data) {
(data - ran[1]) / (ran[2] - ran[1])
}
minmax <- data.frame(sapply(training.samples[, -10], range))
training.samples[, -10] <- mapply(fnorm, minmax, training.samples[, -10])
testing.samples[, -10] <- mapply(fnorm, minmax, testing.samples[, -10])
writeLines('making the dataset binary')
training.samples$Type <- factor((training.samples$Type == 1) * 1)
testing.samples$Type <- factor((testing.samples$Type == 1) * 1)
writeLines('training the SVM')
svm.model <- svm(Type ~ ., data=training.samples, cost=1, gamma=2**-5)
writeLines('predicting the SVM with outlier samples')
points = c(0, 0.8, 1, # non-outliers
1.5, -0.5, 2, -1, 2.5, -1.5, 3, -2, 10, -9) # outliers
outlier.samples <- t(sapply(points, function(p) rep(p, 9)))
svm.pred <- predict(svm.model, testing.samples[, -10], decision.values=TRUE)
svm.pred.outliers <- predict(svm.model, outlier.samples, decision.values=TRUE)
writeLines('') # printing
svm.pred.dv <- c(attr(svm.pred, 'decision.values'))
svm.pred.outliers.dv <- c(attr(svm.pred.outliers, 'decision.values'))
names(svm.pred.outliers.dv) <- points
writeLines('test sample decision values')
print(head(svm.pred.dv))
writeLines('non-outliers and outliers decision values')
print(svm.pred.outliers.dv)
writeLines('svm.model$rho')
print(svm.model$rho)
writeLines('')
writeLines('<< setting svm.model$rho to 0 >>')
writeLines('predicting the SVM with outlier samples')
svm.model$rho <- 0
svm.pred <- predict(svm.model, testing.samples[, -10], decision.values=TRUE)
svm.pred.outliers <- predict(svm.model, outlier.samples, decision.values=TRUE)
writeLines('') # printing
svm.pred.dv <- c(attr(svm.pred, 'decision.values'))
svm.pred.outliers.dv <- c(attr(svm.pred.outliers, 'decision.values'))
names(svm.pred.outliers.dv) <- points
writeLines('test sample decision values')
print(head(svm.pred.dv))
writeLines('non-outliers and outliers decision values')
print(svm.pred.outliers.dv)
writeLines('svm.model$rho')
print(svm.model$rho)
Comments about the code:
It uses a dataset of 9 dimensions.
It splits the dataset into training and testing.
It normalizes the samples between 0 and 1 for all dimensions.
It makes the problem to be binary.
It fits a SVM model.
It predicts the testing samples, getting the decision values.
It predicts some synthetic (outlier) samples outside [0, 1] in the feature space, getting the decision values.
It shows that the decision value for outliers tends to be the negative of the bias term b generated by the model.
It sets the bias term b to 0.
It predicts the testing samples, getting the decision values.
It predicts some synthetic (outlier) samples outside [0, 1] in the feature space, getting the decision values.
It shows that the decision value for outliers tends to be 0.
Do you mean negative of the bias term instead of inverse?
The decision function of the SVM is sign(w^T x - rho), where rho is the bias term , w is the weight vector, and x is the input. But thats in the primal space / linear form. w^T x is replaced by our kernel function, which in this case is the RBF kernel.
The RBF kernel is defined as . So if the distance between two things is very large, then it gets squared - we get a huge number. γ is a positive number, so we are making our huge giant value a huge giant negative value. exp(-10) is already on the order of 5*10^-5, so for far away points the RBF kernel is going to become essentailly zero. If sample is far aware from all of your training data, than all of the kernel products will be nearly zero. that means w^T x will be nearly zero. And so what you are left with is essentially sign(0-rho), ie: the negative of your bias term.

Learning Weka - Precision and Recall - Wiki example to .Arff file

I'm new to WEKA and advanced statistics, starting from scratch to understand the WEKA measures. I've done all the #rushdi-shams examples, which are great resources.
On Wikipedia the http://en.wikipedia.org/wiki/Precision_and_recall examples explains with an simple example about a video software recognition of 7 dogs detection in a group of 9 real dogs and some cats.
I perfectly understand the example, and the recall calculation.
So my first step, let see in Weka how to reproduce with this data.
How do I create such a .ARFF file?
With this file I have a wrong Confusion Matrix, and the wrong Accuracy By Class
Recall is not 1, it should be 4/9 (0.4444)
#relation 'dogs and cat detection'
#attribute 'realanimal' {dog,cat}
#attribute 'detected' {dog,cat}
#attribute 'class' {correct,wrong}
#data
dog,dog,correct
dog,dog,correct
dog,dog,correct
dog,dog,correct
cat,dog,wrong
cat,dog,wrong
cat,dog,wrong
dog,?,?
dog,?,?
dog,?,?
dog,?,?
dog,?,?
cat,?,?
cat,?,?
Output Weka (without filters)
=== Run information ===
Scheme:weka.classifiers.rules.ZeroR
Relation: dogs and cat detection
Instances: 14
Attributes: 3
realanimal
detected
class
Test mode:10-fold cross-validation
=== Classifier model (full training set) ===
ZeroR predicts class value: correct
Time taken to build model: 0 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 4 57.1429 %
Incorrectly Classified Instances 3 42.8571 %
Kappa statistic 0
Mean absolute error 0.5
Root mean squared error 0.5044
Relative absolute error 100 %
Root relative squared error 100 %
Total Number of Instances 7
Ignored Class Unknown Instances 7
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
1 1 0.571 1 0.727 0.65 correct
0 0 0 0 0 0.136 wrong
Weighted Avg. 0.571 0.571 0.327 0.571 0.416 0.43
=== Confusion Matrix ===
a b <-- classified as
4 0 | a = correct
3 0 | b = wrong
There must be something wrong with the False Negative dogs,
or is my ARFF approach totally wrong and do I need another kind of attributes?
Thanks
Lets start with the basic definition of Precision and Recall.
Precision = TP/(TP+FP)
Recall = TP/(TP+FN)
Where TP is True Positive, FP is False Positive, and FN is False Negative.
In the above dog.arff file, Weka took into account only the first 7 tuples, it ignored the remaining 7. It can be seen from the above output that it has classified all the 7 tuples as correct(4 correct tuples + 3 wrong tuples).
Lets calculate the precision for correct and wrong class.
First for the correct class:
Prec = 4/(4+3) = 0.571428571
Recall = 4/(4+0) = 1.
For wrong class:
Prec = 0/(0+0)= 0
recall =0/(0+3) = 0

Resources