Logistic regression using Flux.jl - machine-learning

I have a dataset consisting of student marks in 2 subjects and the result if the student is admitted in college or not. I need to perform a logistic regression on the data and find the optimum parameter θ to minimize the loss and predict the results for the test data. I am not trying to build any complex non linear network here.
The data looks like this
I have the loss function defined for logistic regression like this which works fine
predict(X) = sigmoid(X*θ)
loss(X,y) = (1 / length(y)) * sum(-y .* log.(predict(X)) .- (1 - y) .* log.(1 - predict(X)))
I need to minimize this loss function and find the optimum θ. I want to do it with Flux.jl or any other library which makes it even easier.
I tried using Flux.jl after reading the examples but not able to minimize the cost.
My code snippet:
function update!(ps, η = .1)
for w in ps
w.data .-= w.grad .* η
print(w.data)
w.grad .= 0
end
end
for i = 1:400
back!(L)
update!((θ, b))
#show L
end

You can use either GLM.jl (simpler) or Flux.jl (more involved but more powerful in general).
In the code I generate the data so that you can check if the result is correct. Additionally I have a binary response variable - if you have other encoding of target variable you might need to change the code a bit.
Here is the code to run (you can tweak the parameters to increase the convergence speed - I chose ones that are safe):
using GLM, DataFrames, Flux.Tracker
srand(1)
n = 10000
df = DataFrame(s1=rand(n), s2=rand(n))
df[:y] = rand(n) .< 1 ./ (1 .+ exp.(-(1 .+ 2 .* df[1] .+ 0.5 .* df[2])))
model = glm(#formula(y~s1+s2), df, Binomial(), LogitLink())
x = Matrix(df[1:2])
y = df[3]
W = param(rand(2,1))
b = param(rand(1))
predict(x) = 1.0 ./ (1.0+exp.(-x*W .- b))
loss(x,y) = -sum(log.(predict(x[y,:]))) - sum(log.(1 - predict(x[.!y,:])))
function update!(ps, η = .0001)
for w in ps
w.data .-= w.grad .* η
w.grad .= 0
end
end
i = 1
while true
back!(loss(x,y))
max(maximum(abs.(W.grad)), abs(b.grad[1])) > 0.001 || break
update!((W, b))
i += 1
end
And here are the results:
julia> model # GLM result
StatsModels.DataFrameRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Binomial{Float64},GLM.LogitLink},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}
Formula: y ~ 1 + s1 + s2
Coefficients:
Estimate Std.Error z value Pr(>|z|)
(Intercept) 0.910347 0.0789283 11.5338 <1e-30
s1 2.18707 0.123487 17.7109 <1e-69
s2 0.556293 0.115052 4.83513 <1e-5
julia> (b, W, i) # Flux result with number of iterations needed to converge
(param([0.910362]), param([2.18705; 0.556278]), 1946)

Thanks for this helpful example. However, it does not seem to run with my setup (Julia 1.1, Flux 0.7.1.), since the 1+ and 1- operations in the predict and loss function is not broadcast over the TrackedArray objects. Fortunately the fix is simple (note the dots!):
predict(x) = 1.0 ./ (1.0 .+ exp.(-x*W .- b))
loss(x,y) = -sum(log.(predict(x[y,:]))) - sum(log.(1 .- predict(x[.!y,:])))

Related

Use the Survey package to weight observations in stacked imputations

I am exploring model variable selection within imputed data.
One technique is to stack imputations in long format (where n observations in M imputed datasets creates a dataset n x M long), and use weighted regression to reduce the contribution of each observation proportionally to the number of imputations. If we treated the stacked dataset as one large dataset, the standard errors would be too small.
I am trying to use the weights argument in svyglm to account for the stacked data, resulting in SEs that you would expect with n obervations, rather than n x M observations.
To illustrate:
library(mice)
### create data
set.seed(42)
n <- 50
id <- 1:n
var1 <- rbinom(n,1,0.4)
var2 <- runif(n,30,80)
var3 <- rnorm(n, mean = 12, sd = 5)
var4 <- rnorm(n, mean = 100, sd = 20)
prob <- (((var1*var2)+var3)-min((var1*var2)+var3)) / (max((var1*var2)+var3)-min((var1*var2)+var3))
outcome <- rbinom(n, 1, prob = prob)
data <- data.frame(id, var1, var2, var3, var4, outcome)
### Add missingness
data_miss <- ampute(data)
patt <- data_miss$patterns
patt <- patt[2:5,]
data_miss <- ampute(data, patterns = patt)
data_miss <- data_miss$amp
## create 5 imputed datasets
nimp <- 5
imp <- mice(data_miss, m = nimp)
## Stack data
data_long <- complete(imp, action = "long")
## Generate model in stacked data (SEs will be too small)
modlong <- glm(outcome ~ var1 + var2 + var3 + var4, family = "binomial", data = data_long)
summary(modlong)
the long data gives overly small SEs, as we've increased the size of our dataset by 5x
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.906417 0.965090 -3.012 0.0026 **
var1 2.221053 0.311167 7.138 9.48e-13 ***
var2 -0.002543 0.010468 -0.243 0.8081
var3 0.076955 0.032265 2.385 0.0171 *
var4 0.006595 0.008031 0.821 0.4115
Add weights
data_long$weight <- 1/nimp
library(survey)
des <- svydesign(ids = ~1, data = data_long, weights = ~weight)
mod_svy <- svyglm(formula = outcome ~ var1 + var2 + var3 + var4, family = quasibinomial(), design = des)
summary(mod_svy)
The weighted regression gives similar SEs to the unweighted model
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.906417 1.036691 -2.804 0.00546 **
var1 2.221053 0.310906 7.144 1.03e-11 ***
var2 -0.002543 0.010547 -0.241 0.80967
var3 0.076955 0.030955 2.486 0.01358 *
var4 0.006595 0.008581 0.769 0.44288
Adding rescale = F (to apparently stop weights being rescaled to the sum of the sample size) doesn't change anything
mod_svy <- svyglm(formula = outcome ~ var1 + var2 + var3 + var4, family = quasibinomial(), design = des, rescale = F)
summary(mod_svy)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.906417 1.036688 -2.804 0.00546 **
var1 2.221053 0.310905 7.144 1.03e-11 ***
var2 -0.002543 0.010547 -0.241 0.80967
var3 0.076955 0.030955 2.486 0.01358 *
var4 0.006595 0.008581 0.769 0.44288
I would have expected SEs similar to those obtained when running a model in a single imputed dataset
## Assess SEs in single imputation
mod_singleimp <- glm(outcome ~ var1 + var2 + var3 + var4, family = "binomial", data = complete(imp,1))
summary(mod_singleimp)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.679589 2.116806 -1.266 0.20556
var1 2.476193 0.761195 3.253 0.00114 **
var2 0.014823 0.025350 0.585 0.55874
var3 0.048940 0.072752 0.673 0.50114
var4 -0.004551 0.017986 -0.253 0.80026
All assistance greatly appreciated. Or if anybody knows other ways of achieving the same goal.
Alternative options
the psfmi package allows for stepwise selection in multiply imputed datasets and pooling of models. However, it is computationally intensive and slow with large datasets, particularly if the process needs to be bootstrapped (e.g. during internal validation) - hence the requirement for a less intensive stacking approach.
Sorry, no, this isn't going to work.
To handle stacked imputation data with weights you need frequency weights, so that a weight of 1/10 means you have 1/10 of an observation. With svydesign you specify sampling weights, so that a weight of 1/10 means your observation represents 10 observations in the population. These will (and should) give different standard errors. Pretending you have frequency weights when you actually have imputations is a clever hack to avoid having software that understands what it's doing, which is fine but isn't compatible with survey, which understands what it's doing and is doing something different.
Currently,if you want to use svyglm with multiple imputations you need to compute the standard errors separately -- most conveniently with Rubin's rules using mitools::MIcombine, which is set up to work with the survey package (see the help for with.svyimputationList and withPV).
It might be worth putting in a feature request to the mitools or survey developers (with citations to examples) to allow for stacked analysis of imputations, but this isn't just a matter of adjusting the weights.

Importance weighted autoencoder doing worse than VAE

I've been implementing VAE and IWAE models on the caltech silhouettes dataset and am having an issue where the VAE outperforms IWAE by a modest margin (test LL ~120 for VAE, ~133 for IWAE!). I don't believe this should be the case, according to both theory and experiments produced here.
I'm hoping someone can find some issue in how I'm implementing that's causing this to be the case.
The network I'm using to approximate q and p is the same as that detailed in the appendix of the paper above. The calculation part of the model is below:
data_k_vec = data.repeat_interleave(K,0) # Generate K samples (in my case K=50 is producing this behavior)
mu, log_std = model.encode(data_k_vec)
z = model.reparameterize(mu, log_std) # z = mu + torch.exp(log_std)*epsilon (epsilon ~ N(0,1))
decoded = model.decode(z) # this is the sigmoid output of the model
log_prior_z = torch.sum(-0.5 * z ** 2, 1)-.5*z.shape[1]*T.log(torch.tensor(2*np.pi))
log_q_z = compute_log_probability_gaussian(z, mu, log_std) # Definitions below
log_p_x = compute_log_probability_bernoulli(decoded,data_k_vec)
if model_type == 'iwae':
log_w_matrix = (log_prior_z + log_p_x - log_q_z).view(-1, K)
elif model_type =='vae':
log_w_matrix = (log_prior_z + log_p_x - log_q_z).view(-1, 1)*1/K
log_w_minus_max = log_w_matrix - torch.max(log_w_matrix, 1, keepdim=True)[0]
ws_matrix = torch.exp(log_w_minus_max)
ws_norm = ws_matrix / torch.sum(ws_matrix, 1, keepdim=True)
ws_sum_per_datapoint = torch.sum(log_w_matrix * ws_norm, 1)
loss = -torch.sum(ws_sum_per_datapoint) # value of loss that gets returned to training function. loss.backward() will get called on this value
Here are the likelihood functions. I had to fuss with the bernoulli LL in order to not get nan during training
def compute_log_probability_gaussian(obs, mu, logstd, axis=1):
return torch.sum(-0.5 * ((obs-mu) / torch.exp(logstd)) ** 2 - logstd, axis)-.5*obs.shape[1]*T.log(torch.tensor(2*np.pi))
def compute_log_probability_bernoulli(theta, obs, axis=1): # Add 1e-18 to avoid nan appearances in training
return torch.sum(obs*torch.log(theta+1e-18) + (1-obs)*torch.log(1-theta+1e-18), axis)
In this code there's a "shortcut" being used in that the row-wise importance weights are being calculated in the model_type=='iwae' case for the K=50 samples in each row, while in the model_type=='vae' case the importance weights are being calculated for the single value left in each row, so that it just ends up calculating a weight of 1. Maybe this is the issue?
Any and all help is huge - I thought that addressing the nan issue would permanently get me out of the weeds but now I have this new problem.
EDIT:
Should add that the training scheme is the same as that in the paper linked above. That is, for each of i=0....7 rounds train for 2**i epochs with a learning rate of 1e-4 * 10**(-i/7)
The K-sample importance weighted ELBO is
$$ \textrm{IW-ELBO}(x,K) = \log \sum_{k=1}^K \frac{p(x \vert z_k) p(z_k)}{q(z_k;x)}$$
For the IWAE there are K samples originating from each datapoint x, so you want to have the same latent statistics mu_z, Sigma_z obtained through the amortized inference network, but sample multiple z K times for each x.
So its computationally wasteful to compute the forward pass for data_k_vec = data.repeat_interleave(K,0), you should compute the forward pass once for each original datapoint, then repeat the statistics output by the inference network for sampling:
mu = torch.repeat_interleave(mu,K,0)
log_std = torch.repeat_interleave(log_std,K,0)
Then sample z_k. And now repeat your datapoints data_k_vec = data.repeat_interleave(K,0), and use the resulting tensor to efficiently evaluate the conditional p(x |z_k) for each importance sample z_k.
Note you may also want to use the logsumexp operation when calculating the IW-ELBO for numerical stability. I can't quite figure out what's going on with the log_w_matrix calculation in your post, but this is what I would do:
log_pz = ...
log_qzCx = ....
log_pxCz = ...
log_iw = log_pxCz + log_pz - log_qzCx
log_iw = log_iw.reshape(-1, K)
iwelbo = torch.logsumexp(log_iw, dim=1) - np.log(K)
EDIT: Actually after thinking about it a bit and using the score function identity, you can interpret the IWAE gradient as an importance weighted estimate of the standard single-sample gradient, so the method in the OP for calculation of the importance weights is equivalent (if a bit wasteful), provided you place a stop_gradient operator around the normalized importance weights, which you call w_norm. So I the main problem is the absence of this stop_gradient operator.

Xgboost-How to use "mae" as objective function?

I know xgboost need first gradient and second gradient, but anybody else has used "mae" as obj function?
A little bit of theory first, sorry! You asked for the grad and hessian for MAE, however, the MAE is not continuously twice differentiable so trying to calculate the first and second derivatives becomes tricky. Below we can see the "kink" at x=0 which prevents the MAE from being continuously differentiable.
Moreover, the second derivative is zero at all the points where it is well behaved. In XGBoost, the second derivative is used as a denominator in the leaf weights, and when zero, creates serious math-errors.
Given these complexities, our best bet is to try to approximate the MAE using some other, nicely behaved function. Let's take a look.
We can see above that there are several functions that approximate the absolute value. Clearly, for very small values, the Squared Error (MSE) is a fairly good approximation of the MAE. However, I assume that this is not sufficient for your use case.
Huber Loss is a well documented loss function. However, it is not smooth so we cannot guarantee smooth derivatives. We can approximate it using the Psuedo-Huber function. It can be implemented in python XGBoost as follows,
import xgboost as xgb
dtrain = xgb.DMatrix(x_train, label=y_train)
dtest = xgb.DMatrix(x_test, label=y_test)
param = {'max_depth': 5}
num_round = 10
def huber_approx_obj(preds, dtrain):
d = preds - dtrain.get_labels() #remove .get_labels() for sklearn
h = 1 #h is delta in the graphic
scale = 1 + (d / h) ** 2
scale_sqrt = np.sqrt(scale)
grad = d / scale_sqrt
hess = 1 / scale / scale_sqrt
return grad, hess
bst = xgb.train(param, dtrain, num_round, obj=huber_approx_obj)
Other function can be used by replacing the obj=huber_approx_obj.
Fair Loss is not well documented at all but it seems to work rather well. The fair loss function is:
It can be implemented as such,
def fair_obj(preds, dtrain):
"""y = c * abs(x) - c**2 * np.log(abs(x)/c + 1)"""
x = preds - dtrain.get_labels()
c = 1
den = abs(x) + c
grad = c*x / den
hess = c*c / den ** 2
return grad, hess
This code is taken and adapted from the second place solution in the Kaggle Allstate Challenge.
Log-Cosh Loss function.
def log_cosh_obj(preds, dtrain):
x = preds - dtrain.get_labels()
grad = np.tanh(x)
hess = 1 / np.cosh(x)**2
return grad, hess
Finally, you can create your own custom loss functions using the above functions as templates.
Warning: Due to API changes newer versions of XGBoost may require loss functions for the form:
def custom_objective(y_true, y_pred):
...
return grad, hess
For the Huber loss above, I think the gradient is missing a negative sign upfront. Should be as
grad = - d / scale_sqrt
I am running the huber/fair metric from above on ~normally distributed Y, but for some reason with alpha <0 (and all the time for fair) the result prediction will equal to zero...

Implementation of backpropagation algorithm

I'm building a neural network with the architecture:
input layer --> fully connected layer --> ReLU --> fully connected layer --> softmax
I'm using the equations outlined here DeepLearningBook to implement backprop. I think my mistake is in eq. 1. When differentiating do I consider each example independently yielding an N x C (no. of examples x no. of classes) matrix or together to yield an N x 1 matrix?
# derivative of softmax
da2 = -a2 # a2 comprises activation values of output layer
da2[np.arange(N),y] += 1
da2 *= (a2[np.arange(N),y])[:,None]
# derivative of ReLU
da1 = a1 # a1 comprises activation values of hidden layer
da1[a1>0] = 1
# eq. 1
mask = np.zeros(a2.shape)
mask = [np.arange(N),y] = 1
delta_2 = ((1/a2) * mask) * da2 / N
# delta_L = - (1 / a2[np.arange(N),y])[:,None] * da2 / N
# eq.2
delta_1 = np.dot(delta_2,W2.T) * da1
# eq. 3
grad_b1 = np.sum(delta_1,axis=0)
grad_b2 = np.sum(delta_2,axis=0)
# eq. 4
grad_w1 = np.dot(X.T,delta_1)
grad_w2 = np.dot(a1.T,delta_2)
Oddly, the commented line in eq. 1 returns the correct value for biases but I can't seem to justify using that equation since it returns an N x 1 matrix which is multiplied with the corresponding rows of da2.
Edit: I'm working on the assignment problems of the CS231n course which can be found here: CS231n
I also couldn't find any explanation about this elsewhere. So I write a post :) Please read it here.

Vectorized gradient descent basics

I'm implementing simple gradient descent in octave but its not working. Here is the data I'm using:
X = [1 2 3
1 4 5
1 6 7]
y = [10
11
12]
theta = [0
0
0]
alpha = 0.001 and itr = 50
This is my gradient descent implementation:
function theta = Gradient(X,y,theta,alpha,itr)
m= length(y)
for i = 1:itr,
th1 = theta(1) - alpha * (1/m) * sum((X * theta - y) .* X(:, 1));
th2 = theta(2) - alpha * (1/m) * sum((X * theta - y) .* X(:, 2));
th3 = theta(3) - alpha * (1/m) * sum((X * theta - y) .* X(:, 3));
theta(1) = th1;
theta(2) = th2;
theta(3) = th3;
end
Questions are:
It produces some values of theta which I use in theta * [1 2 3] and expect an output near about 10 (from y). Is that the correct way to test the hypothesis? [h(x) = theta' * x]
How can I determine how many times should it iterate? If I give it 1500 iterations, theta gets extremely small (in e).
If I use double digit numbers in X, theta gets too small again. Even with < 5 iterations.
I've been struggling with these things for a long time now. Unable to resolve it myself.
Sorry for bad formatting.
Your Batch gradient descent implementation seems perfectly fine to me. Can you be more specific on what is the error you are facing. Having said that, for your question Is that the correct way to test the hypothesis? [h(x) = theta' * x].
Based on the dimensions of your test set here, you should test it as h(x) = X*theta.
For your second question, the number of iterations depends on the data set provided. To decide on the optimized number of iterations, you need to plot your cost function with the number of iterations. And as iterations increase, values of cost function should decrease. By this you can decide upon how many iterations you need. You might also consider, increasing the value of alpha in steps of 0.001, 0.003, 0.01, 0.03, 0.1 ... to consider best possible alpha value.
For your third question, I guess you are directly trying to model the data which you have in this question. This data is very small, it just contains 3 training examples. You might be trying to implement linear regression algorithm. For that, you need to that proper training set which contains sufficient data to train your model. Then you can test your model with your test data.
Refer to Andrew Ng course of Machine Learning in www.coursera.org. You will find more information in that course.

Resources