H2O giving a different R^2 than calculating manually? - machine-learning

I am confused about how H2O calculates R^2. I created a dummy dataframe used H2O's RandomForestEstimator:
df = pd.DataFrame({'x':[1,2,3,4,5],'y':[3,9,2,8,1]})
h2o_df=h2o.H2OFrame(df)
rf = H2ORandomForestEstimator()
rf.train('x','y',h2o_df)
rf.r2()
This returns -0.667, which would indicate a pretty poor fit! But I calculated R^2 with the predict method:
y_true = df.y
y_pred = rf.predict(h2o_df).as_data_frame().predict
SSE = sum((y_pred-y_true)**2)
SST = sum((y_true-y_true.mean())**2)
r2 = 1-(SSE/SST)
r2
This returns 0.727, which makes a lot more sense. What is happening internally with the .r2() method?

Pretty sure this is a bug. As a workaround, rf.model_performance(h2o_df).r2() returns the correct value for R^2 (the same as when calculating manually).

Related

Use of 'is_unbalance' parameter in Lightgbm

I am trying to use the 'is_unbalance' parameter in my model training for a binary classification problem where the positive class is approximately 3%. If I set the parameter 'is_unbalance', I observe that the binary log loss drops in the first iteration but then keeps on increasing. I'm noticing this behavior only if I enable this parameter 'is_unbalance'. Otherwise, there is a steady drop in log_loss. Appreciate your help on this. Thanks.
When you do not balance the sets for such an unbalanced dataset, then obviously the objective value will always drop - and will probably reach the point of classifying all the predictions to the majority class, while having a fantastic objective value.
Balancing the classes is necessary, but it doesn't mean that you should stop on is_unbalanced - you can use sample_pos_weight, have customized metric, or apply weights to your samples, like following:
WEIGHTS = y_train.value_counts(normalize = True).min() / y_train.value_counts(normalize = True)
TRAIN_WEIGHTS = pd.DataFrame(y_train.rename('old_target')).merge(WEIGHTS, how = 'left', left_on = 'old_target', right_on = WEIGHTS.index).target.values
train_data = lgb.Dataset(X_train, label=y_train, weight = TRAIN_WEIGHTS)
Also, optimizing other hyperparameters should solve the issue of increasing log_loss.
When you set Is_unbalace: True, the algorithm will try to Automatically balance the weight of the dominated label (with the pos/neg fraction in train set).
If you want change scale_pos_weight (it is by default 1 which mean assume both positive and negative label are equal) in case of unbalance dataset you can use following formula(based on this issue on lightgbm repository) to set it correctly.
sample_pos_weight = number of negative samples / number of positive samples

when setting .eval() my model performs worse than when I set .train()

During the training phase, I select the model parameters with the best performance metric.
if performance_metric.item()>max_performance:
max_performance= performance_metric.item()
torch.save(neural_net.state_dict(), PATH+'/best_model.pt')
This is the neural network model used:
class Neural_Net(nn.Module):
def __init__(self, M,shape_input,batch_size):
super(Neural_Net, self).__init__()
self.lstm = nn.LSTM(shape_input,M)
#self.dense1 = nn.Linear(shape_input,M)
self.dense1 = nn.Linear(M,M) #Used with the LSTM
torch.nn.init.xavier_uniform_(self.dense1.weight)
self.dense2 = nn.Linear(M,M)
torch.nn.init.xavier_uniform_(self.dense2.weight)
self.dense3 = nn.Linear(M,1)
torch.nn.init.xavier_uniform_(self.dense3.weight)
self.drop = nn.Dropout(0.7)
self.bachnorm1 = nn.BatchNorm1d(M)
self.relu = nn.ReLU()
self.sigmoid = nn.Sigmoid()
self.hidden_cell = (torch.zeros(1,batch_size,M),torch.zeros(1,batch_size,M))
def forward(self, x):
lstm_out, self.hidden_cell = self.lstm(x.view(1 ,len(x), -1), self.hidden_cell)
x = self.drop(self.relu(self.dense1(self.bachnorm1(lstm_out.view(len(x), -1)))))
x = self.drop(self.relu(self.dense2(x)))
x = self.relu(self.dense3(x))
return x
After that I load the model with the best parameters and set the evaluation mode:
neural_net.load_state_dict(torch.load(PATH+'/best_model.pt'))
neural_net.eval()
The results are completely random. When I set train() the performance is similar to the selected best model parameter.
There is an important aspect of the eval() that I am forgetting? Is the batch normalization correctly used? I am using a batch the same size as in the training phase for the test phase.
Without knowing your batch size, training/test dataset size, or the training/test dataset discrepancies, this issue has been discussed on the pytorch forums previously here.
In my experience, it sounds very much like your latent training data representation in your model is significantly different to your validation data representation. The main advice I can provide is for you to try reducing the momentum of your batchnorm layer. It might be worth substituting a layernorm layer instead (which doesn't track a running mean/standard deviation) OR setting track_running_stats=False in the batchnorm1d function and seeing if the problem persists.

MLJ: selecting rows and columns for training in evaluate

I want to implement a kernel ridge regression that also works within MLJ. Moreover, I want to have the option to use either feature vectors or a predefined kernel matrix as in Python sklearn.
When I run this code
const MMI = MLJModelInterface
MMI.#mlj_model mutable struct KRRModel <: MLJModelInterface.Deterministic
mu::Float64 = 1::(_ > 0)
kernel::String = "linear"
end
function MMI.fit(m::KRRModel,verbosity::Int,K,y)
K = MLJBase.matrix(K)
fitresult = inv(K+m.mu*I)*y
cache = nothing
report = nothing
return (fitresult,cache,report)
end
N = 10
K = randn(N,N)
K = K*K
a = randn(N)
y = K*a + 0.2*randn(N)
m = KRRModel()
kregressor = machine(m,K,y)
cv = CV(; nfolds=6, shuffle=nothing, rng=nothing)
evaluate!(kregressor, resampling=cv, measure=rms, verbosity=1)
the evaluate! function evaluates the machine on different subsets of rows of K. Due to the Representer Theorem, a kernel ridge regression has a number of nonzero coefficients equal to the number of samples. Hence, a reduced size matrix K[train_rows,train_rows] can be used instead of K[train_rows,:].
To denote I'm using a kernel matrix I'd set m.kernel = "" . How do I make evaluate! select the columns as well as the rows to form a smaller matrix when m.kernel = ""?
This is my first time using MLJ and I'd like to make as few modifications as possible.
Quoting the answer I got on the Julia Discourse from #ablaom
The intended use of evaluate! is to estimate the generalisation error
associated with some supervised learning model, by subsampling
observations, as in cross-validation, a common use-case. I’m afraid
there is no natural way for evaluate! do feature subsampling.
https://alan-turing-institute.github.io/MLJ.jl/dev/evaluating_model_performance/
FYI: There is a version of kernel regression implementing the MLJ
model interface, namely kernel partial least squares regression from
the package GitHub - lalvim/PartialLeastSquaresRegressor.jl:
Implementation of a Partial Least Squares Regressor 2 .

In backpropogation, what does it mean when the error of a neural network converges to 0.5?

I've been trying to learn the math behind neural networks and have implemented (in Octave) a version of the following equations which include bias terms.
Back-propagation equations matrix form:
Visual representation of the problem and Network:
clear; clc; close all;
#Initialize weights and bias from input to hidden layer
W1 = rand(3,4)
b1 = ones(3,1)
#Initialize weights from hidden to output
W2 = rand(2,3)
b2 = ones(2,1)
#define sigmoid function
s = #(z) 1./(1 + exp(-z));
ds = #(z) s(z).*(1-s(z));
data = csvread("data.txt");
for j = 1 : 100
for i = 1 : length(data)
x0 = data(i,2:5)';
#Find the truth
if data(i,6) == 1 ;
t = [1;0] ;
else
t = [0;1];
end
#Forward propagate
x1 = s(W1*x0 + b1);
x2 = s(W2*x1 + b2);
iter = (j-1)*length(data) + i;
E((j-1)*length(data) + i) = norm(x2-t)^2;
E(length(E))
#Back propagate
delta2 = (x2-t).*ds(W2*x1+b2);
delta1 = W2'*delta2.*ds(W1*x0+b1);
dedw2 = delta2*x1';
dedw1 = delta1*x0';
alpha = 0.001*(40000-iter)/40000;
W2 = W2 - alpha*dedw2;
W1 = W1 - alpha*dedw1;
b2 = b2 - alpha*delta2;
b1 = b1 - alpha*delta1;
end
end
plot(E)
title('Gradient Descent')
xlabel('Iteration')
ylabel('Error')
When I run this, I converge on weights that give an constant error of 0.5 rather than 0.0. The error plot looks something like this depending on the initial samples of W1 and W2:
The resulting weights W1 and W2 yield output ~[0.5,0.5] for the whole set rather than [1,0](isStairs = true) or [0,1](isStairs = False)
Other information:
If I loop over a single data point instead of the entire learning set, it does converge to zero error for that particular case. (like 20 iterations or so), so I assume my derivatives are correct?
For the model to converge the learning rate has to be insanely small. Not sure what this means.
Is this neural network valid to solve the described problem? If so, what does it mean to converge to an error of 0.5?
The NN learns from data. If there is only one example, it will learn this example by heard and you have zero error. But if you have more examples, they will likely not lie on a nice curve, but are noisy instead. So it is harder to learn the data by heard for the network (it also depends on the number of free parameters that the NN has but you get the idea)... However, you don't want the NN to learn everything in detail. You want it to learn the overall trend (so not the noise). But this also means, that your error won't converge to zero as there is noise, which your NN should not learn... So don't worry if you have a (small) error at the end.
But what about the learning rate? Well, imagine you have 10 examples. Eight of them describe a perfect line but two exhibit noise. One sightly to the right (lets say +1) and the other slightly to the left (-1). If the NN estimates one of those points and updates to minimize the error drawn from it. The update will jump from + to - or vice versa. Depending on your learning rate, this jumping may eventually converge to the middle (which is the correct function) or may go on forever... This is essentially what the learning rate does: it determines how much impact an estimation error has on the update/learning of the network. So a good idea is to choose a larger learning rate the the beginning (where the network has a really bad performance due to its random initialization) and decrease the rate when it already learned something. You can achieve the same thing with a small learning rate but you will need longer time for it;)

XGBoost: Is it possible to predict multiple labels and calculate their MAPE?

As far as I'm concerned, XGBoost supports multi-class prediction with objective functions such as softmax.
In my case, I'd like it to output several labels (float numbers) and minimize the MAPE of them. Is it viable? What should I do to make that happen? (Say, how do I construct a DMatrix with multiple labels at first hand.)
data = numpy.array([[1,2,3],[3,4,5]])
label = numpy.array([[0.2,0.1], [0.3,0.4]])
dtrain = xgb.DMatrix(data, label=label)
param = {'gamma':2.0,'nthread':8, 'max_depth':15, 'eta':0.000000003, 'silent':1, 'objective':'multi:softprob', 'eval_metric':'auc' ,'num_class':105}
bst = xgb.train(param, dtrain, num_round)

Resources