Avoiding data leakage when using BaggingClassifier (Regressor) with feature scaling (StandardScaler) - machine-learning

I am running bagging with LogisticRegression. Since the latter uses regularization, features must be scaled. Since bagging takes a sample (with replacement) from the original training data set, scaling should take place after that. Scaling the original data set and then taking a sample amounts to data leakage. This is similar to how scaling is (mis)used with CV: it is wrong to scale the whole data set and then feed it to CV.
It appears that there are no built in tools to avoid data leakage with bagging (see the code below), but I may be wrong. Any help will be appreciated.
from sklearn.ensemble import BaggingClassifier
single_log_reg = LogisticRegression(solver="liblinear", random_state = np.random.RandomState(18))
bagged_logistic = BaggingClassifier(single_log_reg, n_estimators = 100, random_state = np.random.RandomState(42))
logit_bagged_pipeline = Pipeline(steps=[
('scaler', StandardScaler(with_mean = False)),
('bagged_logit', bagged_logistic)
])
logit_bagged_grid = {'bagged_logit__base_estimator__C': c_values,
'bagged_logit__max_features' : [100, 200, 400, 600, 800, 1000]}
logit_bagged_searcher = GridSearchCV(estimator = logit_bagged_pipeline, param_grid = logit_bagged_grid, cv = skf,
scoring = "roc_auc", n_jobs = 6, verbose = 4)
logit_bagged_searcher.fit(all_model_features, y_train)

The leakage you mention is really only a major concern if you intend to use out-of-bag performance estimates. Otherwise, each of your models gets a little information from the scaling as to how its bag compares to the rest of the data, which might lead to slight overfitting, but your test scores will be fine.
But, it is relatively straightforward to do this in sklearn. You just need to tie the scaler to the logistic regression inside the bagging:
single_log_reg = LogisticRegression(solver="liblinear", random_state = 18)
logit_scaled_pipeline = Pipeline(steps=[
('scaler', StandardScaler(with_mean = False)),
('logit', single_log_reg),
])
bagged_logsc = BaggingClassifier(logit_scaled_pipeline, n_estimators = 100, random_state = 42)
logit_bagged_grid = {
'bagged_logsc__base_estimator__logit__C': c_values,
'bagged_logsc__max_features' : [100, 200, 400, 600, 800, 1000],
}
logit_bagged_searcher = GridSearchCV(estimator = bagged_logsc, param_grid = logit_bagged_grid, cv = skf,
scoring = "roc_auc", n_jobs = 6, verbose = 4)
logit_bagged_searcher.fit(all_model_features, y_train)
On random states, see https://stackoverflow.com/a/69756672/10495893.

Related

Using the LSTM layer in encoder in Pytorch

I want to build an autoencoder with LSTM layers. But, at the first step of the encoder, I got an error. Could you please help me with that?
Here is the model which I tried to build:
import numpy
import torch.nn as nn
r_input = torch.nn.LSTM(1, 1, 28)
activation = nn.functional.relu
mu_r = nn.Linear(22, 6)
log_var_r = nn.Linear(22, 6)
y = np.random.rand(1, 1, 28)
def encode_r(y):
y = torch.reshape(y, (-1, 1, 28)) # torch.Size([batch_size, 1, 28])
hidden = torch.flatten(activation(r_input(y)), start_dim = 1)
z_mu = mu_r(hidden)
z_log_var = log_var_r(hidden)
return z_mu, z_log_var
But I got this error in my code:
RuntimeError: input.size(-1) must be equal to input_size. Expected 1, got 28.
You're not creating the layer in the correct way.
torch.nn.LSTM requires input_size as the first argument, but your tensor has a dimension of 28. It seems that you want the encoder to output a tensor with a dimension of 22. You're also passing the batch as the first dimension, so you need to include batch_first=True as an argument.
r_input = torch.nn.LSTM(28, 22, batch_first=True)
This should work for your specific setup. You should also note that LSTM returns 2 items, the first one is the one you want to use.
hidden = torch.flatten(activation(r_input(y)[0]), start_dim=1)
Please read the declaration on the official wiki for more information.

K-means initialization with further-first traversal and k-mean++

I am confused about k-mean++ initialization. I understand k-mean++ choose and furthest data point as next data center. But how about outlier? What is the different between `initialization with further-first traversal and k-mean++ ?
I saw someone explain in this way:
Here is a one-dimensional example. Our observations are [0, 1, 2, 3, 4]. Let the first center, c1, be 0. The probability that the next
cluster center, c2, is x is proportional to ||c1-x||^2. So, P(c2 = 1)
= 1a, P(c2 = 2) = 4a, P(c2 = 3) = 9a, P(c2 = 4) = 16a, where a = 1/(1+4+9+16).
Suppose c2=4. Then, P(c3 = 1) = 1a, P(c3 = 2) = 4a, P(c3 = 3) = 1a,
where a = 1/(1+4+1).
What is this array or list is [0,1,2,4,5,6,100]. Obviously, 100 is the outlier in this case and it will be chosen as the data center at some point. Can someone give a better explanation?
K-means chooses points with probability.
But yes, with extreme outliers it is likely to chose the outlier.
That is fine, because so will k-means. Most likely the best SSQ solution has a one-element cluster containing only that point.
If you have such data, the k-means solutions tend to be rather useless, and you probably should choose another algorithm such as DBSCAN instead.

How to split a model trained in keras?

I trained a model with 4 hidden layers and 2 dense layers and I have saved that model.
Now I want to load that model and want to split into two models, one with one hidden layers and another one with only dense layers.
I have splitted the model with hidden layer in the following way
model = load_model ("model.hdf5")
HL_model = Model(inputs=model.input, outputs=model.layers[7].output)
Here the model is loaded model, in the that 7th layer is my last hidden layer. I tried to split the dense in the like
DL_model = Model(inputs=model.layers[8].input, outputs=model.layers[-1].output)
and I am getting error
TypeError: Input layers to a `Model` must be `InputLayer` objects.
After splitting, the output of the HL_model will the input for the DL_model.
Can anyone help me to create a model with dense layer?
PS :
I have tried below code too
from keras.layers import Input
inputs = Input(shape=(9, 9, 32), tensor=model_1.layers[8].input)
model_3 = Model(inputs=inputs, outputs=model_1.layers[-1].output)
And getting error as
RuntimeError: Graph disconnected: cannot obtain value for tensor Tensor("conv2d_1_input:0", shape=(?, 144, 144, 3), dtype=float32) at layer "conv2d_1_input". The following previous layers were accessed without issue: []
here (144, 144, 3) in the input image size of the model.
You need to specify a new Input layer first, then stack the remaining layers over it:
DL_input = Input(model.layers[8].input_shape[1:])
DL_model = DL_input
for layer in model.layers[8:]:
DL_model = layer(DL_model)
DL_model = Model(inputs=DL_input, outputs=DL_model)
A little more generic. You can use the following function to split a model
from keras.layers import Input
from keras.models import Model
def get_bottom_top_model(model, layer_name):
layer = model.get_layer(layer_name)
bottom_input = Input(model.input_shape[1:])
bottom_output = bottom_input
top_input = Input(layer.output_shape[1:])
top_output = top_input
bottom = True
for layer in model.layers:
if bottom:
bottom_output = layer(bottom_output)
else:
top_output = layer(top_output)
if layer.name == layer_name:
bottom = False
bottom_model = Model(bottom_input, bottom_output)
top_model = Model(top_input, top_output)
return bottom_model, top_model
bottom_model, top_model = get_bottom_top_model(model, "dense_1")
Layer_name is just the name of the layer that you want to split at.

How to use masking layer to mask input/output in LSTM autoencoders?

I am trying to use LSTM autoencoder to do sequence-to-sequence learning with variable lengths of sequences as inputs, using following code:
inputs = Input(shape=(None, input_dim))
masked_input = Masking(mask_value=0.0, input_shape=(None,input_dim))(inputs)
encoded = LSTM(latent_dim)(masked_input)
decoded = RepeatVector(timesteps)(encoded)
decoded = LSTM(input_dim, return_sequences=True)(decoded)
sequence_autoencoder = Model(inputs, decoded)
encoder = Model(inputs, encoded)
where inputs are raw sequence data padded with 0s to the same length (timesteps). Using the code above, the output is also of length timesteps, but when we calculate loss function we only want first Ni elements of the output (where Ni is length of input sequence i, which may be different for different sequences). Does anyone know if there is some good way to do that?
Thanks!
Option 1: you can always train without padding if you accept to train separate batches.
See this answer to a simple way of separating batches of equal length: Keras misinterprets training data shape
In this case, all you have to do is to perform the "repeat" operation in another manner, since you don't have the exact length at training time.
So, instead of RepeatVector, you can use this:
import keras.backend as K
def repeatFunction(x):
#x[0] is (batch,latent_dim)
#x[1] is inputs: (batch,length,features)
latent = K.expand_dims(x[0],axis=1) #shape(batch,1,latent_dim)
inpShapeMaker = K.ones_like(x[1][:,:,:1]) #shape (batch,length,1)
return latent * inpShapeMaker
#instead of RepeatVector:
Lambda(repeatFunction,output_shape=(None,latent_dim))([encoded,inputs])
Option2 (doesn't smell good): use another masking after RepeatVector.
I tried this, and it works, but we don't get 0's at the end, we get the last value repeated until the end. So, you will have to make a weird padding in your target data, repeating the last step until the end.
Example: target [[[1,2],[5,7]]] will have to be [[[1,2],[5,7],[5,7],[5,7]...]]
This may unbalance your data a lot, I think....
def makePadding(x):
#x[0] is encoded already repeated
#x[1] is inputs
#padding = 1 for actual data in inputs, 0 for 0
padding = K.cast( K.not_equal(x[1][:,:,:1],0), dtype=K.floatx())
#assuming you don't have 0 for non-padded data
#padding repeated for latent_dim
padding = K.repeat_elements(padding,rep=latent_dim,axis=-1)
return x[0]*padding
inputs = Input(shape=(timesteps, input_dim))
masked_input = Masking(mask_value=0.0)(inputs)
encoded = LSTM(latent_dim)(masked_input)
decoded = RepeatVector(timesteps)(encoded)
decoded = Lambda(makePadding,output_shape=(timesteps,latent_dim))([decoded,inputs])
decoded = Masking(mask_value=0.0)(decoded)
decoded = LSTM(input_dim, return_sequences=True)(decoded)
sequence_autoencoder = Model(inputs, decoded)
encoder = Model(inputs, encoded)
Option 3 (best): crop the outputs directly from the inputs, this also eliminates the gradients
def cropOutputs(x):
#x[0] is decoded at the end
#x[1] is inputs
#both have the same shape
#padding = 1 for actual data in inputs, 0 for 0
padding = K.cast( K.not_equal(x[1],0), dtype=K.floatx())
#if you have zeros for non-padded data, they will lose their backpropagation
return x[0]*padding
....
....
decoded = LSTM(input_dim, return_sequences=True)(decoded)
decoded = Lambda(cropOutputs,output_shape=(timesteps,input_dim))([decoded,inputs])
For this LSTM Autoencoder architecture, which I assume you understand, the Mask is lost at the RepeatVector due to the LSTM encoder layer having return_sequences=False.
So another option, instead of cropping like above, could also be to create custom bottleneck layer that propagates the mask.

How to estimate? "simple" Nonlinear Regression + Parameter Constraints + AR residuals

I am new to this site so please bear with me. I want to
the nonlinear model as shown in the link: https://i.stack.imgur.com/cNpWt.png by imposing constraints on the parameters a>0 and b>0 and gamma1 in [0,1].
In the nonlinear model [1] independent variable is x(t) and dependent are R(t), F(t) and ΞΎ(t) is the error term.
An example of the dataset can be shown here: https://i.stack.imgur.com/2Vf0j.png 68 rows of time series
To estimate the nonlinear regression I use the nls() function with no problem as shown below:
NLM1 = nls(**Xt ~ (aRt-bFt)/(1-gamma1*Rt), start = list(a = 10, b = 10, lamda = 0.5)**,algorithm = "port", lower=c(0,0,0),upper=c(Inf,Inf,1),data = temp2)
I want to estimate NLM1 with allowing for also an AR(1) on the residuals.
Basically I want the same procedure as we go from lm() to gls(). My problem is that in the gnls() function I dont know how to put contraints for the model parameters a, b, gamma1 and the model estimates wrong values for them.
nls() has the option for lower and upper bounds. I cant do the same on gnls()
In the gnls(): I need to add the contraints something like as in nls() lower=c(0,0,0),upper=c(Inf,Inf,1)
NLM1_AR1 = gnls( model = Xt ~ (aRt-bFt)/(1-gamma1*Rt), data = temp2, start = list(a =13, b = 10, lamda = 0.5),correlation = corARMA(p = 1))
Does any1 know the solution on how to do it?
Thank you

Resources