Hyperparameter tuning; what parameter space for ML algorithms (rf, adaboost, xgboost)

Im trying to tune the hyperparameters of several ML algorithms (rf, adaboost and xgboost) to train a model with a multiclass classification variable as target. Im working with the MLR package in R. However, Im not sure about the following.
which hyperparameters to tune (and for which hyperparameters to use the default)
what should be the space for the hyperparameters which are tuned
Do you know any sources where I can find information about this?
For example;
filterParams(getParamSet("classif.randomForest"), tunable = TRUE)
Type len Def Constr Req Tunable Trafo
ntree integer - 500 1 to Inf - TRUE -
mtry integer - - 1 to Inf - TRUE -
replace logical - TRUE - - TRUE -
classwt numericvector <NA> - 0 to Inf - TRUE -
cutoff numericvector <NA> - 0 to 1 - TRUE -
sampsize integervector <NA> - 1 to Inf - TRUE -
nodesize integer - 1 1 to Inf - TRUE -
maxnodes integer - - 1 to Inf - TRUE -
importance logical - FALSE - - TRUE -
localImp logical - FALSE - - TRUE -
Space; lower, upper, transformation
params_to_tune <- makeParamSet(makeNumericParam("mtry", lower = 0, upper = 1, trafo = function(x) ceiling(x*ncol(train_x))))

In general, you want to tune all the parameters that are marked tunable with value ranges as large as you can afford. In practice, some of these won't make a difference in terms of performance, but you usually don't know that beforehand.


BERT HuggingFace gives NaN Loss

I'm trying to fine-tune BERT for a text classification task, but I'm getting NaN losses and can't figure out why.
First I define a BERT-tokenizer and then tokenize my text:
from transformers import DistilBertTokenizer, RobertaTokenizer
distil_bert = 'distilbert-base-uncased'
tokenizer = DistilBertTokenizer.from_pretrained(distil_bert, do_lower_case=True, add_special_tokens=True,
max_length=128, pad_to_max_length=True)
def tokenize(sentences, tokenizer):
input_ids, input_masks, input_segments = [],[],[]
for sentence in tqdm(sentences):
inputs = tokenizer.encode_plus(sentence, add_special_tokens=True, max_length=25, pad_to_max_length=True,
return_attention_mask=True, return_token_type_ids=True)
return np.asarray(input_ids, dtype='int32'), np.asarray(input_masks, dtype='int32'), np.asarray(input_segments, dtype='int32')
train = pd.read_csv('train_dataset.csv')
d = train['text']
input_ids, input_masks, input_segments = tokenize(d, tokenizer)
Next, I load my integer labels which are: 0, 1, 2, 3.
d_y = train['label']
0 0
1 1
2 0
3 2
4 0
5 0
6 0
7 0
8 3
9 1
Name: label, dtype: int64
Then I load the pretrained Transformer model and put layers on top of it. I use SparseCategoricalCrossEntropy Loss when compiling the model:
from transformers import TFDistilBertForSequenceClassification, DistilBertConfig, AutoTokenizer, TFDistilBertModel
distil_bert = 'distilbert-base-uncased'
optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.0000001)
config = DistilBertConfig(num_labels=4, dropout=0.2, attention_dropout=0.2)
config.output_hidden_states = False
transformer_model = TFDistilBertModel.from_pretrained(distil_bert, config = config)
input_ids_in = tf.keras.layers.Input(shape=(25,), name='input_token', dtype='int32')
input_masks_in = tf.keras.layers.Input(shape=(25,), name='masked_token', dtype='int32')
embedding_layer = transformer_model(input_ids_in, attention_mask=input_masks_in)[0]
X = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(embedding_layer)
X = tf.keras.layers.GlobalMaxPool1D()(X)
X = tf.keras.layers.Dense(50, activation='relu')(X)
X = tf.keras.layers.Dropout(0.2)(X)
X = tf.keras.layers.Dense(4, activation='softmax')(X)
model = tf.keras.Model(inputs=[input_ids_in, input_masks_in], outputs = X)
for layer in model.layers[:3]:
layer.trainable = False
Finally, I run the model using previously tokenized input_ids and input_masks as inputs to the model and get a NAN Loss after the first epoch:
model.fit(x=[input_ids, input_masks], y = d_y, epochs=3)
Epoch 1/3
20/20 [==============================] - 4s 182ms/step - loss: 0.9714 - sparse_categorical_accuracy: 0.6153
Epoch 2/3
20/20 [==============================] - 0s 19ms/step - loss: nan - sparse_categorical_accuracy: 0.5714
Epoch 3/3
20/20 [==============================] - 0s 20ms/step - loss: nan - sparse_categorical_accuracy: 0.5714
<tensorflow.python.keras.callbacks.History at 0x7fee0e220f60>
EDIT: The model computes losses on the first epoch but it starts returning NaNs
at the second epoch. What could be causing that problem???
Does anyone has any ideas about what I am doing wrong?
All suggestions are welcomed!
The problem is here:
X = tf.keras.layers.Dense(1, activation='softmax')(X)
At the end of the network, you only have a single neuron, corresponding to a single class. The output probability is always 100% for class 0. If you have classes 0, 1, 2, 3, you need to have 4 outputs at the end.
The problem would have occurred because of not specifying the num_labels
At the final output layer, by default K = 1 (number of labels), and as mentioned
\sigma(\vec{z})_{i}=\frac{e^{z_{i}}}{\sum_{j=1}^{K} e^{z_{j}}}
so while fine tuning we need to provide num_labels when going for multi class classification.
model = TFBertForSequenceClassification.from_pretrained('bert-base-cased', num_labels=5)
I'd also suggest removing NA values from the pandas data frame before using the dataset for training and evaluation.
train = pd.read_csv('train_dataset.csv')
d = train['text']
d = d.dropna()
I had a similar problem where my model produced NaN losses only during the last batch of an epoch. All the other batches resulted in typical loss values. In my case, the problem was that the size of the batches was not always equal. Thus, the model produced NaN losses. After I made all batches equally sized, the NaN's were gone. It might be also worth investigating if this is also true in your case.

Sum of two losses in Keras (Perceptual and MSE)

I want to add perceptual loss in my objective function to the MSE loss. I wrote below code for this:
def custom_objective(y_true, y_pred):
tosub = K.constant([103.939, 116.779, 123.68])
y1 = vgg_model(y_pred * 255. - tosub)
y2 = vgg_model(y_true * 255. - tosub)
loss2 = K.mean(K.square(y2 - y1), axis=-1)
loss1 = K.mean(K.square(y_pred - y_true), axis=-1)
loss = loss1 + loss2
return loss
the problem is that shape of loss1 is something like (BatchSize, 224, 224), but the shape of loss2 is (BatchSize, 7, 7), so it gives me error about incompatible shapes which is right. I want to know how could I add this two properly? should I unravel first? and how?
The loss function should always return a scalar (per sample in the batch or over the whole batch), since we want to minimize it (i.e. you can't minimize a vector, unless you define what you mean by "minimizing a vector"). Therefore, one simple way to reduce this to a scalar is to take the average across all the axes, except the batch axis which is averaged over internally:
loss2 = K.mean(K.square(y2 - y1), axis=[1,2,3])
loss1 = K.mean(K.square(y_pred - y_true), axis=[1,2,3])
loss = loss1 + loss2
Update: Let me clarify that it is OK if the loss function returns a vector or even an n-D array (actually the loss function above returns a vector of length batch_size), but keep in mind that at the end Keras takes the average of returned values and that's the real value of loss (which would be minimized).

Strange Loss function behaviour when training CNN

I'm trying to train my network on MNIST using a self-made CNN (C++).
It gives enough good results when I use a simple model, like:
Convolution (2 feature maps, 5x5) (Tanh) -> MaxPool (2x2) -> Flatten -> Fully-Connected (64) (Tanh) -> Fully-Connected (10) (Sigmoid).
After 4 epochs, it behaves like here 1.
After 16 epochs, it gives ~6,5% error on a test dataset.
But in the case of 4 feature maps in Conv, the MSE value isn't improving, sometimes even increasing 2,5 times 2.
The online training mode is used, with help of Adam optimizer (alpha: 0.01, beta_1: 0.9, beta_2: 0.999, epsilon: 1.0e-8). It is calculated as:
double AdamOptimizer::calc(int t, double& m_t, double& v_t, double g_t)
m_t = this->beta_1 * m_t + (1.0 - this->beta_1) * g_t;
v_t = this->beta_2 * v_t + (1.0 - this->beta_2) * (g_t * g_t);
double m_t_aver = m_t / (1.0 - std::pow(this->beta_1, t + 1));
double v_t_aver = v_t / (1.0 - std::pow(this->beta_2, t + 1));
return -(this->alpha * m_t_aver) / (std::sqrt(v_t_aver) + this->epsilon);
So, can be this problem caused by lack of some additional learning techniques (dropout, batch-normalization), or wrongly set parameters? Or it is caused by some implementation issues?
P. S. I provide a github link if necessary.
Try to decrease the learning rate.

Is there an easy way to implement a Optimizer.Maximize() function in TensorFlow

There are several experiments that rely on gradient ascent rather than gradient descent. I have looked into some approaches to using "cost" and the minimize function to simulate the "maximize" function, but I am still not certain I know how to properly implement a maximize() function. Also, in most of these cases, I would say they are closer to an unsupervised learning. So given this code concept for a cost function:
cost = (Yexpected - Ycalculated)^2
train_step = tf.train.AdamOptimizer(0.5).minimize(cost)
I would like to write something were I am following the positive gradient and there may not be a Yexpected value:
maxMe = Function(Ycalculated)
train_step = tf.train.AdamOptimizer(0.5).maximize(maxMe)
A good example of this need is "http://cs229.stanford.edu/proj2009/LvDuZhai.pdf" with Recurrent Reinforcement Learning.
I have read a few papers and references that state changing the sign will flip the direction of movement to increasing gradient, but given TensorFlow's internal calculation of the gradient, I am not sure if this will work to Maximize as I don't know of a way to validate the results:
maxMe = Function(Ycalculated)
train_step = tf.train.AdamOptimizer(0.5).minimize( -1 * maxMe )
The intuition is simple, the minimize() function keeps squashing the given value, for example, if you start with 5, then for every iteration (for example and depending on the learning rate), the value will become say, 4, then 3, then 2, 1, 0 and so on if possible to bring it down more. Now if you pass -5 at the beginning (which is in fact a +5 but you changed the sign explicitly), the gradient will try to change the parameters to bring the number down more, as for example, -5, -6, -7, -8, ...etc. But in fact, the function is increasing because we changed the sign, and the actual sign is (+). In other words, the gradient, in the latter case, is changing the parameters of the neural network in a way that maximizes the function, not minimizing it.
Toy example with arbitrary numbers:
The input x = 1.5, The weight parameter at time (t) w_t = 0.1,
The observed response y = 3.0, The learning rate lr = 0.1.
x * w = 0.15 (this is y predicted for the current w)
loss function = (3.0 - 0.15)^2 = 8.1
Applying gradient descent:
w_(t+1) = w_t - lr * (derivative of loss function with respect to w)
w_(t+1) = 0.1 - (0.1 * [1.5 * 2(0.15 - 3.0)]) = 0.1 - (-0.855) = 0.955
If we use the new w_(t+1) we will have:
1.5 * 0.955 = 1.49 (which is closer to the correct answer 3.0)
and the new loss is: (3.0 - 1.49)^2 = 2.27 (smaller error).
If we keep iterating, we will adjust w to a value that gives us the minimum cost possible.
Now lets repeat the same experiment but with the sign flipped to negative:
loss function = - (3.0 - 0.15)^2 = -8.1
Applying gradient descent:
w_(t+1) = w_t - lr * (derivative of loss function with respect to w)
w_(t+1) = 0.1 - (0.1 * [1.5 * -2(0.15 - 3.0)]) = 0.1 - 0.855 = −0.755
If we apply the new w_(t+1) we will have:
1.5 * −0.755 = −1.1325 and the new loss is: (3.0 - (-1.1325))^2 = 17.07
(the loss function is maximizing!).
That is also applicable to any differentiable function, but this is just a simple naive example to demonstrate the idea.
So, you can do, as you suggested already:
optimizer.minimize( -1 * value)
Or if you like, create a wrapper function (which in fact is needless, but just to mention it):
def maximize(optimizer, value, **kwargs):
return optimizer.minimize(-value, **kwargs)

How to calculate MAPE for Training/Test set in application of Neural Network in MATLAB efficiently?

I've been using MATLAB for my time series dataset (for an electricity dataset) as a part of my course. It consists of 40,000+ samples. After the formation of neural network, I wanted to test its accuracy. I've been curious more on MAPE(mean absolute percentage error) and RMS(Root Mean Square) errors. To calculate them, I've used following lines of code.
mape_res = zeros(N_TRAIN);
mse_res = zeros(N_TRAIN);
for i_train = 1:N_TRAIN
Inp = inputs_consumption(i_train );
Actual_Output = targets_consumption( i_train + 1 );
Observed_Output = sim( ann, Inp );
mape_res(i_train) = abs(Observed_Output - Actual_Output)/Actual_Output;
mse_res(i_train) = Observed_Output - Actual_Output;
mape = sum(mape_res)/N_TRAIN;
mse = sum(power(mse_res,2))/N_TRAIN;
sprintf( 'The MSE on training is %g', mse )
sprintf( 'The MAPE on training is %g', mape )
The problem with above coding is that, for a large dataset(40K samples), it takes almost 15 minutes to iterate through all those loops and it's quite a long waiting for getting result for the error rate; Isn't there any other efficient way to calculate them?
You could always do a rolling average that gets updated each iteration, as follows:
mape_res = abs(Observed_Output - Actual_Output) / Actual_Output;
mse_res = Observed_Output - Actual_Output;
alpha = 1 / i_train;
mape = mape * (1 - alpha) + mape_res * alpha;
mse = mes * (1 - alpha) + power(mse_res,2) * alpha;
Then you could either display the resulting values each iteration, use them for stopping criteria if the desired error rate is reached, or both. This also has the added benefit of not requiring the initialization and population of the mape_res and mse_res vectors unless they happen to be needed elsewhere...
Edit: Do make sure to initialize the mape and mse values to zero prior to entering the for loop :)
