xgboost custom early stopping function - machine-learning

It does not seem that xgboost support custom early stopping function. Right?
This page https://xgboost.readthedocs.io/en/latest/python/callbacks.html says xgboost allows a custom metric and early_stopping_rounds to be specified.
What I need is custom early stopping function where you can specify the threshold for the metric, e.g., when the metric is above/below a threshold or stopped decreasing according to the threshold. Tensorflow Keras support such call back https://www.tensorflow.org/guide/keras/custom_callback.
Thanks.

Found this https://xgboost.readthedocs.io/en/latest/python/examples/callbacks.html#sphx-glr-python-examples-callbacks-py. Testing it now. Thanks.

Related

What is the default behavior of Keras model.compile metrics parameter?

The default value for Keras model.compile metrics parameter is metrics=None. There are a plenty of explanations and information about this parameter different values, and I believe I pretty much understand their meaning and purpose, but what I struggle finding, is what is the behavior of the default value metrics=None.
The official documentation of the model.compile method here doesn't say anything about the default value (which is None), and googling about it for a while hasn't brought any enlightenment for me either so far.
I'd be greatful for any helpful hint on this!
It means it will not calculate any metric, it is not necessary to specifiy a metric. The only parameters that need to be specified are optimizer and loss.
These are the only things necessary to train a model, the metric doesn't go into the calculation for backpropagation.
By not specifing metric, simply no metric will be calculated. A metric is something extra that is calculated for you to simbolize how well a problem is being solved. Usually for classification you would use cross entropy.

How to look at the parameters of a pytorch model?

I have a simple pytorch neural net that I copied from openai, and I modified it to some extent (mostly the input).
When I run my code, the output of the network remains the same on every episode, as if no training occurs.
I want to see if any training happens, or if some other reason causes the results to be the same.
How can I make sure any movement happens to the weights?
Thanks
Depends on what you are doing, but the easiest would be to check the weights of your model.
You can do this (and compare with the ones from previous iteration) using the following code:
for parameter in model.parameters():
print(parameter.data)
If the weights are changing, the neural network is being optimized (which doesn't necessarily mean it learns anything useful in particular).

How to tune maximum entropy's parameter?

I am doing text classification with scikit learn's logistic regression function (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). I am using grid search in order to choose a value for the C parameter. Do I need to do the same for max_iter parameter? why?
Both C and max_iter parameters have default values in Sklearn, which means they need to be tuned. But, from what I understand, early stopping and l1/l2 regularization are two desperate methods for avoiding overfitting and performing one of them is enough. Am I incorrect in assuming that tunning the value of max_iter is equivalent to early stopping?
To summarize, here are my main questions:
1- Does max_iter need tuning? why? (the documentation says it is only useful for certain solvers)
2- Is tuning the max_iter equivalent to early stopping?
3- Should we perform early stopping and L1/L2 regularization at the same time?
Here's some simple responses to your numbered questions and grossly simplified:
Yes, sometimes you need to tune max_iter. Why? See next.
No. max_iter is the number of iterations that the logistic regression classifier's solver is allowed to step through before being stopped. The aim is to reach a "stable" solution for the parameters of the logistic regression model, i.e., it is an optimisation problem. If your max_iter is too low, you may not reach an optimal solution and your model is underfit. If your value is too high, you can essentially wait forever to have a solution for little gain in accuracy. You may also get stuck at local optima if max_iter is too low.
Yes or No.
a. L1/L2 regularisation is essentially "smoothing" of your complex model so that it does not overfit to the training data. If parameters become too large, they are penalised in the cost.
b. Early stopping is when you stop optimising your model (e.g., via gradient descent) at some stage in which you deem acceptable (before max_iter). For example, a metric such as RMSE can be used to define when to stop, or a comparison of the metrics from your test/training data.
c. When to use them? This is dependent on your problem. If you have a simple linear problem, with limited features, you will not need regularisation or early stopping. If you have thousands of features and experience overfitting then apply regularisation as one solution. If you do not want to wait for the optimisation to run to the end when you are playing with parameters as you only care about a certain level of accuracy, you could apply early stopping.
Finally, how do I tune max_iter correctly? This depends on your problem at hand. If you find your classification metric shows your model is performing poorly, it could be that your solver has not taken enough steps to reach a minimum. I'd suggest you do this by hand and look at the cost vs. max_iter to see if it is reaching a minimum properly rather than automate it.

Keras: Difference between Kernel and Activity regularizers

I have noticed that weight_regularizer is no more available in Keras and that, in its place, there are activity and kernel regularizer.
I would like to know:
What are the main differences between kernel and activity regularizers?
Could I use activity_regularizer in place of weight_regularizer?
The activity regularizer works as a function of the output of the net, and is mostly used to regularize hidden units, while weight_regularizer, as the name says, works on the weights (e.g. making them decay). Basically you can express the regularization loss as a function of the output (activity_regularizer) or of the weights (weight_regularizer).
The new kernel_regularizer replaces weight_regularizer - although it's not very clear from the documentation.
From the definition of kernel_regularizer:
kernel_regularizer: Regularizer function applied to
the kernel weights matrix
(see regularizer).
And activity_regularizer:
activity_regularizer: Regularizer function applied to
the output of the layer (its "activation").
(see regularizer).
Important Edit: Note that there is a bug in the activity_regularizer that was only fixed in version 2.1.4 of Keras (at least with Tensorflow backend). Indeed, in the older versions, the activity regularizer function is applied to the input of the layer, instead of being applied to the output (the actual activations of the layer, as intended). So beware if you are using an older version of Keras (before 2.1.4), activity regularization may probably not work as intended.
You can see the commit on GitHub
Five months ago François Chollet provided a fix to the activity regularizer, that was then included in Keras 2.1.4
This answer is a bit late, but is useful for the future readers.
So, necessity is the mother of invention as they say. I only understood it when I needed it.
The above answer doesn't really state the difference cause both of them end up affecting the weights, so what's the difference between punishing for the weights themselves or the output of the layer?
Here is the answer: I encountered a case where the weights of the net are small and nice, ranging between [-0.3] to [+0.3].
So, I really can't punish them, there is nothing wrong with them. A kernel regularizer is useless. However, the output of the layer is HUGE, in 100's.
Keep in mind that the input to the layer is also small, always less than one. But those small values interact with the weights in such a way that produces those massive outputs. Here I realized that what I need is an activity regularizer, rather than kernel regularizer. With this, I'm punishing the layer for those large outputs, I don't care if the weights themselves are small, I just want to deter it from reaching such state cause this saturates my sigmoid activation and causes tons of other troubles like vanishing gradient and stagnation.

NuSVR vs SVR in scikit-learn

sklearn provides two SVM based regression, SVR and NuSVR. The latter claims to be using libsvm. However, other than that I don't see any description of when to use what.
Does anyone have an idea?
I am trying to do regression on 3m X 21 matrix using 5 fold cross validation using SVR, but it is taking forever to finish. I've aborted the job and I'm now considering using NuSVR. But I'm not sure what advantage it provides.
NuSVR - http://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVR.html#sklearn.svm.NuSVR
SVR - http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR
They are equivalent but slightly different parametrizations of the same implementation.
Most people use SVR.
You can not use that many samples with a kernel SVR. You could try SVR(kernel="Linear") but that would probably also be infeasible. I recommend using SGDRegressor. You might need to adjust the learning rate and number of epochs, though.
You can also try RandomForestRegressor which should work just fine.
Check out the github code for nuSVR. It says that it is based on libSVM as well. NuSVR allows you to limit the number of support vectors used.

Resources