What does having multiple metrics do when training a model? - machine-learning

I'm trying to decide on metrics to use when fitting my model. I'm curious as to what impact multiple metrics will have?
metrics={'output_1': 'metric_1', 'output_2': ['metric_2', 'metric_3']})
vs
metrics={'output_1': ['metric_1', 'metric_2'], 'output_2': ['metric_3', 'metric_2']})
Does the order of the metrics matter?

The metrics will be shown in the logs and on plot to give you an indication of how good your model performs at any particular stage of the training phase. They are not used as optimization functions.
So they won't have any impact on your model (or performance of your model).

Related

How to properly score a DNN model during hyperparameter tuning?

I'm using a grid search to tune the hyperparameters of my DNN, which has 2 depth layers. I'm currently scoring each model based on the average loss in the test set, but I'm not sure if this is the best approach. Would it be better to use the accuracy, or both the loss and accuracy, as a scoring metric? How do other people typically score their models during hyperparameter tuning? Any advice or insights would be greatly appreciated.
The first thing in your experimental setup is using the test set while making hyperparameter tunning. You should train your model with your train set and make your hyperparameter tunning with your validation set. After finishing this process, you need to use test set to get the model score is the best option to the way of using/splitting the dataset correctly.
The second part of your question is very open-ended, but you may benefit from the following tips:
Different metrics may be suitable for different tasks, so it is important to choose the right metric. For instance, in some classification tasks you would like to track accuracy, and some of them recall or precision etc. (or you can use and track multiple metrics to understand your model behavior more deeper)
The recent advancement on this topic is generally referred to as AutoML and there are many different applications/libraries/methodologies that are used for hyperparameter tuning. So you may also want to search other methods rather than just using GridSeach. If you want to continue with GridSearch, to find the optimal parameters for your problem, you can switch the GridSearchCV so you can test your model more than once with a different part of the dataset which makes your hyperparameter tunning operation more robust.

Can normalization decrease model performance of ensemble methods?

Normalization e.g. z-scoring is a common preprocessing method in Machine Learning.
I am analyzing a dataset and use ensemble methods like Random Forests or the XGBOOST framework.
Now I compare models using
non normalized features
z-scored features
Using crossvalidation I observe in both cases that with higher max_depth parameter the training error decreases.
For the 1. case the test error also decreases and saturates at a certain MAE:
For the z-scored features however the test error is non decreasing at all.
In this question: https://datascience.stackexchange.com/questions/16225/would-you-recommend-feature-normalization-when-using-boosting-trees it was discussed that normalization is not necessary for tree based methods. But the example above shows that it has a severe effect.
So I have two questions regarding this:
Does it imply that overfitting with ensemble based methods is possible even when the test error decreases?
Should normalization like z-scoring always be common practice when working with ensemble methods?
Is it possible that normalization methods decrease the model performance?
Thanks!
It is not easy to see what is going on in the absence of any code or data.
Normalisation may or may not be helpful depending on the particular data and how the normalisation step is applied.
Tree based methods ought to be robust enough to handle the raw data.
In your cross validations is your code doing the normalisation separately for each fold?
Doing a single normalisation prior to cv may lead to significant leakage.
With very high values of depth you will have a much more complex model that will fit the training data well but will fail to generalise to new data.
I tend to prefer max depths from 2 to 5.
If I can't get a reasonable model I turn my efforts to feature engineering rather than trying to tweak the hyperparameters too much.

Choosing right metrics for regression model

I have always been using r2 score metrics. I know there are several evaluation metrics out there i have read several articles about it. Since i'm still a beginner in machine learning. I'm still very confused of
When to use each of it, is depending on our case, if yes please give me example
I read this article and it said, r2 score is not straightforward, we need other stuff to measure the performance of our model. Does it mean we need more than 1 evaluation metrics in order to get better insight of our model performance?
Is it recommended if we only measure our model performance by just one evaluation metrics?
From this article it said knowing the distribution of our data and our business goal helps us to understand choose appropriate metrics. What does it mean by that?
How to know for each metrics that the model is 'good' enough?
There are different evaluation metrics for regression problems like below.
Mean Squared Error(MSE)
Root-Mean-Squared-Error(RMSE)
Mean-Absolute-Error(MAE)
R² or Coefficient of Determination
Mean Square Percentage Error (MSPE)
so on so forth..
As you mentioned you need to use them based on your problem type, what you want to measure and the distribution of your data.
To do this, you need to understand how these metrics evaluate the model. You can check the definitions and pros/cons of evaluation metrics from this nice blog post.
R² shows what variation of your purpose variable is described by independent variables. A good model can give R² score close to 1.0 but it does not mean it should be. Models which have low R² can also give low MSE score. So to ensure your predictive power of your model it is better to use MSE, RMSE or other metrics besides the R².
No. You can use multiple evaluation metrics. The important thing is if you compare two models, you need to use same test dataset and the same evaluation metrics.
For example, if you want to penalize your bad predictions too much, you can use MSE evaluation metric because it basically measures the average squared error of our predictions or if your data have too much outlier MSE give too much penalty to this examples.
The good model definition changes based on your problem complexity. For example if you train a model which predicts that heads or tails and gives %49 accuracy it is not good enough because the baseline of this problem is %50. But for any other problem, %49 accuracy may enough for your problem. So in a summary, it depends on your problem and you need to define or think that human(baseline) threshold.

(cross) validation of CNN models - when to bring in test data?

For a research project, I am working on using a CNN for defect detection on images of weld beads, using a dataset containing about 500 images. To do so, I am currently testing different models (e.g. ResNet18-50), as well as data augmentation and transfer learning techniques.
After having experimented for a while, I am wondering which way of training/validation/testing is best suited to provide an accurate measurement of performance while at the same time keeping the computational expensiveness at a reasonable level, considering I want to test as many models etc.as possible.
I guess it would be reasonable performing some sort of cross validation (cv), especially since the dataset is so small. However, after conducting some research, I could not find a clear best way to apply this and it seems to me that different people follow different strategies an I am a little confused.
Could somebody maybe clarify:
Do you use the validation set only to find the best performing epoch/weights while training and then directly test every model on the test set (and repeat this k times as a k-fold-cv)?
Or do you find the best performing model by comparing the average validation set accuracies of all models (with k-runs per model) and then check its accuracy on the testset? If so, which exact weigths do you load for testing or would one perform another cv for the best model to determine the final test accuracy?
Would it be an option to perform multiple consecutive training-validation-test runs for each model, while before each run shuffling the dataset and splitting it in “new” training-, validation- & testsets to determine an average test accuracy (like monte-carlo-cv, but maybe with less amount of runs)?
Thanks a lot!
In your case I would remove 100 images for your test set. You should make sure that they resemble what you expect your final model to be able to handle. E.g. it should be objects which are not in the training or validation set, because you probably also want it to generalize to new weld beads. Then you can do something like 5 fold cross validation, where you randomly or intelligently sample 100 images as your validation set. Then you train your model on the remaining 300 and use the remaining 100 for validation purposes. Then you can use a confidence interval on the performance of the model. This confidence interval is what you use to tune your hyperparameters.
The test set is then useful to predict the performance on novel data, but you should !!!NEVER!!! use it to tune hyperparmeter.

Is it a good practice to use your full data set for predictions?

I know you're supposed to separate your training data from your testing data, but when you make predictions with your model is it OK to use the entire data set?
I assume separating your training and testing data is valuable for assessing the accuracy and prediction strength of different models, but once you've chosen a model I can't think of any downsides to using the full data set for predictions.
You can use full data for prediction but better retain indexes of train and test data. Here are pros and cons of it:
Pro:
If you retain index of rows belonging to train and test data then you just need to predict once (and so time saving) to get all results. You can calculate performance indicators (R2/MAE/AUC/F1/precision/recall etc.) for train and test data separately after subsetting actual and predicted value using train and test set indexes.
Cons:
If you calculate performance indicator for entire data set (not clearly differentiating train and test using indexes) then you will have overly optimistic estimates. This happens because (having trained on train data) model gives good results of train data. Which depending of % split of train and test, will gives illusionary good performance indicator values.
Processing large test data at once may create memory bulge which is can result in crash in all-objects-in-memory languages like R.
In general, you're right - when you've finished selecting your model and tuning the parameters, you should use all of your data to actually build the model (exception below).
The reason for dividing data into train and test is that, without out-of-bag samples, high-variance algorithms will do better than low-variance ones, almost by definition. Consequently, it's necessary to split data into train and test parts for questions such as:
deciding whether kernel-SVR is better or worse than linear regression, for your data
tuning the parameters of kernel-SVR
However, once these questions are determined, then, in general, as long as your data is generated by the same process, the better predictions will be, and you should use all of it.
An exception is the case where the data is, say, non-stationary. Suppose you're training for the stock market, and you have data from 10 years ago. It is unclear that the process hasn't changed in the meantime. You might be harming your prediction, by including more data, in this case.
Yes, there are techniques for doing this, e.g. k-fold cross-validation:
One of the main reasons for using cross-validation instead of using the conventional validation (e.g. partitioning the data set into two sets of 70% for training and 30% for test) is that there is not enough data available to partition it into separate training and test sets without losing significant modelling or testing capability. In these cases, a fair way to properly estimate model prediction performance is to use cross-validation as a powerful general technique.
That said, there may not be a good reason for doing so if you have plenty of data, because it means that the model you're using hasn't actually been tested on real data. You're inferring that it probably will perform well, since models trained using the same methods on less data also performed well. That's not always a safe assumption. Machine learning algorithms can be sensitive in ways you wouldn't expect a priori. Unless you're very starved for data, there's really no reason for it.

Resources