Tuning of hyperparameters of SVR [closed] - machine-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
What is the better way of selecting the hyperparameters of SVR for tuning them using GridSearchCV? I learnt that the input to GridSearchCV is set of values for C, gamma and epsilon. The GridSearchCV algorithm evaluates each of these values and suggests the best among the set of values given to it as input. How to select the set of values to be given as input? Is there a better way of selecting them apart from trial and error? Resorting to trial and error is time consuming and also one might miss out optimal values of the Hyperparameters.

As always, good hyperparameters range depends on the problem. It is difficult to find one solution that fit all problems.
The literature recommends an epsilon between 1-e3 and 1. Concerning the C parameter a good hyperparameter space would be between 1 and 100. A C that is too large will simply overfit the training data.
The gamma is already calculated by scikit-learn SVR. I would not change it.
Don't forget that you can also tune the kernel and this might be the most important hyperparameter to tune.
To conclude, there is no free lunch when searching for the best hyperparameter ranges. The best option is to read literature and documenation as well as understanding each parameter impact.

Related

Batch normalization [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
Why the batch normalization is working on the different samples of the same characteristics instead of different characteristics of the same sample? Shouldn't it be the normalization of different features? In the diagram, why do we use the first row and not the first column?
Could someone help me?
Because different features of the same object mean different things, and it's not logical to calculate some statistics over these values. They can have different range, mean, std, etc. E.g. one of your features could mean the age of a person and other one is the height of the person. If you calculate mean of these values you will not get any meaningful number.
In classic machine learning (especially in linear models and KNN) you should normalize your features (i.e. calculate mean and std of the specific feature over the entire dataset and transform your features to (X-mean(X)) / std(X) ). Batch normalization is analogue of this applied to stochastic optimization methods, like SGD (it's not meaningful to use global statistics on mini batch, furthermore you want to use batch norm more often than just before the first layer). More fundamenal ideas could be found in the original paper

What are the performance metrics for Clustering Algorithms? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I'm working on Kmeans clustering but unlike supervised learning I cannot figure the performance metrics for clustering algorithms. How to perform the accuracy after training the data?
For kmeans you can find the inertia_ of it. Which can give you an idea how well kmeans algorithm has worked.
kmeans = KMeans(...)
# Assuming you already have fitted data on it.
kmeans.inertia_ # lesser is better
Or, alternatively if you call score() function, which will give you the same but the sign will be negative. As we assume bigger score means better but for kmeans lesser inertia_ is better. So, to make them consistent an extra negation is applied on it.
# Call score with data X
kmeans.score(X) # greater is better
This is the very basic form of analyzing performance of kmeans. In reality if you take the number of clusters too high the score() will increase accordingly (in other words inertia_ will decrease), because inertia_ is nothing but the summation of the squared distances from each point to its corresponding cluster's centroid to which cluster it is assigned to. So if you increase the number of the clusters too much, the overall distances' squared summation will decrease as each point will get a centroid very near to it. Although, the quality of clustering is horrible in this case. So, for better analysis you should find out silhouette score or even better use silhouette diagram in this case.
You will find all of the implementations in this notebook: 09_unsupervised_learning.ipynb
The book corresponding to this repository is: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition. It is a great book to learn all of these details.

How do you calculate accuracy metric for a regression problem with multiple outputs? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
My CNN (Conv1D) on pytorch has 20 inputs and 6 outputs. The predicted output is said to be "accurate" only of all 6 them match,right? So, unless all my predicted results are accurate to the 8th decimal point ,will I ever be able to get decent accuracy?
The standard accuracy metric is used for classification tasks. In order to sue accuracy you have to say if an output if one of the following: True positive (TP), True Negative (TN), False positive (FP), False negative (FN).
These classification metrics and be used to a certain extentin regression tasks, when you can apply these labels (TP, TN, FP, FN) to the outputs, maybe via simple threshold. This heavily depends on the kind of problem you are dealing with and may or may not be possible or useful.
As Andrey said in general you wan't to use metrics like the Mean absolute error (MAE) or the Mean squared error (MSE). But these metrics can be hard to interpret. I would suggest to look into papers who have a similar problem as you do and see which metrics they use to evaluate their results and compare themselves to other work.
Accuracy isn't a suitable metric for regression tasks. For regression tasks you should use such metrics as MAE, RMSE and so on.

Ridge regression vs Lasso Regression [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
Is Lasso regression or Elastic-net regression always better than the ridge regression?
I've conducted these regressions on a few data sets and I've always got the same result that the mean squared error is the least in lasso regression. Is this a mere coincidence or is this true in any case?
On the topic, James, Witten, Hastie and Tibshirani write in their book "An Introduktion to Statistical Learning":
These two examples illustrate that neither ridge regression nor the
lasso will universally dominate the other. In general, one might expect
the lasso to perform better in a setting where a relatively small
number of predictorshave substantial coefficients, and the remaining
predictors have coefficients that are very small or that equal zero.
Ridge regression will perform better when the response is a function of
many predictors, all with coefficients of roughly equal size. However,
the number of predictors that is related to the response is never
known apriori for real data sets. A technique such as cross-validation
can be used in order to determine which approach is betteron a
particular data set. (chapter 6.2)
It's different for each problem. In lasso regression, algorithm is trying to remove the extra features that doesn't have any use which sounds better because we can train with less data very nicely as well but the processing is a little bit harder, but in ridge regression the algorithm is trying to make those extra features less effective but not removing them completely which is easier to process.

Handing high Cardinality features with supervised ratio and weight of evidence [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 5 years ago.
Improve this question
Say a data set has a categorical feature with high cardinality. Say zipcodes, or cities. Encoding this feature would give hundreds of feature columns. Different approaches such as supervised_ratio, weight of evidence seems to give better performance.
The question is, these supervised_ratio and WOE are to be calculated on the training set, right ? So I get the training set and process it and calcuate the SR and WOE and update the training set with the new values and keep the calculated values to be used in test set as well. But what happens if the test set has zip codes which were not in training set ? when there is no SR or WOE value to be used? (Practically this is possible if the training data set is not covering all the possible zip codes or if there are only one or two records from certain zip codes which might fall in to either training set or test set).
(Same will happen with encoding approach also)
I am more interested in the question, is SR and/or WOE the recommended way to handle a feature with high cardinality? if so what do we do when there are values in test set which were not in training set?
If not, what are the recommended ways to handling high cardinality features and which algorithms are more robust to them ? Thank you
This is a well-known problem when applying value-wise transformations to a categorical feature. The most common workaround is to have a set of rules to translate unseen values into values known by your training set.
This can be just a single 'NA' value (or 'others', as another answer is suggesting), or something more elaborate (e.g. in your example, you can map unseen zip codes to the closest know one in the training set).
Another posible solution in some scenarios is to have the model refusing to made a prediction in those cases, and just return an error.
For your second question, there is not really a recommended way of encoding high cardinality features (there are many methods and some may work better than others depending on the other features, the target variable, etc..); but what we can recommend you is to implement a few and experiment which one is more effective for your problem. You can consider the preprocessing method used as just another parameter in your learning algorithm.
That's a great question, thanks for asking!
When approaching this kind of problem of handle a feature with high cardinality, like zip codes, I keep in my training set just the most frequent ones and put all others in new category "others", then I calculate their WOE or any metric.
If some unseen zip code are found the test set, they falls to 'others' category. In general, this approach works well in practice.
I hope this nayve solution can help you!

Resources