Ridge regression vs Lasso Regression [closed] - machine-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
Is Lasso regression or Elastic-net regression always better than the ridge regression?
I've conducted these regressions on a few data sets and I've always got the same result that the mean squared error is the least in lasso regression. Is this a mere coincidence or is this true in any case?

On the topic, James, Witten, Hastie and Tibshirani write in their book "An Introduktion to Statistical Learning":
These two examples illustrate that neither ridge regression nor the
lasso will universally dominate the other. In general, one might expect
the lasso to perform better in a setting where a relatively small
number of predictorshave substantial coefficients, and the remaining
predictors have coefficients that are very small or that equal zero.
Ridge regression will perform better when the response is a function of
many predictors, all with coefficients of roughly equal size. However,
the number of predictors that is related to the response is never
known apriori for real data sets. A technique such as cross-validation
can be used in order to determine which approach is betteron a
particular data set. (chapter 6.2)

It's different for each problem. In lasso regression, algorithm is trying to remove the extra features that doesn't have any use which sounds better because we can train with less data very nicely as well but the processing is a little bit harder, but in ridge regression the algorithm is trying to make those extra features less effective but not removing them completely which is easier to process.

Related

Tuning of hyperparameters of SVR [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
What is the better way of selecting the hyperparameters of SVR for tuning them using GridSearchCV? I learnt that the input to GridSearchCV is set of values for C, gamma and epsilon. The GridSearchCV algorithm evaluates each of these values and suggests the best among the set of values given to it as input. How to select the set of values to be given as input? Is there a better way of selecting them apart from trial and error? Resorting to trial and error is time consuming and also one might miss out optimal values of the Hyperparameters.
As always, good hyperparameters range depends on the problem. It is difficult to find one solution that fit all problems.
The literature recommends an epsilon between 1-e3 and 1. Concerning the C parameter a good hyperparameter space would be between 1 and 100. A C that is too large will simply overfit the training data.
The gamma is already calculated by scikit-learn SVR. I would not change it.
Don't forget that you can also tune the kernel and this might be the most important hyperparameter to tune.
To conclude, there is no free lunch when searching for the best hyperparameter ranges. The best option is to read literature and documenation as well as understanding each parameter impact.

What are the performance metrics for Clustering Algorithms? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I'm working on Kmeans clustering but unlike supervised learning I cannot figure the performance metrics for clustering algorithms. How to perform the accuracy after training the data?
For kmeans you can find the inertia_ of it. Which can give you an idea how well kmeans algorithm has worked.
kmeans = KMeans(...)
# Assuming you already have fitted data on it.
kmeans.inertia_ # lesser is better
Or, alternatively if you call score() function, which will give you the same but the sign will be negative. As we assume bigger score means better but for kmeans lesser inertia_ is better. So, to make them consistent an extra negation is applied on it.
# Call score with data X
kmeans.score(X) # greater is better
This is the very basic form of analyzing performance of kmeans. In reality if you take the number of clusters too high the score() will increase accordingly (in other words inertia_ will decrease), because inertia_ is nothing but the summation of the squared distances from each point to its corresponding cluster's centroid to which cluster it is assigned to. So if you increase the number of the clusters too much, the overall distances' squared summation will decrease as each point will get a centroid very near to it. Although, the quality of clustering is horrible in this case. So, for better analysis you should find out silhouette score or even better use silhouette diagram in this case.
You will find all of the implementations in this notebook: 09_unsupervised_learning.ipynb
The book corresponding to this repository is: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition. It is a great book to learn all of these details.

How do you calculate accuracy metric for a regression problem with multiple outputs? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
My CNN (Conv1D) on pytorch has 20 inputs and 6 outputs. The predicted output is said to be "accurate" only of all 6 them match,right? So, unless all my predicted results are accurate to the 8th decimal point ,will I ever be able to get decent accuracy?
The standard accuracy metric is used for classification tasks. In order to sue accuracy you have to say if an output if one of the following: True positive (TP), True Negative (TN), False positive (FP), False negative (FN).
These classification metrics and be used to a certain extentin regression tasks, when you can apply these labels (TP, TN, FP, FN) to the outputs, maybe via simple threshold. This heavily depends on the kind of problem you are dealing with and may or may not be possible or useful.
As Andrey said in general you wan't to use metrics like the Mean absolute error (MAE) or the Mean squared error (MSE). But these metrics can be hard to interpret. I would suggest to look into papers who have a similar problem as you do and see which metrics they use to evaluate their results and compare themselves to other work.
Accuracy isn't a suitable metric for regression tasks. For regression tasks you should use such metrics as MAE, RMSE and so on.

Can a model have both high bias and high variance? Overfitting and Underfitting? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 10 months ago.
Improve this question
As I understand it when creating a supervised learning model, our model may have high bias if we are making very simple assumptions (for example if our function is linear) which cause the algorithm to miss relationships between our features and target output resulting in errors. This is underfitting.
On the other hand if we make our algorithm too strong (many polynomial features), it'll be very sensitive to small fluctuations in our training set causing ovefitting: modeling the random noise in the training data, rather than the intended outputs. This is overfitting.
This makes sense to me, but I heard that a model can have both high variance and high bias and I just don't understand how that would possible. If high bias and high variance are synonyms for underfitting and overfitting, then how can you have both overfitting and underfitting on the same model? Is it possible? How can it happen? What does it look like when it does happen?
Imagine a regression problem. I define a classifier which outputs the maximum of the target variable observed in the training data, for all possible inputs.
This model is both biased (can only represent a singe output no matter how rich or varied the input) and has high variance (the max of a dataset will exhibit a lot of variability between datasets).
You're right to a certain extent that bias means a model is likely to underfit and variance means it's susceptible to overfitting, but they're not quite the same.
According to me High bias and High variance will happen when the line fits to outliers in the data

Imbalanced Data for Random ferns [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
For a Multiclass problem, should the data be balanced for machine learning algorithms such as Random forests and Random ferns or is it ok for it to be imbalanced for a certain extent?
The issue with imbalanced classes raises when the disproportion alters the separability of the classes instances. But this does not happen in ever imbalanced dataset: sometimes the more data you have from one class the better you can differentiate the scarse data from it since it lets you find more easily which features are meaningful to create an discriminating plane (even though you are not using discriminative analysis the point is to classify-separate the instances according to classes).
For example I can remember the KDDCup2004 protein classification task in which one class had 99.1% of the instances in the training set but if you tried to use under sampling methods to alleviate the imbalance you would only get worse results. That meaning that the large amount of data from the first class defined the data in the smaller one.
Concerning random forests, and decision trees in general, they work by selecting, at each step, the most promising feature that can partitionate the set into two (or more) class-meaningful subsets. Having inherently more data about one class does not bias this partitioning by default ( = always) but only when the imbalance is not representative of the classes real distributions.
So I suggest that you first run a multivariate analysis to try to get the extent of imbalance among classes in your dataset and the run a series of experiments with different undersampling ratios if you still ar ein doubt.
I have used Random Forrests in my task before. Although the data don't need be balanced, however if the positive samples are too few, the pattern of the data maybe drown in the noise. Most of classify methods even (random forrests and ada boost) should have this flaw more or less.'Over sample' may be a good idea to deal with this problem.
Perhaps the paper Logistic Regression in rareis useful with this sort of problem, although its topic is logistic regression.

Resources