Dimensionality / noise reduction techniques for regression problems [closed] - machine-learning

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
What are some techniques for dimensionality reduction in regression problems? I have tried the only unsupervised techniques I know, PCA and Kernel PCA (using the scikit learn library), but I have not seen any improvements using these. Perhaps these are only suitable for classification problems? What are some other techniques I can try? Preferably, ones that are implemented in sklearn.

This is a very general question, and the suitability of the techniques (or combinations of them), really depends on your problem specifics.
In general, there are several categories of dimension reduction (asides from those you mentioned.
Perhaps the simplest form of dimension reduction is to just use some of the features, in which case we are really talking about feature selection (see sklearn's module).
Another way would be to cluster (sklearn's), and replace each cluster by an aggregate of its components.
Finally, some regressors use l1 penalization and properties of convex optimization to simultaneously select a subset of features; in sklearn, see the lasso and elastic net.
Once again, this a very broad problem. There are entire books and competitions even of feature selection, which is a subset of dimension reduction.

Adding to #AmiTavory 's good answer: PCA principal components analysis can be used here. If you do not wish to perform dimensionality reduction simply retain the same number of eigenvectors from the PCA as the size of the input matrix: in your case 20.
The resulting output will be orthogonal eigenvectors: you may consider them to provide the "transformation" you are seeking as follow: the vectors are ranked by their respective amount of variance they represent wrt the inputs.

Related

Which supervised machine learning classification method suits for randomly spread classes? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
If classes are randomly spread or it is having more noise, which type of supervised ML classification model will give better results, and why?
It is difficult to say which classifier will perform best on general problems. It often requires testing of a variety of algorithms on a given problem in order to determine which classifier performs best.
Best performance is also dependent on the nature of the problem. There is a great answer in this stackoverflow question which looks at various scoring metrics. For each problem, one needs to understand and consider which scoring metric will be best.
All of that said, neural networks, Random Forest classifiers, Support Vector Machines, and a variety of others are all candidates for creating useful models given that classes are, as you indicated, equally distributed. When classes are imbalanced, the rules shift slightly, as most ML algorithms assume balance.
My suggestion would be to try a few different algorithms, and tune the hyper parameters, to compare them for your specific application. You will often find one algorithm is better, but not remarkably so. In my experience, often of far greater importance, is how your data are preprocessed and how your features are prepared. Once again this is a highly generic answer as it depends greatly on your given application.

Does the use of kernels in SVMs increase the chances of overfitting? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
When using kernels to delimit non linear domains in SVMs, we introduce new features based on the training examples. We then have as many features as training examples. But having as many features as examples increases the chances of overfitting right? Should we drop some of these new features?
You really can't drop any of the kernel-generated features, in many cases you don't know what features are being used or what weight is being given to them. In addition to the usage of kernels, SVMs use regularization, and this regularization decreases the possibility of overfitting.
You can read about the connection between the formulation of SVMs and statistical learning theory, but the high level summary is that the SVM doesn't just find a separating hyperplane but finds one that maximizes the margin.
The wikipedia article for SVMs is very good and provides excellent links to regularization, and parameter search and many other important topics.
increasing feature did increase the chances of overfitting,may be you should use cross validation(libsvm contains) strategy to test the model you trained now is overfitting or not,
and use feature seletion tools to select feature http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/fselect/fselect.py

Lasvm documentation and information [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I have thousands of samples for training and test and I want to use SVM with RBF kernel to classify them. The problem is the fact that the Libsvm's implementation of RBF kernel is very slow when using 10k or more data. The main focus of the slow performance is the grid search.
I read about the Liblinear and Lasvm. But liblinear is not what I want because Svms with linear kernel usually have smaller accuracy than RBF kernel accuracy.
I was searching for Lasvm and I can't find useful informations about it. The project site is very poor about information of it. I want to know if Lasvm can use the RBF kernel or it has a specific kind of kernel, if I should scale the test and treining data and if I can do the grid search for my kernel parameters with cross validation.
LaSVM has an RBF kernel implementation too. Based on my experience on large data (>100.000 instances in >1.000 dimensions), it is no faster than LIBSVM though. If you really want to use a nonlinear kernel for huge data, you could try EnsembleSVM.
If your data is truly huge and you are not familiar with ensemble learning, LIBLINEAR is the way to go. If you have a high number of input dimensions, the linear kernel is usually not much worse than RBF while being orders of magnitude faster.

Different usage of Machine Learning classifiers [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I have learned several classifiers in Machine learning - Decision tree, Neural network, SVM, Bayesian classifier, K-NN...etc.
Can anyone please help to understand when I should prefer one of the classifier over other - for example - in which situation(nature of data sets, etc) I should prefer decision tree over neural net OR which situation SVM might work better than Bayesian ??
Sorry if this is not a good place to post this question.
Thanks.
This is EXTREMELY related to the nature of the dataset. There are several meta-learning approaches that will tell you which classifier to use, but generaly there isn't a golden rule.
If you're data is easily separable (easy to distinguish entries from different classes), perhaps decision-trees or SVMs (with a linear kernel) are good enough. However, if your data needs to be transformed into other [higher] dimensional spaces, kernel-based classifiers might work well, such as RBF SVMs. SVMs also work better with non-redundant, independent features. When combinations between features are needed, artificial neural networks and bayesian classifiers work good as well.
Yet again, this is highly subjective and strongly depends on your feature set. For instance, having a single feature that is highly correlated with the class might determine which classifier works best. That said, overall, the no-free-lunch theorem says that no classifier is better for everything, but SVMs are generally regarded as the current best bet on binary classification.

Survey to determine satisfaction: how to find the questions that mattered? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
If a survey is given to determine overall customer satisfaction, and there are 20 general questions and a final summary question: "What's your overall satisfaction 1-10", how could it be determined which questions are most significantly related to the summary question's answer?
In short, which questions actually mattered and which ones were just wasting space on the survey...
Information about the relevance of certain features is given by linear classification and regression weights associated with these features.
For your specific application, you could try training an L1 or L0 regularized regressor (http://en.wikipedia.org/wiki/Least-angle_regression, http://en.wikipedia.org/wiki/Matching_pursuit). These regularizers force many of the regression weights to zero, which means that the features associated with these weights can be effectively ignored.
There are many different approaches for answering this question and at varying levels of sophistication. I would start by calculating the correlation matrix for all pair-wise combinations of answers, thereby indicating which individual questions are most (or most negatively) correlated with the overall satisfaction score. This is pretty straightforward in Excel with the Analysis ToolPak.
Next, I would look into clustering techniques starting simple and moving up in sophistication only if necessary. Not knowing anything about the domain to which this survey data applies it is hard to say which algorithm would be the most effective, but for starters I would look at k-means and variants if your clusters are likely to all be similarly-sized. However, if a vast majority of the responses are very similar, I would look into expectation-maximization-based algorithms. A good open-source toolkit for exploring data and testing the efficacy of various algorithms is called Weka.

Resources