Is logistic regression better for a linearly separable data? - machine-learning

Regarding classification...
Suppose that it is found a data to be linearly separable( tested the linear separability using SVM/Clustering/Single perceptron etc..)
Can we go with a simpler model like logistic regression (instead of SVM or any other) as they say simple model is the better model
Please correct me if wrong
Thanks in advance !
Surya

Don't confuse the algorithm with the model. With linearly separable data, each of those algorithms should return a simple hyperplane, a sum of linear (first-degree) terms with real coefficients. Thus, each of the models is equally simple.
If you're concerned with the simplest algorithm, then you do have a point.
I would stick with straightforward SVM: it provides a closed-form computation to determine the optimum separation, based on the nearest N+1 observations (given N features).
Each of the algorithms has its advantages with respect to run-time, clarity, accuracy, etc. If your criterion is something other than maximum gap, then linear regression (in its closed form) may be the best choice.

Related

Feature selection/extraction in an ANN model for regression

I am trying to fit an ANN model for regression with 15 input parameters.
Some of these parameters are related to each other and the relationship is not linear. Say, one of the input parameters can be expressed as a non-linear function of other parameters. But I don't know these relations exactly because I lack domain knowledge. Is there a way to find these relationships among the input parameters?
I have tried finding these relationships with pandas correlation matrix, couldn't draw any conclusion since it talks about the only linear correlation between 2 parameters.
Thanks in advance.
One way to find the non-linear relationship between your input features would be to follow the approach similar to computing Variance Inflation Factor (VIF) which is used to find out the linear relationships between your input features. The modification which we need would be, instead of running a Ordinary Least Squares (OLS) regression to find out the linear relationship between your input features, you can use a neural network in place of OLS which captures the non-linear relationship between your features.
So your f will be a OLS for the computing VIF in the regular setting, in our case it will be Neural Network, which is proven to be efficient in capturing non-linear relationships.
And for finding out the VIF in this case instead of we could replace it with the accuracy of the neural network.
And finally for which ever variable your VIF higher you can conclude that they highly correlated with other features.

Random forest is worse than linear regression? It it normal and what is the reason?

I am trying to use machine learning to predict a dataset. It is a regression problem with 180 input features and 1 continuously-valued output. I try to compare deep neural networks, random forest regression, and linear regression.
As I expect, 3-hidden-layer deep neural networks outperform other two approaches with a root mean square error (RMSE) of 0.1. However, I unexpected to see that random forest even performs worse than linear regression (RMSE 0.29 vs. 0.27). In my expectation, the random forest can discover more complex dependencies between features to decrease error. I have tried to tune the parameters of random forest (number of trees, maximum features, max_depth, etc.). I also tried different K-cross validation, but the performance is still less than linear regression.
I searched online, and one answer says linear regression may perform better if features have a smooth, nearly linear dependence on the covariates. I do not fully get the point because if that is the case, should not deep neural networks give much performance gain?
I am struggling to give an explanation. Under what situation, random forest is worse than linear regression, but deep neural networks can perform much better?
If your features explain linear relation to the target variable then a Linear Model usually performs well than a Random Forest Model. It totally depends on the linear relations between your features.
That said, Linear models are not superior or the Random Forest is any inferior one.
Try scaling and transforming the data using MinMaxScaler() from scikit-learn to see if the linear model improves further
Pro Tips
If linear model is working like a charm you need to ask your self Why? and How? And get into the basics of both the models to understand why it worked on your data. These questions will lead you to feature engineer better. And as a matter of fact, Kaggle Grand Masters do use Linear Models in stacking to get that top 1% score by capturing the linear relations in the dataset.
So at the end of the day, linear models could wonders too.

High bias or variance? - SVM and weired learning curves

I have never seen such learning curves. Am I right, that huge overfitting occurs? The model is fitting better and better to the training data, while it generalizes worse for the test data.
Usually when there is high variance, like here, more examples should help. In this case, they won't, I suspect. Why is that? Why such example of learning curves can't be found easily in literature/tutorials?
Learning curves. SVM, param1 is C, param2 is gamma
You have to remember that SVM is non parametric model, thus more samples does not have to reduce variance. Reduction in variance can be more or less guaranteed for parametric model (like neural net), but SVM is not one of them - more samples mean not only better training data but also more complex model. Your learning curves are typical example of SVM overfitting, which happens a lot with RBF kernel.

Is there any classifier which is able to make decisions very fast?

Most classification algorithms are developed to improve the training speed. However, is there any classifier or algorithm focusing on the decision making speed(low computation complexity and simple realizable structure)? I can get enough training dataļ¼Œand endure the long training time.
There are many methods which classify fast, you could more or less sort models by classification speed in a following way (first ones - the fastest, last- slowest)
Decision Tree (especially with limited depth)
Linear models (linear regression, logistic regression, linear svm, lda, ...) and Naive Bayes
Non-linear models based on explicit data transformation (Nystroem kernel approximation, RVFL, RBFNN, EEM), Kernel methods (such as kernel SVM) and shallow neural networks
Random Forest and other committees
Big Neural Networks (ie. CNN)
KNN with arbitrary distance
Obviously this list is not exhaustive, it just shows some general ideas.
One way of obtaining such model is to build a complex, slow model, then use it as a black box label generator to train a simplier model (but on potentialy infinite training set) - thus getting a fast classifier at the cost of very expensive training. There are many works showing that one can do that for example by training a shallow neural network on outputs of deep nn.
In general classification speed should not be a problem. Some exceptions are algorithms which have a time complexity depending on the number of samples you have for training. One example is k-Nearest-Neighbors which has no training time, but for classification it needs to check all points (if implemented in a naive way). Other examples are all classifiers which work with kernels since they compute the kernel between the current sample and all training samples.
Many classifiers work with a scalar product of the features and a learned coefficient vector. These should be fast enough in almost all cases. Examples are: Logistic regression, linear SVM, perceptrons and many more. See #lejlot's answer for a nice list.
If these are still too slow you might try to reduce the dimension of your feature space first and then try again (this also speeds up training time).
Btw, this question might not be suited for StackOverflow as it is quite broad and recommendation instead of problem oriented. Maybe try https://stats.stackexchange.com/ next time.
I have a decision tree which is represented in the compressed form and which is at least 4 times faster than the actual tree in classifying an unseen instance.

unigrams & bigrams (tf-idf) less accurate than just unigrams (ff-idf)?

This is a question about linear regression with ngrams, using Tf-IDF (term frequency - inverse document frequency). To do this, I am using numpy sparse matrices and sklearn for linear regression.
I have 53 cases and over 6000 features when using unigrams. The predictions are based on cross validation using LeaveOneOut.
When I create a tf-idf sparse matrix of only unigram scores, I get slightly better predictions than when I create a tf-idf sparse matrix of unigram+bigram scores. The more columns I add to the matrix (columns for trigram, quadgram, quintgrams, etc.), the less accurate the regression prediction.
Is this common? How is this possible? I would have thought that the more features, the better.
It's not common for bigrams to perform worse than unigrams, but there are situations where it may happen. In particular, adding extra features may lead to overfitting. Tf-idf is unlikely to alleviate this, as longer n-grams will be rarer, leading to higher idf values.
I'm not sure what kind of variable you're trying to predict, and I've never done regression on text, but here's some comparable results from literature to get you thinking:
In random text generation with small (but non-trivial) training sets, 7-grams tend to reconstruct the input text almost verbatim, i.e. cause complete overfit, while trigrams are more likely to generate "new" but still somewhat grammatical/recognizable text (see Jurafsky & Martin; can't remember which chapter and I don't have my copy handy).
In classification-style NLP tasks performed with kernel machines, quadratic kernels tend to fare better than cubic ones because the latter often overfit on the training set. Note that unigram+bigram features can be thought of as a subset of the quadratic kernel's feature space, and {1,2,3}-grams of that of the cubic kernel.
Exactly what is happening depends on your training set; it might simply be too small.
As larsmans said, adding more variables / features makes it easier for the model to overfit hence lose in test accuracy. In the master branch of scikit-learn there is now a min_df parameter to cut-off any feature with less than that number of occurrences. Hence min_df==2 to min_df==5 might help you get rid of spurious bi-grams.
Alternatively you can use L1 or L1 + L2 penalized linear regression (or classification) using either the following classes:
sklearn.linear_model.Lasso (regression)
sklearn.linear_model.ElasticNet (regression)
sklearn.linear_model.SGDRegressor (regression) with penalty == 'elastic_net' or 'l1'
sklearn.linear_model.SGDClassifier (classification) with penalty == 'elastic_net' or 'l1'
This will make it possible to ignore spurious features and lead to a sparse model with many zero weights for noisy features. Grid Searching the regularization parameters will be very important though.
You can also try univariate feature selection such as done the text classification example of scikit-learn (check the SelectKBest and chi2 utilities.

Resources