What's the relationship between an SVM and hinge loss? - machine-learning

My colleague and I are trying to wrap our heads around the difference between logistic regression and an SVM. Clearly they are optimizing different objective functions. Is an SVM as simple as saying it's a discriminative classifier that simply optimizes the hinge loss? Or is it more complex than that? How do the support vectors come into play? What about the slack variables? Why can't you have deep SVM's the way you can't you have a deep neural network with sigmoid activation functions?

I will answer one thing at at time
Is an SVM as simple as saying it's a discriminative classifier that simply optimizes the hinge loss?
SVM is simply a linear classifier, optimizing hinge loss with L2 regularization.
Or is it more complex than that?
No, it is "just" that, however there are different ways of looking at this model leading to complex, interesting conclusions. In particular, this specific choice of loss function leads to extremely efficient kernelization, which is not true for log loss (logistic regression) nor mse (linear regression). Furthermore you can show very important theoretical properties, such as those related to Vapnik-Chervonenkis dimension reduction leading to smaller chance of overfitting.
Intuitively look at these three common losses:
hinge: max(0, 1-py)
log: y log p
mse: (p-y)^2
Only the first one has the property that once something is classified correctly - it has 0 penalty. All the remaining ones still penalize your linear model even if it classifies samples correctly. Why? Because they are more related to regression than classification they want a perfect prediction, not just correct.
How do the support vectors come into play?
Support vectors are simply samples placed near the decision boundary (losely speaking). For linear case it does not change much, but as most of the power of SVM lies in its kernelization - there SVs are extremely important. Once you introduce kernel, due to hinge loss, SVM solution can be obtained efficiently, and support vectors are the only samples remembered from the training set, thus building a non-linear decision boundary with the subset of the training data.
What about the slack variables?
This is just another definition of the hinge loss, more usefull when you want to kernelize the solution and show the convexivity.
Why can't you have deep SVM's the way you can't you have a deep neural network with sigmoid activation functions?
You can, however as SVM is not a probabilistic model, its training might be a bit tricky. Furthermore whole strength of SVM comes from efficiency and global solution, both would be lost once you create a deep network. However there are such models, in particular SVM (with squared hinge loss) is nowadays often choice for the topmost layer of deep networks - thus the whole optimization is actually a deep SVM. Adding more layers in between has nothing to do with SVM or other cost - they are defined completely by their activations, and you can for example use RBF activation function, simply it has been shown numerous times that it leads to weak models (to local features are detected).
To sum up:
there are deep SVMs, simply this is a typical deep neural network with SVM layer on top.
there is no such thing as putting SVM layer "in the middle", as the training criterion is actually only applied to the output of the network.
using of "typical" SVM kernels as activation functions is not popular in deep networks due to their locality (as opposed to very global relu or sigmoid)

Related

Random forest is worse than linear regression? It it normal and what is the reason?

I am trying to use machine learning to predict a dataset. It is a regression problem with 180 input features and 1 continuously-valued output. I try to compare deep neural networks, random forest regression, and linear regression.
As I expect, 3-hidden-layer deep neural networks outperform other two approaches with a root mean square error (RMSE) of 0.1. However, I unexpected to see that random forest even performs worse than linear regression (RMSE 0.29 vs. 0.27). In my expectation, the random forest can discover more complex dependencies between features to decrease error. I have tried to tune the parameters of random forest (number of trees, maximum features, max_depth, etc.). I also tried different K-cross validation, but the performance is still less than linear regression.
I searched online, and one answer says linear regression may perform better if features have a smooth, nearly linear dependence on the covariates. I do not fully get the point because if that is the case, should not deep neural networks give much performance gain?
I am struggling to give an explanation. Under what situation, random forest is worse than linear regression, but deep neural networks can perform much better?
If your features explain linear relation to the target variable then a Linear Model usually performs well than a Random Forest Model. It totally depends on the linear relations between your features.
That said, Linear models are not superior or the Random Forest is any inferior one.
Try scaling and transforming the data using MinMaxScaler() from scikit-learn to see if the linear model improves further
Pro Tips
If linear model is working like a charm you need to ask your self Why? and How? And get into the basics of both the models to understand why it worked on your data. These questions will lead you to feature engineer better. And as a matter of fact, Kaggle Grand Masters do use Linear Models in stacking to get that top 1% score by capturing the linear relations in the dataset.
So at the end of the day, linear models could wonders too.

Why do neural networks work so well?

I understand all the computational steps of training a neural network with gradient descent using forwardprop and backprop, but I'm trying to wrap my head around why they work so much better than logistic regression.
For now all I can think of is:
A) the neural network can learn it's own parameters
B) there are many more weights than simple logistic regression thus allowing for more complex hypotheses
Can someone explain why a neural network works so well in general? I am a relative beginner.
Neural Networks can have a large number of free parameters (the weights and biases between interconnected units) and this gives them the flexibility to fit highly complex data (when trained correctly) that other models are too simple to fit. This model complexity brings with it the problems of training such a complex network and ensuring the resultant model generalises to the examples it’s trained on (typically neural networks require large volumes of training data, that other models don't).
Classically logistic regression has been limited to binary classification using a linear classifier (although multi-class classification can easily be achieved with one-vs-all, one-vs-one approaches etc. and there are kernalised variants of logistic regression that allow for non-linear classification tasks). In general therefore, logistic regression is typically applied to more simple, linearly-separable classification tasks, where small amounts of training data are available.
Models such as logistic regression and linear regression can be thought of as simple multi-layer perceptrons (check out this site for one explanation of how).
To conclude, it’s the model complexity that allows neural nets to solve more complex classification tasks, and to have a broader application (particularly when applied to raw data such as image pixel intensities etc.), but their complexity means that large volumes of training data are required and training them can be a difficult task.
Recently Dr. Naftali Tishby's idea of Information Bottleneck to explain the effectiveness of deep neural networks is making the rounds in the academic circles.
His video explaining the idea (link below) can be rather dense so I'll try to give the distilled/general form of the core idea to help build intuition
https://www.youtube.com/watch?v=XL07WEc2TRI
To ground your thinking, vizualize the MNIST task of classifying the digit in the image. For this, I am only talking about simple fully-connected neural networks (not Convolutional NN as is typically used for MNIST)
The input to a NN contains information about the output hidden inside of it. Some function is needed to transform the input to the output form. Pretty obvious.
The key difference in thinking needed to build better intuition is to think of the input as a signal with "information" in it (I won't go into information theory here). Some of this information is relevant for the task at hand (predicting the output). Think of the output as also a signal with a certain amount of "information". The neural network tries to "successively refine" and compress the input signal's information to match the desired output signal. Think of each layer as cutting away at the unneccessary parts of the input information, and
keeping and/or transforming the output information along the way through the network.
The fully-connected neural network will transform the input information into a form in the final hidden layer, such that it is linearly separable by the output layer.
This is a very high-level and fundamental interpretation of the NN, and I hope it will help you see it clearer. If there are parts you'd like me to clarify, let me know.
There are other essential pieces in Dr.Tishby's work, such as how minibatch noise helps training, and how the weights of a neural network layer can be seen as doing a random walk within the constraints of the problem.
These parts are a little more detailed, and I'd recommend first toying with neural networks and taking a course on Information Theory to help build your understanding.
Consider you have a large dataset and you want to build a binary classification model for that, Now you have two options that you have pointed out
Logistic Regression
Neural Networks ( Consider FFN for now )
Each node in a neural network will be associated with an activation function for example let's choose Sigmoid since Logistic regression also uses sigmoid internally to make decision.
Let's see how the decision of logistic regression looks when applied on the data
See some of the green spots present in the red boundary?
Now let's see the decision boundary of neural network (Forgive me for using a different color)
Why this happens? Why does the decision boundary of neural network is so flexible which gives more accurate results than Logistic regression?
or the question you asked is "Why neural networks works so well ?" is because of it's hidden units or hidden layers and their representation power.
Let me put it this way.
You have a logistic regression model and a Neural network which has say 100 neurons each of Sigmoid activation. Now each neuron will be equivalent to one logistic regression.
Now assume a hundred logistic units trained together to solve one problem versus one logistic regression model. Because of these hidden layers the decision boundary expands and yields better results.
While you are experimenting you can add more number of neurons and see how the decision boundary is changing. A logistic regression is same as a neural network with single neuron.
The above given is just an example. Neural networks can be trained to get very complex decision boundaries
Neural networks allow the person training them to algorithmically discover features, as you pointed out. However, they also allow for very general nonlinearity. If you wish, you can use polynomial terms in logistic regression to achieve some degree of nonlinearity, however, you must decide which terms you will use. That is you must decide a priori which model will work. Neural networks can discover the nonlinear model that is needed.
'Work so well' depends on the concrete scenario. Both of them do essentially the same thing: predicting.
The main difference here is neural network can have hidden nodes for concepts, if it's propperly set up (not easy), using these inputs to make the final decission.
Whereas linear regression is based on more obvious facts, and not side effects. A neural network should de able to make more accurate predictions than linear regression.
Neural networks excel at a variety of tasks, but to get an understanding of exactly why, it may be easier to take a particular task like classification and dive deeper.
In simple terms, machine learning techniques learn a function to predict which class a particular input belongs to, depending on past examples. What sets neural nets apart is their ability to construct these functions that can explain even complex patterns in the data. The heart of a neural network is an activation function like Relu, which allows it to draw some basic classification boundaries like:
Example classification boundaries of Relus
By composing hundreds of such Relus together, neural networks can create arbitrarily complex classification boundaries, for example:
Composing classification boundaries
The following article tries to explain the intuition behind how neural networks work: https://medium.com/machine-intelligence-report/how-do-neural-networks-work-57d1ab5337ce
Before you step into neural network see if you have assessed all aspects of normal regression.
Use this as a guide
and even before you discard normal regression - for curved type of dependencies - you should strongly consider kernels with SVM
Neural networks are defined with an objective and loss function. The only process that happens within a neural net is to optimize for the objective function by reducing the loss function or error. The back propagation helps in finding the optimized objective function and reach our output with an output condition.

High bias or variance? - SVM and weired learning curves

I have never seen such learning curves. Am I right, that huge overfitting occurs? The model is fitting better and better to the training data, while it generalizes worse for the test data.
Usually when there is high variance, like here, more examples should help. In this case, they won't, I suspect. Why is that? Why such example of learning curves can't be found easily in literature/tutorials?
Learning curves. SVM, param1 is C, param2 is gamma
You have to remember that SVM is non parametric model, thus more samples does not have to reduce variance. Reduction in variance can be more or less guaranteed for parametric model (like neural net), but SVM is not one of them - more samples mean not only better training data but also more complex model. Your learning curves are typical example of SVM overfitting, which happens a lot with RBF kernel.

Is there any classifier which is able to make decisions very fast?

Most classification algorithms are developed to improve the training speed. However, is there any classifier or algorithm focusing on the decision making speed(low computation complexity and simple realizable structure)? I can get enough training data,and endure the long training time.
There are many methods which classify fast, you could more or less sort models by classification speed in a following way (first ones - the fastest, last- slowest)
Decision Tree (especially with limited depth)
Linear models (linear regression, logistic regression, linear svm, lda, ...) and Naive Bayes
Non-linear models based on explicit data transformation (Nystroem kernel approximation, RVFL, RBFNN, EEM), Kernel methods (such as kernel SVM) and shallow neural networks
Random Forest and other committees
Big Neural Networks (ie. CNN)
KNN with arbitrary distance
Obviously this list is not exhaustive, it just shows some general ideas.
One way of obtaining such model is to build a complex, slow model, then use it as a black box label generator to train a simplier model (but on potentialy infinite training set) - thus getting a fast classifier at the cost of very expensive training. There are many works showing that one can do that for example by training a shallow neural network on outputs of deep nn.
In general classification speed should not be a problem. Some exceptions are algorithms which have a time complexity depending on the number of samples you have for training. One example is k-Nearest-Neighbors which has no training time, but for classification it needs to check all points (if implemented in a naive way). Other examples are all classifiers which work with kernels since they compute the kernel between the current sample and all training samples.
Many classifiers work with a scalar product of the features and a learned coefficient vector. These should be fast enough in almost all cases. Examples are: Logistic regression, linear SVM, perceptrons and many more. See #lejlot's answer for a nice list.
If these are still too slow you might try to reduce the dimension of your feature space first and then try again (this also speeds up training time).
Btw, this question might not be suited for StackOverflow as it is quite broad and recommendation instead of problem oriented. Maybe try https://stats.stackexchange.com/ next time.
I have a decision tree which is represented in the compressed form and which is at least 4 times faster than the actual tree in classifying an unseen instance.

SVM versus MLP (Neural Network): compared by performance and prediction accuracy

I should decide between SVM and neural networks for some image processing application. The classifier must be fast enough for near-real-time application and accuracy is important too. Since this is a medical application, it is important that the classifier has the low failure rate.
which one is better choice?
A couple of provisos:
performance of a ML classifier can refer to either (i) performance of the classifier itself; or (ii) performance of the predicate step: execution speed of the model-building algorithm. Particularly in this case, the answer is quite different depending on which of the two is intended in the OP, so i'll answer each separately.
second, by Neural Network, i'll assume you're referring to the most common implementation--i.e., a feed-forward, back-propagating single-hidden-layer perceptron.
Training Time (execution speed of the model builder)
For SVM compared to NN: SVMs are much slower. There is a straightforward reason for this: SVM training requires solving the associated Lagrangian dual (rather than primal) problem. This is a quadratic optimization problem in which the number of variables is very large--i.e., equal to the number of training instances (the 'length' of your data matrix).
In practice, two factors, if present in your scenario, could nullify this advantage:
NN training is trivial to parallelize (via map reduce); parallelizing SVM training is not trivial, but it's also not impossible--within the past eight or so years, several implementations have been published and proven to work (https://bibliographie.uni-tuebingen.de/xmlui/bitstream/handle/10900/49015/pdf/tech_21.pdf)
mult-class classification problem SVMs are two-class classifiers.They can be adapted for multi-class problems, but this is never straightforward because SVMs use direct decision functions. (An excellent source for modifying SVMs to multi-class problems is S. Abe, Support Vector Machines for Pattern Classification, Springer, 2005). This modification could wipe out any performance advantage SVMs have over NNs: So for instance, if your data has
more than two classes and you chose to configure the SVM using
successive classificstaion (aka one-against-many classification) in
which data is fed to a first SVM classifier which classifiers the
data point either class I or other; if the class is other then
the data point is fed to a second classifier which classifies it
class II or other, etc.
Prediction Performance (execution speed of the model)
Performance of an SVM is substantially higher compared to NN. For a three-layer (one hidden-layer) NN, prediction requires successive multiplication of an input vector by two 2D matrices (the weight matrices). For SVM, classification involves determining on which side of the decision boundary a given point lies, in other words a cosine product.
Prediction Accuracy
By "failure rate" i assume you mean error rate rather than failure of the classifier in production use. If the latter, then there is very little if any difference between SVM and NN--both models are generally numerically stable.
Comparing prediction accuracy of the two models, and assuming both are competently configured and trained, the SVM will outperform the NN.
The superior resolution of SVM versus NN is well documented in the scientific literature. It is true that such a comparison depends on the data, the configuration, and parameter choice of the two models. In fact, this comparison has been so widely studied--over perhaps all conceivable parameter space--and the results so consistent, that even the existence of a few exceptions (though i'm not aware of any) under impractical circumstances shouldn't interfere with the conclusion that SVMs outperform NNs.
Why does SVM outperform NN?
These two models are based on fundamentally different learing strategies.
In NN, network weights (the NN's fitting parameters, adjusted during training) are adjusted such that the sum-of-square error between the network output and the actual value (target) is minimized.
Training an SVM, by contrast, means an explicit determination of the decision boundaries directly from the training data. This is of course required as the predicate step to the optimization problem required to build an SVM model: minimizing the aggregate distance between the maximum-margin hyperplane and the support vectors.
In practice though it is harder to configure the algorithm to train an SVM. The reason is due to the large (compared to NN) number of parameters required for configuration:
choice of kernel
selection of kernel parameters
selection of the value of the margin parameter

Resources