svm-train executable has parameter e, which allows to set some epsilon. The description says only
set tolerance of termination criterion (default 0.001)
I don't find it informative enough and can't find the relevant explanation on the internet. Perhaps, it is some well-known generic SVM parameter, but I'm not familiar enough with generic SVM.
I mean the epsilon used in classification, but not the epsilon used in regression ("in loss function of epsilon-SVR") and specified to libsvm with option -p.
The solution to the SVM is solved using numerical optimization. The solver is iterative, and one could potentially repeat the iterations forever until you reach an error of exactly zero - finding the exact solution to the problem (this would never really happen due to floating point rounding errors). epsilon, in this case, is the tolerance for how close to zero the solution needs to be before we stop running iterations of the solver. 0.001 is generally a good value. Smaller values will take longer to train (requiring more iterations), but are not likely to result in a lower error rate as the solution was close enough to begin with. 0.01 is also common, this takes less time to train (fewer iterations) but sometimes has a higher error rate on test data then a more exact solution.
Related
I came up with the following result, tested on many data sets, but I do not have a formal proof yet:
Theorem: The width L of any confidence interval is asymptotically equal (as n tends to infinity) to a power function of n, namely L=A / n^B where A and B are two positive constants depending on the data set, and n is the sample size.
See here and here for details. The B exponent seems to be very similar to the Hurst exponent in time series, not only in terms of what it represents, but also in the values that it takes: B=1/2 corresponds to perfect data (no auto-correlation or undesirable features) and B=1 corresponds to "bad data" typically with strong auto-correlations.
Note that B=1/2 is what everyone uses nowadays, assuming observations are independently and identically distributed, with an underlying normal distribution. I also devised a method to make the interval width converges faster to zero: O(1/n) rather than O(1/SQRT(n)). This is also described in section 3.3. in my article on re-sampling (here) and my approach in this context seems very much related to what is called second-order accurate intervals (usually achieved with modern versions of bootstrapping, see here.)
My question is whether my theorem is original, ground-breaking, and correct, and how would someone prove it (or refute it.)
Example of Confidence Interval
Perl code to produce confidence intervals for the correlation
The first problem is, what do you mean by confidence interval?
Let's say i do non parametric estimation of a density probability function with a kernel density estimator.
Interval confidence has no meaning in this setting. however you can compute something which is the "speed" of convergence of your kernel density estimator to your target function. Depending on the choice of the distance you choose between function, you can get different speed of convergence. And for example, the best speed with $L^{\infty}$ distance depends on a $\log(n)$ factor.
By the way you give yourself a counterexample in your first article.
So for me your theorem can not exist for two reasons :
It is not clear, you need to specify exactly what you mean by confidence interval. You need to say what do you mean by depending on the dataset (does it depends on $N$ the number of observations?)
There is "counter example", since asymptotic speed of convergence of estimators can be more complicated than what you say.
I am doing text classification with scikit learn's logistic regression function (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). I am using grid search in order to choose a value for the C parameter. Do I need to do the same for max_iter parameter? why?
Both C and max_iter parameters have default values in Sklearn, which means they need to be tuned. But, from what I understand, early stopping and l1/l2 regularization are two desperate methods for avoiding overfitting and performing one of them is enough. Am I incorrect in assuming that tunning the value of max_iter is equivalent to early stopping?
To summarize, here are my main questions:
1- Does max_iter need tuning? why? (the documentation says it is only useful for certain solvers)
2- Is tuning the max_iter equivalent to early stopping?
3- Should we perform early stopping and L1/L2 regularization at the same time?
Here's some simple responses to your numbered questions and grossly simplified:
Yes, sometimes you need to tune max_iter. Why? See next.
No. max_iter is the number of iterations that the logistic regression classifier's solver is allowed to step through before being stopped. The aim is to reach a "stable" solution for the parameters of the logistic regression model, i.e., it is an optimisation problem. If your max_iter is too low, you may not reach an optimal solution and your model is underfit. If your value is too high, you can essentially wait forever to have a solution for little gain in accuracy. You may also get stuck at local optima if max_iter is too low.
Yes or No.
a. L1/L2 regularisation is essentially "smoothing" of your complex model so that it does not overfit to the training data. If parameters become too large, they are penalised in the cost.
b. Early stopping is when you stop optimising your model (e.g., via gradient descent) at some stage in which you deem acceptable (before max_iter). For example, a metric such as RMSE can be used to define when to stop, or a comparison of the metrics from your test/training data.
c. When to use them? This is dependent on your problem. If you have a simple linear problem, with limited features, you will not need regularisation or early stopping. If you have thousands of features and experience overfitting then apply regularisation as one solution. If you do not want to wait for the optimisation to run to the end when you are playing with parameters as you only care about a certain level of accuracy, you could apply early stopping.
Finally, how do I tune max_iter correctly? This depends on your problem at hand. If you find your classification metric shows your model is performing poorly, it could be that your solver has not taken enough steps to reach a minimum. I'd suggest you do this by hand and look at the cost vs. max_iter to see if it is reaching a minimum properly rather than automate it.
Technically speaking, given a complex enough network and sufficient amounts of time, is it always possible to overfit any dataset to the point where training error is 0?
Neural networks are universal approximators, which pretty much means that as long as there exists a deterministic mapping f from input to output, there always exists a set of parameters (for large enough network) that give you error which is arbitrarly close to minimal possible error, but:
if dataset is infinite (it is a distribution) then minimal obtainable error (called Bayes risk) can be greater than zero, bur rather some value e (pretty much the measure of "overlap" of different classes/value).
if mapping f is non-deterministic then again there is a non-zero Bayes risk e (this is a mathematical way of saying that a given point can have "multiple" values, with given probabilities)
arbitrarly close does not mean minimal. So even if the minimal error is zero, it does not mean that you just need "big enough" network to get to zero, you might always end up with veeeery small epsilon (but you can decrease it as long as you want). For example a network trained on classification task which has sigmoid/softmax output cannot ever obtain minimal log loss (cross entropy loss), as you can always move your activations "closer to 1" or "closer to 0", but you cannot achieve neither of these.
So from mathematical perspective the answer is no, from practical point of view - under the assumption of finite training set and deterministic mapping - the answer is yes.
In particular when you are asking about accuracy of the classification, and you have finite dataset with unique label per datapoint then it is easy to construct by hand a neural network which has 100% accuracy. However this does not mean minimal possible loss (as described above). Thus from the optimization perspective you are not obtaining "zero error".
Recall that when exponentially decaying the learning rate in TensorFlow one does:
decayed_learning_rate = learning_rate *
decay_rate ^ (global_step / decay_steps)
the docs mention this staircase option as:
If the argument staircase is True, then global_step /decay_steps is an
integer division and the decayed learning rate follows a staircase
function.
when is it better to decay every X number of steps and follow at stair case function rather than a smoother version that decays more and more with every step?
The existing answers didn't seem to describe this. There are two different behaviors being described as 'staircase' behavior.
From the feature request for staircase, the behavior is described as being a hand-tuned piecewise constant decay rate, so that a user could provide a set of iteration boundaries and a set of decay rates, to have the decay rate jump to the specified value after the iterations pass a given boundary.
If you look into the actual code for this feature pull request, you'll see that the PR isn't related much to the staircase option in the function arguments. Instead, it defines a wholly separate piecewise_constant operation, and the associated unit test shows how to define your own custom learning rate as a piecewise constant with learning_rate_decay.piecewise_constant.
From the documentation on decaying the learning rate, the behavior is described as treating global_step / decay_steps as integer division, so for the first set of decay_steps steps, the division results in 0, and the learning rate is constant. Once you cross the decay_steps-th iteration, you get the decay rate raised to a power of 1, then a power of 2, etc. So you only observe decay rates at the particular powers, rather than smoothly varying across all the powers if you treated the global step as a float.
As to advantages, this is just a hyperparameter decision you should make based on your problem. Using the staircase option allows you hold a decay rate constant, essentially like maintaining a higher temperature in simulated annealing for a longer time. This can allow you explore more of the solution space by taking bigger strides in the gradient direction, at the cost of possible noisy or unproductive updates. Meanwhile, smoothly increasing the decay rate power will steadily "cool" the exploration, which can limit you by making you stuck near a local optimum, but it can also prevent you from wasting time with noisily large gradient steps.
Whether one approach or the other is better (a) often doesn't matter very much and (b) usually needs to be specially tuned in the cases when it might matter.
Separately, as the feature request link mentions, the piecewise constant operation seems to be for very specifically tuned use cases, when you have separate evidence in favor of a hand-tuned decay rate based on collecting training metrics as a function of iteration. I would generally not recommend that for general use.
Good question.
For all I know it is preference of the research group.
Back from the old times, it was computationally more efficient to reduce the learning rate only every epoch. That's why some people prefer to use it nowadays.
Another, hand-wavy, story that people may tell is it prevents from local optima. By "suddenly" changing the learning rate, the weights might jump to a better bassin. (I don;t agree with this, but add it for completeness)
I have been working on Support Vector Machine for about 2 months now. I have coded SVM myself and for the optimization problem of SVM, I have used Sequential Minimal Optimization(SMO) by Dr. John Platt.
Right now I am in the phase where I am going to grid search to find optimal C value for my dataset. ( Please find details of my project application and dataset details here SVM Classification - minimum number of input sets for each class)
I have successfully checked my custom implemented SVM`s accuracy for C values ranging from 2^0 to 2^6. But now I am having some issues regarding the convergence of the SMO for C> 128.
Like I have tried to find the alpha values for C=128 and it is taking long time before it actually converges and successfully gives alpha values.
Time taken for the SMO to converge is about 5 hours for C=100. This huge I think ( because SMO is supposed to be fast. ) though I`m getting good accuracy?
I am screwed right not because I can not test the accuracy for higher values of C.
I am actually displaying number of alphas changed in every pass of SMO and getting 10, 13, 8... alphas changing continuously. The KKT conditions assures convergence so what is so weird happening here?
Please note that my implementation is working fine for C<=100 with good accuracy though the execution time is long.
Please give me inputs on this issue.
Thank You and Cheers.
For most SVM implementations, training time can increase dramatically with larger values of C. To get a sense of how training time in a reasonably good implementation of SMO scales with C, take a look at the log-scale line for libSVM in the graph below.
SVM training time vs. C - From Sentelle et al.'s A Fast Revised Simplex Method for SVM Training.
alt text http://dmcer.net/StackOverflowImages/svm_scaling.png
You probably have two easy ways and one not so easy way to make things faster.
Let's start with the easy stuff. First, you could try loosening your convergence criteria. A strict criteria like epsilon = 0.001 will take much longer to train, while typically resulting in a model that is no better than a looser criteria like epsilon = 0.01. Second, you should try to profile your code to see if there are any obvious bottlenecks.
The not so easy fix, would be to switch to a different optimization algorithm (e.g., SVM-RSQP from Sentelle et al.'s paper above). But, if you have a working implementation of SMO, you should probably only really do that as a last resort.
If you want complete convergence, especially if C is large, it takes a very long time.You can consider defining a large stop criterion, and give the maximum number of iterations, the default in Libsvm is 1000000, if the number of iterations is more, the time will multiply, but the loss is not worth the cost, but the result may not fully meet the KKT condition, some support vectors are in the band, non-support vectors are out of the band, but the error is small and acceptable.In my opinion, it is recommended to use other quadratic programming algorithms instead of SMO algorithm if the accuracy is higher