I am solving an optimization problem with log_sum_exp as the objective function. I solved it using CVX in Matlab. However, the process was slow compared to other optimization problems. I am not sure if it is because of the successive approximation method used for the log_sum_exp function?
It seems that successive approximation is not mentioned in CVXPY. I am guessing that CVXPY does not use successive approximation? If successive approximation slows down the optimization, can I expedite the optimization by using CVXPY?
Thank you!
No, CVXPY does not use successive approximation because the default solver ECOS supports the exponential cone directly, which can model log_sum_exp.
Related
Let say I do the DNN regression task for some skewed data distribution. Now I am using mean absolute error as loss function.
All typical approaches in machine learning are minimizing mean loss, but for skewed that is unappropriating. It is better from a practical point of view to minimize median loss. I think one way is to penalize big losses with some coefficient. And then mean will be close to the median. But how to calculate that coef for the unknown distribution type? Are there other approaches? What can you to advice?
(I am using tensorflow/keras)
Just use the mean absolute error loss function in keras, instead of the mean squared.
The mean absolute is pretty much equivalent the median, and anyway would be more robust to outliers or skewed data. you should have a look at all of the possible keras losses:
https://keras.io/losses/
and obviously, you can create your own too.
But for most data sets it just empirically turns out that mean square gets you better accuracy. so i would recommend to at least try both methods before settling on the mean absolute one.
If you have skewed error distributions, you can use tfp.stats.percentile as your Keras loss function, with something like:
def loss_fn(y_true, y_pred):
return tfp.stats.percentile(tf.abs(y_true - y_pred), q=50)
model.compile(loss=loss_fn)
It gives gradients, so works with Keras, although isn't as fast as MAE / MSE.
https://www.tensorflow.org/probability/api_docs/python/tfp/stats/percentile
Customizing loss (/objective) functions is tough. Keras does theoretically allow you to do this, though they seem to have removed the documentation specifically describing it in their 2.0 release.
You can check their docs on loss functions for ideas, and then head over to the source code to see what kind of an API you should implement.
However, there are a number of issues filed by people who are having trouble with this, and the fact that they've removed the documentation on it is not inspiring.
Just remember that you have to use Keras own backend to compute your loss function. If you get it working, please write a blog post, or update with an answer here, because this is something quite a few other people have struggled/are struggling with!
In my understanding, I think it's to avoid over/under fitting, and for the faster calculation.
Is it right?
Your understanding is partially correct. Regularization will not help with underfitting. It can protect (to some extent) from overfitting. Furthermore it will not speed up calculations (as it is actually more complex to compute something with added regularization) but can lead to simplier optimization problem - thus less number of steps required for convergenc (as a resulting error surface is more smooth).
I'm using WEKA/LibSVM to train a classifier for a term extraction system. My data is not linearly separable, so I used an RBF kernel instead of a linear one.
I followed the guide from Hsu et al. and iterated over several values for both c and gamma. The parameters which worked best for classifying known terms (test and training material differ of course) are rather high, c=2^10 and gamma=2^3.
So far the high parameters seem to work ok, yet I wonder if they may cause any problems further on, especially regarding overfitting. I plan to do another evaluation by extracting new terms, yet those are costly as I need human judges.
Could anything still be wrong with my parameters, even if both evaluation turns out positive? Do I perhaps need another kernel type?
Thank you very much!
In general you have to perform cross validation to answer whether the parameters are all right or do they lead to the overfitting.
From the "intuition" perspective - it seems like highly overfitted model. High value of gamma means that your Gaussians are very narrow (condensed around each poinT) which combined with high C value will result in memorizing most of the training set. If you check out the number of support vectors I would not be surprised if it would be the 50% of your whole data. Other possible explanation is that you did not scale your data. Most ML methods, especially SVM, requires data to be properly preprocessed. This means in particular, that you should normalize (standarize) the input data so it is more or less contained in the unit sphere.
RBF seems like a reasonable choice so I would keep using it. A high value of gamma is not necessary a bad thing, it would depends on the scale where your data lives. While a high C value can lead to overfitting, it would also be affected by the scale so in some cases it might be just fine.
If you think that your dataset is a good representation of the whole data, then you could use crossvalidation to test your parameters and have some peace of mind.
what is the benefit of using Gradient Descent in the linear regression space? looks like the we can solve the problem (finding theta0-n that minimum the cost func) with analytical method so why we still want to use gradient descent to do the same thing? thanks
When you use the normal equations for solving the cost function analytically you have to compute:
Where X is your matrix of input observations and y your output vector. The problem with this operation is the time complexity of calculating the inverse of a nxn matrix which is O(n^3) and as n increases it can take a very long time to finish.
When n is low (n < 1000 or n < 10000) you can think of normal equations as the better option for calculation theta, however for greater values Gradient Descent is much more faster, so the only reason is the time :)
You should provide more details about yout problem - what exactly are you asking about - are we talking about linear regression in one or many dimensions? Simple or generalized ones?
In general, why do people use the GD?
it is easy to implement
it is very generic optimization technique - even if you change your model to the more general one, you can stil use it
So what about analytical solutions? Well, we do use them, your claim is simply false here (if we are talking in general), for example the OLS method is a closed form, analytical solution, which is widely used. If you can use the analytical solution, it is affordable computationaly (as sometimes GD is simply cheapier or faster) then you can, and even should - use it.
Neverlethles this is always a matter of some pros and cons - analytical solutions are strongly connected to the model, so implementing them can be inefficient if you plan to generalize/change your models in the future. They are sometimes less efficient then their numerical approximations, and sometimes there are simply harder to implement. If none of above is true - you should use the analytical solution, and people do it, really.
To sum up, you rather use GD over analytical solution if:
you are considering changes in the model, generalizations, adding some more complex terms/regularization/modifications
you need a generic method because you do not know much about the future of the code and the model (you are only one of the developers)
analytical solution is more expensive computationaly, and you need efficiency
analytical solution requires more memory, which you do not have
analytical solution is hard to implement and you need easy, simple code
I saw a very good answer from https://stats.stackexchange.com/questions/23128/solving-for-regression-parameters-in-closed-form-vs-gradient-descent
Basically, the reasons are:
1.For most nonlinear regression problems there is no closed form solution.
2.Even in linear regression (one of the few cases where a closed form solution is available), it may be impractical to use the formula. The following example shows one way in which this can happen.
Other reason is that gradient descent is immediately useful when you generalize linear regression, especially if the problem doesn't have a closed-form solution, like for example in Lasso (which adds regularization term consisting on sum of absolute values of weight vector).
I need to do matrix operations (mainly multiply and inverse) of a sparse matrix SparseMat in OpenCV.
I noticed that you can only iterate and insert values to SparseMat.
Is there an external code I can use? (or am I missing something?)
It's just that sparse matrices are not really suited for inversion or matrix-matrix-multiplication, so it's quite reasonable there is no builtin function for that. They're actually more used for matrix-vector multiplication (usually when solving iterative linear systems).
What you can do is solve N linear systems (with the columns of the identity matrix as right hand sides) to get the inverse matrix. But then you need N*N storage for the inverse matrix anyway, so using a dense matrix with a usual decompositions algorithm would be a better way to do it, as the performance gain won't be that high when doing N iterative solutions. Or maybe some sparse direct solvers like SuperLU or TAUCS may help, but I doubt that OpenCV has such functionalities.
You should also think if you really need the inverse matrix. Often such problem are also solvable by just solving a linear system, which can be done with a sparse matrix quite easily and fast via e.g. CG or BiCGStab.
You can convert a SparseMat to a Mat, do what operations you need and then convert back.
you can use Eigen library directly. Eigen works together with OpenCV very well.