Cross validation is very slow in Grid search (libsvm) - opencv

I am using libsvm on 62 classes with 2000 samples each. The problem is i wanted to optimize my parameters using grid search. i set the range to be C=[0.0313,0.125,0.5,2,8] and gamma=[0.0313,0.125,0.5,2,8] with 5-folds. the crossvalition does not finish at the first two parameters of each. Is there a faster way to do the optimization? Can i reduce the number of folds to 3 for instance? The number of iterations written keeps playing in (1629,1630,1627) range I don't know if that is related
optimization finished,
#iter = 1629 nu = 0.997175 obj = -81.734944, rho = -0.113838 nSV = 3250, nBSV = 3247

This is simply expensive task to find a good model. Lets do some calculations:
62 classes x 5 folds x 4 values of C x 4 values of Gamma = 4960 SVMs
You can always reduce the number of folds, which will decrease the quality of the search, but will reduce the whole amount of trained SVMs of about 40%.
The most expensive part is the fact, that SVM is not well suited for multi label classification. It needs to train at least O(log n) models (in the error correcting code scenario), O(n) (in libsvm one-vs-all) to even O(n^2) (in one-vs-one scenario, which achieves the best results).
Maybe it would be more valuable to switch to some fast multilabel model? Like for example some ELM (Extreme Learning Machine)?

Related

Can intercept and regression coefficients (Beta values) be very high?

I have 38 variables, like oxygen, temperature, pressure, etc and have a task to determine the total yield produced every day from these variables. When I calculate the regression coefficients and intercept value, they seem to be abnormal and very high (Impractical). For example, if 'temperature' coefficient was found to be +375.456, I could not give a meaning to them saying an increase in one unit in temperature would increase yield by 375.456g. That's impractical in my scenario. However, the prediction accuracy seems right. I would like to know, how to interpret these huge intercept( -5341.27355) and huge beta values shown below. One other important point is that I removed multicolinear columns and also, I am not scaling the variables/normalizing them because I need beta coefficients to have meaning such that I could say, increase in temperature by one unit increases yield by 10g or so. Your inputs are highly appreciated!
modl.intercept_
Out[375]: -5341.27354961415
modl.coef_
Out[376]:
array([ 1.38096017e+00, -7.62388829e+00, 5.64611255e+00, 2.26124164e-01,
4.21908571e-01, 4.50695302e-01, -8.15167717e-01, 1.82390184e+00,
-3.32849969e+02, 3.31942553e+02, 3.58830763e+02, -2.05076898e-01,
-3.06404757e+02, 7.86012402e+00, 3.21339318e+02, -7.00817205e-01,
-1.09676321e+04, 1.91481734e+00, 6.02929848e+01, 8.33731416e+00,
-6.23433431e+01, -1.88442804e+00, 6.86526274e+00, -6.76103795e+01,
-1.11406021e+02, 2.48270706e+02, 2.94836048e+01, 1.00279016e+02,
1.42906659e-02, -2.13019683e-03, -6.71427100e+02, -2.03158515e+02,
9.32094007e-03, 5.56457014e+01, -2.91724945e+00, 4.78691176e-01,
8.78121854e+00, -4.93696073e+00])
It's very unlikely that all of these variables are linearly correlated, so I would suggest that you have a look at simple non-linear regression techniques, such as Decision Trees or Kernel Ridge Regression. These are however more difficult to interpret.
Going back to your issue, these high weights might well be due to there being some high amount of correlation between the variables, or that you simply don't have very much training data.
If you instead of linear regression use Lasso Regression, the solution is biased away from high regression coefficients, and the fit will likely improve as well.
A small example on how to do this in scikit-learn, including cross validation of the regularization hyper-parameter:
from sklearn.linear_model LassoCV
# Make up some data
n_samples = 100
n_features = 5
X = np.random.random((n_samples, n_features))
# Make y linear dependent on the features
y = np.sum(np.random.random((1,n_features)) * X, axis=1)
model = LassoCV(cv=5, n_alphas=100, fit_intercept=True)
model.fit(X,y)
print(model.intercept_)
If you have a linear regression, the formula looks like this (y= target, x= features inputs):
y= x1*b1 +x2*b2 + x3*b3 + x4*b4...+ c
where b1,b2,b3,b4... are your modl.coef_. AS you already realized one of your bigges number is 3.319+02 = 331 and the intercept is also quite big with -5431.
As you already mentioned the coeffiecient variables means how much the target variable changes, if the coeffiecient feature changes with 1 unit and all others features are constant.
so for your interpretation, the higher the absoult coeffienct, the higher the influence of your analysis. But it is important to note that the model is using a lot of high coefficient, that means your model is not depending only of one variable

How to squish a continuous cosine-theta score to a discrete (0/1) output?

I implemented a cosine-theta function, which calculates the relation between two articles. If two articles are very similar then the words should contain quite some overlap. However, a cosine theta score of 0.54 does not mean "related" or "not related". I should end up with a definitive answer which is either 0 for 'not related' or 1 for 'related'.
I know that there are sigmoid and softmax functions, yet I should find the optimal parameters to give to such functions and I do not know if these functions are satisfactory solutions. I was thinking that I have the cosine theta score, I can calculate the percentage of overlap between two sentences two (e.g. the amount of overlapping words divided by the amount of words in the article) and maybe some more interesting things. Then with the data, I could maybe write a function (what type of function I do not know and is part of the question!), after which I can minimize the error via the SciPy library. This means that I should do some sort of supervised learning, and I am willing to label article pairs with labels (0/1) in order to train a network. Is this worth the effort?
# Count words of two strings.
v1, v2 = self.word_count(s1), self.word_count(s2)
# Calculate the intersection of the words in both strings.
v3 = set(v1.keys()) & set(v2.keys())
# Calculate some sort of ratio between the overlap and the
# article length (since 1 overlapping word on 2 words is more important
# then 4 overlapping words on articles of 492 words).
p = min(len(v1), len(v2)) / len(v3)
numerator = sum([v1[w] * v2[w] for w in v3])
w1 = sum([v1[w]**2 for w in v1.keys()])
w2 = sum([v2[w]**2 for w in v2.keys()])
denominator = math.sqrt(w1) * math.sqrt(w2)
# Calculate the cosine similarity
if not denominator:
return 0.0
else:
return (float(numerator) / denominator)
As said, I would like to use variables such as p, and the cosine theta score in order to produce an accurate discrete binary label, either 0 or 1.
As said, I would like to use variables such as p, and the cosine theta score in order to produce an accurate discrete binary label, either 0 or 1.
Here it really comes down to what you mean by accuracy. It is up to you to choose how the overlap affects whether or not two strings are "matching" unless you have a labelled data set. If you have a labelled data set (I.e., a set of pairs of strings along with a 0 or 1 label), then you can train a binary classification algorithm and try to optimise based on that. I would recommend something like a neural net or SVM due to the potentially high dimensional, categorical nature of your problem.
Even the optimisation, however, is a subjective measure. For example, in theory let's pretend you have a model which out of 100 samples only predicts 1 answer (Giving 99 unknowns). Technically if that one answer is correct, that is a model with 100% accuracy, but which has a very low recall. Generally in machine learning you will find a trade off between recall and accuracy.
Some people like to go for certain metrics which combine the two (The most famous of which is the F1 score), but honestly it depends on the application. If I have a marketing campaign with a fixed budget, then I care more about accuracy - I would only want to target consumers who are likely to buy my product. If however, we are looking to test for a deadly disease or markers for bank fraud, then it's feasible for that test to be accurate only 10% of the time - if its recall of true positives is somewhere close to 100%.
Finally, if you have no labelled data, then your best bet is just to define some cut off value which you believe indicates a good match. This is would then be more analogous to a binary clustering problem, and you could use some more abstract measure such as distance to a centroid to test which cluster (Either the "related" or "unrelated" cluster) the point belongs to. Note however that here your features feel like they would be incredibly hard to define.

Parameter tuning/model selection using resampling

I have been trying to get into more details of resampling methods and implemented them on a small data set of 1000 rows. The data was split into 800 training set and 200 validation set. I used K-fold cross validation and repeated K-fold cross validation to train the KNN using the training set. Based on my understanding I have done some interpretations of the results - however, I have certain doubts about them (see questions below):
Results :
10 Fold Cv
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 720, 720, 720, 720, 720, 720, ...
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.6600 0.07010791
7 0.6775 0.09432414
9 0.6800 0.07054371
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
Repeated 10 fold with 10 repeats
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.670250 0.10436607
7 0.676875 0.09288219
9 0.683125 0.08062622
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
10 fold, 1000 repeats
k Accuracy Kappa
5 0.6680438 0.09473128
7 0.6753375 0.08810406
9 0.6831800 0.07907891
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
10 fold with 2000 repeats
k Accuracy Kappa
5 0.6677981 0.09467347
7 0.6750369 0.08713170
9 0.6826894 0.07772184
Doubts:
While selecting the parameter, K=9 is the optimal value for highest accuracy. However, I don't understand how to take Kappa into consideration while finally choosing parameter value?
Repeat number has to be increased until we get stabilised result, the accuracy changes when the repeats are increased from 10 to 1000. However,the results are similar for 1000 repeats and 2000 repeats. Will it be right to consider the results of 1000/2000 repeats to be stabilised performance estimate?
Any thumb rule for the repeat number?
Finally,should I train the model on my complete training data (800 rows) now test the accuracy on the validation set ?
Accuracy and Kappa are just different classification performance metrics. In a nutshell, their difference is that Accuracy does not take possible class imbalance into account when calculating the metrics, while Kappa does. Therefore, with imbalanced classes, you might be better off using Kappa. With R caret you can do so via the train::metric parameter.
You could see a similar effect of slightly different performance results when running e.g. the 10CV with 10 repeats multiple times - you will just get slightly different results for those as well. Something you should look out for is the variance of classification performance over your partitions and repeats. In case you obtain a small variance you can derive that you by training on all your data, you likely obtain a model that will give you similar (hence stable) results on new data. But, in case you obtain a huge variance, you can derive that just by chance (being lucky or unlucky) you might instead obtain a model that either gives you rather good or rather bad performance on new data. BTW: the prediction performance variance is something e.g. R caret::train will give you automatically, hence I'd advice on using it.
See above: look at the variance and increase the repeats until you can e.g. repeat the whole process and obtain a similar average performance and variance of performance.
Yes, CV and resampling methods exist to give you information about how well your model will perform on new data. So, after performing CV and resampling and obtaining this information, you will usually use all your data to train a final model that you use in your e.g. application scenario (this includes both train and test partition!).

how can fixed parameters cost and gamma using libsvm matlab to improve accuracy?

I use libsvm to classify a data base that contain 1000 labels. I am new in libsvm and I found a problem to choose the parameters c and g to improve performance. First, here is the program that I use to set the parameters:
bestcv = 0;
for log2c = -1:3,
for log2g = -4:1,
cmd = ['-v 5 -c ', num2str(2^log2c), ' -g ', num2str(2^log2g)];
cv = svmtrain(yapp, xapp, cmd);
if (cv >= bestcv),
bestcv = cv; bestc = 2^log2c; bestg = 2^log2g;
end
fprintf('%g %g %g (best c=%g, g=%g, rate=%g)\n', log2c, log2g, cv, bestc, bestg, bestcv);
end
end
as a result, this program gives c = 8 and g = 2 and when I use these values
c and g, I found an accuracy rate of 55%. for classification, I use svm one against all.
numLabels=max(yapp);
numTest=size(ytest,1);
%# train one-against-all models
model = cell(numLabels,1);
for k=1:numLabels
model{k} = svmtrain(double(yapp==k),xapp, ' -c 1000 -g 10 -b 1 ');
end
%# get probability estimates of test instances using each model
prob_black = zeros(numTest,numLabels);
for k=1:numLabels
[~,~,p] = svmpredict(double(ytest==k), xtest, model{k}, '-b 1');
prob_black(:,k) = p(:,model{k}.Label==1); %# probability of class==k
end
%# predict the class with the highest probability
[~,pred_black] = max(prob_black,[],2);
acc = sum(pred_black == ytest) ./ numel(ytest) %# accuracy
The problem is that I need to change these parameters to increase performance. for example, when I put randomly c = 10000 and g = 100, I found a better accuracy rate: 70%.
Please I need help, how can I set theses parameters ( c and g) so to find the optimum accuracy rate? thank you in advance
Hyperparameter tuning is a nontrivial problem in machine learning. The simplest approach is what you've already implemented: define a grid of values, and compute the model on the grid until you find some optimal combination. A key assumption is that the grid itself is a good approximation of the surface: that it's fine enough to not miss anything important, but not so fine that you waste time computing values that are essentially the same as neighboring values. I'm not aware of any method to, in general, know ahead of time how fine a grid is necessary. As illustration: imagine that the global optimum is at $(5,5)$ and the function is basically flat elsewhere. If your grid is $(0,0),(0,10),(10,10),(0,10)$, you'll miss the optimum completely. Likewise, if the grid is $(0,0), (-10,-10),(-10,0),(0,-10)$, you'll never be anywhere near the optimum. In both cases, you have no hope of finding the optimum itself.
Some rules of thumb exist for SVM with RBF kernels, though: a grid of $\gamma\in\{2^{-15},2^{-14},...,2^5\}$ and $C \in \{2^{-5}, 2^{-4},...,2^{15}\}$ is one such recommendation.
If you found a better solution outside of the range of grid values that you tested, this suggests you should define a larger grid. But larger grids take more time to evaluate, so you'll either have to commit to waiting a while for your results, or move to a more efficient method of exploring the hyperparameter space.
Another alternative is random search: define a "budget" of the number of SVMs that you want to try out, and generate that many random tuples to test. This approach is mostly just useful for benchmarking purposes, since it's entirely unintelligent.
Both grid search and random search have the advantage of being stupidly easy to implement in parallel.
Better options fall in the domain of global optimization. Marc Claeson et al have devised the Optunity package, which uses particle swarm optimization. My research focuses on refinements of the Efficient Global Optimization algorithm (EGO), which builds up a Gaussian process as an approximation of the hyperparameter response surface and uses that to make educated predictions about which hyperparameter tuples are most likely to improve upon the current best estimate.
Imagine that you've evaluated the SVM at some hyperparameter tuple $(\gamma, C)$ and it has some out-of-sample performance metric $y$. An advantage to EGO-inspired methods is that it assumes that the values $y^*$ nearby $(\gamma,C)$ will be "close" to $y$, so we don't necessarily need to spend time exploring those tuples nearby, especially if $y-y_{min}$ is very large (where $y_{min}$ is the smallest $y$ value we've discovered). EGO will identify and evaluate the SVM at points where it estimates there is a high probability of improvement, so it will intelligently move through the hyper-parameter space: in the ideal case, it will skip over regions of low performance in favor of focusing on regions of high performance.

how to handle large number of features machine learning

I developed a image processing program that identifies what a number is given an image of numbers. Each image was 27x27 pixels = 729 pixels. I take each R, G and B value which means I have 2187 variables from each image (+1 for the intercept = total of 2188).
I used the below gradient descent formula:
Repeat {
θj = θj−α/m∑(hθ(x)−y)xj
}
Where θj is the coefficient on variable j; α is the learning rate; hθ(x) is the hypothesis; y is real value and xj is the value of variable j. m is the number of training sets. hθ(x), y are for each training set (i.e. that's what the summation sign is for). Further the hypothesis is defined as:
hθ(x) = 1/(1+ e^-z)
z= θo + θ1X1+θ2X2 +θ3X3...θnXn
With this, and 3000 training images, I was able to train my program in just over an hour and when tested on a cross validation set, it was able to identify the correct image ~ 67% of the time.
I wanted to improve that so I decided to attempt a polynomial of degree 2.
However the number of variables jumps from 2188 to 2,394,766 per image! It takes me an hour just to do 1 step of gradient descent.
So my question is, how is this vast number of variables handled in machine learning? On the one hand, I don't have enough space to even hold that many variables for each training set. On the other hand, I am currently storing 2188 variables per training sample, but I have to perform O(n^2) just to get the values of each variable multiplied by another variable (i.e. the polynomial to degree 2 values).
So any suggestions / advice is greatly appreciated.
try to use some dimensionality reduction first (PCA, kernel PCA, or LDA if you are classifying the images)
vectorize your gradient descent - with most math libraries or in matlab etc. it will run much faster
parallelize the algorithm and then run in on multiple CPUs (but maybe your library for multiplying vectors already supports parallel computations)
Along with Jirka-x1's answer, I would first say that this is one of the key differences in working with image data than say text data for ML: high dimensionality.
Second... this is a duplicate, see How to approach machine learning problems with high dimensional input space?

Resources