Wierd behavoir while training an SVM classifier - machine-learning

I am searching for the best value of C (Cost parameter) for training my SVM classifier. Here is my code:
clear all; close all; clc
% Load training features and labels
[y, x] = libsvmread('training_data.train'); %the training dataset is named training_data.train
cost=[2^-7,2^-5,2^-3,2^-1,2^1,2^3,2^5,2^7,2^9,2^11,2^13,2^15];
accuracy=zeros(1,length(cost)); %This array will store the accuracy values corresponding to each element in the cost array
for i = 1:length(cost)
opt = sprintf('-c %i -v 3',cost(i));
accuracy(i)=svmtrain(y,x,opt);
end
accuracy
I am using the LIBSVM library. When I run this program, the accuracy array is populated with pretty weird values:
Here is the output:
Columns 1 through 8:
67.335 93.696 91.404 92.550 93.696 93.553 93.553 93.553
Columns 9 through 12:
93.553 93.553 93.553 93.553
This means that I get the highest cross-validation accuracy on 2^-5. Should I get the highest accuracy on the highest value of C? (As much as I understand, it is a penalty factor for misclassification). Is this behavior expected of it?
(I am building a classifier for breast cancer identification using the UCI ML database).

Should I get the highest accuracy on the highest value of C? (As much as I understand, it is a penalty factor for misclassification).
No, there is no guarantee, as the SVM cost is not accuracy-based, it uses a specific surrogate function which only roughly behaves like accuracy, but you can expect many random fluctuations. In general, you should expect high values for high C, but not necessarily the highest one in general.
Is this behavior expected of it? (I am building a classifier for breast cancer identification using the UCI ML database).
Yes, it is a possible outcome.

Related

How can I get the index of features for each score of rfecv?

I am useing recursive feature elimination and cross-validated (rfecv) in order to find the best accuracy score for features.
As I see _grid_scoresis the score the estimator produced when trained with the i-th subset of features. Is there any way to get the index of subset features for each score in the _grid_score?
I can get the index of the selected features for highest score using get_support ( 5 subset of features).
subset_features, scores
5 , 0.976251
4 , 0.9762072
3 , 0.97322212
How can I get the indexes of 4 or 3 subset of features?
I checked the output of rfecv.ranking_ and the 5 features have rank =1 , but the Rank= 2 only has one feature and so on.
A (single) subset of 3 (or 4) features was (probably) never chosen!
This seems to be a common misconception on how RFECV works; see How does cross-validated recursive feature elimination drop features in each iteration (sklearn RFECV)?. There's an RFE for each cross-validation fold (say 5), and each will produce its own set of 3 features (probably different). Unfortunately (in this case at least), those RFE objects are not saved, so you cannot identify which sets of features each fold has selected; only the score is saved (source pt1, pt2) for choosing the optimal number of features, and then another RFE is trained on the entire dataset to reduce to the final set of features.

Can intercept and regression coefficients (Beta values) be very high?

I have 38 variables, like oxygen, temperature, pressure, etc and have a task to determine the total yield produced every day from these variables. When I calculate the regression coefficients and intercept value, they seem to be abnormal and very high (Impractical). For example, if 'temperature' coefficient was found to be +375.456, I could not give a meaning to them saying an increase in one unit in temperature would increase yield by 375.456g. That's impractical in my scenario. However, the prediction accuracy seems right. I would like to know, how to interpret these huge intercept( -5341.27355) and huge beta values shown below. One other important point is that I removed multicolinear columns and also, I am not scaling the variables/normalizing them because I need beta coefficients to have meaning such that I could say, increase in temperature by one unit increases yield by 10g or so. Your inputs are highly appreciated!
modl.intercept_
Out[375]: -5341.27354961415
modl.coef_
Out[376]:
array([ 1.38096017e+00, -7.62388829e+00, 5.64611255e+00, 2.26124164e-01,
4.21908571e-01, 4.50695302e-01, -8.15167717e-01, 1.82390184e+00,
-3.32849969e+02, 3.31942553e+02, 3.58830763e+02, -2.05076898e-01,
-3.06404757e+02, 7.86012402e+00, 3.21339318e+02, -7.00817205e-01,
-1.09676321e+04, 1.91481734e+00, 6.02929848e+01, 8.33731416e+00,
-6.23433431e+01, -1.88442804e+00, 6.86526274e+00, -6.76103795e+01,
-1.11406021e+02, 2.48270706e+02, 2.94836048e+01, 1.00279016e+02,
1.42906659e-02, -2.13019683e-03, -6.71427100e+02, -2.03158515e+02,
9.32094007e-03, 5.56457014e+01, -2.91724945e+00, 4.78691176e-01,
8.78121854e+00, -4.93696073e+00])
It's very unlikely that all of these variables are linearly correlated, so I would suggest that you have a look at simple non-linear regression techniques, such as Decision Trees or Kernel Ridge Regression. These are however more difficult to interpret.
Going back to your issue, these high weights might well be due to there being some high amount of correlation between the variables, or that you simply don't have very much training data.
If you instead of linear regression use Lasso Regression, the solution is biased away from high regression coefficients, and the fit will likely improve as well.
A small example on how to do this in scikit-learn, including cross validation of the regularization hyper-parameter:
from sklearn.linear_model LassoCV
# Make up some data
n_samples = 100
n_features = 5
X = np.random.random((n_samples, n_features))
# Make y linear dependent on the features
y = np.sum(np.random.random((1,n_features)) * X, axis=1)
model = LassoCV(cv=5, n_alphas=100, fit_intercept=True)
model.fit(X,y)
print(model.intercept_)
If you have a linear regression, the formula looks like this (y= target, x= features inputs):
y= x1*b1 +x2*b2 + x3*b3 + x4*b4...+ c
where b1,b2,b3,b4... are your modl.coef_. AS you already realized one of your bigges number is 3.319+02 = 331 and the intercept is also quite big with -5431.
As you already mentioned the coeffiecient variables means how much the target variable changes, if the coeffiecient feature changes with 1 unit and all others features are constant.
so for your interpretation, the higher the absoult coeffienct, the higher the influence of your analysis. But it is important to note that the model is using a lot of high coefficient, that means your model is not depending only of one variable

understanding test and validation set usage for early stop and model selection

I implemented an ANN (1 hidden layer of 64 units, learning rate = 0.001, epsilon = 0.001, iters = 500) with pythons OpenCV module. Train error ~ 3% and test error ~ 12%
In order to improve the accruacy/ generalisation of my NN I decided to proceed by- implementing model selection (of #hidden units and learning rate) to get an accurate value of hyperparameters and plotting learning curves to determine if more data is needed (currently have 2.5k).
Having read some sources regarding NN training and model selection, I'm very confused on the following matter -
1) In order to perform model selection, I know the following needs to be done-
create set possibleHiddenUnits {4, 8, 16, 32, 64}
randomly select Tr & Va sets from the total set of Tr + Va with some split e.g. 80/20
foreach ele in possibleHiddenUnits
(*) compute weights for the NN using backpropagation and an iterative optimisation algorithm like Gradient Descent (where we provide the termination criteria in the form of number of iterations / epsilon)
compute Validation set error using these trained weights
select the number of hidden units which min Va set error
Alternatively, I believe we can also use k-fold cross validation.
a. how do you decide what the number of iterations/ epsilon for GD should be?
b. does 1 iteration out of x iterations of GD (where the entire training set is used to compute the gradients of cost wrt weights through backprop) constitute an 'epoch'?
2) Sources (whats is the difference between train, validation and test set, in neural networks? and How to use k-fold cross validation in a neural network) mention that the training for a NN is done in the following way as it prevents over-fitting
for each epoch
for each training data instance
propagate error through the network
adjust the weights
calculate the accuracy over training data
for each validation data instance
calculate the accuracy over the validation data
if the threshold validation accuracy is met
exit training
else
continue training
a. I believe this method should be executed once the model selection has been done. But then how do we avoid overfitting of the model in step (*) of the model selection process above?
b. Am I right in assuming that one epoch constitues one iteration of training where weights are calculated using the entire Tr set through GD + backprop and GD involves x (>1) iters over the entire Tr set to calculate the weights ?
Also, out off 1b and 2b which is correct?
This is more of a comment but since I cant make comments yet I write it here. Have you tried other methods like l2 regularization or dropout? I dont know a lot about model selection but dropout has a very similiar effect like taking lots of models and averaging them. Normaly dropout should do the trick and you wont have problems with overfitting anymore.

Normalizing feature values for SVM

I've been playing with some SVM implementations and I am wondering - what is the best way to normalize feature values to fit into one range? (from 0 to 1)
Let's suppose I have 3 features with values in ranges of:
3 - 5.
0.02 - 0.05
10-15.
How do I convert all of those values into range of [0,1]?
What If, during training, the highest value of feature number 1 that I will encounter is 5 and after I begin to use my model on much bigger datasets, I will stumble upon values as high as 7? Then in the converted range, it would exceed 1...
How do I normalize values during training to account for the possibility of "values in the wild" exceeding the highest(or lowest) values the model "seen" during training? How will the model react to that and how I make it work properly when that happens?
Besides scaling to unit length method provided by Tim, standardization is most often used in machine learning field. Please note that when your test data comes, it makes more sense to use the mean value and standard deviation from your training samples to do this scaling. If you have a very large amount of training data, it is safe to assume they obey the normal distribution, so the possibility that new test data is out-of-range won't be that high. Refer to this post for more details.
You normalise a vector by converting it to a unit vector. This trains the SVM on the relative values of the features, not the magnitudes. The normalisation algorithm will work on vectors with any values.
To convert to a unit vector, divide each value by the length of the vector. For example, a vector of [4 0.02 12] has a length of 12.6491. The normalised vector is then [4/12.6491 0.02/12.6491 12/12.6491] = [0.316 0.0016 0.949].
If "in the wild" we encounter a vector of [400 2 1200] it will normalise to the same unit vector as above. The magnitudes of the features is "cancelled out" by the normalisation and we are left with relative values between 0 and 1.

how to handle large number of features machine learning

I developed a image processing program that identifies what a number is given an image of numbers. Each image was 27x27 pixels = 729 pixels. I take each R, G and B value which means I have 2187 variables from each image (+1 for the intercept = total of 2188).
I used the below gradient descent formula:
Repeat {
θj = θj−α/m∑(hθ(x)−y)xj
}
Where θj is the coefficient on variable j; α is the learning rate; hθ(x) is the hypothesis; y is real value and xj is the value of variable j. m is the number of training sets. hθ(x), y are for each training set (i.e. that's what the summation sign is for). Further the hypothesis is defined as:
hθ(x) = 1/(1+ e^-z)
z= θo + θ1X1+θ2X2 +θ3X3...θnXn
With this, and 3000 training images, I was able to train my program in just over an hour and when tested on a cross validation set, it was able to identify the correct image ~ 67% of the time.
I wanted to improve that so I decided to attempt a polynomial of degree 2.
However the number of variables jumps from 2188 to 2,394,766 per image! It takes me an hour just to do 1 step of gradient descent.
So my question is, how is this vast number of variables handled in machine learning? On the one hand, I don't have enough space to even hold that many variables for each training set. On the other hand, I am currently storing 2188 variables per training sample, but I have to perform O(n^2) just to get the values of each variable multiplied by another variable (i.e. the polynomial to degree 2 values).
So any suggestions / advice is greatly appreciated.
try to use some dimensionality reduction first (PCA, kernel PCA, or LDA if you are classifying the images)
vectorize your gradient descent - with most math libraries or in matlab etc. it will run much faster
parallelize the algorithm and then run in on multiple CPUs (but maybe your library for multiplying vectors already supports parallel computations)
Along with Jirka-x1's answer, I would first say that this is one of the key differences in working with image data than say text data for ML: high dimensionality.
Second... this is a duplicate, see How to approach machine learning problems with high dimensional input space?

Resources