I'm training and cross-validating (10-fold) data using libSVM (with linear kernel).
The data consist 1800 fMRI intensity voxels represented as a single datapoint.
There are around 88 datapoints in the training-set-file for svm-train.
the training-set-file looks as follow:
+1 1:0.9 2:-0.2 ... 1800:0.1
-1 1:0.6 2:0.9 ... 1800:-0.98
...
I should also mention i'm using the svm-train script (came along with the libSVM package).
The problem is that when running svm-train - it's result as 100% accuracy!
This doesn't seem to reflect the true classification results!
The data isn't unbalanced since
#datapoints labeled +1 == #datpoints labeled -1
Iv'e also checked the scaler (scaling correctly), and also tried to change the labels randomly to see how it impacts the accuracy - and it's decreasing from 100% to 97.9%.
Could you please help me understand the problem?
If so, what can I do to fix it?
Thanks,
Gal Star
Make sure you include '-v 10' in the svmtrain option. I'm not sure your 100% accuracy comes from training sample or validation sample. It is very possible to get a 100% training accuracy since you have much less sample number than the feature number. But if your model suffers from overfitting, the validation accuracy may be low.
Related
I am playing some demos about recurrent neural network.
I noticed that the scale of my data in each column differs a lot. So I am considering to do some preprocess work before I throw data batches into my RNN. The close column is the target I want to predict in the future.
open high low volume price_change p_change ma5 ma10 \
0 20.64 20.64 20.37 163623.62 -0.08 -0.39 20.772 20.721
1 20.92 20.92 20.60 218505.95 -0.30 -1.43 20.780 20.718
2 21.00 21.15 20.72 269101.41 -0.08 -0.38 20.812 20.755
3 20.70 21.57 20.70 645855.38 0.32 1.55 20.782 20.788
4 20.60 20.70 20.20 458860.16 0.10 0.48 20.694 20.806
ma20 v_ma5 v_ma10 v_ma20 close
0 20.954 351189.30 388345.91 394078.37 20.56
1 20.990 373384.46 403747.59 411728.38 20.64
2 21.022 392464.55 405000.55 426124.42 20.94
3 21.054 445386.85 403945.59 473166.37 21.02
4 21.038 486615.13 378825.52 461835.35 20.70
My question is, is preprocessing the data with, say StandardScaler in sklearn necessary in my case? And why?
(You are welcome to edit my question)
It will be beneficial to normalize your training data. Having different features with widely different scales fed to your model will cause the network to weight the features not equally. This can cause a falsely prioritisation of some features over the others in the representation.
Despite that the whole discussion on data preprocessing is controversial either on when exactly it is necessary and how to correctly normalize the data for each given model and application domain there is a general consensus in Machine Learning that running a Mean subtraction as well as a general Normalization preprocessing step is helpful.
In the case of Mean subtraction, the mean of every individual feature is being subtracted from the data which can be interpreted as centering the data around the origin from a geometric point of view. This is true for every dimensionality.
Normalizing the data after the Mean subtraction step results in a normalization of the data dimensionality to approximately the same scale. Note that the different features will loose any prioritization over each other after this step as mentioned above. If you have good reasons to think that the different scales in your features bear important information that the network may need to truly understand the underlying patterns in your dataset, then a normalization will be harmful. A standard approach would be to scale the inputs to have mean of 0 and a variance of 1.
Further preprocessing operations may be helpful in specific cases such as performing PCA or Whitening on your data. Look into the awesome notes of CS231n (Setting up the data and the model) for further reference on these topics as well as for a more detailed explenation of the topics above.
Definetly yes. Most of neural networks work best with data beetwen 0-1 or -1 to 1(depends on output function). Also when some inputs are higher then others network will "think" they are more important. This can make learning very long. Network must first lower weights in this inputs.
I found this https://arxiv.org/abs/1510.01378
If you normalize it may improve convergence so you will get lower training times.
I am working on a classifier, by logistic regression, based on Spark ML.
and I wonder should I train the equal quantity of data for true , false.
I mean
When I want to classify people into male or female,
Is it ok that train a model with 100 male data + 100 female data.
The online people may 40% male and 60% female , but this percent is forcasted based on the past, so it can be change(like 30% female, 70% male)
In this situation.
what female/male percent of data should I train?
is this related with overfitting?
when If I trained a model with 40%female + 60%male, It is useless to classifying a field data composed with 70%female+30%male?
Spark classification sample data has 43 false, 57true.
https://github.com/apache/spark/blob/master/data/mllib/sample_binary_classification_data.txt
what means the true/false ratio of trainig data in logisticregression?
I am really not good at English, but hope you understand me.
It should not matter what ratio you use, as long as it is reasonable.
60:40, 30:70, 50:50, it's okay. Just make sure it's not too lopsided, like 99:1.
If the entire data set is 70:30 female:male, and you want to only use a subset of this dataset, going for a 60:40 female:male ratio will not kill you.
Consider the following example:
Your test data contains 99% males, and 1 % female.
Technically, you can classify all males correctly, ALL females incorrectly, and your algorithm would show an error of 1%. Seems pretty good right? No, because your data is too lopsided.
This low error is not a result of overfitting (high variance), but rather a result of a lopsided data set.
This is an extreme example, but you get the point.
here is the deal.
I am trying to make an SVM based POS tagger.
The feature vectors for the SVM was created with the help of format converters.
Now here is a screenshot of the training file that I am using.
http://tinypic.com/r/n4fn2r/8
I have 25 labels for various POS tags. when i use the java implementation or the command line tools for prediction i get the following results.
http://tinypic.com/r/2dtw5ky/8
I have tried with all the kernels available but it gave more or less the same results.
This is happening even when the training file is used as the testing file.
please help me out here..!!
P.S. I cannot share more than two links. Thus here is a snippet of the model file
svm_type c_svc
kernel_type rbf
gamma 0.000548546
nr_class 25
total_sv 431
rho -0.929467 1.01073 1.0531 1.03472 1.01585 0.953263 1.03027 -0.921365 0.984535 1.02796 1.01266 1.03374 0.949463 0.977925 0.986551 -0.920912 0.940926 -0.955562 0.975386 -0.981959 -0.884042 0.0516955 -0.980884 -0.966095 0.995091 1.023 1.01489 1.00308 0.948314 1.01137 -0.845876 0.968034 1.0076 1.00064 1.01335 0.942633 0.965703 0.979212 -0.861236 0.935055 -0.91739 0.970223 -0.97103 0.0743777 0.970321 -0.971215 -0.931582 0.972377 0.958193 0.931253 0.825797 0.954894 -0.972884 -0.941726 0.945077 0.922366 0.953999 -1.00503 0.840985 0.882229 -0.961742 0.791631 -0.984971 0.855911 -0.991528 -0.951211 -0.962096 -0.99213 -0.99708 -0.957557 -0.308987 -0.455442 -0.94881 -0.995319 -0.974945 -0.964637 -0.902152 -0.955258 -1.05287 -1.00614 -0.
update
Just trained the SVM with svm type as c-SVC and kernel type as linear. Which gave a non-zero(although very poor) accuracy.
As mentioned by #Pedrom, parameter choice is absolutely crucial when training SVMs. I suggest you have a look at this practical guide. Also, 431 words is nowhere near enough to train a 25-class model. You will definitely need more data.
That said, 0% accuracy is indeed odd. Can you please show us the commands you are using to train and evaluate the model?
I am using backpropogation algorithm for my model. It works perfectly fine a simple xor case and when I tested it for a smaller subset of my actual data.
There are 3 inputs in total and a single output(0,1,2)
I have split the data set into training set (80% amounting to approx 5.5k) and the rest 20% as validation data.
I use trainingRate and momentum for calculating the delta weights.
I have normalized the input as below
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(input_array)
I use 1 hidden layer with sigmoid and linear activation functions for input-hidden and hidden-output respectively.
I train with trainingRate = 0.0005, momentum = 0.6, Epochs = 100,000. Any higher trainingRate shoots up the error to Nan. momentum values between 0.5 and 0.9 works fine and any other value makes the error Nan.
I tried various number of nodes in the hidden layer such as 3,6,9,10 and the error converged to 4140.327574 in each case. I am not sure how to reduce this. Changing the activation functions doesn't help. I even tried adding another hidden layer with gaussian activation function but I cannot reduce the error whatsoever.
Is it because of the outliers? Do i need to clean those values from the training data?
Any suggestion would be of great help be it the activation function, hidden layers, etc. I had been trying to get this working for quite some time and I am sort of stuck now.
Well I'm having kind of a similar problem, still haven fixed it, but I can tell you a couple of things I have found. I think the net is overfitting, my error at some point goes down and then starts going up again, also the verification set... is this you case also?
Check if you are implementing well the "early stopping" algorithm, most of the times the problem is not the backpropagation, but the error analysis or the validation analysis.
Hope this helps!
I'm trying to use sk-learn's RandomForestClassifier for a binary classification task (positive and negative examples). My training data contains 1.177.245 examples with 40 features, in SVM-light format (sparse vectors) which I load using sklearn.dataset's load_svmlight_file. It produces a sparse matrix of 'feature values' (1.177.245 * 40) and one array of 'target classes' (1s and 0s, 1.177.245 of them). I don't know whether this is worrysome, but the trainingdata has 3552 positives and the rest are all negative.
As the sk-learn's RFC doesn't accept sparse matrices, I convert the sparse matrix to a dense array (if I'm saying that right? Lots of 0s for absent features) using .toarray(). I print the matrix before and after converting to arrays and that seems to be going all right.
When I initiate the classifier and start fitting it to the data, it takes this long:
[Parallel(n_jobs=40)]: Done 1 out of 40 | elapsed: 24.7min remaining: 963.3min
[Parallel(n_jobs=40)]: Done 40 out of 40 | elapsed: 27.2min finished
(is that output right? Those 963 minutes take about 2 and a half...)
I then dump it using joblib.dump.
When I re-load it:
RandomForestClassifier: RandomForestClassifier(bootstrap=True, compute_importances=True,
criterion=gini, max_depth=None, max_features=auto,
min_density=0.1, min_samples_leaf=1, min_samples_split=1,
n_estimators=1500, n_jobs=40, oob_score=False,
random_state=<mtrand.RandomState object at 0x2b2d076fa300>,
verbose=1)
And test it on real trainingdata (consisting out of 750.709 examples, exact same format as training data) I get "unexpected" results. To be exact; only one of the examples in the testingdata is classified as true. When I train on half the initial trainingdata and test on the other half, I get no positives at all.
Now I have no reason to believe anything is wrong with what's happening, it's just that I get weird results, and furthermore I think it's all done awfully quick. It's probably impossible to make a comparison, but training a RFClassifier on the same data using rt-rank (also with 1500 iterations, but with half the cores) takes over 12 hours...
Can anyone enlighten me whether I have any reason to believe something is not working the way it's supposed to? Could it be the ratio of positives to negatives in the training data? Cheers.
Indeed this dataset is very very imbalanced. I would advise you to subsample the negative examples (e.g. pick n_positive_samples of them at random) or to oversample the positive example (the latter is more expensive and but might yield better models).
Also are you sure that all your features are numerical features (larger values means something in real life)? If some of them are categorical integer markers, those feature should be exploded as one-of-k boolean encodings instead as scikit-learn implementation of random forest s cannot directly deal with categorical data.