Should I use cross validation or validation dataset? - machine-learning

I have the dataset of 9 classes and 3000 images approx.
Should I use cross validation for deep (4 conv layers, 4 max pool, 2 fc, 2 dropouts and softmax) convolutional network in such conditions?

probably yes, because the amount of images per class, isn't that much. Specially when you create a train-test-validation set : 70-15-15% ? Thus hypothetically, if you train your classifier on 70% of your dataset and your data-set is equally divided over the classes. Then each (training) class will contain +/- 3000*0,7 = 2100 /9 = 233 images...
Also a great motivation to use cross validation is, is that you'll generalize the classifier more. (+ The amount of training examples will be technically higher)

Related

Using PCA on test set which has dimensionality less than number of principal components

I've been working on a classification problem with a dataset that has 800 samples and 5000 features. I've used a dimensionality reduction technique such as PCA to reduce the dimensionality to around 120. This was done after I experimented with various no of principal components, and chose the number of principal components that captured the variance the most. I realize that the same principal components from the training stage must be used to transform the test set. However, I am confused about the situation where my test set has 100 samples and 5000 features. I realize the number of principal components cannot exceed 100(which is less than 120 chosen during the training stage)
(https://stats.stackexchange.com/questions/28909/pca-when-the-dimensionality-is-greater-than-the-number-of-samples)
Should I be estimating the size of my test set with some certainty and then choosing my principal components during the training stage ? I was wondering if somebody could point me to literature or any other stackoverflow answer that deals with a similar problem. I'd really appreciate it.
Just to clarify and follow up on the previous comment: by "a dataset that has dimensionality around 800 x 5k" you mean that you have a dataset consisting of 5000 samples with 800 features each? If so, then your test set should have the same number of features, i.e. 800, as your training data set. Training and test data sets are created by randomly splitting samples, not features.
As an example, let's say you randomly split your dataset into a training dataset of 4000 samples and a test dataset of 1000 samples. You would then train PCA on the training data set to reduce the number of features from 800 to something like 120. The PCA learned on the training dataset would then be applied to the 1000 samples in your test data set to reduce the number of features from 800 to 120.

Is this training dataset enough for training and testing classification model?

My training dataset contains just 2 classes with 40 features.
In case 1, class 1 has 35 samples and class 2 has 700 samples.
In case 2, class 1 has 65 samples and class 2 has the same value as above.
Is my training dataset enough for constructing the model using SVM classifier or some other classifiers?
I'm using WEKA. Testing options are 10-fold cross-validation and %66 and i get very good results.
You satisfied by result, so it means that you have enough data. It's hard to tell how much data you need, it depends on which problem you are solving exactly, how much noise in data you have, what features, etc.
I described it here in second part: https://stackoverflow.com/a/31567143/1030820

How do I improve my Neural Network output?

I have a data set with 150 rows, 45 features and 40 outputs. I can well overfit the data but I cannot obtain acceptable results for my cross validation set.
With 25 hidden layers and quite large number of iterations, I was able to get ~94% accuracy on my training set; put a smile on my face. But cross validation result turned out to be less than 15%.
So to mitigate overfitting I started playing with the regularization parameter (lambda) and also the number of hidden layers. The best result (CV) I could get was 24% on training set and 34% on the training set with lambda=1, 70 hidden layers and 14000 iterations. Increasing the number of iters also made it worse; I can't understand why I cannot improve CV results with increased lambda and iters?
Here is the lambda-hiddenLayer-iter combinations I have tried:
https://docs.google.com/spreadsheets/d/11ObRTg05lZENpjUj4Ei3CbHOh5mVzF7h9PKHq6Yn6T4/edit?usp=sharing
Any suggested way(s) of trying smarter regulationParameter-hiddenLayer-iters combinations? Or other ways of improving my NN? I using my matlab code from Andrew Ng's ML class (uses backpropagation algorithm.)
It's very hard to learn anything from just 150 training examples with 45 features (and if I read your question right, 40 possible output classes). You need far more labeled training examples if you want to learn a reasonable classifier - probably tens or hundreds of thousands if you do have 40 possible classes. Even for binary classification or regression, you likely need thousands of examples of you have 45 meaningful features.
Some suggestions:
overfitting occurs primarily when the structure of the neural network is too complex for the problem in hand. If the structure of the NN isn't too complex, increasing the number of iterations shouldn't decrease accuracy of prediction
70 hidden layers is quite a lot, you may try to dramatically decrease the number of hidden layers (to 3-15) and increase the number of iterations. It seems from your file that 15 hidden layers are fine in comparison to 70 hidden layers
while reducing the number of hidden layers you may vary the number of neurons in hidden layers (increase/decrease) and check how the results are changing
I agree with Logan. What you see in your dataset makes perfect sense. If you simply train a NN classifier with 45 features for 40 classes you will get great accuracy because you have more features than output-classes. So the model can basically "assign" each feature to one of the output classe, but the resulting model will be highly over-fitted and probably not represent whatever you are modeling. Your significantly lower cross-validation results seem to be right.
You should rethink your approach: Why do you have 40 classes? Maybe you can change your problem into a regressive problem instead of a classification problem? Also try to look into some other algorithms like Random Forrest for example. Or decrease the number of features significantly.

big number of attributes best classifiers

I have dataset which is built from 940 attributes and 450 instance and I'm trying to find the best classifier to get the best results.
I have used every classifier that WEKA suggest (such as J48, costSensitive, combinatin of several classifiers, etc..)
The best solution I have found is J48 tree with accuracy of 91.7778 %
and the confusion matrix is:
394 27 | a = NON_C
10 19 | b = C
I want to get better reuslts in the confution matrix for TN and TP at least 90% accuracy for each.
Is there something that I can do to improve this (such as long time run classifiers which scans all options? other idea I didn't think about?
Here is the file:
https://googledrive.com/host/0B2HGuYghQl0nWVVtd3BZb2Qtekk/
Please help!!
I'd guess that you got a data set and just tried all possible algorithms...
Usually, it is a good to think about the problem:
to find and work only with relevant features(attributes), otherwise
the task can be noisy. Relevant features = features that have high
correlation with class (NON_C,C).
your dataset is biased, i.e. number of NON_C is much higher than C.
Sometimes it can be helpful to train your algorithm on the same portion of positive and negative (in your case NON_C and C) examples. And cross-validate it on natural (real) portions
size of your training data is small in comparison with the number of
features. Maybe increasing number of instances would help ...
...
There are quite a few things you can do to improve the classification results.
First, it seems that your training data is severly imbalanced. By training with that imbalance you are creating a significant bias in almost any classification algorithm
Second, you have a larger number of features than examples. Consider using L1 and/or L2 regularization to improve the quality of your results.
Third, consider projecting your data into a lower dimension PCA space, say containing 90 % of the variance. This will remove much of the noise in the training data.
Fourth, be sure you are training and testing on different portions of your data. From your description it seems like you are training and evaluating on the same data, which is a big no no.

Machine Learning - Support Vector Machines

I came across an SVM example, but I didn't understand. I would appreciate it if somebody could explain how the prediction works. Please see the explanation below:
The dataset has 10,000 observations with 5 attributes (Sepal Width, Sepal Length, Petal Width, Petal Length, Label). The label gets positive if it belongs to the I.setosa class, and negative if belongs to some other class.
There are 6000 observations for which the outcome is known (i.e. they belong to the I.setosa class, so they get positive for the label attribute). The labels for the remaining 4000 are unknown, so the label was assumed to be negative. The 6000 observations and 2500 randomly selected observations from the remaining 4000 form the set for the 10-fold cross validation. SVM (10 fold cross validation) is then used for machine learning on the 8500 observations and the ROC is plotted.
Where are we predicting here? The set has 6000 observations for which the values are already known. How did the remaining 2500 get negative labels? When SVM is used, some observations that are positive get negative prediction. The prediction didn't make any sense to me here. Why are those 1500 observations excluded.
I hope my explanation is clear. Please let me know if I haven't explained anything clearly.
I think that the issue is a semantic one: you refer to the set of 4000 samples as being both "unknown" and "negative" -- which of these apply is the critical difference.
If the labels for the 4000 samples are truly unknown, then I'd do a 1-class SVM using the
6000 labelled samples [c.f. validation below]. And then the predictions would be generated by testing the N=4000 set to assess whether or not they belong to the setosa class.
If instead, we have 6000 setosa, and 4000 (known) non-setosa, we could construct a binary
classifier on the basis of this data [c.f. validation below], and then use it to predict setosa vs. non on
any other available non-labelled data.
Validation: Usually as part of the model construction process you will take only a subset of your labelled
training data and use it to configure the model. For the unused subset, you apply the model to the data (ignoring the labels), and compare what your model predicts against what the true labels are in order to assess error rates. This applies both to the 1-class and
the 2-class situations above.
Summary: if all of your data are labelled, then usually one will still make predictions for a subset of them (ignoring the known labels) as part of the model validation process.
Your SVM classifier is trained to tell if a new (unknown) instance is or not an instance of I. Setosa. In order words, you are predicting if the new, unlabeled instance is I.Setosa or not.
You found the incorrectly classified result, probably, because your training data has many more instances of the positive case than of the negative one. Also, it's common to have some error margin.
Summarizing: your SVM classifier learned how to identify I.Setosa instances, however, it was provided with too little examples of non-I.Setosa instances, which is likely to get you a biased model.

Resources