Inconsistency in cross-validation results - machine-learning

I have a set of dataset recorded from subjects as they perform some particular cognitive task. The data consists of 16 channels and a number of sample points per channel and I want to classify this data according to the cognitive task being performed (everything is labelled).
The issue is that I do not have a large amount of data (approximately 60 trials per session, 30 for each cognitive task) and I have 2 sessions. I am trying to train a Linear Discriminant Analysis (LDA) classifier to classify this data. The classifier is later to be used in real-time to give some form of output every number of samples.
I made use of a 5-fold cross-validation to measure the generalization error of my classifier. The problem is that when I run this 5-fold cross validation a number of times, the results I get are not constant at all. Instead there's a significant variation in the overall accuracy (for example, first 5-fold cross validation may yield an average accuracy of 80%, 2nd yields an accuracy of 65%, 3rd yields an average of 72% etc...). Is this normal? If not, what could be the causes?

It sounds like you may have some bad data or your classifier is overfitting. You can perform Leave-one-out cross-validation and note your results. It can help to find data that may be biasing the results.

Related

How do I increase the sensitivity of a CNN autoencoder for anomaly detection, when my dataset contains mostly zero values?

I am learning to use autoencoders for anomaly detection. I am using the model provided on the keras website, with data that I have created. My ‘correct data’ used for training the model is a single frequency sine wave with random amplitudes and my ‘anomaly data’ used for testing the model is the original data, with a sine wave of a different frequency superimposed.
In the time domain the model works very well. However, in the frequency domain (I did a fft), because the data is almost all zeros with either 1 or 2 nonzero values, the sensitivity of the loss function to these data points is extremely low. The model is ultimately able to detect the different between the correct data and the anomalies with 100% precision but the loss is near 0 in all cases. If we had a scenario where instead of the non-dominant frequencies being 0, they were simply a small number, the ML model would surely have no chance of making the correct prediction. Please see the following graphs
Example of correct data + prediction
Example of anomaly data + prediction
Histogram of predictions, showing clear separation into 2 group
What can I do to make my model loss more sensitive to the desired data? One idea I have is to remove the zero data. However, this means that my input data length is not fixed. I don’t even know if the CNN- can handle this. I don’t want to switch to an LSTM as the convolution is important to me so that the network can handle various frequencies without being explicitly trained on those frequencies.
What other options do I have? Perhaps a loss function which can give higher importance to these values? I do understand that if have noisy data I can clean it but I am still concerned that it will not insufficient for less 'black and white' scenarios.

Why does my Random Forest Classifier perform better on test and validation data than on training data?

I'm currently training a random forest on some data I have and I'm finding that the model performs better on the validation set, and even better on the test set, than on the train set. Here are some details of what I'm doing - please let me know if I've missed any important information and I will add it in.
My question
Am I doing anything obviously wrong and do you have any advice for how I should improve my approach because I just can't believe that I'm doing it right when my model predicts significantly better on unseen data than training data!
Data
My underlying data consists of tables of features describing customer behaviour and a binary target (so this is a binary classification problem). Technically I have one such table per month and I tend to use several months of data to train and then a different month to predict (e.g. Train on Apr, May and Predict on Jun)
Generally this means I end up with a training dataset of about 100k rows and 20 features (I've previously looked into feature selection and found a set of 7 features which seem to perform best, so have been using these lately). My prediction set generally has around 50k rows.
My dataset is heavily unbalanced (approximately 2% incidence of target feature), so I'm using oversampling techniques - more on that below.
Method
I've searched around online quite a lot and this has led me to the following approach:
Take scaleable (continuous) features in the training data and standardise them (currently using sklearn StandardScaler)
Take categorical features and encode them into separate binary columns (one-hot) using Pandas get_dummies function
Remove 10% of the training data to form a validation set (I'm currently using a random seed in this process for comparability whilst I vary different things such as hyperparameters in the model)
Take the remaining 90% of training data and perform a grid search across a few parameters of the RandomForestClassifier() (currently min_samples_split, max_depth, n_estimators and max_features)
Within each hyperparameter combination from the grid I perform kfold validation with 5 folds and using a random state
Within each fold I oversample my minority class for training data only (sometimes using imbalanced-learn's RandomOverSampler() and sometimes using SMOTE() from the same package), train the model on the training data and then apply the model to the kth fold and record performance metrics (precision, recall, F1 and AUC)
Once I've been through 5 folds on each hyperparameter combination I find the best F1 score (and best precision if two combinations are tied on F1 score) and retrain a random forest on the entire 90% training data using those hyperparameters. During this step I use the same oversampling technique as I did in the kfold process
I then use this model to make predictions on the 10% of training data that I put aside earlier as a validation set, evaluating the same metrics as above
Finally I have a test set, which is actually based on data from another month, which I apply the already trained model to and evaluate the same metrics
Outcome
At the moment I'm finding that my training set achieves an F1 score of around 30%, the validation set is consistently slightly higher than this at around 36% (mostly driven by a much better precision than the training data e.g. 60% vs. 30%) and then the testing set is getting an F1 score of between 45% and 50% which is again driven by a better precision (around 65%)
Notes
Please do ask about any details I haven't mentioned; I've had my stuck in this for weeks and so have doubtless omitted some details
I've had a brief look (not a systematic analysis) of the stability of metrics between folds in the kfold validation and it seems that they aren't varying very much, so I'm fairly happy with the stability of the model here
I'm actually performing the grid search manually rather than using a Python pipeline because try as I might I couldn't get imbalanced-learn's Pipeline function to work with the oversampling functions and so I run a loop with combinations of hyperparameters, but I'm confident that this isn't impacting the results I've talked about above in an adverse way
When I apply the final model to the prediction data (and get an F1 score around 45%) I also apply it back to the training data itself out of interest and get F1 scores around 90% - 100%. I suppose this is to be expected as the model is trained and predicts on almost exactly the same data (except the 10% holdout validation set)

Machine Learning - Huge Only positive text dataset

I have a dataset with thousand of sentences belonging to a subject. I would like to know what would be best to create a classifier that will predict a text as "True" or "False" depending on whether they talk about that subject or not.
I've been using solutions with Weka (basic classifiers) and Tensorflow (neural network approaches).
I use string to word vector to preprocess the data.
Since there are no negative samples, I deal with a single class. I've tried one-class classifier (libSVM in Weka) but the number of false positives is so high I cannot use it.
I also tried adding negative samples but when the text to predict does not fall in the negative space, the classifiers I've tried (NB, CNN,...) tend to predict it as a false positive. I guess it's because of the sheer amount of positive samples
I'm open to discard ML as the tool to predict the new incoming data if necessary
Thanks for any help
I have eventually added data for the negative class and build a Multilineal Naive Bayes classifier which is doing the job as expected.
(the size of the data added is around one million samples :) )
My answer is based on the assumption that that adding of at least 100 negative samples for author’s dataset with 1000 positive samples is acceptable for the author of the question, since I have no answer for my question about it to the author yet
Since this case with detecting of specific topic is looks like particular case of topics classification I would recommend using classification approach with the two simple classes 1 class – your topic and another – all other topics for beginning
I succeeded with the same approach for face recognition task – at the beginning I built model with one output neuron with high level of output for face detection and low if no face detected
Nevertheless such approach gave me too low accuracy – less than 80%
But when I tried using 2 output neurons – 1 class for face presence on image and another if no face detected on the image, then it gave me more than 90% accuracy for MLP, even without using of CNN
The key point here is using of SoftMax function for the output layer. It gives significant increase of accuracy. From my experience, it increased accuracy of the MNIST dataset even for MLP from 92% up to 97% for the same model
About dataset. Majority of classification algorithms with a trainer, at least from my experience are more efficient with equal quantity of samples for each class in a training data set. In fact, if I have for 1 class less than 10% of average quantity for other classes it makes model almost useless for the detection of this class. So if you have 1000 samples for your topic, then I suggest creating 1000 samples with as many different topics as possible
Alternatively, if you don’t want to create a such big set of negative samples for your dataset, you can create a smaller set of negative samples for your dataset and use batch training with a size of batch = 2x your negative sample quantity. In order to do so, split your positive samples in n chunks with the size of each chunk ~ negative samples quantity and when train your NN by N batches for each iteration of training process with chunk[i] of positive samples and all your negative samples for each batch. Just be aware, that lower accuracy will be the price for this trade-off
Also, you could consider creation of more generic detector of topics – figure out all possible topics which can present in texts which your model should analyze, for example – 10 topics and create a training dataset with 1000 samples per each topic. It also can give higher accuracy
One more point about the dataset. The best practice is to train your model only with part of a dataset, for example – 80% and use the rest 20% for cross-validation. This cross-validation of unknown previously data for model will give you a good estimation of your model accuracy in real life, not for the training data set and allows to avoid overfitting issues
About building of model. I like doing it by "from simple to complex" approach. So I would suggest starting from simple MLP with SoftMax output and dataset with 1000 positive and 1000 negative samples. After reaching 80%-90% accuracy you can consider using CNN for your model, and also I would suggest increasing training dataset quantity, because deep learning algorithms are more efficient with bigger dataset
For text data you can use Spy EM.
The basic idea is to combine your positive set with a whole bunch of random samples, some of which you hold out. You initially treat all the random documents as the negative class, and train a classifier with your positive samples and these negative samples.
Now some of those random samples will actually be positive, and you can conservatively relabel any documents that are scored higher than the lowest scoring held out true positive samples.
Then you iterate this process until it stablizes.

What does inconsistent test results mean?

I'm doing some research on CNN for text classification using tensorflow. When I run my model I get a very high training accuracy (arround 100%). However, on test split I get an inconsistent accuracy results (sometimes 11% and sometimes 90%).
Moreover, I noticed also that the loss in training is decreasing until it reaches small numbers like 0.000499564048368, while in testing it is not and sometimes it gets high values like 70. What does this mean? Any ideas?
If you get very high training accuracy and bad testing accuracy, you are almost definitely overfitting. To get a better picture of what your models real accuracy is, use cross-validation.
Cross validation splits the dataset into a training and validation set, and does this multiple times, slightly changing the training and validation data each time. This is beneficial because it can prevent scenarios where you train your model on one label, and it can't accurately identify another one. For example, picture a training set like this:
Feature1, Feature2, Label
x y 0
a y 0
b c 1
If we train the model only on the first two datapoints, it will not be able to identify the third datapoint because it is not built generally.

Why do Tensorflow tf.learn classification results vary a lot?

I use the TensorFlow high-level API tf.learn to train and evaluate a DNN classifier for a series of binary text classifications (actually I need multi-label classification but at the moment I check every label separately). My code is very similar to the tf.learn Tutorial
classifier = tf.contrib.learn.DNNClassifier(
hidden_units=[10],
n_classes=2,
dropout=0.1,
feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(training_set.data))
classifier.fit(x=training_set.data, y=training_set.target, steps=100)
val_accuracy_score = classifier.evaluate(x=validation_set.data, y=validation_set.target)["accuracy"]
Accuracy score varies roughly from 54% to 90%, with 21 documents in the validation (test) set which are always the same.
What does the very significant deviation mean? I understand there are some random factors (eg. dropout), but to my understanding the model should converge towards an optimum.
I use words (lemmas), bi- and trigrams, sentiment scores and LIWC scores as features, so I do have a very high-dimensional feature space, with only 28 training and 21 validation documents. Can this cause problems? How can I consistently improve the results apart from collecting more training data?
Update: To clarify, I generate a dictionary of occurring words and n-grams and discard those that occur only 1 time, so I only use words (n-grams) that exist in the corpus.
This has nothing to do with TensorFlow. This dataset is ridiculously small, thus you can obtain any results. You have 28 + 21 points, in a space which has "infinite" amount of dimensions (there are around 1,000,000 english words, thus 10^18 trigrams, however some of them do not exist, and for sure they do not exist in your 49 documents, but still you have at least 1,000,000 dimensions). For such problem, you have to expect huge variance of the results.
How can I consistently improve the results apart from collecting more training data?
You pretty much cannot. This is simply way to small sample to do any statistical analysis.
Consequently the best you can do is change evaluation scheme instead of splitting data to 28/21 do 10-fold cross validation, with ~50 points this means that you will have to run 10 experiments, each with 45 training documents and 4 testing ones, and average the result. This is the only thing you can do to reduce the variance, however remember that even with CV, dataset so small gives you no guarantees how well your model will actualy behave "in the wild" (once applied to never seen before data).

Resources