Deep learning: parameter selection and result interpretation [closed] - machine-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 5 years ago.
Improve this question
I have a multiclass(7 labels) classification problem implemented in MLP. I am trying to classify 7 cancers based on some data. The overall accuracy is quite low, around 58%. However, some of the cancers is classified with accuracy around 90% for different parameters. Below cancer 1,2,3 etc means different types of cancer, For example 1= breast cancer, 2 = lung cancer etc. Now, for different parameter settings I get different classification accuracy. For example,
1. hyper parameters
learning_rate = 0.001
training_epochs = 10
batch_size = 100
hidden_size = 256
#overall accuracy 53%, cancer 2 accuracy 91%, cancer 5 accuracy 88%,
#cancer 6 accuracy 89%
2. hyper parameters
learning_rate = 0.01
training_epochs = 30
batch_size = 100
hidden_size = 128
#overall accuracy 56%, cancer 2 accuracy 86%, cancer 5 accuracy 93%,
#caner 6 accuracy 75%
As you can see, for different parameter settings I am getting totally different results. Cancer 1,3,4,7 have very low accuracy, so I excluded them. But cancer 2, 5,6 have comparatively better results. But, for cancer 6, the results vary by great number depending on the parameter settings.
An important note is, here overall accuracy is not important but if I can classify 2-3 cancers with more than 90% accuracy that is more important. So my question is, how do I interpret the results? In my paper how should I show the results? which parameter settings should I show/use? Or should I show different parameter settings for different cancer types? So basically, how to handle this type of situations?

Data Imbalance?
The first question you'll have to ask yourself is, do you have a balanced dataset, or do you have data imbalance? With this I mean, how many instances of each class do you have in your training and test datasets?
Suppose, for example, suppose that 90% of all the instances in your dataset are cancer 2, and the remaining 10% is spread out over the other classes. Then, you can very easily get 90% accuracy by implementing a very dumb classifier that simply classifies everything as cancer 2. This is probably not what you want out of your classifier though.
Interpretation of Results
I'd recommend reporting confusion matrices instead of just raw accuracy numbers. This will provide some information about which classes get confused for which other classes by the classifier, which may be interesting (e.g. different types of cancer may be similar to some extent if they often get confused for each other). Especially if you have data imbalance, I'd also recommend reporting other metrics such as Precision and/or Recall, instead of Accuracy.
Which parameter settings to use/show?
This depends on what problem you're really trying to solve. Is correct detection of every class equally important? If so, overall accuracy is probably the most important metric. Are certain classes more important to accurately detect than others? In this case you may want to look into "cost-sensitive classification", where different classification mistakes have different costs. If you simply don't know (don't have this domain knowledge), I'd recommend reporting as many different settings and metrics as you realistically can.

Related

Classification of Stock Prices Based on Probabilities

I'm trying to build a classifier to predict stock prices. I generated extra features using some of the well-known technical indicators and feed these values, as well as values at past points to the machine learning algorithm. I have about 45k samples, each representing an hour of ohlcv data.
The problem is actually a 3-class classification problem: with buy, sell and hold signals. I've built these 3 classes as my targets based on the (%) change at each time point. That is: I've classified only the largest positive (%) changes as buy signals, the opposite for sell signals and the rest as hold signals.
However, presenting this 3-class target to the algorithm has resulted in poor accuracy for the buy & sell classifiers. To improve this, I chose to manually assign classes based on the probabilities of each sample. That is, I set the targets as 1 or 0 based on whether there was a price increase or decrease.
The algorithm then returns a probability between 0 and 1(usually between 0.45 and 0.55) for its confidence on which class each sample belongs to. I then select some probability bound for each class within those probabilities. For example: I select p > 0.53 to be classified as a buy signal, p < 0.48 to be classified as a sell signal and anything in between as a hold signal.
This method has drastically improved the classification accuracy, at some points to above 65%. However, I'm failing to come up with a method to select these probability bounds without a large validation set. I've tried finding the best probability values within a validation set of 3000 and this has improved the classification accuracy, yet the larger the validation set becomes, it is clear that the prediction accuracy in the test set is decreasing.
So, what I'm looking for is any method by which I could discern what the specific decision probabilities for each training set should be, without large validation sets. I would also welcome any other ideas as to how to improve this process. Thanks for the help!
What you are experiencing is called non-stationary process. The market movement depends on time of the event.
One way I used to deal with it is to build your model with data in different time chunks.
For example, use data from day 1 to day 10 for training, and day 11 for testing/validation, then move up one day, day 2 to day 11 for training, and day 12 for testing/validation.
you can save all your testing results together to compute an overall score for your model. this way you have lots of test data and a model that will adapt to time.
and you get 3 more parameters to tune, #1 how much data to use for train, #2 how much data for test, # per how many days/hours/data points you retrain your data.

Why do Tensorflow tf.learn classification results vary a lot?

I use the TensorFlow high-level API tf.learn to train and evaluate a DNN classifier for a series of binary text classifications (actually I need multi-label classification but at the moment I check every label separately). My code is very similar to the tf.learn Tutorial
classifier = tf.contrib.learn.DNNClassifier(
hidden_units=[10],
n_classes=2,
dropout=0.1,
feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(training_set.data))
classifier.fit(x=training_set.data, y=training_set.target, steps=100)
val_accuracy_score = classifier.evaluate(x=validation_set.data, y=validation_set.target)["accuracy"]
Accuracy score varies roughly from 54% to 90%, with 21 documents in the validation (test) set which are always the same.
What does the very significant deviation mean? I understand there are some random factors (eg. dropout), but to my understanding the model should converge towards an optimum.
I use words (lemmas), bi- and trigrams, sentiment scores and LIWC scores as features, so I do have a very high-dimensional feature space, with only 28 training and 21 validation documents. Can this cause problems? How can I consistently improve the results apart from collecting more training data?
Update: To clarify, I generate a dictionary of occurring words and n-grams and discard those that occur only 1 time, so I only use words (n-grams) that exist in the corpus.
This has nothing to do with TensorFlow. This dataset is ridiculously small, thus you can obtain any results. You have 28 + 21 points, in a space which has "infinite" amount of dimensions (there are around 1,000,000 english words, thus 10^18 trigrams, however some of them do not exist, and for sure they do not exist in your 49 documents, but still you have at least 1,000,000 dimensions). For such problem, you have to expect huge variance of the results.
How can I consistently improve the results apart from collecting more training data?
You pretty much cannot. This is simply way to small sample to do any statistical analysis.
Consequently the best you can do is change evaluation scheme instead of splitting data to 28/21 do 10-fold cross validation, with ~50 points this means that you will have to run 10 experiments, each with 45 training documents and 4 testing ones, and average the result. This is the only thing you can do to reduce the variance, however remember that even with CV, dataset so small gives you no guarantees how well your model will actualy behave "in the wild" (once applied to never seen before data).

Cross Validation in Classification

I have two different datasets, datset X and dataset Y... From which I calculate features to use for classification..
Case 1. When I combine both together as one large datset then use 10 fold cross validation I get very good classification results with accuracy and AUC > 95%
Case2. Yet if I use one of the datasets for training and the other for testing, results fall severely low with both accuracy and AUC becoming ~ 50%
My questions are:
Which of the cases' results is more reliable??
And why the huge difference in results??
Thanks..
There could be a bias in the way the datasets were obtained that makes you get worst results.
Read this.
Another thing is that on one case you are training your classifier with a smaller dataset (the two combined is larger assuming they are about the same size, even with the 10 fold cross validation). This necessarily causes a poorer performance.
So my answers would be:
Depends on how you obtained both datasets and on how the final classifier will be used.
Differences in the size of the training set and bias on how they are obtained.

big number of attributes best classifiers

I have dataset which is built from 940 attributes and 450 instance and I'm trying to find the best classifier to get the best results.
I have used every classifier that WEKA suggest (such as J48, costSensitive, combinatin of several classifiers, etc..)
The best solution I have found is J48 tree with accuracy of 91.7778 %
and the confusion matrix is:
394 27 | a = NON_C
10 19 | b = C
I want to get better reuslts in the confution matrix for TN and TP at least 90% accuracy for each.
Is there something that I can do to improve this (such as long time run classifiers which scans all options? other idea I didn't think about?
Here is the file:
https://googledrive.com/host/0B2HGuYghQl0nWVVtd3BZb2Qtekk/
Please help!!
I'd guess that you got a data set and just tried all possible algorithms...
Usually, it is a good to think about the problem:
to find and work only with relevant features(attributes), otherwise
the task can be noisy. Relevant features = features that have high
correlation with class (NON_C,C).
your dataset is biased, i.e. number of NON_C is much higher than C.
Sometimes it can be helpful to train your algorithm on the same portion of positive and negative (in your case NON_C and C) examples. And cross-validate it on natural (real) portions
size of your training data is small in comparison with the number of
features. Maybe increasing number of instances would help ...
...
There are quite a few things you can do to improve the classification results.
First, it seems that your training data is severly imbalanced. By training with that imbalance you are creating a significant bias in almost any classification algorithm
Second, you have a larger number of features than examples. Consider using L1 and/or L2 regularization to improve the quality of your results.
Third, consider projecting your data into a lower dimension PCA space, say containing 90 % of the variance. This will remove much of the noise in the training data.
Fourth, be sure you are training and testing on different portions of your data. From your description it seems like you are training and evaluating on the same data, which is a big no no.

Is there a rule-of-thumb for how to divide a dataset into training and validation sets? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
Is there a rule-of-thumb for how to best divide data into training and validation sets? Is an even 50/50 split advisable? Or are there clear advantages of having more training data relative to validation data (or vice versa)? Or is this choice pretty much application dependent?
I have been mostly using an 80% / 20% of training and validation data, respectively, but I chose this division without any principled reason. Can someone who is more experienced in machine learning advise me?
There are two competing concerns: with less training data, your parameter estimates have greater variance. With less testing data, your performance statistic will have greater variance. Broadly speaking you should be concerned with dividing data such that neither variance is too high, which is more to do with the absolute number of instances in each category rather than the percentage.
If you have a total of 100 instances, you're probably stuck with cross validation as no single split is going to give you satisfactory variance in your estimates. If you have 100,000 instances, it doesn't really matter whether you choose an 80:20 split or a 90:10 split (indeed you may choose to use less training data if your method is particularly computationally intensive).
Assuming you have enough data to do proper held-out test data (rather than cross-validation), the following is an instructive way to get a handle on variances:
Split your data into training and testing (80/20 is indeed a good starting point)
Split the training data into training and validation (again, 80/20 is a fair split).
Subsample random selections of your training data, train the classifier with this, and record the performance on the validation set
Try a series of runs with different amounts of training data: randomly sample 20% of it, say, 10 times and observe performance on the validation data, then do the same with 40%, 60%, 80%. You should see both greater performance with more data, but also lower variance across the different random samples
To get a handle on variance due to the size of test data, perform the same procedure in reverse. Train on all of your training data, then randomly sample a percentage of your validation data a number of times, and observe performance. You should now find that the mean performance on small samples of your validation data is roughly the same as the performance on all the validation data, but the variance is much higher with smaller numbers of test samples
You'd be surprised to find out that 80/20 is quite a commonly occurring ratio, often referred to as the Pareto principle. It's usually a safe bet if you use that ratio.
However, depending on the training/validation methodology you employ, the ratio may change. For example: if you use 10-fold cross validation, then you would end up with a validation set of 10% at each fold.
There has been some research into what is the proper ratio between the training set and the validation set:
The fraction of patterns reserved for the validation set should be
inversely proportional to the square root of the number of free
adjustable parameters.
In their conclusion they specify a formula:
Validation set (v) to training set (t) size ratio, v/t, scales like
ln(N/h-max), where N is the number of families of recognizers and
h-max is the largest complexity of those families.
What they mean by complexity is:
Each family of recognizer is characterized by its complexity, which
may or may not be related to the VC-dimension, the description
length, the number of adjustable parameters, or other measures of
complexity.
Taking the first rule of thumb (i.e.validation set should be inversely proportional to the square root of the number of free adjustable parameters), you can conclude that if you have 32 adjustable parameters, the square root of 32 is ~5.65, the fraction should be 1/5.65 or 0.177 (v/t). Roughly 17.7% should be reserved for validation and 82.3% for training.
Last year, I took Prof: Andrew Ng’s online machine learning course. His recommendation was:
Training: 60%
Cross-validation: 20%
Testing: 20%
Well, you should think about one more thing.
If you have a really big dataset, like 1,000,000 examples, split 80/10/10 may be unnecessary, because 10% = 100,000 examples may be just too much for just saying that model works fine.
Maybe 99/0.5/0.5 is enough because 5,000 examples can represent most of the variance in your data and you can easily tell that model works good based on these 5,000 examples in test and dev.
Don't use 80/20 just because you've heard it's ok. Think about the purpose of the test set.
Perhaps a 63.2% / 36.8% is a reasonable choice. The reason would be that if you had a total sample size n and wanted to randomly sample with replacement (a.k.a. re-sample, as in the statistical bootstrap) n cases out of the initial n, the probability of an individual case being selected in the re-sample would be approximately 0.632, provided that n is not too small, as explained here: https://stats.stackexchange.com/a/88993/16263
For a sample of n=250, the probability of an individual case being selected for a re-sample to 4 digits is 0.6329.
For a sample of n=20000, the probability is 0.6321.
It all depends on the data at hand. If you have considerable amount of data then 80/20 is a good choice as mentioned above. But if you do not Cross-Validation with a 50/50 split might help you a lot more and prevent you from creating a model over-fitting your training data.
Suppose you have less data, I suggest to try 70%, 80% and 90% and test which is giving better result. In case of 90% there are chances that for 10% test you get poor accuracy.

Resources