I am trying to classify a data using Supervised machine learning algorithms.Everything's working fine, but just for my curiosity, I tried 6 classification algorithms simultaneously on a single data set. Steps followed are as follows-
1> Train all the algorithms.
2> predicted the result(either 1 or 0) for all test_data individually, by all algorithms.
3> If most of the algos gave 0, i considered the result for that data pair to be 0, similarly for result 1.
4> Then i found out the overall accuracy.
I expected the overall accuracy to be higher then the individual results(By each algorithm working individually), But i got almost the average accuracy.(Average here means average of accuracies of individual algos).
Can anyone please help me to find the reason?
This depends on the algorithms you picked. Many algorithms are sensitive to different things. For instance, k-means, linear SVM, and power iteration clustering will get markedly different results.
You got what you asked for: you averaged the votes, without coordinating the algorithms in any way. You got an average result.
I doubt that weighted averaging will help much; all you're doing there is training a meta-model. Instead, consider the data set you have. You need to research modelling algorithms and pick one that tends to work well on the statistical shape of your data set with respect to the desired purpose. Since you've given us none of this background, we can't help with specifics.
Related
I was using Keras' CNN to classify MNIST dataset. I found that using different batch-sizes gave different accuracies. Why is it so?
Using Batch-size 1000 (Acc = 0.97600)
Using Batch-size 10 (Acc = 0.97599)
Although, the difference is very small, why is there even a difference?
EDIT - I have found that the difference is only because of precision issues and they are in fact equal.
That is because of the Mini-batch gradient descent effect during training process. You can find good explanation Here that I mention some notes from that link here:
Batch size is a slider on the learning process.
Small values give a learning process that converges quickly at the
cost of noise in the training process.
Large values give a learning
process that converges slowly with accurate estimates of the error
gradient.
and also one important note from that link is :
The presented results confirm that using small batch sizes achieves the best training stability and generalization performance, for a
given computational cost, across a wide range of experiments. In all
cases the best results have been obtained with batch sizes m = 32 or
smaller
Which is the result of this paper.
EDIT
I should mention two more points Here:
because of the inherent randomness in machine learning algorithms concept, generally you should not expect machine learning algorithms (like Deep learning algorithms) to have same results on different runs. You can find more details Here.
On the other hand both of your results are too close and somehow they are equal. So in your case we can say that the batch size has no effect on your network results based on the reported results.
This is not connected to Keras. The batch size, together with the learning rate, are critical hyper-parameters for training neural networks with mini-batch stochastic gradient descent (SGD), which entirely affect the learning dynamics and thus the accuracy, the learning speed, etc.
In a nutshell, SGD optimizes the weights of a neural network by iteratively updating them towards the (negative) direction of the gradient of the loss. In mini-batch SGD, the gradient is estimated at each iteration on a subset of the training data. It is a noisy estimation, which helps regularize the model and therefore the size of the batch matters a lot. Besides, the learning rate determines how much the weights are updated at each iteration. Finally, although this may not be obvious, the learning rate and the batch size are related to each other. [paper]
I want to add two points:
1) When use special treatments, it is possible to achieve similar performance for a very large batch size while speeding-up the training process tremendously. For example,
Accurate, Large Minibatch SGD:Training ImageNet in 1 Hour
2) Regarding your MNIST example, I really don't suggest you to over-read these numbers. Because the difference is so subtle that it could be caused by noise. I bet if you try models saved on a different epoch, you will see a different result.
I've trained a deep neural network of a few hundreds of features which analyzes geo data of a city, and calculate a score per sample based on the profile between the observer and the target location. That is, the longer the distance between the observer and target, the more features I will have for this sample. When I train my NN with samples from part of a city and test with other parts of the same city, the NN works very well, but when I apply my NN to other cities, the NN starts to give high standard deviation of errors, especially on cases which the samples of the city I'm applying the NN to generally has more features than samples of the city I used to train this NN. To deal with that, I've appended 10% of empty samples in training which was able to reduce the errors by half, but the remaining errors are still too large compare to the solutions calculated by hand. May I have some advise of generalize a regression neural network? Thanks!
I was going to ask for more examples of your data, and your network, but it wouldn't really matter.
How to improve the generalization of a regression neural network?
You can use exactly the same things you would use for a classification neural network. The only difference is what it does with the numbers that are output from the penultimate layer!
I've appended 10% of empty samples in training which was able to reduce the errors by half,
I didn't quite understand what that meant (so I'd still be interested if you expanded your question with some more concrete details), but it sounds a bit like using dropout. In Keras you append a Dropout() layer between your other layers:
...
model.append(Dense(...))
model.append(Dropout(0.2))
model.append(Dense(...))
...
0.2 means 20% dropout, which is a nice starting point: you could experiment with values up to about 0.5.
You could read the original paper or this article seems to be a good introduction with keras examples.
The other generic technique is to add some L1 and/or L2 regularization, here is the manual entry.
I typically use a grid search to experiment with each of these, e.g. trying each of 0, 1e-6, 1e-5 for each of L1 and L2, and each of 0, 0.2, 0.4 (usually using the same value between all layers, for simplicity) for dropout. (If 1e-5 is best, I might also experiment with 5e-4 and 1e-4.)
But, remember that even better than the above are more training data. Also consider using domain knowledge to add more data, or more features.
Based on data that our business department supplied to us, I used the sklearn decision tree algorithm to determine the ROC_AUC for a binary classification problem.
The data consists of 450 rows and there are 30 features in the data.
I used 10 times StratifiedKFold repetition/split of training and test data. As a result, I got the following ROC_AUC values:
0.624
0.594
0.522
0.623
0.585
0.656
0.629
0.719
0.589
0.589
0.592
As I am new in machine learning, I am unsure whether such a variation in the ROC_AUC values can be expected (with minimum values of 0.522 and maximum values of 0.719).
My questions are:
Is such a big variation to be expected?
Could it be reduced with more data (=rows)?
Will the ROC_AUC variance get smaller, if the ROC_AUC gets better ("closer to 1")?
Well, you do k-fold splits to actually evaluate how well your model generalizes.
Therefore, from your current results I would assume the following:
This is a difficult problem, the AUCs are usually low.
0.71 is an outlier, you were just lucky there (probably).
Important questions that will help us help you:
What is the proportion of the binary classes? Are they balanced?
What are the features? Are they all continuous? If categorical, are they ordinal or nominal?
Why Decision Tree? Have you tried other methods? Logistic Regression for instance is a good start before you move on to more advanced ML methods.
You should run more iterations, instead of k fold use the ShuffleSplit function and run at least 100 iterations, compute the Average AUC with 95% Confidence Intervals. That will give you a better idea of how well the models perform.
Hope this helps!
Is such a big variation to be expected?
This is a textbook case of high variance.
Depending on the difficulty of your problem, 405 training samples may not be enough for it to generalize properly, and the random forest may be too powerful.
Try adding some regularization, by limiting the number of splits that the trees are allowed to make. This should reduce the variance in your model, though you might expect a potentially lower average performance.
Could it be reduced with more data (=rows)?
Yes, adding data is the other popular way of lowering the variance of your model. If you're familiar with deep learning, you'll know that deep models usually need LOTS of samples to learn properly. That's because they are very powerful models with an intrinsically high variance, and therefore a lot of data is needed for them to generalize.
Will the ROC_AUC variance get smaller, if the ROC_AUC gets better ("closer to 1")?
Variance will decrease with regularization and adding data, it has no relation to the actual performance "number" that you get.
Cheers
I am trying to build a classifier to predict breast cancer using the UCI dataset. I am using support vector machines. Despite my most sincere efforts to improve upon the accuracy of the classifier, I cannot get beyond 97.062%. I've tried the following:
1. Finding the most optimal C and gamma using grid search.
2. Finding the most discriminative feature using F-score.
Can someone suggest me techniques to improve upon the accuracy? I am aiming at at least 99%.
1.Data are already normalized to the ranger of [0,10]. Will normalizing it to [0,1] help?
2. Some other method to find the best C and gamma?
For SVM, it's important to have the same scaling for all features and normally it is done through scaling the values in each (column) feature such that the mean is 0 and variance is 1. Another way is to scale it such that the min and max are for example 0 and 1. However, there isn't any difference between [0, 1] and [0, 10]. Both will show the same performance.
If you insist on using SVM for classification, another way that may result in improvement is ensembling multiple SVM. In case you are using Python, you can try BaggingClassifier from sklearn.ensemble.
Also notice that you can't expect to get any performance from a real set of training data. I think 97% is a very good performance. It is possible that you overfit the data if you go higher than this.
some thoughts that have come to my mind when reading your question and the arguments you putting forward with this author claiming to have achieved acc=99.51%.
My first thought was OVERFITTING. I can be wrong, because it might depend on the dataset - But the first thought will be overfitting. Now my questions;
1- Has the author in his article stated whether the dataset was split into training and testing set?
2- Is this acc = 99.51% achieved with the training set or the testing one?
With the training set you can hit this acc = 99.51% when your model is overfitting.
Generally, in this case the performance of the SVM classifier on unknown dataset is poor.
I'm developing an algorithm to classify different types of dogs based off of image data. The steps of the algorithm are:
Go through all training images, detect image features (ie SURF), and extract descriptors. Collect all descriptors for all images.
Cluster within the collected image descriptors and find k "words" or centroids within the collection.
Reiterate through all images, extract SURF descriptors, and match the extracted descriptor with the closest "word" found via clustering.
Represent each image as a histogram of the words found in clustering.
Feed these image representations (feature vectors) to a classifier and train...
Now, I have run into a bit of a problem. Finding the "words" within the collection of image descriptors is a very important step. Due to the random nature of clustering, different clusters are found each time I run my program. The unfortunate result is that sometimes the accuracy of my classifier will be very good, and other times, very bad. I have chalked this up to the clustering algorithm finding "good" words sometimes, and "bad" words other times.
Does anyone know how I can hedge against the clustering algorithm from finding "bad" words? Currently I just cluster several times and take the mean accuracy of my classifier, but there must be a better way.
Thanks for taking time to read through this, and thank you for your help!
EDIT:
I am not using KMeans for classification; I am using a Support Vector Machine for classification. I am using KMeans for finding image descriptor "words", and then using these words to create histograms which describe each image. These histograms serve as feature vectors that are fed to the Support Vector Machine for classification.
There are many possible ways of making clustering repeatable:
The most basic method of dealing with k-means randomness is simply running it multiple times and selecting the best one (the one that minimizes the inner cluster distances/maximizes the between clusters distance).
One can use some fixed initialization for your data instead of randomization. There are many heuristics for starting the k-means. Or at least minimize the variance by using algorithms like k-means++.
Use modification of k-means which guarantees global minimum of regularized function, ie. convex k-means
Use different clustering method, which is deterministic, ie. Data Nets
I would offer two possible suggestions, in addition to those provided.
K-means optimises an objective related to the distance between cluster points and their centroids. You care about classification accuracy. Depending on the computational cost, a simple brute-force approach is to induce multiple clusterings on a subset of your training data, and evaluate the performance of each on some held-out development set for the task you care about. Then use the highest performing variant as the final model. I don't like the use of non-random initialisation because this is only a solution to avoid the randomness, not find the true global minimum of the objective, and your chosen initialisation may be useless and just produce consistently bad classifiers.
The other approach, which is much harder, is to view the k-means step as a dimensionality reduction to enable classification, and incorporate this into the classifier directly. If you use a deep neural net, the layer(s) closest to the input are essentially dimensionality reducers in the same way as the k-means clustering you induce: the difference is their weights are set wrt the error of the net on the classification problem, rather than some unrelated intermediate step. The downside is that this is much closer to a current research problem: training deep nets is hard. You could start with a standard one-hidden-layer architecture (with binary activations on the hidden layer, and using cross-entropy loss on the output layer with outputs coded as one-of-n categories), and attempt to add layers incrementally, but as far as I'm aware standard training algorithms start to behave poorly beyond the single hidden layer, so you'd need to investigate layer-wise training to initialise, or some of the Hessian-Free stuff coming out of Geoff Hinton's group in Toronto.
That is actually an important problem with the BofW approach, and you should share this prominently. SIFT data may actually not have k-means clusters at all. However, due to the nature of the algorithm, k-means will always produce k clusters. One of the things to test with k-means is to validate that the results are stable. If you get a completely different result each time, they are not much better than random.
Nevertheless, if you just want to get some working results, you can just fix the dictionary once and choose one that is working well.
Or you might look into more advanced clustering (in particular one that is more robust wrt. noise!)