We all know that the objective function of SVM is iteratively trained. In order to continue training, at least we can store all the variables used in the iterations if we want to continue on the same training dataset.
While, if we want to train on a slightly different dataset, what should we do to make full use of the previously trained model? Or does this kind of thought make sense? I think it is quite reasonable if we train a K-means model. But I am not sure if it still makes sense for the SVM problem.
There are some literature on this topic:
alpha-seeding, in which the training data is divided into chunks. After you train a SVM on the ith chunk, you take those and use them to train your SVM with the (i+1)th chunk.
Incremental SVM serves as an online learning in which you update a classifier with new examples rather than retrain the entire data set.
SVM heavy package with online SVM training as well.
What you are describing is what an online learning algorithm does and unfortunately the classic definition for SVM is done in a batch fashion.
However, there are several solvers for SVM that produces quasy optimal hypothesis to the underneath optimization problem in an online learning way. In particular my favourite is Pegasos-SVM which can find a good near optimal solution in linear time:
http://ttic.uchicago.edu/~nati/Publications/PegasosMPB.pdf
In general this doesn't make sense. SVM training is an optimization process with regard to every training set vector. Each training vector has an associated coefficient, which as a result is either 0 (irrelevant) or > 0 (support vector). Adding another training vector imposes another, different, optimization problem.
The only way to reuse information from previous training I can think of is to choose support vectors from the previous training and add them to the new training set. I'm not sure, but this probably will negatively affect generalization - VC dimension of an SVM is related to the number of support vectors, so adding previous support vectors to the new dataset is likely to increase the support vector count.
Apparently, there are more possibilities, as noted in lennon310's answer.
Related
I am working to create an MLP model on a CEA Classification Dataset (Binary Classification). Each sample contains different 4 features, such as resistance and other values, each in its own range (resistance in hundreds, another in micros, etc.). I am still new to machine learning and this is the first real model to build. How can I deal with such data? I have tried feeding each sample to the neural network with a sigmoid activation function, but I am not getting accurate results. My assumption to deal with this kind of data is to scale it? If so, what are some resources which are useful to look at, since I do not quite understand when is scaling required.
Scaling your data can be an important step in building a machine-learning model, especially when working with neural networks. Scaling can help to ensure that all of the features in your dataset are on a similar scale, which can make it easier for the model to learn.
There are a few different ways to scale your data, such as normalization and standardization. Normalization is the process of scaling the data so that it has a minimum value of 0 and a maximum value of 1. Standardization is the process of scaling the data so that it has a mean of 0 and a standard deviation of 1.
When working with your CEA Classification dataset, it might be helpful to try both normalization and standardization to see which one works better for your specific dataset. You can use scikit-learn library's preprocessing functions like MinMaxScaler() and StandardScaler() for normalization and standardization respectively.
Additionally, it might be helpful to try different activation functions, such as ReLU or LeakyReLU, to see if they lead to more accurate results. Also, you can try adding more layers and neurons in your neural network to see if it improves the performance.
It's also important to remember that feature engineering, which includes the process of selecting the most important features, can be more important than scaling.
The Situation:
I am wondering how to use TensorFlow optimally when my training data is imbalanced in label distribution between 2 labels. For instance, suppose the MNIST tutorial is simplified to only distinguish between 1's and 0's, where all images available to us are either 1's or 0's. This is straightforward to train using the provided TensorFlow tutorials when we have roughly 50% of each type of image to train and test on. But what about the case where 90% of the images available in our data are 0's and only 10% are 1's? I observe that in this case, TensorFlow routinely predicts my entire test set to be 0's, achieving an accuracy of a meaningless 90%.
One strategy I have used to some success is to pick random batches for training that do have an even distribution of 0's and 1's. This approach ensures that I can still use all of my training data and produced decent results, with less than 90% accuracy, but a much more useful classifier. Since accuracy is somewhat useless to me in this case, my metric of choice is typically area under the ROC curve (AUROC), and this produces a result respectably higher than .50.
Questions:
(1) Is the strategy I have described an accepted or optimal way of training on imbalanced data, or is there one that might work better?
(2) Since the accuracy metric is not as useful in the case of imbalanced data, is there another metric that can be maximized by altering the cost function? I can certainly calculate AUROC post-training, but can I train in such a way as to maximize AUROC?
(3) Is there some other alteration I can make to my cost function to improve my results for imbalanced data? Currently, I am using a default suggestion given in TensorFlow tutorials:
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, y))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)
I have heard this may be possible by up-weighting the cost of miscategorizing the smaller label class, but I am unsure of how to do this.
(1)It's ok to use your strategy. I'm working with imbalanced data as well, which I try to use down-sampling and up-sampling methods first to make the training set even distributed. Or using ensemble method to train each classifier with an even distributed subset.
(2)I haven't seen any method to maximise the AUROC. My thought is that AUROC is based on true positive and false positive rate, which doesn't tell how well it works on each instance. Thus, it may not necessarily maximise the capability to separate the classes.
(3)Regarding weighting the cost by the ratio of class instances, it similar to Loss function for class imbalanced binary classifier in Tensor flow
and the answer.
Regarding imbalanced datasets, the first two methods that come to mind are (upweighting positive samples, sampling to achieve balanced batch distributions).
Upweighting positive samples
This refers to increasing the losses of misclassified positive samples when training on datasets that have much fewer positive samples. This incentivizes the ML algorithm to learn parameters that are better for positive samples. For binary classification, there is a simple API in tensorflow that achieves this. See (weighted_cross_entropy) referenced below
https://www.tensorflow.org/api_docs/python/tf/nn/weighted_cross_entropy_with_logits
Batch Sampling
This involves sampling the dataset so that each batch of training data has an even distribution positive samples to negative samples. This can be done using the rejections sampling API provided from tensorflow.
https://www.tensorflow.org/api_docs/python/tf/contrib/training/rejection_sample
I'm one who struggling with imbalanced data. What my strategy to counter imbalanced data are as below.
1) Use cost function calculating 0 and 1 labels at the same time like below.
cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(_pred) + (1-y)*tf.log(1-_pred), reduction_indices=1))
2) Use SMOTE, oversampling method making number of 0 and 1 labels similar. Refer to here, http://comments.gmane.org/gmane.comp.python.scikit-learn/5278
Both strategy worked when I tried to make credit rating model.
Logistic regression is typical method to handle imbalanced data and binary classification such as predicting default rate. AUROC is one of the best metric to counter imbalanced data.
1) Yes. This is well received strategy to counter imbalanced data. But this strategy is good in Neural Nets only if you using SGD.
Another easy way to balance the training data is using weighted examples. Just amplify the per-instance loss by a larger weight/smaller when seeing imbalanced examples. If you use online gradient descent, it can be as simple as using a larger/smaller learning rate when seeing imbalanced examples.
Not sure about 2.
Most classification algorithms are developed to improve the training speed. However, is there any classifier or algorithm focusing on the decision making speed(low computation complexity and simple realizable structure)? I can get enough training dataļ¼and endure the long training time.
There are many methods which classify fast, you could more or less sort models by classification speed in a following way (first ones - the fastest, last- slowest)
Decision Tree (especially with limited depth)
Linear models (linear regression, logistic regression, linear svm, lda, ...) and Naive Bayes
Non-linear models based on explicit data transformation (Nystroem kernel approximation, RVFL, RBFNN, EEM), Kernel methods (such as kernel SVM) and shallow neural networks
Random Forest and other committees
Big Neural Networks (ie. CNN)
KNN with arbitrary distance
Obviously this list is not exhaustive, it just shows some general ideas.
One way of obtaining such model is to build a complex, slow model, then use it as a black box label generator to train a simplier model (but on potentialy infinite training set) - thus getting a fast classifier at the cost of very expensive training. There are many works showing that one can do that for example by training a shallow neural network on outputs of deep nn.
In general classification speed should not be a problem. Some exceptions are algorithms which have a time complexity depending on the number of samples you have for training. One example is k-Nearest-Neighbors which has no training time, but for classification it needs to check all points (if implemented in a naive way). Other examples are all classifiers which work with kernels since they compute the kernel between the current sample and all training samples.
Many classifiers work with a scalar product of the features and a learned coefficient vector. These should be fast enough in almost all cases. Examples are: Logistic regression, linear SVM, perceptrons and many more. See #lejlot's answer for a nice list.
If these are still too slow you might try to reduce the dimension of your feature space first and then try again (this also speeds up training time).
Btw, this question might not be suited for StackOverflow as it is quite broad and recommendation instead of problem oriented. Maybe try https://stats.stackexchange.com/ next time.
I have a decision tree which is represented in the compressed form and which is at least 4 times faster than the actual tree in classifying an unseen instance.
I am considering using random forest for a classification problem. The data comes in sequences. I plan to use first N(500) to train the classifier. Then, use the classifier to classify the data after that. It will make mistakes and the mistakes sometimes can be recorded.
My question is: can I use those mis-classified data to retrain the original classifier and how? If I simply add the mis-classified ones to original training set with size N, then the importance of the mis-classified ones will be exaggerated as the corrected classified ones are ignored. Do I have to retrain the classifier using all data? What other classifiers can do this kind of learning?
What you describe is a basic version of the Boosting meta-algorithm.
It's better if your underlying learner have a natural way to handle samples weights. I have not tried boosting random forests (generally boosting is used on individual shallow decision trees with a depth limit between 1 and 3) but that might work but will likely be very CPU intensive.
Alternatively you can train several independent boosted decision stumps in parallel with different PRNG seed values and then aggregate the final decision function as you would do with a random forests (e.g. voting or averaging class probability assignments).
If you are using Python, you should have a look at the scikit-learn documentation on the topic.
Disclaimer: I am a scikit-learn contributor.
Here is my understanding of your problem.
You have a dataset and create two subdata set with it say, training dataset and evaluation dataset. How can you use the evaluation dataset to improve classification performance ?
The point of this probleme is'nt to find a better classifier but to find a good way for the evaluation, then have a good classifier in the production environnement.
Evaluation purpose
As the evaluation dataset has been tag for evaluation there is now way yo do this. You must use another way for training and evaluation.
A common way to do is cross-validation;
Randomize your samples in your dataset. Create ten partitions from your initial dataset. Then do ten iteration of the following :
Take all partitions but the n-th for training and do the evaluation with the n-th.
After this take the median of the errors of the ten run.
This will give you the errors rate of yours classifiers.
The least run give you the worst case.
Production purpose
(no more evaluation)
You don't care anymore of evaluation. So take all yours samples of all your dataset and give it for training to your classifier (re-run a complet simple training). The result can be use in production environnement, but can't be evaluate any more with any of yours data. The result is as best as the worst case in previous partitions set.
Flow sample processing
(production or learning)
When you are in a flow where new samples are produce over time. You will face case where some sample correct errors case. This is the wanted behavior because we want the system to
improve itself. If you just correct in place the leaf in errors, after some times your
classifier will have nothing in common with the original random forest. You will be doing
a form of greedy learning, like meta taboo search. Clearly we don't wanna this.
If we try to reprocess all the dataset + the new sample every time a new sample is available we will experiment terrible low latency. The solution is like human, sometime
a background process run (when service is on low usage), and all data get a complet
re-learning; and at the end swap old and new classifier.
Sometime the sleep time is too short for a complet re-learning. So you have to use node computing clusturing like that. It cost lot of developpement because you probably need to re-write the algorithms; but at that time you already have the bigest computer you could have found.
note : Swap process is very important to master. You should already have it in your production plan. What do you do if you want to change algorithms? backup? benchmark? power-cut? etc...
I would simply add the new data and retrain the classifier periodically if it weren't too expensive.
A simple way to keep things in balance is to add weights.
If you weigh all positive samples by 1/n_positive and all negative samples by 1/n_negative ( including all the new negative samples you're getting ), then you don't have to worry about the classifier getting out of balance.
I am doing remote sensing image classification. I am using the object-oriented method: first I segmented the image to different regions, then I extract the features from regions such as color, shape and texture. The number of all features in a region may be 30 and commonly there are 2000 regions in all, and I will choose 5 classes with 15 samples for every class.
In summary:
Sample data 1530
Test data 197530
How do I choose the proper classifier? If there are 3 classifiers (ANN, SVM, and KNN), which should I choose for better classification?
KNN is the most basic machine learning algorithm to paramtise and implement, but as alluded to by #etov, would likely be outperformed by SVM due to the small training data sizes. ANNs have been observed to be limited by insufficient training data also. However, KNN makes the least number of assumptions regarding your data, other than that accurate training data should form relatively discrete clusters. ANN and SVM are notoriously difficult to paramtise, especially if you wish to repeat the process using multiple datasets and rely upon certain assumptions, such as that your data is linearly separable (SVM).
I would also recommend the Random Forests algorithm as this is easy to implement and is relatively insensitive to training data size, but I would advise against using very small training data sizes.
The scikit-learn module contains these algorithms and is able to cope with large training data sizes, so you could increase the number of training data samples. the best way to know for sure would be to investigate them yourself, as suggested by #etov
If your "sample data" is the train set, it seems very small. I'd first suggest using more than 15 examples per class.
As said in the comments, it's best to match the algorithm to the problem, so you can simply test to see which algorithm works better. But to start with, I'd suggest SVM: it works better than KNN with small train sets, and generally easier to train then ANN, as there are less choices to make.
Have a look at below mind map
KNN: KNN performs well when sample size < 100K records, for non textual data. If accuracy is not high, immediately move to SVC ( Support Vector Classifier of SVM)
SVM: When sample size > 100K records, go for SVM with SGDClassifier.
ANN: ANN has evolved overtime and they are powerful. You can use both ANN and SVM in combination to classify images
More details are available #semanticscholar.org