How easy/fast are support vector machines to create/update?

If I provided you with data sufficient to classify a bunch of objects as either apples, oranges or bananas, how long might it take you to build an SVM that could make that classification? I appreciate that it probably depends on the nature of the data, but are we more likely talking hours, days or weeks?
Ok. Now that you have that SVM, and you have an understanding of how the data behaves, how long would it likely take you to upgrade that SVM (or build a new one) to classify an extra class (tomatoes) as well? Seconds? Minutes? Hours?
The motivation for the question is trying to assess the practical suitability of SVMs to a situation in which not all data is available to be sampled at any time. Fruit are an obvious case - they change colour and availability with the season.
If you would expect SVMs to be too fiddly to be able to create inside 5 minutes on demand, despite experience with the problem domain, then suggestions of a more user-friendly form of classifier for such a situation would be appreciated.

Generally, adding a class to a 1 vs. many SVM classifier requires retraining all classes. In case of large data sets, this might turn out to be quite expensive. In the real world, when facing very large data sets, if performance and flexibility are more important than state-of-the-art accuracy, Naive Bayes is quite widely used (adding a class to a NB classifier requires training of the new class only).
However, according to your comment, which states the data has tens of dimensions and up to 1000s of samples, the problem is relatively small, so practically, SVM retrain can be performed very fast (probably, in the order of seconds to tens of seconds).

You need to give us more details about your problem, since there are too many different scenarios where SVM can be trained fairly quickly (I could train it in real time in a third person shooting game and not have any latency) or it could last several minutes (I have a case for a face detector that training took an hour long)
As a thumb rule, the training time is proportional to the number of samples and the dimension of each vector.


what machine learning algorithm could be better for this scenario

I have a dataset comprised of roughly 15M observations, with approximately 3% of it being from the interest class. I can train the model in a pc, but i need to implement the classifier in a raspberry pi3. Since the raspberry has such a limited memory, what algorithms represent the least load for it?.
Additional info: the dataset is hard to differentiate. For example, ANNs can't get past the 80% detection rate for the interest class, no matter the architecture or activation function. Random forest has demonstrated great performance but the number of trees and nodes required aren't feasible for the implementation on a microcontroller.
You could potentially trim the trees in Random Forest approach so that to balance the classifier performance with memory / processing power requirements.
Also, I am suspecting you have a strongly imbalanced train/test sets so I wonder if you used any of the approaches suggested in this case (e.g. SMOTE, ADASYN, etc.). In case of python I strongly suggest reviewing imbalanced-learn library. Using such an approach could lead to a reduced size of classifier with acceptably good performance that you would be able to fit to run on the target device.
Last but not least, this question could easily go to Cross Validated or Data Science sites.

How to scale up a model in a training dataset to cover all aspects of training data

I was asked in an interview to solve a use case with the help of machine learning. I have to use a Machine Learning algorithm to identify fraud from transactions. My training dataset has lets say 100,200 transactions, out of which 100,000 are legal transactions and 200 are fraud.
I cannot use the dataset as a whole to make the model because it would be a biased dataset and the model would be a very bad one.
Lets say for example I take a sample of 200 good transactions which represent the dataset well(good transactions), and the 200 fraud ones and make the model using this as the training data.
The question I was asked was that how would I scale up the 200 good transactions to the whole data set of 100,000 good records so that my result can be mapped to all types of transactions. I have never solved this kind of a scenario so I did not know how to approach it.
Any kind of guidance as to how I can go about it would be helpful.
This is a general question thrown in an interview. Information about the problem is succinct and vague (we don't know for example the number of features!). First thing you need to ask yourself is What do the interviewer wants me to respond? So, based on this context the answer has to be formulated in a similar general way. This means that we don't have to find 'the solution' but instead give arguments that show that we actually know how to approach the problem instead of solving it.
The problem we have presented with is that the minority class (fraud) is only a ~0.2% of the total. This is obviously a huge imbalance. A predictor that only predicted all cases as 'non fraud' would get a classification accuracy of 99.8%! Therefore, definitely something has to be done.
We will define our main task as a binary classification problem where we want to predict whether a transaction is labelled as positive (fraud) or negative (not fraud).
The first step would be considering what techniques we do have available to reduce imbalance. This can be done either by reducing the majority class (undersampling) or increasing the number of minority samples (oversampling). Both have drawbacks though. The first implies a severe loss of potential useful information from the dataset, while the second can present problems of overfitting. Some techniques to improve overfitting are SMOTE and ADASYN, which use strategies to improve variety in the generation of new synthetic samples.
Of course, cross-validation in this case becomes paramount. Additionally, in case we are finally doing oversampling, this has to be 'coordinated' with the cross-validation approach to ensure we are making the most of these two ideas. Check for more details.
Apart from these sampling ideas, when selecting our learner, many ML methods can be trained/optimised for specific metrics. In our case, we do not want to optimise accuracy definitely. Instead, we want to train the model to optimise either ROC-AUC or specifically looking for a high recall even at a loss of precission, as we want to predict all the positive 'frauds' or at least raise an alarm even though some will prove false alarms. Models can adapt internal parameters (thresholds) to find the optimal balance between these two metrics. Have a look at this nice blog for more info about metrics:
Finally, is only a matter of evaluate the model empirically to check what options and parameters are the most suitable given the dataset. Following these ideas does not guarantee 100% that we are going to be able to tackle the problem at hand. But it ensures we are in a much better position to try to learn from data and being able to get rid of those evil fraudsters out there, while perhaps getting a nice job along the way ;)
In this problem you want to classify transactions as good or fraud. However your data is really imbalance. In that you will probably be interested by Anomaly detection. I will let you read all the article for more details but I will quote a few parts in my answer.
I think this will convince you that this is what you are looking for to solve this problem:
Is it not just Classification?
The answer is yes if the following three conditions are met.
You have labeled training data Anomalous and normal classes are
balanced ( say at least 1:5) Data is not autocorrelated. ( That one
data point does not depend on earlier data points. This often breaks
in time series data). If all of above is true, we do not need an
anomaly detection techniques and we can use an algorithm like Random
Forests or Support Vector Machines (SVM).
However, often it is very hard to find training data, and even when
you can find them, most anomalies are 1:1000 to 1:10^6 events where
classes are not balanced.
Now to answer your question:
Generally, the class imbalance is solved using an ensemble built by
resampling data many times. The idea is to first create new datasets
by taking all anomalous data points and adding a subset of normal data
points (e.g. as 4 times as anomalous data points). Then a classifier
is built for each data set using SVM or Random Forest, and those
classifiers are combined using ensemble learning. This approach has
worked well and produced very good results.
If the data points are autocorrelated with each other, then simple
classifiers would not work well. We handle those use cases using time
series classification techniques or Recurrent Neural networks.
I would also suggest another approach of the problem. In this article the author said:
If you do not have training data, still it is possible to do anomaly
detection using unsupervised learning and semi-supervised learning.
However, after building the model, you will have no idea how well it
is doing as you have nothing to test it against. Hence, the results of
those methods need to be tested in the field before placing them in
the critical path.
However you do have a few fraud data to test if your unsupervised algorithm is doing well or not, and if it is doing a good enough job, it can be a first solution that will help gathering more data to train a supervised classifier later.
Note that I am not an expert and this is just what I've come up with after mixing my knowledge and some articles I read recently on the subject.
Is it ok to only use one epoch?

I'm training a neural network in TensorFlow (using tflearn) on data that I generate. From what I can tell, each epoch we use all of the training data. Since I can control how many examples I have, it seems like it would be best to just generate more training data until one epoch is enough to train the network.
So my question is: Is there any downside to only using one epoch, assuming I have enough training data? Am I correct in assuming that 1 epoch of a million examples is better than 10 epochs of 100,000?
Following a discussion with #Prune:
Suppose you have the possibility to generate an infinite number of labeled examples, sampled from a fixed underlying probability distribution, i.e. from the same manifold.
The more examples the network see, the better it will learn, and especially the better it will generalize. Ideally, if you train it long enough, it could reach 100% accuracy on this specific task.
The conclusion is that only running 1 epoch is fine, as long as the examples are sampled from the same distribution.
The limitations to this strategy could be:
if you need to store the generated examples, you might run out of memory
to handle unbalanced classes (cf. #jorgemf answer), you just need to sample the same number of examples for each class.
e.g. if you have two classes, with 10% chance of sampling the first one, you should create batches of examples with a 50% / 50% distribution
it's possible that running multiple epochs might make it learn some uncommon cases better.
I disagree, using multiple times the same example is always worse than generating new unknown examples. However, you might want to generate harder and harder examples with time to make your network better on uncommon cases.
You need training examples in order to make the network learn. Usually you don't have so many examples in order to make the network converge, so you need to run more than one epoch.
It is ok to use only one epoch if you have so many examples and they are similar. If you have 100 classes but some of them only have very few examples you are not going to learn those classes only with one epoch. So you need balanced classes.
Moreover, it is a good idea to have a variable learning rate which decreases with the number of examples, so the network can fine tune itself. It starts with a high learning rate and then decreases it over time, if you only run for one epoch you need to bear in mind this to tweak the graph.
My suggestion is to run more than one epoch, mostly because the more examples you have the more memory you need to store them. But if memory is fine and learning rate is adjusted based on number of examples and not epochs, then it is fine run one epoch.
Edit: I am assuming you are using a learning algorithm which updates the weights of the network every batch or similar.

How can I know training data is enough for machine learning

For example: If I want to train a classifier (maybe SVM), how many sample do I need to collect? Is there a measure method for this?
It is not easy to know how many samples you need to collect. However you can follow these steps:
For solving a typical ML problem:
Build a dataset a with a few samples, how many? it will depend on the kind of problem you have, don't spend a lot of time now.
Split your dataset into train, cross, test and build your model.
Now that you've built the ML model, you need to evaluate how good it is. Calculate your test error
If your test error is beneath your expectation, collect new data and repeat steps 1-3 until you hit a test error rate you are comfortable with.
This method will work if your model is not suffering "high bias".
This video from Coursera's Machine Learning course, explains it.
Unfortunately, there is no simple method for this.
The rule of thumb is the bigger, the better, but in practical use, you have to gather the sufficient amount of data. By sufficient I mean covering as big part of modeled space as you consider acceptable.
Also, amount is not everything. The quality of test samples is very important too, i.e. training samples should not contain duplicates.
Personally, when I don't have all possible training data at once, I gather some training data and then train a classifier. Then I classifier quality is not acceptable, I gather more data, etc.
Here is some piece of science about estimating training set quality.
This depends a lot on the nature of the data and the prediction you are trying to make, but as a simple rule to start with, your training data should be roughly 10X the number of your model parameters. For instance, while training a logistic regression with N features, try to start with 10N training instances.
For an empirical derivation of the "rule of 10", see

Using a feature as Input vs. using it to build Several Machines on SVM

I am an undergraduate student and for my graduation thesis I am using SVM to predict the arrival time of a bus to a bus stop in its route. After doing a lot of research and reading some papers I still have a key doubt about how to model my system.
We've decided which features to use and we are in the process of gathering the data required to perform the regression, but what is confusing us are the implications or consequences of using some features as input for the SVM or building separated machines based on some of these features.
For instance, in this paper the authors built 4 SVMs for predicting bus arrival times: one for rush hour on sunny days, rush hour on rainy days, off-rush hour on sunny days and the last one for off-rush hours and rainy days.
But on a following paper on the same subejct they decided to use a single SVM with the weather condition and the rush/off-rush hour as input instead of breaking it in 4 SVMs as before.
I feel like this is the kind of thing that is more about experience so I would like to hear from you guys if anyone has any information about when to choose one of these approaches.
Thanks in advance.
There is no other way: you have to find out on your own. This is why you have to write this thesis. Nobody starts with a perfect solution. Everyone makes mistakes. Your problem is not easy and you cannot say what will work when you have never done anything similar. Try everything you found in the literature, compare the results, develop your own ideas, ...
Most important question: what is the data like?
Second question: what model do you expect to capture this?
So if you want to use SVMs for some reason, keep in mind their basic mechanism is linear, and can only capture non-linear phenomena if data is transformed by a suitable kernel.
For a particular problem at hand that means:
Do you have reason (plots, insights in the problem nature) to believe your problem is linear(ly separable)? Just use one linear svm.
Do you have reason your problem consist of several linear subproblems? Use a linear svm on each of the subproblems.
Does your data seem non-linearly grouped? Try an svm with something like rbf kernel.
Of course, you can just plug in and try, but checking the above may increase understanding of the problem.
In your particular problem I would go for single SVM.
With my not so extensive experience, I would consider breaking a problem in several SVMs for following reasons:
1)The classes are too different, or there are classes and subclasses in your problem.
E.g. in my case: there are several types of antibodies in a microscope image and they all may be positive or negative. So instead of defining A_Pos, A_Neg, B_Pos, B_Neg, ... I decide first if the image is positive or negative and determine the type in second SVM.
2)The feature extraction is too expensive. Provided you have groups of classes, which may be identified with fever features. Instead of extracting all features for a single machine, you may first extract only a small subset, and if required (result not with high enough probability) extract further features.
3)Decide whether the instance belongs to problem at all. Make a model containing one class and all instances of training set. If the instance to be classified is an outlier, stop. Otherwise classify with 2nd SVM containing all classes.
The key-word is "cascaded SVM"
