SVM Classification - minimum number of input sets for each class - machine-learning

I'm trying to build an app to detect images which are advertisements from the webpages. Once I detect those I`ll not be allowing those to be displayed on the client side.
From the help that I got on this Stackoverflow question, I thought SVM is the best approach to my aim.
So, I have coded SVM and an SMO myself. The dataset which I have got from UCI data repository has 3280 instances ( Link to Dataset ) where around 400 of them are from class representing Advertisement images and rest of them representing non-advertisement images.
Right now I'm taking the first 2800 input sets and training the SVM. But after looking at the accuracy rate I realised that most of those 2800 input sets are from non-advertisement image class. So I`m getting very good accuracy for that class.
So what can I do here? About how many input set shall I give to SVM to train and how many of them for each class?
Thanks. Cheers. ( Basically made a new question because the context was different from my previous question. Optimization of Neural Network input data )
Thanks for the reply.
I want to check whether I`m deriving the C values for ad and non-ad class correctly or not.
Please give me feedback on this.
Or you u can see the doc version here.
You can see graph of y1 eqaul to y2 here
and y1 not equal to y2 here

There are two ways of going about this. One would be to balance the training data so it includes an equal number of advertisement and non-advertisement images. This could be done by either oversampling the 400 advertisement images or undersampling the thousands of non-advertisement images. Since training time can increase dramatically with the number of data points used, you should probably first try undersampling the non-advertisement images and create a training set with the 400 ad images and 400 randomly selected non-advertisements.
The other solution would be to use a weighted SVM so that margin errors for the ad images are weighted more heavily than those for non-ads, for the package libSVM this is done with the -wi flag. From your description of the data, you could try weighing the ad images about 7 times more heavily than the non-ads.

The required size of your training set depends on the sparseness of the feature space. As far as I can see, you are not discussing what image features you have chose to use. Before you can train, you need to to convert each image into a vector of numbers (features) that describe the image, hopefully capturing the aspects that you care about.
Oh, and unless you are reimplementing SVM for sport, I'd recomment just using libsvm,

Related

How to compute similarity score between two images using their feature vectors?

I am working on face recognition project using deep learning architecture to classify the images into respective classes. The output of network at softmax layer is the predicted class label and the output of last but one layer at the dense layer is a feature representation of the input image. Here the feature vector is a 1-D matrix of size 1000 for each image. Predicting classes is recognition type problem, but I'm interested in verification problem.
So given two sample images, I need to compare the similarity/dissimilarity score between two given images using their feature representations. If the match score is greater than the threshold then it's a hit else no hit. Please let me know if there are any standard approaches?
Example of similar faces (which should ideally generate matchscore>threshold): https://3c1703fe8d.site.internapcdn.net/newman/gfx/news/hires/2014/yvyughbujh.jpg
Your project has two solutions:
Train your own network (using pretrained one) with output in 1000 classes. This approach is not the simplest one because of the necessity of having enough (say huge) amount of data for each class, approximately 1000 samples per class.
Another approach is to use Distance Metrics Learning. By this "distance" we usually mean Euclidean norm. This approach is much wider and deeper than just extract features and match them to the nearest one. Try to search for it.
Good luck!

Datageneration for Neural Networks

I have images that I want to process. First features are extracted from those images and then those features are fed into a neural network for training. I do not have many images though and would like to generate more data.
1) What yields less overfitting: Should I generate more images from the original images and then feed the entire pipeline with them, or should I bring variation into the extracted features and simply train the neural network with more data this way?
The second approach would be computationally cheaper, but yields better results?
2) What techniques are tried and true for generating more data - either more images or the features?
Is true that when you don't have enough data the performance of your model can be poor. So you have to try a few things:
You can modify the data that you have applying translations, rotations, etc; for example move all the pixel of the image a few pixel to the left. This are operation on images.
Also you can generate more images through generative models: Restricted Boltzmann Machines, Deep Belief Networks etc.
Also you have a way of determine if you need more training data. In the coordinate axis you draw the score of the training data and validation data. In the x axis goes the size of the sets(10% of the all set, 20% of the all set, ..., 90% of the all set) and in the y axis is the score. Then you look at the graph. For understand well enough this what i'm saying i strongly recommend the videos of Andrew Ng of Machine Learning(https://www.coursera.org/learn/machine-learning) specifically the Week 6(Advice for Applying Machine Learning)

Miminum requirements for Google tensorflow image classifier

We are planning to build image classifiers using Google Tensorflow.
I wonder what are the minimum and what are the optimum requirements to train a custom image classifier using a convolutional deep neural network?
The questions are specifically:
how many images per class should be provided at a minimum?
do we need to appx. provide the same amount of training images per class or can the amount per class be disparate?
what is the impact of wrong image data in the training data? E.g. 500 images of a tennis shoe and 50 of other shoes.
is it possible to train a classifier with much more classes than the recently published inception-v3 model? Let's say: 30.000.
"how many images per class should be provided at a minimum?"
Depends how you train.
If training a new model from scratch, purely supervised: For a rule of thumb on the number of images, you can look at the MNIST and CIFAR tasks. These seem to work OK with about 5,000 images per class. That's if you're training from scratch.
You can probably bootstrap your network by beginning with a model trained on ImageNet. This model will already have good features, so it should be able to learn to classify new categories without as many labeled examples. I don't think this is well-studied enough to tell you a specific number.
If training with unlabeled data, maybe only 100 labeled images per class. There is a lot of recent research work on this topic, though not scaling to as large of tasks as Imagenet.
Simple to implement:
http://arxiv.org/abs/1507.00677
Complicated to implement:
http://arxiv.org/abs/1507.02672
http://arxiv.org/abs/1511.06390
http://arxiv.org/abs/1511.06440
"do we need to appx. provide the same amount of training images per class or can the amount per class be disparate?"
It should work with different numbers of examples per class.
"what is the impact of wrong image data in the training data? E.g. 500 images of a tennis shoe and 50 of other shoes."
You should use the label smoothing technique described in this paper:
http://arxiv.org/abs/1512.00567
Smooth the labels based on your estimate of the label error rate.
"is it possible to train a classifier with much more classes than the recently published inception-v3 model? Let's say: 30.000."
Yes
How many images per class should be provided at a minimum?
do we need to appx. provide the same amount of training images per class or can the amount per class be disparate?
what is the impact of wrong image data in the training data? E.g. 500 images of a tennis shoe and 50 of other shoes.
These three questions are not really TensorFlow specific. But the short answer is, it depends on the resiliency of your model in handling unbalanced data set and noisy labels.
is it possible to train a classifier with much more classes than the recently published inception-v3 model? Let's say: 30.000.
Yes, definitely. This would mean a much larger classifier layer, so your training time might be longer. Other than that, there are no limitations in TensorFlow.

Does SVM need to do learning each time when detecting people?

I'm using SVM and HOG in OpenCV to implement people detection.
Say using my own dataset: 3000 positive samples and 6000 negative samples.
My question is Does SVM need to do learning each time when detecting people?
If so, the learning time and predicting time could be so time-consuming. Is there any way to implement real-time people detection?
Thank you in advance.
Thank you for your answers. I have obtained the xml result after training(3000 positive and 6000 negative), so I can just use this result to write an other standalone program just use svm.load() and svm.predict()? That's great. Besides, I found that the predicting time for 1000 detection window size image(128x64) is also quite time-consuming(about 10 seconds), so how does it handle a normal surveillance camera capture(320x240 or higher) using 1 or 2 pixels scanning stepsize in real time?
I implemented HOG according to the original paper, 8x8 pixels per cell, 2x2 cells per block(50% overlap), so 3780 dimensions vector for one detection window(128x64). Is the time problem caused by the huge feature vector? Should I reduce the dimensions for each window?
This is a very specific question to a general topic.
Short answer: no, you don't need to learning every time you want to use a SVM. It is a two step process. The first step, learning (in your case by providing your learning algorithm with many many labeled (containing, not containing) pictures containing people or not containing people), results in a model which is used in the second step: testing (in your case detecting people).
no, you don't have to re-train an svm each and every time.
you do the training once, then svm.save() the trained model to a xml/yml file.
later you just svm.load() that instead of the (re-)training, and do your predictions

Random Perturbation of Data to get Training Data for Neural Networks

I am working on Soil Spectral Classification using neural networks and I have data from my Professor obtained from his lab which consists of spectral reflectance from wavelength 1200 nm to 2400 nm. He only has 270 samples.
I have been unable to train the network for accuracy more than 74% since the training data is very less (only 270 samples). I was concerned that my Matlab code is not correct, but when I used the Neural Net Toolbox in Matlab, I got the same results...nothing more than 75% accuracy.
When I talked to my Professor about it, he said that he does not have any more data, but asked me to do random perturbation on this data to obtain more data. I have research online about random perturbation of data, but have come up short.
Can someone point me in the right direction for performing random perturbation on 270 samples of data so that I can get more data?
Also, since by doing this, I will be constructing 'fake' data, I don't see how the neural network would be any better cos isn't the point of neural nets using actual real valid data to train the network?
Thanks,
Faisal.
I think trying to fabricate more data is a bad idea: you can't create anything with higher information content than you already have, unless you know the true distribution of the data to sample from. If you did, however, you'd be able to classify with the Bayes optimal error rate, which would be impossible to beat.
What I'd be looking at instead is whether you can alter the parameters of your neural net to improve performance. The thing that immediately springs to mind with small amounts of training data is your weight regulariser (are you even using regularised weights), which can be seen as a prior on the weights if you're that way inclined. I'd also look at altering the activation functions if you're using simple linear activations, and the number of hidden nodes in addition (with so few examples, I'd use very few, or even bypass the hidden layer entirely since it's hard to learn nonlinear interactions with limited data).
While I'd not normally recommend it, you should probably use cross-validation to set these hyper-parameters given the limited size, as you're going to get unhelpful insight from a 10-20% test set size. You might hold out 10-20% for final testing, however, so as to not bias the results in your favour.
First, some general advice:
Normalize each input and output variable to [0.0, 1.0]
When using a feedforward MLP, try to use 2 or more hidden layers
Make sure your number of neurons per hidden layer is big enough, so the network is able to tackle the complexity of your data
It should always be possible to get to 100% accuracy on a training set if the complexity of your model is sufficient. But be careful, 100% training set accuracy does not necessarily mean that your model does perform well on unseen data (generalization performance).
Random perturbation of your data can improve generalization performance, if the perturbation you are adding occurs in practice (or at least similar perturbation). This works because this means teaching your network on how the data could look different but still belong to the given labels.
In the case of image classification, you could rotate, scale, noise, etc. the input image (the output stays the same, naturally). You will need to figure out what kind of perturbation could apply to your data. For some problems this is difficult or does not yield any improvement, so you need to try it out. If this does not work, it does not necessarily mean your implementation or data are broken.
The easiest way to add random noise to your data would be to apply gaussian noise.
I suppose your measures have errors associated with them (a measure without errors has almost no meaning). For each measured value M+-DeltaM you can generate a new number with N(M,DeltaM), where n is the normal distribution.
This will add new points as experimental noise from previous ones, and will add help take into account exprimental errors in the measures for the classification. I'm not sure however if it's possible to know in advance how helpful this will be !

Resources