Combining multiple Haar Classifiers with OpenCV - image-processing

I was wandering if there is a way to combine Haar-Classifiers from different trained cascades?
I have a scenario, where I detect one object that differs depending on the angle of the object. So I separated my training samples to train multiple classifiers. They work OK for their classes. Right now I run them sequentially which is costing me a lot of calculation time.
I figured that OpenCV is probably calculating all the features every time thus iterating newly every time. I thought, if I could combine my classifiers by an OR operation, then OpenCV might be able to just use one cascade thus only iterating once and only calculating the needed features once and so on. This might increase my performance dramatically. However I am not sure if (and how) this could be done. Maybe someone else has tried something similar before?
Cheers!
-- artur

Well, when you train a specific classifier, AdaBoost algorithm (at every stage) picks different features to minimize training error. That procedure is done for every stage of an cascade.
Unfortunately for every object those features are not the same (different size although you have fixed number of feature shapes), thus a feature space is not the same either.
So even that there is a way to combine those classifiers, the benefit would be marginal because you probably do not have the same features for different objects so you would need evaluate almost every feature again.

I run each of mine as a separate parallel task.

I don't wait for all but process each as they end by raising an event.

Related

Are there any ways to build an ML model using CBIR and SIFT for image comparison in my case?

I have this project I'm working on. A part of the project involves multiple test runs during which screenshots of an application window are taken. Now, we have to ensure that screenshots taken between consecutive runs match (barring some allowable changes). These changes could be things like filenames, dates, different logos, etc. within the application window that we're taking a screenshot of.
I had the bright idea to automate the process of doing this checking. Essentially my idea was this. If I could somehow mathematically quantify the difference between a screenshot from the N-1th run and the Nth run, I could create a binary labelled dataset that mapped feature vectors of some sort to a label (0 for pass or 1 for fail if the images do not adequately match up). The reason for all of this was so that my labelled data would help make the model understand what scale of changes are acceptable, because there are so many kinds that are acceptable.
Now lets say I have access to lots of data that I have meticulously labelled, in the thousands. So far I have tried using SIFT in opencv using keypoint matching to determine a similarity score between images. But this isn't an intelligent, learning process. Is there some way I could take some information from SIFT and use it as my x-value in my dataset?
Here are my questions:
what would that be the information I need as my x-value? It needs to be something that represents the difference between two images. So maybe the difference between feature vectors from SIFT? What do I do when those vectors are of slightly different dimensions?
Am I on the right track with thinking about using SIFT? Should I look elsewhere and if so where?
Thanks for your time!
The approach that is being suggested in the question goes like this -
Find SIFT features of two consecutive images.
Use those to somehow quantify the similarity between two images (sounds reasonable)
Use this metric to first classify the images into similar and non-similar.
Use this dataset to train a NN do to the same job.
I am not completely convinced if this is a good approach. Let's say that you created the initial classifier with SIFT features. You are then using this data to train a NN. But this data will definitely have a lot of wrong labels. Because if it didn't have a lot of wrong labels, what's stopping you from using your original SIFT based classifier as your final solution?
So if your SIFT based classification is good, why even train a NN? On the other hand, if it's bad, you are giving a lot of wrong labeled data to the NN for training. I think the latter is a probably a bad idea. I say probably because there is a possibility that maybe the wrong labels just encourage the NN to generalize better, but that would require a lot of data, I imagine.
Another way to look at this is, let's say that your initial classifier is 90% accurate. That's probably the upper limit of the performance for the NN that you are looking at when talking about training it with this data.
You said that the issue that you have with your first approach is that 'it's not a an intelligent, learning process'. I think it's the wrong approach to think that the former approach is always inferior to the latter. SIFT is a powerful tool that can solve a lot of problems without all the 'black-boxness' of an NN. If this problem can be solved with sufficient accuracy using SIFT, I think going after a learning based approach is not the way to go, because again, a learning based approach isn't necessarily superior.
However, if the SIFT approach isn't giving you good enough results, definitely start thinking of NN stuff, but at that point, using the "bad" method to label the data is probably a bad idea.
Also in relation, I think you could potentially be underestimating the amount of data that is needed for this. You mentioned data in the thousands, but that's honestly, not a lot. You would need a lot more, I think.
One way I would think about instead doing this -
Do SIFT keyponits detection for a sample reference image.
Manually filter out keypoints that does not belong to the things in the image that are invariant. That is, just take keypoints at the locations in the image that is guaranteed (or very likely) to be always present.
When you get a new image, compute the keypoints and do matching with the reference image.
Set some threshold of the ratio of good matches to the total number of matches.
Depending on your application, this might give you good enough results.
If not, and if you really want your solution to be NN based, I would say you need to manually label the dataset as opposed to using SIFT.

How to clarify which model layers to use for machine learning?

We are currently doing a little experiment with machine learning with Deeplearning4j.
We have voltage measurements in time series from different devices that I know that depends on each other.
We manage to labeling huge amount of those data with one and zeroes.
Our problem is to figure out the use of layers for the model.
For us it seems that it is experience that it is used among people and examples seems to be random.
We currently using LSTM and RNN
But how can we clarify if there is better models?
We would like to see if the model can figure out some dependencies through predictions that we haven’t noticed.
The best way to go about this, is to start by looking at your data and what you want to get out of it. Then you should start out by setting up a base line. Use the simplest possible modelling technique you are familiar with just so you have anything at all.
In your case it looks like you have a label for each timestep. So, you might just use simple linear regression for each timestep separately to get a feel for what you would get if you don't incorporate any sequence information at all. Anything that works fast is a good candidate for this step.
Once you have that baseline, you can start looking at building a deeplearning model that outperforms this baseline.
For time series data, you have two options at the moment in DL4J, either you use a recurrent layer like LSTM, or you use convolutions over time.
If you want to have an output at each timestep, then a recurrent layer is probably better for you. The convolutional approach usually works best if you want to have just a single result after reading in the whole sequence.
For choosing how wide those layers should be, and how many layers you should use, you will have to experiment a bit.
The first thing that you want to achieve is to build a model that can over-fit on a subset of your data. So you start out, by passing in only a single batch of examples over and over again. If the model can't overfit on that, you make the layers wider. If the layers start getting too wide, you add another layer on top.
If you use the deeplearning4j-ui module, it will tell you how many parameters your model currently has. They should usually be less than the number of total examples you have, or you risk overfitting on your full data set.
As soon as you can train a model to overfit on a small subset of your data, you can start training it with all of your data.
At that point you can then start looking into finding better hyperparameters and seeing by how much you can beat your baseline.

Multiple cross-validation + testing on a small dataset to improve confidence

I am currently working on a very small dataset of about 25 samples (200 features) and I need to perform model selection and also have a reliable classification accuracy. I was planning to split the dataset in a training set (for a 4-fold CV) and a test set (for testing on unseen data). The main problem is that the resulting accuracy obtained from the test set is not reliable enough.
So, performing multiple time the cross-validation and testing could solve the problem?
I was planning to perform multiple times this process in order to have a better confidence on the classification accuracy. For instance: I would run one cross-validation plus testing and the output would be one "best" model plus the accuracy on the test set. The next run I would perform the same process, however, the "best" model may not be the same. By performing this process multiple times I eventually end up with one predominant model and the accuracy will be the average of the accuracies obtained on that model.
Since I never heard about a testing framework like this one, does anyone have any suggestion or critics on the algorithm proposed?
Thanks in advance.
The algorithm seems interesting but you need to make lots of passes through data and ensure that some specific model is really dominant (that it surfaces in real majority of tests, not just 'more than others'). In general, in ML a real problem is having too little data. As anyone will tell you, not the team with the most complicated algorithm wins, but the team with biggest amount of data.
In your case I would also suggest one additional approach - bootstrapping. Details are here:
what is the bootstrapped data in data mining?
Or can be googled. Long story short it is a sampling with replacement, which should help you to expand your dataset from 25 samples to something more interesting.
When the data is small like yours you should consider 'LOOCV' or leave one out cross validation. In this case you partition the data into 25 different samples where and each one a single different observatin is held out. Performance is then calcluated using the 25 individual held out predictions.
This will allow you to use the most data in your modeling and you will still have a good measure of performance.

Should I remove test samples that are identical to some training sample?

I've been having a bit of a debate with my adviser about this issue, and I'd like to get your opinion on it.
I have a fairly large dataset that I've used to build a classifier. I have a separate, smaller testing dataset that was obtained independently from the training set (in fact, you could say that each sample in either set was obtained independently). Each sample has a class label, along with metadata such as collection date and location.
There is no sample in the testing set that has the same metadata as any sample in the training set (as each sample was collected at a different location or time). However, it is possible that the feature vector itself could be identical to some sample in the training set. For example, there could be two virus strains that were sampled in Africa and Canada, respectively, but which both have the same protein sequence (the feature vector).
My adviser thinks that I should remove such samples from the testing set. His reasoning is that these are like "freebies" when it comes to testing, and may artificially boost the reported accuracy.
However, I disagree and think they should be included, because it may actually happen in the real world that the classifier sees a sample that it has already seen before. To remove these samples would bring us even further from reality.
What do you think?
It would be nice to know if you're talking about a couple of repetitions in million samples or 10 repetitions in 15 samples.
In general I don't find what you're doing reasonable. I think your advisor has a very good point. Your evaluation needs to be as close as possible to using your classifier outside your control -- You can't just assume your going to be evaluated on a datapoint you've already seen. Even if each data point is independent, you're going to be evaluated on never-before-seen data.
My experience is in computer vision, and it would be very highly questionable to train and test with the same picture of a one subject. In fact I wouldn't be comfortable training and testing with frames of the same video (not even the same frame).
EDIT:
There are two questions:
The distribution permits that these repetitions naturally happen. I
believe you, you know your experiment, you know your data, you're
the expert.
The issue that you're getting a boost by doing this and that this
boost is possibly unfair. One possible way to address your advisor's
concerns is to evaluate how significant a leverage you're getting
from the repeated data points. Generate 20 test cases 10 in which
you train with 1000 and test on 33 making sure there are not
repetitions in the 33, and another 10 cases in which you train with
1000 and test on 33 with repetitions allowed as they occur
naturally. Report the mean and standard deviation of both
experiments.
It depends... Your adviser suggested the common practice. You usually test a classifier on samples which have not been used for training. If the samples of the test set matching the training set are very few, your results are not going to have statistical difference because of the reappearance of the same vectors. If you want to be formal and still keep your logic, you have to prove that the reappearance of the same vectors has no statistical significance on the testing process. If you proved this theoretically, I would accept your logic. See this ebook on statistics in general, and this chapter as a start point on statistical significance and null hypothesis testing.
Hope I helped!
In as much as the training and testing datasets are representative of the underlying data distribution, I think it's perfectly valid to leave in repetitions. The test data should be representative of the kind of data you would expect your method to perform on. If you genuinely can get exact replicates, that's fine. However, I would question what your domain is where it's possible to generate exactly the same sample multiple times. Are your data synthetic? Are you using a tiny feature set with few possible values for each of your features, such that different points in input space map to the same point in feature space?
The fact that you're able to encounter the same instance multiple times is suspicious to me. Also, if you have 1,033 instances, you should be using far more than 33 of them for testing. The variance in your test accuracy will be huge. See the answer here.
Having several duplicate or very similar samples seems somewhat analogous to the distribution of the population you're attempting to classify being non-uniform. That is, certain feature combinations are more common than others, and the high occurrence of them in your data is giving them more weight. Either that, or your samples are not representative.
Note: Of course, even if a population is uniformly distributed there is always some likelihood of drawing similar samples (perhaps even identical depending on the distribution).
You could probably make some argument that identical observations are a special case, but are they really? If your samples are representative it seems perfectly reasonable that some feature combinations would be more common than others (perhaps even identical depending on your problem domain).

Using a feature as Input vs. using it to build Several Machines on SVM

I am an undergraduate student and for my graduation thesis I am using SVM to predict the arrival time of a bus to a bus stop in its route. After doing a lot of research and reading some papers I still have a key doubt about how to model my system.
We've decided which features to use and we are in the process of gathering the data required to perform the regression, but what is confusing us are the implications or consequences of using some features as input for the SVM or building separated machines based on some of these features.
For instance, in this paper the authors built 4 SVMs for predicting bus arrival times: one for rush hour on sunny days, rush hour on rainy days, off-rush hour on sunny days and the last one for off-rush hours and rainy days.
But on a following paper on the same subejct they decided to use a single SVM with the weather condition and the rush/off-rush hour as input instead of breaking it in 4 SVMs as before.
I feel like this is the kind of thing that is more about experience so I would like to hear from you guys if anyone has any information about when to choose one of these approaches.
Thanks in advance.
There is no other way: you have to find out on your own. This is why you have to write this thesis. Nobody starts with a perfect solution. Everyone makes mistakes. Your problem is not easy and you cannot say what will work when you have never done anything similar. Try everything you found in the literature, compare the results, develop your own ideas, ...
Most important question: what is the data like?
Second question: what model do you expect to capture this?
So if you want to use SVMs for some reason, keep in mind their basic mechanism is linear, and can only capture non-linear phenomena if data is transformed by a suitable kernel.
For a particular problem at hand that means:
Do you have reason (plots, insights in the problem nature) to believe your problem is linear(ly separable)? Just use one linear svm.
Do you have reason your problem consist of several linear subproblems? Use a linear svm on each of the subproblems.
Does your data seem non-linearly grouped? Try an svm with something like rbf kernel.
Of course, you can just plug in and try, but checking the above may increase understanding of the problem.
In your particular problem I would go for single SVM.
With my not so extensive experience, I would consider breaking a problem in several SVMs for following reasons:
1)The classes are too different, or there are classes and subclasses in your problem.
E.g. in my case: there are several types of antibodies in a microscope image and they all may be positive or negative. So instead of defining A_Pos, A_Neg, B_Pos, B_Neg, ... I decide first if the image is positive or negative and determine the type in second SVM.
2)The feature extraction is too expensive. Provided you have groups of classes, which may be identified with fever features. Instead of extracting all features for a single machine, you may first extract only a small subset, and if required (result not with high enough probability) extract further features.
3)Decide whether the instance belongs to problem at all. Make a model containing one class and all instances of training set. If the instance to be classified is an outlier, stop. Otherwise classify with 2nd SVM containing all classes.
The key-word is "cascaded SVM"

Resources