Testing an image processing algorithm on noisy data - image-processing

I wrote an image processing program that train some classifier to recognize some object in the image. now I want to test the response of my algorithm to noise. I wish the algorithm have some robustness to noise.
My question is that, should I train the classifier using noisy version of train dataset, or train the classifier using original version of dataset, and see its performance on noisy data.
Thank you.

to show robustness of classifier one might use highly noisy test data on the originally trained classifier. depending on that performance, one can train again using noisy data and then test again. obviously for an application development, if including extremely noisy samples increase accuracy then that's the way to go. literature says to have as large a range of training samples as possible. however sometimes this degrades performances in specific cases.

Related

Which algorithm i should choose from KNN and CNN for audio binary classifiaction?

I am doing thesis on baby cry detection, I build the model with CNN and KNN, the train accuracy of CNN is 99% and Test accuracy is 98% and for KNN, train accuracy is 98% and Test accuracy is 98%.
Please suggest me which algorithms I should choose and why?
In KNN, output completely relies on nearest neighbours, which may or may not be good choice. Also it is sensitive to distance metrics. More you can find here. And great discussion on its distance metrics can be helpful for you.
On the other hand, CNN extract the features from the input data. Which are very helpful for making analysis. And recent success in the CNN specially wavenet for the audio application, i will prefer to go with CNN.
Edit: Considering your data-size, CNN is not good option here.

Stratified sampling for regression

I need to do regression analysis using SVM kernels on the large sets of data. My laptop is not able to handle and it takes hours to finish running. Is there any good way to reduce the dataset size without affecting the (much) quality of the model? Will stratified sampling work?
There are dozens of ways of reducing SVM complexity, probably the easiest ones involve approximating Kernel space projection. In particular libraries such as scikit-learn provides functions to do this kind of explicit projection, which followed by a linear SVM - can be trained realatively fast.

Machine learning : RandomForest data pre-processing

Before fitting a RandomForest what should be done with continuous features, should they be standard scaled?
No decision trees approach or Random Forests for that matter don't really care whether they are dealing with continuous data or discrete data. So even if you don't standardize it wont be a issue.

How can I know training data is enough for machine learning

For example: If I want to train a classifier (maybe SVM), how many sample do I need to collect? Is there a measure method for this?
It is not easy to know how many samples you need to collect. However you can follow these steps:
For solving a typical ML problem:
Build a dataset a with a few samples, how many? it will depend on the kind of problem you have, don't spend a lot of time now.
Split your dataset into train, cross, test and build your model.
Now that you've built the ML model, you need to evaluate how good it is. Calculate your test error
If your test error is beneath your expectation, collect new data and repeat steps 1-3 until you hit a test error rate you are comfortable with.
This method will work if your model is not suffering "high bias".
This video from Coursera's Machine Learning course, explains it.
Unfortunately, there is no simple method for this.
The rule of thumb is the bigger, the better, but in practical use, you have to gather the sufficient amount of data. By sufficient I mean covering as big part of modeled space as you consider acceptable.
Also, amount is not everything. The quality of test samples is very important too, i.e. training samples should not contain duplicates.
Personally, when I don't have all possible training data at once, I gather some training data and then train a classifier. Then I classifier quality is not acceptable, I gather more data, etc.
Here is some piece of science about estimating training set quality.
This depends a lot on the nature of the data and the prediction you are trying to make, but as a simple rule to start with, your training data should be roughly 10X the number of your model parameters. For instance, while training a logistic regression with N features, try to start with 10N training instances.
For an empirical derivation of the "rule of 10", see
https://medium.com/#malay.haldar/how-much-training-data-do-you-need-da8ec091e956

OpenCV: Training a soft cascade classifier

I've built an algorithm for pedestrian detection using openCV tools. To perform classification I use a boosted classifier trained with the CvBoost class.
The problem of this implementation is that I need to feed my classifier the whole set of features I used for training. This makes the algorithm extremely slow, so much that each image takes around 20 seconds to be fully analysed.
I need a different detection structure, and openCV has this Soft Cascade class that seems like exactly what I need. Its basic principle is that there is no need to examine all the features of a testing sample, since a detector can reject most negative samples using a small number of features. The problem is that I have no idea how to train one given a fully labeled set of negative and positive examples.
I find no information about this online, so I am looking for any tips you can give me on how to use this soft cascade to make classification.
Best regards

Resources