Can we train of a binary classifier with "A" to classify "a"? - machine-learning

I have a maybe naive question of the appropriateness in the use of binary classification. This is a hypothetical example, so forgive me if it is too coarse.
Let’s say I want to train a support vector machine to classify the letter "a" from the letters "b" and "c". The problem is that I don’t have letters "a" to train my SVM model, but I do have a lot of letters "A" which basically mean the same. It is a model like this valid?
My real problem is within the realm of image processing. I have some pixels that I want to classify using texture but I don’t have a "class" for those pixels. I do have ground truth for something very similar (that I know the class) and I have been using the same kind of texture (but from this similar pixels) to train a SVM and do the classification. This SVM works reasonably well (assessed with the visual inspection of the images) with some accuracies up to 90%. The problem I think is that this accuracy is tested with the training of "A" and not "a"
Does this make any sense?
It is something like this valid? or there is anything else that I can code to make this a valid classification? How do you test accuracy in something like this?

Theoretically a binary classifier can't distinguish between two classes if they are not linearly separable. This is why you get low accuracy.
If you want to distinguish between two textures, one approach could be to train a binary classifier for each texture, and then use a voting scheme to make a decision on the label of a new texture.

Related

Is doc vector learned through PV-DBOW equivalent to the average/sum of the word vectors contained in the doc?

I've seen some posts say that the average of the word vectors perform better in some tasks than the doc vectors learned through PV_DBOW. What is the relationship between the document's vector and the average/sum of its words' vectors? Can we say that vector d
is approximately equal to the average or sum of its word vectors?
Thanks!
No. The PV-DBOW vector is calculated by a different process, based on how well the PV-DBOW-vector can be incrementally nudged to predict each word in the text in turn, via a concurrently-trained shallow neural network.
But, a simple average-of-word-vectors often works fairly well as a summary vector for a text.
So, let's assume both the PV-DBOW vector and the simple-average-vector are the same dimensionality. Since they're bootstrapped from exactly the same inputs (the same list of words), and the neural-network isn't significantly more sophisticated (in its internal state) than a good set of word-vectors, the performance of the vectors on downstream evaluations may not be very different.
For example, if the training data for the PV-DBOW model is meager, or meta-parameters not well optimized, but the word-vectors used for the average-vector are very well-fit to your domain, maybe the simple-average-vector would work better for some downstream task. On the other hand, a PV-DBOW model trained on sufficient domain text could provide vectors that outperform a simple-average based on word-vectors from another domain.
Note that FastText's classification mode (and similar modes in Facebook's StarSpace) actually optimizes word-vectors to work as parts of a simple-average-vector used to predict known text-classes. So if your end-goal is to have a text-vector for classification, and you have a good training dataset with known-labels, those techniques are worth considering as well.

Is it possible to train a binary classification neural network by only feeding it with input from only one class?

I'm basically trying to create a neural network that should tell me whether an input I'm giving it is valid or not. The problem is that I only have valid input with which I can train it.
Right now I am trying to come up with a working dense model that validates only mnist digits between 0 and 4. All other digits should be seen as invalid. First attempt was to train it with digits between 0 and 4 as valid and images with random pixels as invalid (with the same percent of black pixels as a normal image) but unfortunately it doesn't work. When I test it with digits between 5 and 9, they are seen as valid.
So I'm starting to think if it's even possible to train a neural network this way.
Also I realize there might be better ways to do this, maybe with an autoencoder or a different kind of network but right now I want to try this with only dense layers.
Thank you.
What you are looking for is one-class classification, also known as unary classification or class-modelling.
Quick google search suggests to train an autoencoder and define an object as in your class if the reconstruction error is below a specific threshold.
But if you start building up something like that i would suggest you to use something like One-Class K-Nearest Neighbor or One-Class SVM first to see if you get acceptable results. If so you can improve your results with the "extremly more complicated to develop"- solution using autoencoders

binary classification with sparse binary matrix

My crime classification dataset has indicator features, such as has_rifle.
The job is to train and predict whether data points are criminals or not. The metric is weighted mean absolute error, where if the person is criminal, and the model predicts him/her as not, then the weight is large as 5. If person is not criminal and the model predicts as he/she is, then weight is 1. Otherwise the model predicts correctly, with weight 0.
I've used classif:multinom method in mlr in R, and tuned the threshold to 1/6. The result is not that good. Adaboost is slightly better. Though neither is perfect.
I'm wondering which method is typically used in this kind of binary classification problem with a sparse {0,1} matrix? And how to improve the performance measured by the weighted mean absolute error metric?
Dealing with sparse data is not a trivial task. Lack of information makes difficult to capture features such as variance. I would suggest you to search for subspace clustering methods or to be more specific, soft subspace clustering. The last one usually identifies relevant/irrelevant data dimensions. It is a good approach when you want to improve classification accuracy.

Methods to ignore missing word features on test data

I'm working on a text classification problem, and I have problems with missing values on some features.
I'm calculating class probabilities of words from labeled training data.
For example;
Let word foo belongs to class A for 100 times and belongs to class B for 200 times. In this case, i find class probability vector as [0.33,0.67] , and give it along with the word itself to classifier.
Problem is that, in the test set, there are some words that have not been seen in training data, so they have no probability vectors.
What could i do for this problem?
I ve tried giving average class probability vector of all words for missing values, but it did not improve accuracy.
Is there a way to make classifier ignore some features during evaluation just for specific instances which does not have a value for giving feature?
Regards
There is many way to achieve that
Create and train classifiers for all sub-set of feature you have. You can train your classifier on sub-set with the same data as tre training of the main classifier.
For each sample juste look at the feature it have and use the classifier that fit him the better. Don't try to do some boosting with thoses classifiers.
Just create a special class for samples that can't be classified. Or you have experimented result too poor with so little feature.
Sometimes humans too can't succefully classify samples. In many case samples that can't be classified should just be ignore. The problem is not in the classifier but in the input or can be explain by the context.
As nlp point of view, many word have a meaning/usage that is very similare in many application. So you can use stemming/lemmatization to create class of words.
You can also use syntaxic corrections, synonyms, translations (does the word come from another part of the world ?).
If this problem as enouph importance for you then you will end with a combination of the 3 previous points.

SVM for HOG descriptors in opencv

I am trying to classify the yard digits on the football field. I am able to detect them (different method) well. I have a minimal bounding box drawn around the tens place digits '1,2,3,4,5'. My goal is to classify them.
Ive been trying to train an SVM classifier on hog features I extract from the training set. A small subset of my training digits are here: http://ssadanand.imgur.com/all/
While training, I visualize my hog descriptors and they look correct. I use a 64X128 training window and other default parameters that OPencv's HOGDescriptor uses.
Once I train my images (50 samples per class, 5 classes), I have a 250X3780 training vector and 1X250 label vector which holds the class label values which I feed to a CvSVM object. Here is where I have a problem.
I tried using the default CvSVMParams() while using CvSVM. Terrible performance when tested on the training set itself!
I tried customizing my CvSVMPARAMS doing this:
CvSVMParams params = CvSVMParams();
params.svm_type = CvSVM::EPS_SVR;
params.kernel_type = CvSVM::POLY;
params.C = 1; params.p = 0.5; params.degree = 1;
and different variations of these parameters and my SVM classifier is terribly even when I test on the training set!
Can somebody help me out with parameterizing my SVM for this 5 class classifier?
I don't understand which kernel and what svm type I must use for this problem. Also, how in the world am I supposed to find out the values of c, p, degree for my svm?
I would assume this is an extremely easy classification problem since all my objects are nicely bounded in a box, fairly good resolution, and the classes i.e.: the digits 1,2,3,4,5 are fairly unique in appearance. I don't understand why my SVM is doing so poorly. What am I missing here?
A priori and without experimentation, it's very hard to give you some good parameters but I can give you some ideas.
First, you want to model a multi class classifier but you are using a regression algorithm, not that you can't do that but usually is easier if you start with C-SVM first.
Second, I would recommend to use RBF instead of a Polynomial kernel. Poly is very hard to get it right and usually RBF would do a better job out of the box.
Third, I would play with several values of C, don't be shy and try a bigger C (such as 100) which would force the algorithm to pick more SVs. It can lead to overfitting but if you can't even make the algorithm to learn the training set that's not your immediate problem.
Fourth, I would reduce the dimension of the images at first and then if needed, when you have a more stable model, you could try with that dimension again.
I really recommend you to read LibSVM guide which is very easy to follow http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf
Hope it helps!
EDIT:
I forgot to mention, that a good way to pick parameters for SVM is to perform cross-validation: http://en.wikipedia.org/wiki/Cross-validation_(statistics)
http://www.autonlab.org/tutorials/overfit10.pdf
http://www.youtube.com/watch?v=hihuMBCuSlU
http://www.youtube.com/watch?v=m5StqDv-YlM
EDIT2:
I know is silly because it's on the title of the question, but I didn't realize you were using HOG descriptors until you pointed out on the comments.

Resources