Beginner at machine learning here! Just like to get a sensing of how I should approach a classification problem. Given that the problem at hand is to say classify whether an object belongs to class A or class B, I am wondering whether I should use a generative or a discriminative model. I have 2 questions.
A discriminative model seems to do a better job at classification problems because it is purely concerned with how the decision boundary is drawn and nothing else.
Q: However, with a small dataset of around 80 class A objects and less than 10 class B objects to train and test, would a discriminative model overfit and therefore a generative model would perform better?
Also, with a very huge difference in numbers of the number of class A objects and class B objects, the model trained is likely to only be able to pick up on class A objects. Even if the model classifies all objects to be class A, this would still result in a very high accuracy score.
Q: Any ideas on how to reduce this biasedness given that there is no other way of increasing the size of class B's dataset?
Related
I have a use case where in text needs to be classified into one of the three categories. I started with Naive Bayes [Apache OpenNLP, Java] but i was informed that the algorithm is biased, meaning if my training data has 60% of data as classA and 30% as classB and 10% as classC then the algorithm tends to biased towards ClassA and thus predicting the other class texts to be of classA.
If this is true is there a way to overcome this issue?
There are other algorithm that i came across like SVM Classifier or logistic regression (maximum entropy model), however I am not sure which will be more suitable for my use case. Please advise.
there a way to overcome this issue?
Yes, there is. But first you need to understand why it happens?
Basically your dataset is imbalanced.
An imbalanced dataset means instances of one of the two classes is higher than the other, in another way, the number of observations is not the same for all the classes in a classification dataset.
In this scenario, your model becomes bias towards the class with majority of samples as you have more training data for that class.
Solutions
Under sampling:
Randomly removing samples from majority class to make dataset balance.
Over sampling:
Adding more samples of minority classes to makes dataset balance.
Change Performance Metrics
Use F1-score, 'recallorprecision` to measure the performance of your model.
There are few more solutions, if you want to know more refer this blog
There are other algorithm that i came across like SVM Classifier or logistic regression (maximum entropy model), however I am not sure
which will be more suitable for my usecase
You will never know unless you try, I would suggest you try 3-4 different algorithms on your data.
I am training an object detection model for multi-class objects in the image. The dataset is custom collected and labelled data with bounding boxes and class labels in the ground truth data.
I trained the MobileNet+SSD , SqueezeDet and YoloV3 networks with this custom data but get poor results. The rationale of choosing these models is their fast performance and light weight (low memory foot print). Their single shot detector approach is shown to perform well in literature as well.
The class instance distribution in the dataset is as below
Class 1 -- 2469
Class 2 -- 5660
Class 3 -- 7614
Class 4 -- 13253
Class 5 -- 35262
Each image can have objects from any of the five classes. Class 4 and 5 have very high incidence.
The performance is very skewed with high recall scores and Average Precision for the class 4 and 5 , and an order of magnitude difference (lower) for the other 3 classes.
I have tried fine tuning on different filtering parameters , NMS threshold, model training parameters to no avail.
Question,
How to tackle such class imbalance to boost the detection Average precision and object detection accuracy for all classes in object detection models. ?
Low precision means your model is suffering from false positives. So you can try hard negative mining. Run your model. Find False positives. Include them in your training data. You can even try using only false negatives as false examples.
As you expect another way can be collecting more data if possible.
If it is not possible you may consider adding synthetic data. (i.e. change brightness of image, or view point(multiply with a matrix so it looks stretched))
One last thing may be having data for each class i.e. 5k for each.
PS: Keep in mind that flexibility of your model has a great impact. So be aware of over fitting under fitting.
In generating your synthetic data as mentioned by previous author, do not apply illumination or viewpoint variations..etc to all your dataset but rather, randomly. The number of classes is also way off, and will be best to either limit the numbers or gather more datasets for those classes. You could also try applying class weights to penalize the over representing classes more. You are making alot of assumptions that simple experimentation will yield results that could surprise you. Remember deep learning is part science and alot of art.
For machine learning binary classification problems with imbalanced classes, does it matter which class is considered the positive class? So if class A is the majority class, by convention do I want to predict that or the minority class (class B)? Does it even matter?
In fact it does not matter, but it depends on your underlying problem. For example if you want to classifiy a medical test, where positive corresponds to 'disease is present' and we assume that positive samples are the minority, you probably want to predict how high is the probabilty that one Person is sick / belongs to the minority.
I am currently working with a pre-trained MobileNet model that classifies images from a set of 1000 categories. For the purpose of my IOS application, I only need it to recognize/classify one type of object in the scene. How can I train the model so that it only classifies the one object I need but does it extremely well?
I am new to machine learning and unfamiliar with transfer learning techniques. Would doing this type of training reduce the model size and make it more efficient at recognizing the one object I need? If yes, what are resources that teach me how to keep training this pre-trained model for my objective.
Briefly, you want to turn your 1000-way classifier to a binary classifier.
The answer below assumes you have access to the original data, and that you know how to train the original model (that is, you have access to the training script). Here goes:
Assuming you're only interested in a single category C, you want to first map all instances (x, C) of the data to (x, 1) and all other instances (x, not_C) to (x, 0), then train a model on the resulting data (or, continue training the pre-trained model, if the training script also accepts a starting point for the model).
The model would then lose the ability to discern between non-C classes, and hopefully become better at discriminating C vs non-C instances.
Note: A less hacky approach would be to actually restrict the model to output only 0 or 1 and change the objective to a binary softmax. However, that would require some manipulation of the model's architecture, which you can do without.
I am a deep-learning newbie and working on creating a vehicle classifier for images using Caffe and have a 3-part question:
Are there any best practices in organizing classes for training a
CNN? i.e. number of classes and number of samples for each class?
For example, would I be better off this way:
(a) Vehicles - Car-Sedans/Car-Hatchback/Car-SUV/Truck-18-wheeler/.... (note this could mean several thousand classes), or
(b) have a higher level
model that classifies between car/truck/2-wheeler and so on...
and if car type then query the Car Model to get the car type
(sedan/hatchback etc)
How many training images per class is a typical best practice? I know there are several other variables that affect the accuracy of
the CNN, but what rough number is good to shoot for in each class?
Should it be a function of the number of classes in the model? For
example, if I have many classes in my model, should I provide more
samples per class?
How do we ensure we are not overfitting to class? Is there way to measure heterogeneity in training samples for a class?
Thanks in advance.
Well, the first choice that you mentioned corresponds to a very challenging task in computer vision community: fine-grained image classification, where you want to classify the subordinates of a base class, say Car! To get more info on this, you may see this paper.
According to the literature on image classification, classifying the high-level classes such as car/trucks would be much simpler for CNNs to learn since there may exist more discriminative features. I suggest to follow the second approach, that is classifying all types of cars vs. truck and so on.
Number of training samples is mainly proportional to the number of parameters, that is if you want to train a shallow model, much less samples are required. That also depends on your decision to fine-tune a pre-trained model or train a network from scratch. When sufficient samples are not available, you have to fine-tune a model on your task.
Wrestling with over-fitting has been always a problematic issue in machine learning and even CNNs are not free of them. Within the literature, some practical suggestions have been introduced to reduce the occurrence of over-fitting such as dropout layers and data-augmentation procedures.
May not included in your questions, but it seems that you should follow the fine-tuning procedure, that is initializing the network with pre-computed weights of a model on another task (say ILSVRC 201X) and adapt the weights according to your new task. This procedure is known as transfer learning (and sometimes domain adaptation) in community.