Balance problem for classification on Cleveland Dataset

Balance problem for classification on Cleveland Dataset - machine-learning

I’ve questioned the way famous Cleveland heart disease dataset labels its objects here
This dataset is very unbalanced (many objects of “no disease” class). I noticed that many papers that used this dataset used to combine all the other classes and reduce this to a binary classification (disease vs no disease)
Are there other ways to deal with this unbalancing class problem rather than reduce the number of classes to get a good result from a classifer?

Generally speaking, when handling a non balanced dataset, one should use a non-supervised learning approach.
You may use the Multivariate Normal Distribution.
In your case, if you have many elements in one class and very few in the other class, a supervised learning method is not appropriate. Therefore, the Multivariate Normal Distribution, which is a non supervised machine learning approach, may be the solution. The algorithm learns from the data and finds values which define the data (i.e. the most important part of the data, here the "no desease" cases). Once these values are outputed, one can search the elements which do not fit them, and these elements are the so called "abnormal elements" or "anomalies". In your case, these are the "disease" individuals.
A second solution would be to ballance you dataset, and use the initial supervised learning algorithm. You can do that using the following techniques. These statements are generally good, but they depend a lot on the data you have (mind, I do not have access to your input data!), so you should test them and see which one best fits your purpose.
Collecting more elements for the class with few elements.
Duplicate the elements in the class with less elements, in order to obtain the same amount of data for both classes, as for the class with more lements. There is a problem with this solution, in the case where you have a great difference of input data volume between the two classes, and you use a neural network, because the class with duplicated elements will not be very variate, and neural networks provide good results only when trained with a great amount of very variate data.
Use less data in the class with more lements, in order to have the same amount of elements in both classes as in the class with few elements. Here too there might be a problem when using a neural network, because training it with less data might not give the good results. be careful also in order to have more input elements than features, otherwise it would not work.

Related

Classification with Keras, unbalanced classes

I have a binary classification problem I'm trying to tackle in Keras. To start, I was following the usual MNIST example, using softmax as the activation function in my output layer.
However, in my problem, the 2 classes are highly unbalanced (1 appears ~10 times more often than the other). And what's even more critical, they are non-symmetrical in the way they may be mistaken.
Mistaking an A for a B is way less severe than mistaking a B for an A. Just like a caveman trying to classify animals into pets and predators: mistaking a pet for a predator is no big deal, but the other way round will be lethal.
So my question is: how would I model something like this with Keras?
thanks a lot

A non-exhaustive list of things you could do:
Generate a balanced data set using data augmentations. If the data are images, you can add image augmentations in a custom data generator that will output balanced amounts of data from each class per batch and save the results to a new data set. If the data are tabular, you can use a library like imbalanced-learn to perform over/under sampling.
As #Daniel said you can use class_weights during training (in the fit method) in a way that mistakes on important class are penalized more. See this tutorial: Classification on imbalanced data. The same idea can be implemented with a custom loss function with/without class_weights during training.

Classify visually distinct objects as one class

We are building a neural network to classify objects and have a large dataset of images for 1000 classes. One of the classes is “banana” and it contains 1000 images of banana. Some of those images (about 10%) are of mashed bananas, which are visually very different from the rest of the images in that class.
If we want both mashed bananas and regular bananas to be classified, should we split the banana images into two separate classes and train separately, or keep the two subsets merged?
I am trying to understand how the presence of a visually distinct subclass impacts the recognition of a given class.

The problem here is simple. You need your neural network to learn both groups of images. That means you need to back-propagate sensible error information. If you do have the ground truth information about mashed bananas, back-propagating that is definitely useful. It helps the first layers learn two sets of features.
Note that the nice thing about neural networks is that you can back-propagate any kind of error vector. If your output has 3 nodes banana, non-mashed banana, mashed banana, you basically sidestep the binary choice implied in your question. You can always drop output nodes during inference.

There is no standard answer to be given here; it might be very hard for your network to generalize over classes if their subclasses are distinct in the feature space, in which case introducing multiple dummy classes that you collapse into a single one via post-processing would be the ideal solution. You could also pretrain a model with distinct classes (so as to build representations that discriminate between them), and then pop the final network layer (the classifier) and replace it with a collapsed classifier, fitting with the initial labels. This would accomplish having discriminating representations which are simply classified commonly. In any case I would advise you to construct the subclass-specific labels and check per-subclass error while training with the original classes; this way you will be able to quantify the prediction error you get and avoid over-engineering your network in case it can learn the task by itself without stricter supervision.

Is that make sense to construct a learning Model using only one feature?

in order to improve the accuracy of an adaboost classifier (for image classification), I am using genetic programming to derive new statistical Measures. Every Time when a new feature is generated, i evaluate its fitness by training an adaboost Classifier and by testing its performances. But i want to know if that procedure is correct; I mean the use of a single feature to train a learning model.

You can build a model on one feature. I assume, that by "one feature" you mean simply one number in R (otherwise, it would be completely "traditional" usage). However this means, that you are building a classifier in one-dimensional space, and as such - many classifiers will be redundant (as it is really a simple problem). What is more important - checking whether you can correctly classify objects using one particular dimensions does not mean that it is a good/bad feature once you use combination of them. In particular it may be the case that:
Many features may "discover" the same phenomena in data, and so - each of them separatly can yield good results, but once combined - they won't be any better then each of them (as they simply capture same information)
Features may be useless until used in combination. Some phenomena can be described only in multi-dimensional space, and if you are analyzing only one-dimensional data - you won't ever discover their true value, as a simple example consider four points (0,0),(0,1),(1,0),(1,1) such that (0,0),(1,1) are elements of one class, and rest of another. If you look separatly on each dimension - then the best possible accuracy is 0.5 (as you always have points of two different classes in exactly same points - 0 and 1). Once combined - you can easily separate them, as it is a xor problem.
To sum up - it is ok to build a classifier in one dimensional space, but:
Such problem can be solved without "heavy machinery".
Results should not be used as a base of feature selection (or to be more strict - this can be very deceptive).

Machine learning what approach to use when the dataset contain only one-class instances?

I have a dataset of a particular domain (say sports - 1 class). What I want to do is when I fed a web page to the classifier/clusterer I want to get a result whether that instance (web page) is related to sports or not.
Most of the classifiers in weka are not capable of dealing with unary class datasets except the LibSVM (wrapper). I did some tests with the LibSVM, but the problem is during tests on a unrelated dataset, I get all of them correctly classified, even if the instances are empty! Any suggestions?
What if I use the cosine similarity measure here?

Have you seen this thread unary class text classification in weka? and this post https://list.scms.waikato.ac.nz/mailman/htdig/wekalist/2007-October/011631.html ?
I'm assuming you meant that when you run the classifier against another dataset that is not "sports" it gets the results incorrectly classified (i.e. false positives) e.g. "this is sports".
Are you certain your dataset only contains one class? Did you make sure the dataset does not contain any empty instances? (don't mock, this has happened to me before).
In the comments of the previously mentioned thread there is a linked to a PDF on tuning SVM: http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf - I would say SVMs are a bit harder than other common classifiers.
As an alternative, can't you switch the problem to binary classification? It's much easier to get good results and for most problems there are plenty of examples of things that are not in that class e.g. sports websites vs funny image web sites, programming websites, etc ...
PS: you can use other algorithms for outlier detection: http://en.wikipedia.org/wiki/Outlier_detection

A few implementation details for a Support-Vector Machine (SVM)

In a particular application I was in need of machine learning (I know the things I studied in my undergraduate course). I used Support Vector Machines and got the problem solved. Its working fine.
Now I need to improve the system. Problems here are
I get additional training examples every week. Right now the system starts training freshly with updated examples (old examples + new examples). I want to make it incremental learning. Using previous knowledge (instead of previous examples) with new examples to get new model (knowledge)
Right my training examples has 3 classes. So, every training example is fitted into one of these 3 classes. I want functionality of "Unknown" class. Anything that doesn't fit these 3 classes must be marked as "unknown". But I can't treat "Unknown" as a new class and provide examples for this too.
Assuming, the "unknown" class is implemented. When class is "unknown" the user of the application inputs the what he thinks the class might be. Now, I need to incorporate the user input into the learning. I've no idea about how to do this too. Would it make any difference if the user inputs a new class (i.e.. a class that is not already in the training set)?
Do I need to choose a new algorithm or Support Vector Machines can do this?
PS: I'm using libsvm implementation for SVM.

I just wrote my Answer using the same organization as your Question (1., 2., 3).
Can SVMs do this--i.e., incremental learning? Multi-Layer Perceptrons of course can--because the subsequent training instances don't affect the basic network architecture, they'll just cause adjustment in the values of the weight matrices. But SVMs? It seems to me that (in theory) one additional training instance could change the selection of the support vectors. But again, i don't know.
I think you can solve this problem quite easily by configuring LIBSVM in one-against-many--i.e., as a one-class classifier. SVMs are one-class classifiers; application of an SVM for multi-class means that it has been coded to perform multiple, step-wise one-against-many classifications, but again the algorithm is trained (and tested) one class at a time. If you do this, then what's left after step-wise execution against the test set, is "unknown"--in other words, whatever data is not classified after performing multiple, sequential one-class classifications, is by definition in that 'unknown' class.
Why not make the user's guess a feature (i.e., just another dependent variable)? The only other option is to make it the class label itself, and you don't want that. So you would, for instance, add a column to your data matrix "user class guess", and just populate it with some value most likely to have no effect for those data points not in the 'unknown' category and therefore for which the user will not offer a guess--this value could be '0' or '1', but really it depends on how you have your data scaled and normalized).

Your first item will likely be the most difficult, since there are essentially no good incremental SVM implementations in existence.
A few months ago, I also researched online or incremental SVM algorithms. Unfortunately, the current state of implementations is quite sparse. All I found was a Matlab example, OnlineSVR (a thesis project only implementing regression support), and SVMHeavy (only binary class support).
I haven't used any of them personally. They all appear to be at the "research toy" stage. I couldn't even get SVMHeavy to compile.
For now, you can probably get away with doing periodic batch training to incorporate updates. I also use LibSVM, and it's quite fast, so it sould be a good substitute until a proper incremental version is implemented.
I also don't think SVM's can model the concept of an "unknown" sample by default. They typically work as a series of boolean classifiers, so a sample ends up as positively being classified as something, even if that sample is drastically different from anything seen previously. A possible workaround would be to model the ranges of your features, and randomly generate samples that exist outside of these ranges, and then add these to your training set.
For example, if you have an attribute called "color", which has a minimum value of 4 and a maximum value of 123, then you could add these to your training set
[({'color':3},'unknown'),({'color':125},'unknown')]
to give your SVM an idea of what an "unknown" color means.

There are algorithms to train an SVM incrementally, but I don't think libSVM implements this. I think you should consider whether you really need this feature. I see no problem with your current approach, unless the training process is really too slow. If it is, could you retrain in batches (i.e. after every 100 new examples)?
You can get libSVM to produce probabilities of class membership. I think this can be done for multiclass classification, but I'm not entirely sure about that. You will need to decide some threshold at which the classification is not certain enough and then output 'Unknown'. I suppose something like setting a threshold on the difference between the most likely and second most likely class would achieve this.
I think libSVM scales to any number of new classes. The accuracy of your model may well suffer by adding new classes, however.

Even though this question is probably out of date, I feel obliged to give some additional thoughts.
Since your first question has been answered by others (there is no production-ready SVM which implements incremental learning, even though it is possible), I will skip it. ;)
Adding 'Unknown' as a class is not a good idea. Depending on it's use, the reasons are different.
If you are using the 'Unknown' class as a tag for "this instance has not been classified, but belongs to one of the known classes", then your SVM is in deep trouble. The reason is, that libsvm builds several binary classifiers and combines them. So if you have three classes - let's say A, B and C - the SVM builds the first binary classifier by splitting the training examples into "classified as A" and "any other class". The latter will obviously contain all examples from the 'Unknown' class. When trying to build a hyperplane, examples in 'Unknown' (which really belong to the class 'A') will probably cause the SVM to build a hyperplane with a very small margin and will poorly recognizes future instances of A, i.e. it's generalization performance will diminish. That's due to the fact, that the SVM will try to build a hyperplane which separates most instances of A (those officially labeled as 'A') onto one side of the hyperplane and some instances (those officially labeled as 'Unknown') on the other side .
Another problem occurs if you are using the 'Unknown' class to store all examples, whose class is not yet known to the SVM. For example, the SVM knows the classes A, B and C, but you recently got example data for two new classes D and E. Since these examples are not classified and the new classes not known to the SVM, you may want to temporarily store them in 'Unknown'. In that case the 'Unknown' class may cause trouble, since it possibly contains examples with enormous variation in the values of it's features. That will make it very hard to create good separating hyperplanes and therefore the resulting classifier will poorly recognize new instances of D or E as 'Unknown'. Probably the classification of new instances belonging to A, B or C will be hindered as well.
To sum up: Introducing an 'Unknown' class which contains examples of known classes or examples of several new classes will result in a poor classifier. I think it's best to ignore all unclassified instances when training the classifier.
I would recommend, that you solve this issue outside the classification algorithm. I was asked for this feature myself and implemented a single webpage, which shows an image of the object in question and a button for each known class. If the object in question belongs to a class which is not known yet, the user can fill out another form to add a new class. If he goes back to the classification page, another button for that class will magically appear. After the instances have been classified, they can be used for training the classifier. (I used a database to store the known classes and reference which example belongs to which class. I implemented an export function to make the data SVM-ready.)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart