How to do multilable classification by hand? - machine-learning

I have a huge data set and would like to do a multi-lable classification where each object can be assigned to more than one class. I'm using a Naive Bayer Classifier in Apache Mahout to do that. However it is not designed for multi-lable classification and just assign class with highest probability to each object. How can I extend this classifier to my scenario?
One solution that I was thinking was to put a threshold and assign classes whose probabilities are larger than the threshold. But it is not easy to find the threshold so it does not work. I wonder to know if any one has any idea?

You need to train a binary classifier for each class. Train set should contain data with target class and other arbitrary data not matching the target class.

Related

Evaluation of generative models like variational autoencoder

i hope everyone is doing well
I need some help with generative models.
So im working on a project where the main task is to build a binary classification model. In the dataset which contains 300000 sample and 100 feature, there is an imbalance between the 2 classes where majority class is too much bigger than the minory class.
To handle this problem, i'm using VAE (variational autoencoders) to solve this problem.
So i started training the VAE on the minority class and then use the decoder part of the VAE to generate new or fake samples that are similars to the minority class then concatenate this new data with training set in order to have a new balanced training set.
My question is : is there anyway to evalutate generative models like vae, like is there a way to know if the data generated is similar to the real one ??
I have read that there is some metrics to evaluate generated data like inception distance and Frechet inception distance but i saw that they have been only used on image data
I wanna know if i can use them too on my dataset ?
Thanks in advance
I believe your data is not image as you say there are 100 features. What I believe that you can check the similarity between the synthesised features and the original features (the ones belong to minority class), and keep only the ones with certain similarity. Cosine similarity index would be useful for this problem.
That would be also very nice to check a scatter plot of the synthesised features with the original ones to see if they are close to each other. tSNE would be useful at this point.

Machine Learning: Weighting Training Points by Importance

I have a set of labeled training data, and I am training a ML algorithm to predict the label. However, some of my data points are more important than others. Or, analogously, these points have less uncertainty than the others.
Is there a general method to include an importance-representing weight to each training point in the model? Are there instead some specific models which are capable of this while others are not?
I can imagine duplicating these points (and perhaps smearing their features slightly to avoid exact duplicates), or downsampling the less important points. Is there a more elegant way to approach this problem?
Scikit-learn allows you to pass an array of sample weights while fitting the model. Vowpal Wabbit (an online ML library) also has this option.

Should a deep-learning based image classifier include a negative class

I am building a image classifier similar to alexnet(https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks).
I have 6 categories [people,car,bikes,animals,hydroplanes,boats]. So right now if I give an object that doesn't not belong to the above mentioned classes it gets classified to either of the classes with some probability.
To increase the accuracy is it wise to add more classes or add a negative class?
And if I had to add a negative class what kind of data would I train it on?
Thank You
Think about what you really want to produce at the end.
You need an algorithm that tells you wether the image you passed is a car, bike, animal, person, hydroplane, boat.
Do the user is supposed to pass an image that represents something else ? If so, you can add a "other" class.
Well, it depends on what kind of classifier you want to build and available training data.
If you have enough training data for a new class, e.g., train, you can easily add a new class. It is quite straightforward. But the problem remains, if some new object appears at input then what to do....
I think your question is, how to handle a situation when an object(which is not the training set) is presented to the network. In such cases adding a negative class is quite complex as network need enough and clear training data for negative class as well. So one way to deal with this situation is to put a check on the output probabilities. If no training class get say 70% of output probability than classify the input as ambiguous or negative class.

Classification Algorithm which can take predefined weights for attributes as input

I have 20 attributes and one target feature. All the attributes are binary(present or not present) and the target feature is multinomial(5 classes).
But for each instance, apart from the presence of some attributes, I also have the information that how much effect(scale 1-5) did each present attribute have on the target feature.
How do I make use of this extra information that I have, and build a classification model that helps in better prediction for the test classes.
Why not just use the weights as the features, instead of binary presence indicator? You can code the lack of presence as a 0 on the continuous scale.
EDIT:
The classifier you choose to use will learn optimal weights on the features in training to separate the classes... thus I don't believe there's any better you can do if you do not have access to test weights. Essentially a linear classifier is learning a rule of the form:
c_i = sgn(w . x_i)
You're saying you have access to weights, but without an example of what the data look like, and an explanation of where the weights come from, I'd have to say I don't see how you'd use them (or even why you'd want to---is standard classification with binary features not working well enough?)
This clearly depends on the actual algorithms that you are using.
For decision trees, the information is useless. They are meant to learn which attributes have how much effect.
Similarly, support vector machines will learn the best linear split, so any kind of weight will disappear since the SVM already learns this automatically.
However, if you are doing NN classification, just scale the attributes as desired, to emphasize differences in the influential attributes.
Sorry, you need to look at other algorithms yourself. There are just too many.
Use the knowledge as prior over the weight of features. You can actually compute the posterior estimation out of the data and then have the final model

percentage in classify of learning algorithem

I'm using weka, I have a training set, and the classify of the examples in the training set is boolean.
After I have the training set, I want to predict the percentage of new input to be true or false. I want to get a number between 0-1, and not only o or 1.
How can I do that, I have seen that in the prediection there are only the possibels classifes.
Thanks in advance.
You can only make the same kind of prediction with the learned classifier -- it learns to make the predictions you train it to make. The kind of prediction you want sounds more like regression. That is, you're don't want a strict classification, but a continuous value designating the membership probability.
The easiest way to achieve what you want is to replace the Booleans in your training set with 0/1 values and learn a regression model. This will give you numbers, although not necessarily only between 0 and 1.
To get real probabilities, you would need to use a classifier that calculates probabilities (such as Naive Bayes) and write some custom code (using the Weka library) to retrieve them. See the javadoc of the method that gives you access to the class probabilities.

Resources