How can I use one-hot encoded labels with some sklearn classifiers? - machine-learning

I have a multiclass classification task with 10 classes. As such, I used sklearn's OneHotEncoder to transform the one-column labels to 10-columns labels. I was trying to fit the training data. Although I was able to do this with RandomForestClassifier, I got the below error message when fitting with GaussianNB:
ValueError: bad input shape (1203L, 10L)
I understand the allowed shape of y in these two classifiers is different:
GaussianNB:
y : array-like, shape (n_samples,)
RandomForest:
y : array-like, shape = [n_samples] or [n_samples, n_outputs]
The question is, why is this? Wouldn't this be contradictory to "All classifiers in scikit-learn do multiclass classification out-of-the-box"? Any way to go around it? Thanks!

The question is, why is this?
It is because of a slight missunderstanding, in scikit-learn you do not encode labels, you pass it as one dimensional vector of labels, thus instead of
1 0 0
0 1 0
0 0 1
you literally pass
1 2 3
So why does random forest accepts a different scheme? Because it is not for multiclass setting! It is for multi label where each instance can have many labels, like
1 1 0
1 1 1
0 0 0
Wouldn't this be contradictory to "All classifiers in scikit-learn do multiclass classification out-of-the-box"?
Contrary - it is the easiest solution - to never ask for one-hot unless it is multi-label,
Any way to go around it?
Yup, just do not encode - pass raw labels :-)

Related

binary classification:why use +1/0 as label, what's the difference between +1/-1 or even +100/-100

In binary classification problem, we usually use +1 for positive label and 0 for negative label. why is that? especially why use 0 rather than -1 for the negative label?
what's the difference between using -1 for negative label, or even more generally, can we use +100 for positive label and -100 for negative label?
As the name suggests (labeling) is used for differentiating the classes. You can use 0/1, +1/-1, cat/dog, etc. (Any name that fits your problem).
For example:
If you want to distinguish between cat and dog images, then use cat and dog labels.
If you want to detect spam, then labels will be spam/genuine.
However, because ML algorithms mostly work with numbers before training, labels transform to numeric formats.
Using labels of 0 and 1 arises naturally from some of the historically first methods that have been used for binary classification. E.g. logistic regression models directly the probability of an event happening, event in this case meaning belonging of an object to positive or negative class. When we use training data with labels 0 and 1, it basically means that objects with label 0 have probability of 0 belonging to a given class, and objects with label 1 have probability of 1 belonging to a given class. E.g. for spam classification, emails that are not spam would have label 0, which means they have 0 probability of being a spam, and emails that are spam would have label 1, because their probability of being a spam is 1.
So using labels of 0 and 1 makes perfect sense mathematically. When a binary classifaction model outputs e.g. 0.4 for some input, we can usually interpret this as a probability of belonging to a class 1 (although strictly it's not always the case, as pointed out for example here).
There are classification methods that don't make use of convenient properties of labels 0 and 1, such as support vector machines or linear discriminant analysis, but in their case no other labels would provide more convenience than 0 and 1, so using 0 and 1 is still okay.
Even encoding of classes for multiclass classification makes use of probabilities of belonging to a given class. For example in classification with three classes, objects from the first class would be encoded like [1 0 0], from the second class [0 1 0] and the third class [0 0 1], which again can be interpreted with probabilities. (This is called one-hot encoding). Output of a multiclass classification model is often a vector of form [0.1 0.6 0.3] which can be conveniently intepreted as a vector of class probabilities for given object.

Majority layer in Keras?

Is there any kind of majority layer in Keras? That the input would be dome 1D vector, and the output is a single number which is the value that has the most occurrences kn the input vector?
My use case is - I'm building an ensemble of neural networks, but let's say that I want to have a single network. So I'm building a new network, with the previous models as inputs. I want to add a dingle output layer that simply runs a majority vote. Is it possible with Keras?
Note, that such a layer would not make much sense with float activations, since probability of two floats being exactly the same is 0. Consequently this only makes sense if your inputs are categorical. And for categorical layer you usually one-hot encode the result, thus imagine having N networks with one hot encoding of K possible values, thus N x K tensor
Net 1: 0 0 1 0 0 0
Net 2: 1 0 0 0 0 0
Net 3: 0 0 1 0 0 0
Now all we have to do is to sum over N
1 0 2 0 0 0
and take an argmax to find what you want. So you just compose 2 standard operations from every NN library - summation, and argmaxing. Note, that this solution actually works also if your inputs are floats, you "sum" the votes, and take an argmax.

what method is the correct way of implemeting dice loss ? sigmoid or softmax?

I have a binary semantic segmentation problem and there is 2 method in my mind.
Method 1:
Unet output one class with sigmoid activation, then I use the dice loss to calculate the loss
Method 2:
The ground truth is concatenated to it is inverse, thus having 2 classes. The output of Unet is 2 classes and applying softmax activation to them. The dice loss is then used to calculate the loss.
Which is correct?
This question has been answered here. If you have a 2 class problem, output only 1 channel, use a sigmoid function (outputs values between 0 and 1). Then you can calculate your dice loss with output (continuous values) and target(single channel one-hot-encoded, discrete values). If your network outputs 2 channels use a softmax function and calculate your loss with your output (continous values) and target (2 channel one-hot-encoded). The former is preferred, as you will have less parameters.
Method 2 is correct, since softmax is used for multi-class problems.

Classification with imbalanced dataset using Multi Layer Perceptrons

I am having a trouble in classification problem.
I have almost 400k number of vectors in training data with two labels, and I'd like to train MLP which classifies data into two classes.
However, the dataset is so imbalanced. 95% of them have label 1, and others have label 0. The accuracy grows as training progresses, and stops after reaching 95%. I guess this is because the network predict the label as 1 for all vectors.
So far, I tried dropping out layers with 0.5 probabilities. But, the result is the same. Is there any ways to improve the accuracy?
I think the best way to deal with unbalanced data is to use weights for your class. For example, you can weight your classes such that sum of weights for each class will be equal.
import pandas as pd
df = pd.DataFrame({'x': range(7),
'y': [0] * 2 + [1] * 5})
df['weight'] = df['y'].map(len(df)/2/df['y'].value_counts())
print(df)
print(df.groupby('y')['weight'].agg({'samples': len, 'weight': sum}))
output:
x y weight
0 0 0 1.75
1 1 0 1.75
2 2 1 0.70
3 3 1 0.70
4 4 1 0.70
5 5 1 0.70
6 6 1 0.70
samples weight
y
0 2.0 3.5
1 5.0 3.5
You could try another classifier on subset of examples. SVMs, may work good with small data, so you can take let's say 10k examples only, with 5/1 proportion in classes.
You could also oversample small class somehow and under-sample the another.
You can also simply weight your classes.
Think also about proper metric. It's good that you noticed that the output you have predicts only one label. It is, however, not easily seen using accuracy.
Some nice ideas about unbalanced dataset here:
https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
Remember not to change your test set.
That's a common situation: the network learns a constant and can't get out of this local minimum.
When the data is very unbalanced, like in your case, one possible solution is a weighted cross entropy loss function. For instance, in tensorflow, apply a built-in tf.nn.weighted_cross_entropy_with_logits function. There is also a good discussion of this idea in this post.
But I should say that getting more data to balance both classes (if that's possible) will always help.

Implementing a neural network classifier for my data, but is it solveable this way?

I will try to explain what the problem is.
I have 5 materials, each composed of 3 different minerals of a set of 10 different minerals. For each material I have measured the inensity vs wavelength. And each Intensity vs wavelength vector can be mapped into a binary vector of ones and zeros corresponding to the minerals the material is composed of.
So material 1 has an intensity of [0.51 0.53 0.57 0.68...... ] measured at different wavelengths [470 480 490 500 510 ......] and a binary vector
[1 0 0 0 1 0 0 1 0 0]
and so on for each material.
For each material I have 5000 examples, so 25000 examples for all. Each example will have a 'similar' intensity vs wavelength behaviour but will give the 'same' binary vector.
I want to design a NN classifier so that if I give it as an input the intensity vs wavelength, it gives me the corresponding binary vector.
The intensity vs wavelength has a length of 450 so I will have 450 units in the input layer
the binary vector has a length of 10, so 10 output neurons
the hidden layer/s will have as a beginning 200 neurons.
Can I simly design a NN classifier this way, and would it solve the problem, or I need something else?
You can do that, however, be aware to use the right cost and output layer activation functions. In your case, you should use sigmoid units for your outer layer and binary-cross-entropy as a cost function.
Another way to go about this would be to use one-hot encoding so that you can use normal multi-class classification (will probably not make sense since your output is probably sparse).

Resources