Multi Label classification with Sklearn - machine-learning

I have tried using the OneVsRest with Logistic Regression from Sklearn, but it gives empty labels for some samples (i.e. doesn't predict any out), even though I do not have any unlabelled training data.
Any idea what might be causing this or how to fix this?
clf = OneVsRestClassifier(LogisticRegression(multi_class='ovr',max_iter=1000,solver='lbfgs'))
clf.fit(X,Y)
self.classifier=clf
self.classifier.predict(test_data)

Whenever you are performing MultiLabel classification, according to the OneVsRestClassifier the targets need to be "a sequence of sequences of labels".
Moreover, depending on how you encode this labels you may get the following warning: "DeprecationWarning: Direct support for sequence of sequences multilabel representation will be unavailable from version 0.17. Use sklearn.preprocessing.MultiLabelBinarizer to convert to a label indicator representation."
So, neat way to encode your labels:
from sklearn import preprocessing
mlb = preprocessing.MultiLabelBinarizer()
Y = mlb.fit_transform([(1, 2), (1,2), (1,2),(4,)])
# this means sample one belongs to classes {1,2} and so on.
# Take into account the format if only one class is needed, (4,) not (4)
so Y turns out to be:
array([[1, 1, 0],
[1, 1, 0],
[1, 1, 0],
[0, 0, 1]])

Related

SVM's support vectors decision function representation

I am currently using SVM for my project with 'rbf' kernel.
What i understand from the theory is that the decision function value for the support vectors must be either +1 or -1. (if i use clf.decision_function(x))
But i find some support vectors, the decision function value is even 0.76, -0.88, 0.93 and so on.. (its not even closer to +1 or -1 like 0.99 nor -0.99).
What is wrong in this scenario? Or is my understanding wrong?
I guess there is no range limitation for the decision function value output in SVM.
The value of the decision function for those points, will be a high positive number for high-confidence positive decisions and have a low absolute value (near 0) for low-confidence decisions.
Source here
Code Example:
import numpy as np
from sklearn.svm import SVC
X = np.array([[-1, -1], [-2, -1], [0, 0], [0, 0], [1, 1], [2, 1]])
y = np.array([1, 1, 2, 2, 3, 3])
clf = SVC()
clf.fit(X, y)
print(clf.decision_function(X))
print(clf.predict(X))
Output:
# clf.decision_function(X)
array([[ 2.21034835, 0.96227609, -0.20427163],
[ 2.22222707, 0.84702504, -0.17843569],
[-0.16668475, 2.22222222, 0.83335142],
[-0.16668475, 2.22222222, 0.83335142],
[-0.20428472, 0.96227609, 2.21036024],
[-0.17841683, 0.84702504, 2.22221737]])
# clf.predict(X)
array([1, 1, 2, 2, 3, 3])
What SVM is interested is the sign of the decision. e.g., if the sign is negative, the point lies (say) left of the hyperplane. Similarly if the sign is positive, the point lies right of the hyperplane. The value determines how far is it from the hyperplane. Therefore -0.88 means the point is left of the hyperplane and having a distance 0.88. Near the point to the hyperplane, the chances of mis-classification can be considered higher.
Have a look here
To quote from scikit-learn:
the function values are proportional to the distance of the samples X
to the separating hyperplane.
sklearn uses soft margin SVMs. From the User Guide:
In general, when the problem isn’t linearly separable, the support vectors are the samples within the margin boundaries.
See also this sklearn example, where depending on C the margin changes and more or fewer points act as support vectors.
So support vectors can have a decision_function score of anything from -1 to +1. (In fact, misclassified points outside the margin will still be support vectors, and will have a score outside even that range. See https://stats.stackexchange.com/a/585832/232706)

what to do after binning numerical feature?

I want to know what to do after I did the binning. For example, one of the feature is age. So my data is [11, 12, 35, 26].
Then I apply binning with size of 10:
bin, name
[0, 10) --> 1
[10, 20) --> 2
[20, 30) -->3
[30, 40) --> 4
Then my data becomes [2, 2, 4, 3]. Now assume I want to put this data to a linear regression mode. Should I treat the [2, 2, 4, 3] as numerical feature? Or should I treat them as categorical feature, like do one-hot encoding first and then feed it to the model?
If you are building a linear model, then one hot encoding of those bins might be a better option, so that if there is any linear relationship with the target, the ohe will preserve it.
If you are building tree based models, like random forests, then you could use the [2, 2, 4, 3] as numerical feature, because these models are non-linear.
If building a regression model and not wanting to expand the feature space with ohe, you could treat the bins as a categorical variable, and encode that variable using mean / target encoding, or encoding with digits by following the target mean per bin.
More details about the last 2 procedures in this article.
Disclaimer: I wrote the article.

Is it possible to train a sklearn model (eg SVM) incrementally? [duplicate]

This question already has answers here:
Does the SVM in sklearn support incremental (online) learning?
(6 answers)
Closed 4 years ago.
I'm trying to perform sentiment analysis over the twitter dataset "Sentiment140" which consists of 1.6 million labelled tweets . I'm constructing my feature vector using Bag Of Words ( Unigram ) model , so each tweet is represented by about 20000 features . Now to train my sklearn model (SVM,Logistic Regression,Naive Bayes) using this dataset , i have to load the entire 1.6m x 20000 feature vectors into one variable and then feed it to the model . Even on my server machine which has a total of 115GB of memory , it causes the process to be killed .
So i wanted to know if i can train the model instance by instance , rather than loading the entire dataset into one variable ?
If sklearn does not have this flexibility , then is there any other libraries that you could recommend (which support sequential learning) ?
It is not really necessary (let alone efficient) to go to the other extreme and train instance by instance; what you are looking for is actually called incremental or online learning, and it is available in scikit-learn's SGDClassifier for linear SVM and logistic regression, which indeed contains a partial_fit method.
Here is a quick example with dummy data:
import numpy as np
from sklearn import linear_model
X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
Y = np.array([1, 1, 2, 2])
clf = linear_model.SGDClassifier(max_iter=1000, tol=1e-3)
clf.partial_fit(X, Y, classes=np.unique(Y))
X_new = np.array([[-1, -1], [2, 0], [0, 1], [1, 1]])
Y_new = np.array([1, 1, 2, 1])
clf.partial_fit(X_new, Y_new)
The default values for the loss and penalty arguments ('hinge' and 'l2' respectively) are these of a LinearSVC, so the above code essentially fits incrementally a linear SVM classifier with L2 regularization; these settings can of course be changed - check the docs for more details.
It is necessary to include the classes argument in the first call, which should contain all the existing classes in your problem (even though some of them might not be present in some of the partial fits); it can be omitted in subsequent calls of partial_fit - again, see the linked documentation for more details.

How to understand SpatialDropout1D and when to use it?

Occasionally I see some models are using SpatialDropout1D instead of Dropout. For example, in the Part of speech tagging neural network, they use:
model = Sequential()
model.add(Embedding(s_vocabsize, EMBED_SIZE,
input_length=MAX_SEQLEN))
model.add(SpatialDropout1D(0.2)) ##This
model.add(GRU(HIDDEN_SIZE, dropout=0.2, recurrent_dropout=0.2))
model.add(RepeatVector(MAX_SEQLEN))
model.add(GRU(HIDDEN_SIZE, return_sequences=True))
model.add(TimeDistributed(Dense(t_vocabsize)))
model.add(Activation("softmax"))
According to Keras' documentation, it says:
This version performs the same function as Dropout, however it drops
entire 1D feature maps instead of individual elements.
However, I am unable to understand the meaning of entrie 1D feature. More specifically, I am unable to visualize SpatialDropout1D in the same model explained in quora.
Can someone explain this concept by using the same model as in quora?
Also, under what situation we will use SpatialDropout1D instead of Dropout?
To make it simple, I would first note that so-called feature maps (1D, 2D, etc.) is our regular channels. Let's look at examples:
Dropout(): Let's define 2D input: [[1, 1, 1], [2, 2, 2]]. Dropout will consider every element independently, and may result in something like [[1, 0, 1], [0, 2, 2]]
SpatialDropout1D(): In this case result will look like [[1, 0, 1], [2, 0, 2]]. Notice that 2nd element was zeroed along all channels.
The noise shape
In order to understand SpatialDropout1D, you should get used to the notion of the noise shape. In plain vanilla dropout, each element is kept or dropped independently. For example, if the tensor is [2, 2, 2], each of 8 elements can be zeroed out depending on random coin flip (with certain "heads" probability); in total, there will be 8 independent coin flips and any number of values may become zero, from 0 to 8.
Sometimes there is a need to do more than that. For example, one may need to drop the whole slice along 0 axis. The noise_shape in this case is [1, 2, 2] and the dropout involves only 4 independent random coin flips. The first component will either be kept together or be dropped together. The number of zeroed elements can be 0, 2, 4, 6 or 8. It cannot be 1 or 5.
Another way to view this is to imagine that input tensor is in fact [2, 2], but each value is double-precision (or multi-precision). Instead of dropping the bytes in the middle, the layer drops the full multi-byte value.
Why is it useful?
The example above is just for illustration and isn't common in real applications. More realistic example is this: shape(x) = [k, l, m, n] and noise_shape = [k, 1, 1, n]. In this case, each batch and channel component will be kept independently, but each row and column will be kept or not kept together. In other words, the whole [l, m] feature map will be either kept or dropped.
You may want to do this to account for adjacent pixels correlation, especially in the early convolutional layers. Effectively, you want to prevent co-adaptation of pixels with its neighbors across the feature maps, and make them learn as if no other feature maps exist. This is exactly what SpatialDropout2D is doing: it promotes independence between feature maps.
The SpatialDropout1D is very similar: given shape(x) = [k, l, m] it uses noise_shape = [k, 1, m] and drops entire 1-D feature maps.
Reference: Efficient Object Localization Using Convolutional Networks
by Jonathan Tompson at al.

scikit multilabel classification: ValueError: bad input shape

I beieve SGDClassifier() with loss='log' supports Multilabel classification and I do not have to use OneVsRestClassifier. Check this
Now, my dataset is quite big and I am using HashingVectorizer and passing result as input to SGDClassifier. My target has 42048 features.
When I run this, as follows:
clf.partial_fit(X_train_batch, y)
I get: ValueError: bad input shape (300000, 42048).
I have also used classes as the parameter as follows, but still same problem.
clf.partial_fit(X_train_batch, y, classes=np.arange(42048))
In the documentation of SGDClassifier, it says y : numpy array of shape [n_samples]
No, SGDClassifier does not do multilabel classification -- it does multiclass classification, which is a different problem, although both are solved using a one-vs-all problem reduction.
Then, neither SGD nor OneVsRestClassifier.fit will accept a sparse matrix for y. The former wants an array of labels, as you've already found out. The latter wants, for multilabel purposes, a list of lists of labels, e.g.
y = [[1], [2, 3], [1, 3]]
to denote that X[0] has label 1, X[1] has labels {2,3} and X[2] has labels {1,3}.

Resources