Scikit-learn Multiclass Naive Bayes with probabilities for y

Scikit-learn Multiclass Naive Bayes with probabilities for y - machine-learning

I'm doing a tweet classification, where each tweet can belong to one of few classes.
The training set output is given as the probability for belonging that sample to each class.
Eg: tweet#1 : C1-0.6, C2-0.4, C3-0.0 (C1,C2,C3 being classes)
I'm planning to use a Naive Bayes classifier using Scikit-learn. I couldn't find a fit method in naive_bayes.py which takes probability for each class for training.
I need a classifier which accepts output probability for each class for the training set.
(ie: y.shape = [n_samples, n_classes])
How can I process my data set to apply a NaiveBayes classifier?

This is not so easy, as the "classes probability" can have many interpretations.
In case of NB classifier and sklearn the easiest procedure I see is:
Split (duplicate) your training samples according to the following rule:
given (x, [p1, p2, ..., pk ]) sample (where pi is probability for ith class) create artificial training samples:
(x, 1, p1), (x, 2, p2), ..., (x, k, pk). So you get k new observations, each "attached" to one class, and pi is treated as a sample weight, which NB (in sklearn) accepts.
Train your NB with fit(X,Y,sample_weights) (where X is a matrix of your x observations, Y is a matrix of classes from previous step, and sample_weights is a matrix of pi from the previous step.
For example if your training set consists of two points:
( [0 1], [0.6 0.4] )
( [1 3], [0.1 0.9] )
You transform them to:
( [0 1], 1, 0.6)
( [0 1], 2, 0.4)
( [1 3], 1, 0.1)
( [1 3], 2, 0.9)
and train NB with
X = [ [0 1], [0 1], [1 3], [1 3] ]
Y = [ 1, 2, 1, 2 ]
sample_weights = [ 0.6 0.4 0.1 0.9 ]

Related

How to learn a convolution kernel

I have an image matrix A. I want to learn a convolution kernel H that does following operations:
A*H gives a tensor "Intermediate" and
Intermediate * H gives "A"
Here * represents convolution operation (possibly using FFT). I only have the image. I started with a random H matrix. I want to minimise the loss between the final output [(A*H)*H] and A; and using that to get the optimised H. Can someone suggest how should I proceed using Torch?
N.B: I've written a function that does the convolution operations and returns a tensor that I want to be Like A.

Does this code match your requirement?
import torch
A = torch.randn([1, 1, 4, 4])
conv = torch.nn.Conv2d(1, 1, 1)
criterion = torch.nn.MSELoss()
optimizer = torch.optim.SGD(conv.parameters(), lr=0.001)
for i in range(1000):
optimizer.zero_grad()
out = conv(conv(A))
loss = criterion(out, A)
loss.backward()
optimizer.step()
if i % 100 == 0:
print(i, loss.item())
And of course, the weight of convolution will converge to 1

How to specify the positive class manually before fitting Sklearn estimators and transformers

I am trying to predict credit card approvals using the relevant dataset from UCI ML Repo. The problem is that the target encodes the applications for credit cards as '+' for approved and '-' for rejected.
As there are a bit more rejected applications in the target, all scorers, estimators are treating the rejected class as positive while it should be otherwise. Because of this, my confusion matrix is all messed up because I think all True Positives and True Negatives, False Positives and False Negatives get inverted:
How can I specify the positive class manually?

I do not know of scikit-learn estimators or transformers that let you flip positive and negative class identifiers as a parameter. But I can think of two ways to work around this:
Method 1: You transform the array labels yourself before fitting the estimator
That can be easily achieved for numpy arrays:
y = np.array(['+', '+', '+', '-', '-'])
y_transformed = [1 if i == '+' else 0 for i in y]
and also pandas Series objects:
y = pd.Series(['+', '+', '+', '-', '-'])
y_transformed = y.map({'+': 1, '-': 0})
In both cases the output will be [1, 1, 1, 0, 0]
Method 2: You define the labels parameter in confusion_matrix
scikit-learn's confusion_matrix has a parameter labels that lets you reorder the labels. Use like this:
y_true = np.array([1, 1, 1, 0, 0])
y_pred = np.array([1, 0, 1, 0, 0])
print(confusion_matrix(y_true, y_pred))
# output
[[2 0]
[1 2]]
print(confusion_matrix(y_true, y_pred, labels=[1, 0]))
# output
[[2 1]
[0 2]]

How to efficiently find correspondences between two point sets without nested for loop in Pytorch?

I now have two point sets (tensor) A and B that shape like
A.size() >>(50, 3) , example: [ [0, 0, 0], [0, 1, 2], ..., [1, 1, 1]]
B.size() >>(10, 3)
where the first dimension stands for number of points and the second dim stands for coordinates (x,y,z)
To some extent, the question could also be simplified into " Finding common elements between two tensors ". Is there a quick way to do this without nested loop ?

You can quickly compute all the 50x10 distances using:
d2 = ((A[:, None, :] - B[None, ...])**2).sum(dim=2)
Once you have all the pair-wise distances, you can select "similar" ones if the distance does not exceed a threshold thr:
(d2 < thr).nonzero()
returns pairs of a-idx, b-idx of "similar" points.
If you want to match the points exactly, you can do instead:
((a[:, None, :] == b[None, ...]).all(dim=2)).nonzero()

XOR gate with a neural network

I was trying to implement an XOR gate with tensorflow. I succeeded in implementing that, but i don't fully understand why it works. I got help from stackoverflow posts here and here. So both with one hot true and without one hot true outputs. Here is the network as i understood, in order to set things clear.
My Question #1:
Notice the RELU function and Sigmoid function. Why we need that(specifically the RELU function)? You may say that in order to achieve non linearity. I understand how RELU achieves non-linearity. I got the answer from here. Now from what I understand the difference between using RELU and without using RELU is this(see the picture).[I tested the tf.nn.relu function. The output is like this]
Now, if the first function works, why not the second function? From my perspective RELU achieves non-linearity by combining multiple linear functions. So both is linear function(upper two). If first one achieves non linearity, 2nd one should too, shouldn't it? The question is that, without using the RELU why the network gets stuck?
XOR gate with one hot true outputs
hidden1_neuron = 10
def Network(x, weights, bias):
layer1 = tf.nn.relu(tf.matmul(x, weights['h1']) + bias['h1'])
layer_final = tf.matmul(layer1, weights['out']) + bias['out']
return layer_final
weight = {
'h1' : tf.Variable(tf.random_normal([2, hidden1_neuron])),
'out': tf.Variable(tf.random_normal([hidden1_neuron, 2]))
}
bias = {
'h1' : tf.Variable(tf.random_normal([hidden1_neuron])),
'out': tf.Variable(tf.random_normal([2]))
}
x = tf.placeholder(tf.float32, [None, 2])
y = tf.placeholder(tf.float32, [None, 2])
net = Network(x, weight, bias)
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(net, y)
loss = tf.reduce_mean(cross_entropy)
train_op = tf.train.AdamOptimizer(0.2).minimize(loss)
init_op = tf.initialize_all_variables()
xTrain = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
yTrain = np.array([[1, 0], [0, 1], [0, 1], [1, 0]])
with tf.Session() as sess:
sess.run(init_op)
for i in range(5000):
train_data = sess.run(train_op, feed_dict={x: xTrain, y: yTrain})
loss_val = sess.run(loss, feed_dict={x: xTrain, y: yTrain})
if(not(i%500)):
print(loss_val)
result = sess.run(net, feed_dict={x:xTrain})
print(result)
The code you see above implements the XOR gate with one hot true outputs. If i take out tf.nn.relu, the network gets stuck. Why?
My Question #2:
How can I understand if a network is going to get stuck on some local minima[or some value]? Is it from the plot of cost function (or loss function)? Say, for the network designed above, I used cross entropy as the loss function. I could not find the plotting of cross entropy function. (If you can provide this, this would be very helpful.)
My Question #3:
Notice on the code there is a line hidden1_neuron = 10. It means that i have set the number of neurons in the hidden layer 10. Reducing the number of neurons to 5 makes the network to get stuck. So what should be the number of neurons on hidden layer?
The output when the network works the way it is supposed to :
2.42076
0.000456363
0.000149548
7.40216e-05
4.34194e-05
2.78939e-05
1.8924e-05
1.33214e-05
9.62602e-06
7.06308e-06
[[ 7.5128479 -7.58900356]
[-5.65254211 5.28509617]
[-6.96340656 6.62380219]
[ 7.26610374 -5.9665451 ]]
The output when the network gets stuck:
1.45679
0.346579
0.346575
0.346575
0.346574
0.346574
0.346574
0.346574
0.346574
0.346574
[[ 15.70696926 -18.21559143]
[ -7.1562047 9.75774956]
[ -0.03214722 -0.03214724]
[ -0.03214722 -0.03214724]]

Question 1
Both the ReLU and Sigmoid function is non-linear. On the contrary, the function drawn to the right of the ReLU function is linear. Applying multiple linear activation functions will still make the network linear.
Therefore, the network gets stuck when trying to perform linear regression on a non-linear problem.
Question 2
Yes, you will have to pay attention to the progression of the error rate. In larger problem instances, you would typically pay attention to the development of the error function on your test set. This is done by measuring the accuracy of the network after a period of training.
Question 3
The XOR problem requires at least 2 input, 2 hidden, and 1 output node, that is: five nodes are required to correctly model the XOR problem with a simple neural network.

How to use a Gaussian Process for Binary Classification?

I know that a Gaussian Process model is best suited for regression rather than classification. However, I would still like to apply a Gaussian Process to a classification task but I am not sure what is the best way to bin the predictions generated by the model. I have reviewed the Gaussian Process classification example that is available on the scikit-learn website at:
http://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gp_probabilistic_classification_after_regression.html
But I found this example confusing (I have listed the things I found confusing about this example at the end of the question). To try and get a better understanding I have created a very basic python code example using scikit-learn that generates classifications by applying a decision boundary to the predictions made by a gaussian process:
#A minimum example illustrating how to use a
#Gaussian Processes for binary classification
import numpy as np
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.gaussian_process import GaussianProcess
if __name__ == "__main__":
#defines some basic training and test data
#If the descriptive features have large values
#(i.e., 8s and 9s) the target is 1
#If the descriptive features have small values
#(i.e., 2s and 3s) the target is 0
TRAININPUTS = np.array([[8, 9, 9, 9, 9],
[9, 8, 9, 9, 9],
[9, 9, 8, 9, 9],
[9, 9, 9, 8, 9],
[9, 9, 9, 9, 8],
[2, 3, 3, 3, 3],
[3, 2, 3, 3, 3],
[3, 3, 2, 3, 3],
[3, 3, 3, 2, 3],
[3, 3, 3, 3, 2]])
TRAINTARGETS = np.array([1, 1, 1, 1, 1, 0, 0, 0, 0, 0])
TESTINPUTS = np.array([[8, 8, 9, 9, 9],
[9, 9, 8, 8, 9],
[3, 3, 3, 3, 3],
[3, 2, 3, 2, 3],
[3, 2, 2, 3, 2],
[2, 2, 2, 2, 2]])
TESTTARGETS = np.array([1, 1, 0, 0, 0, 0])
DECISIONBOUNDARY = 0.5
#Fit a gaussian process model to the data
gp = GaussianProcess(theta0=10e-1, random_start=100)
gp.fit(TRAININPUTS, TRAINTARGETS)
#Generate a set of predictions for the test data
y_pred = gp.predict(TESTINPUTS)
print "Predicted Values:"
print y_pred
print "----------------"
#Convert the continuous predictions into the classes
#by splitting on a decision boundary of 0.5
predictions = []
for y in y_pred:
if y > DECISIONBOUNDARY:
predictions.append(1)
else:
predictions.append(0)
print "Binned Predictions (decision boundary = 0.5):"
print predictions
print "----------------"
#print out the confusion matrix specifiy 1 as the positive class
cm = confusion_matrix(TESTTARGETS, predictions, [1, 0])
print "Confusion Matrix (1 as positive class):"
print cm
print "----------------"
print "Classification Report:"
print metrics.classification_report(TESTTARGETS, predictions)
When I run this code I get the following output:
Predicted Values:
[ 0.96914832 0.96914832 -0.03172673 0.03085167 0.06066993 0.11677634]
----------------
Binned Predictions (decision boundary = 0.5):
[1, 1, 0, 0, 0, 0]
----------------
Confusion Matrix (1 as positive class):
[[2 0]
[0 4]]
----------------
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 4
1 1.00 1.00 1.00 2
avg / total 1.00 1.00 1.00 6
The approach used in this basic example seems to work fine with this simple dataset. But this approach is very different from the classification example given on the scikit-lean website that I mentioned above (url repeated here):
http://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gp_probabilistic_classification_after_regression.html
So I'm wondering if I am missing something here. So, I would appreciate if anyone could:
With respect to the classification example given on the scikit-learn website:
1.1 explain what the probabilities being generated in this example are probabilities of? Are they the probability of the query instance belonging to the class >0?
1.2 why the example uses a cumulative density function instead of a probability density function?
1.3 why the example divides the predictions made by the model by the square root of the mean square error before they are input into the cumulative density function?
With respect to the basic code example I have listed here, clarify whether or not applying a simple decision boundary to the predictions generated by a gaussian process model is an appropriate way to do binary classification?
Sorry for such a long question and thanks for any help.

In the GP classifier, a standard GP distribution over functions is "squashed," usually using the standard normal CDF (also called the probit function), to map it to a distribution over binary categories.
Another interpretation of this process is through a hierarchical model (this paper has the derivation), with a hidden variable drawn from a Gaussian Process.
In sklearn's gp library, it looks like the output from y_pred, MSE=gp.predict(xx, eval_MSE=True) are the (approximate) posterior means (y_pred) and posterior variances (MSE) evaluated at points in xx before any squashing occurs.
To obtain the probability that a point from the test set belongs to the positive class, you can convert the normal distribution over y_pred to a binary distribution by applying the Normal CDF (see [this paper again] for details).
The hierarchical model of the probit squashing function is defined by a 0 decision boundary (the standard normal distribution is symmetric around 0, meaning PHI(0)=.5). So you should set DECISIONBOUNDARY=0.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Scikit-learn Multiclass Naive Bayes with probabilities for y - machine-learning

Related

How to learn a convolution kernel

How to specify the positive class manually before fitting Sklearn estimators and transformers

How to efficiently find correspondences between two point sets without nested for loop in Pytorch?

XOR gate with a neural network

How to use a Gaussian Process for Binary Classification?

Categories

Resources