Converting Multilabel dataset into Single Label? - machine-learning

i am working on single label text categrorization with a dataset of reuter-21578 however the dataset is multi-label by default. Many researchers removed multilabel instances from thi dataset and their number of instances in reuters categories is quite different than mine. How can i remove all the instance that belongs to more than one category in a dataset ? Can i use weka or Rapidminer for this purpose to identify multilabel instances in a dataset ?
Example:
Input Dataset = {x1, x2, x3, x4, x5, x6, x7, x8, x9, x10}
Labels = {acq, earn, grain , corn}
Classification Results =
x1, x2, x3 = acq
x4, x5 = earn
x6, x7, x8 = grain
x9 = grain, corn
x10 = grain, acq
Output Dataset (what i want) =
output dataset = {x1, x2, x3, x4, x5, x6, x7, x8}
output labels = {acq, earn, grain, corn}
Classification Results =
x1, x2, x3 = acq
x4, x5 = earn
x6, x7, x8 = grain
**OR**
{This is what i assume i have achieved with PolynomiaByBinomial Operator }
output dataset = {x1, x2, x3, x4, x5, x6, x7, x8, x9, x10}
output labels = {acq, earn, grain, corn}
Classification Results =
x1, x2, x3 = acq
x4, x5 = earn
x6, x7, x8, x9, x10 = grain
x9 = grain
x10 = grain
Thanks in advance

The simplest way is to break the dataset into binary problems. If for example you have the datasets
text1: science
text2: sports, politics
Break the dataset into 3 datasets:
dataset1 (science): text1:true, text2:false
dataset2 (sports): text2:false, text2:true
dataset3 (science): text1:false, text2:true
Create 3 binary classifiers, one for each class, use the corresponding datasets to train them, and combine the results.

Related

Difference in the accuracy of an ANN whose weights are initialized by using np.random.randn and np.random.rand

Using a tutorial on neural-networks from scratch, I created an ANN with 1 hidden layer for the MNIST dataset. In the tutorial they use np.random.rand() - 0.5 to initialize the weights and the neural network works fine. I did the same, but when instead of using rand()-0.5 function if I use np.random.randn(), the final accuracy drops significantly.
I don't understand why that would happen because in this question, the response was that using randn is better.
Code for the network:
def init_params():
W1 = np.random.rand(10, 784) - 0.5
b1 = np.random.rand(10, 1) - 0.5
W2 = np.random.rand(10, 10) - 0.5
b2 = np.random.rand(10, 1) - 0.5
return W1, b1, W2, b2
def ReLU(Z):
return np.maximum(0, Z)
def dReLU(Z):
return Z > 0
def softmax(Z):
A = np.exp(Z) / sum(np.exp(Z))
return A
def forward_prop(W1, b1, W2, b2, X):
Z1 = W1.dot(X) + b1
A1 = ReLU(Z1)
Z2 = W2.dot(A1) + b2
A2 = softmax(Z2)
return Z1, A1, Z2, A2
def one_hot(Y):
one_hot_Y = np.zeros((Y.size, Y.max() + 1))
one_hot_Y[np.arange(Y.size), Y] = 1
one_hot_Y = one_hot_Y.T
return one_hot_Y
def back_prop(Z1, A1, Z2, A2, W2, X, Y):
m = Y.size
one_hot_Y = one_hot(Y)
dZ2 = A2 - one_hot_Y
dW2 = 1 / m * np.dot(dZ2, A1.T)
db2 = 1 / m * np.sum(dZ2)
dZ1 = np.dot(W2.T, dZ2)
dW1 = 1 / m * np.dot(dZ1, X.T)
db1 = 1 / m *np.sum(dZ1)
return dW1, db1, dW2, db2
def update_params(W1, b1, W2, b2, dW1, db1, dW2, db2, alpha):
W1 = W1 - alpha * dW1
b1 = b1 - alpha * db1
W2 = W2 - alpha * dW2
b2 = b2 - alpha * db2
return W1, b1, W2, b2
def get_predictions(A2):
return np.argmax(A2, 0)
def get_accuracy(predictions, Y):
print(predictions, Y)
return np.sum(predictions == Y) / Y.size
def gradient_descent(X, Y, iterations, alpha):
W1, b1, W2, b2 = init_params()
for i in range(iterations):
Z1, A1, Z2, A2 = forward_prop(W1, b1, W2, b2, X)
dW1, db1, dW2, db2 = back_prop(Z1, A1, Z2, A2, W2, X, Y)
W1, b1, W2, b2 = update_params(W1, b1, W2, b2, dW1, db1, dW2, db2, alpha)
if (i % 10 == 0):
print('Iteration: ', i)
print('Accuracy: ', get_accuracy(get_predictions(A2), Y))
return W1, b1, W2, b2
Number of epochs = 500
Learning rate = 0.15
Accuracy of using np.random.rand() - 0.5 after 500 epochs: 80.6%
Accuracy of using np.random.randn() after 500 epochs: 55.2%
It also seems that the accuracy stops changing after the 100th epoch for the method that uses np.random.randn()
The data I am using is mnist digit classification data

How to discard a branch after training a pytorch model

I am trying to implement a FCN in pytorch with the overall structure as below:
The code so far looks like below:
class SNet(nn.Module):
def __init__(self):
super(SNet, self).__init__()
self.enc_a = encoder(...)
self.dec_a = decoder(...)
self.enc_b = encoder(...)
self.dec_b = decoder(...)
def forward(self, x1, x2):
x1 = self.enc_a(x1)
x2 = self.enc_b(x2)
x2 = self.dec_b(x2)
x1 = self.dec_a(torch.cat((x1, x2), dim=-1)
return x1, x2
In keras it is relatively easy to do this using the functional API. However, I could not find any concrete example / tutorial to do this in pytorch.
How can I discard the dec_a (decoder part of autoencoder branch) after training?
During joint training the loss will be sum (optionally weighted) of the loss from both the branch?
You can also define separate modes for your model for training and inference:
class SNet(nn.Module):
def __init__(self):
super(SNet, self).__init__()
self.enc_a = encoder(...)
self.dec_a = decoder(...)
self.enc_b = encoder(...)
self.dec_b = decoder(...)
self.training = True
def forward(self, x1, x2):
if self.training:
x1 = self.enc_a(x1)
x2 = self.enc_b(x2)
x2 = self.dec_b(x2)
x1 = self.dec_a(torch.cat((x1, x2), dim=-1)
return x1, x2
else:
x1 = self.enc_a(x1)
x2 = self.enc_b(x2)
x2 = self.dec_b(x2)
return x2
These blocks are examples and may not do exactly what you want because I think there is a bit of ambiguity between how you define the training and inference operations in your block chart vs. your code, but in any case you get the idea of how you can use some modules only during training mode. Then you can just set this variable accordingly.

How to break a long line in Lua

Util = {
scale = function (x1, x2, x3, y1, y3) return (y1) + ( (y2) - (y1)) * \
( (x2) - (x1)) / ( (x3) - (x1)) end
}
print(Util.scale(1, 2, 3, 1, 3))
What is the proper syntax for breaking a long line in Lua?
In your particular case, the convention would be ...
Util = {
scale = function (x1, x2, x3, y1, y3)
return (y1) + ( (y2) - (y1)) * ( (x2) - (x1)) / ( (x3) - (x1))
end
}
Where the break is on statements
Further breaking could be done if the multiplication needs to be split with
Util = {
scale = function (x1, x2, x3, y1, y3)
return (y1) + ( (y2) - (y1)) *
( (x2) - (x1)) / ( (x3) - (x1))
end
}
With the multiplication token used to split the line. By leaving a token at the end of the line, the parser requires more input to complete the expression, so looks onto the next line.
Lua is generally blind to line characters, they are just white-space. However, there are cases where there can be differences, and I would limit line breaks to places where there is an obvious need for extra data.
a = f
(g).x(a)
Is a specific case where it could be treated as a = f(g).x(a) or a = f and (g).x(a)
By breaking after a token which requires continuation, you ensure you are working with the parser.

How to modify data-table columns with Highcharts

Here is the fiddle I'm referring http://jsfiddle.net/mm484L57/
Highcharts.tableLine = function (renderer, x1, y1, x2, y2) {
renderer.path(['M', x1, y1, 'L', x2, y2]).attr({
'stroke': 'silver',
'stroke-width': 5
}).add();
};
I would like to get the legend on the table to be in rows rather in column,
it must be vice versa. Note: I'm referring to the data-table shown there. In
simple, City names(Tokyo, New York, Berlin, London) must be row wise and the
months should be column wise.
Any help is much appreciated.
Thanks.

training random forest using caret package

I want to use my training data to train a random forest model, but some errors occured.
The error message as below:
Error in train.default(x, y, weights = w, ...) :
At least one of the class levels is not a valid R variable name; This will cause errors when class probabilities
are generated because the variables names will be converted to X1, X2, X3, X4, X5, X6, X7 . Please use factor
levels that can be used as valid R variable names (see ?make.names for help).
Below is my code:
rf.ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 10,
classProbs = TRUE,
summaryFunction = twoClassSummary)
set.seed(256)
#train the calssification model with random forest
rf.model <- train(as.factor(response) ~ .,data = trainvals,
method = "rf",
trControl = rf.ctrl,
tuneLength = 10,
metic = "ROC")
The structure of trainvals is :
The class level of response is 1,2,3,4,5,6,and 7.
One or more of the columns in the trainvals data frame is not a factor type, hence the error you are getting. You can convert all columns to factor using the following:
trainvals[] <- lapply(trainvals, factor)

Resources