I am trying to train a basic SVM model for multiclass text classification in Julia. My dataset has around 75K rows and 2 columns (text and label). The context of the dataset is the abstracts of scientific papers gathered from PubMed. I have 10 labels in the dataset.
The dataset looks like this:
I keep receiving two different Method errors. The starting one is:
ERROR: MethodError: no method matching DocumentTermMatrix(::Vector{String})
I have tried:
convert(Array,data[:,:text])
and also:
convert(Matrix,data[:,:text])
Array conversion gives the same error, and matrix conversion gives:
ERROR: MethodError: no method matching (Matrix)(::Vector{String})
My code is:
using DataFrames, CSV, StatsBase,Printf, LIBSVM, TextAnalysis, Random
function ReadData(data)
df = CSV.read(data, DataFrame)
return df
end
function splitdf(df, pct)
#assert 0 <= pct <= 1
ids = collect(axes(df, 1))
shuffle!(ids)
sel = ids .<= nrow(df) .* pct
return view(df,sel, :), view(df, .!sel, :)
end
function Feature_Extract(data)
Text = convert(Array,data[:,:text])
m = DocumentTermMatrix(Text)
X = tfidf(m)
return X
end
function Classify(data)
data = ReadData(data)
train, test = splitdf(data, 0.5)
ytrain = train.label
ytest = test.label
Xtrain = Feature_Extract(train)
Xtest = Feature_Extract(test)
model = svmtrain(Xtrain, ytrain)
ŷ, decision_values = svmpredict(model, Xtest);
#printf "Accuracy: %.2f%%\n" mean(ŷ .== ytest) * 100
end
data = "data/composite_data.csv"
#time Classify(data)
I appreciate your help to solve this problem.
EDIT:
I have managed to get the corpus but now facing DimensionMismatch Error:
using DataFrames, CSV, StatsBase,Printf, LIBSVM, TextAnalysis, Random
function ReadData(data)
df = CSV.read(data, DataFrame)
#count = countmap(df.label)
#println(count)
#amt,lesslabel = findmin(count)
#println(amt, lesslabel)
#println(first(df,5))
return df
end
function splitdf(df, pct)
#assert 0 <= pct <= 1
ids = collect(axes(df, 1))
shuffle!(ids)
sel = ids .<= nrow(df) .* pct
return view(df,sel, :), view(df, .!sel, :)
end
function Feature_Extract(data)
crps = Corpus(StringDocument.(data.text))
update_lexicon!(crps)
m = DocumentTermMatrix(crps)
X = tf_idf(m)
return X
end
function Classify(data)
data = ReadData(data)
#println(labels)
#println(first(instances))
train, test = splitdf(data, 0.5)
ytrain = train.label
ytest = test.label
Xtrain = Feature_Extract(train)
Xtest = Feature_Extract(test)
model = svmtrain(Xtrain, ytrain)
ŷ, decision_values = svmpredict(model, Xtest);
#printf "Accuracy: %.2f%%\n" mean(ŷ .== ytest) * 100
end
data = "data/composite_data.csv"
#time Classify(data)
Error:
ERROR: DimensionMismatch("Size of second dimension of training instance\n matrix (247317) does not match length of\n labels (38263)")
(Copying Bogumił Kamiński's solution from the comments, as a community wiki answer, for better visibility.)
The argument to DocumentTermMatrix should be of type Corpus, as in this example.
A Corpus can be created with:
Corpus(StringDocument.(data.text))
There's a DimensionMismatch error after that, which is due to the mismatch between what tf_idf sends and what svmtrain expects. tf_idf's return value has one row per document, whereas svmtrain expects one column per document i.e. expects each column to be an X value. So, performing a permutedims on the result before passing it to svmtrain resolves this mismatch.
Related
I need to implement a multi-label image classification model in PyTorch. However my data is not balanced, so I used the WeightedRandomSampler in PyTorch to create a custom dataloader. But when I iterate through the custom dataloader, I get the error : IndexError: list index out of range
Implemented the following code using this link :https://discuss.pytorch.org/t/balanced-sampling-between-classes-with-torchvision-dataloader/2703/3?u=surajsubramanian
def make_weights_for_balanced_classes(images, nclasses):
count = [0] * nclasses
for item in images:
count[item[1]] += 1
weight_per_class = [0.] * nclasses
N = float(sum(count))
for i in range(nclasses):
weight_per_class[i] = N/float(count[i])
weight = [0] * len(images)
for idx, val in enumerate(images):
weight[idx] = weight_per_class[val[1]]
return weight
weights = make_weights_for_balanced_classes(train_dataset.imgs, len(full_dataset.classes))
weights = torch.DoubleTensor(weights)
sampler = WeightedRandomSampler(weights, len(weights))
train_loader = DataLoader(train_dataset, batch_size=4,sampler = sampler, pin_memory=True)
Based on the answer in https://stackoverflow.com/a/60813495/10077354, the following is my updated code. But then too when I create a dataloader :loader = DataLoader(full_dataset, batch_size=4, sampler=sampler), len(loader) returns 1.
class_counts = [1691, 743, 2278, 1271]
num_samples = np.sum(class_counts)
labels = [tag for _,tag in full_dataset.imgs]
class_weights = [num_samples/class_counts[i] for i in range(len(class_counts)]
weights = [class_weights[labels[i]] for i in range(num_samples)]
sampler = WeightedRandomSampler(torch.DoubleTensor(weights), num_samples)
Thanks a lot in advance !
I included an utility function based on the accepted answer below :
def sampler_(dataset):
dataset_counts = imageCount(dataset)
num_samples = sum(dataset_counts)
labels = [tag for _,tag in dataset]
class_weights = [num_samples/dataset_counts[i] for i in range(n_classes)]
weights = [class_weights[labels[i]] for i in range(num_samples)]
sampler = WeightedRandomSampler(torch.DoubleTensor(weights), int(num_samples))
return sampler
The imageCount function finds number of images of each class in the dataset. Each row in the dataset contains the image and the class, so we take the second element in the tuple into consideration.
def imageCount(dataset):
image_count = [0]*(n_classes)
for img in dataset:
image_count[img[1]] += 1
return image_count
That code looks a bit complex... You can try the following:
#Let there be 9 samples and 1 sample in class 0 and 1 respectively
class_counts = [9.0, 1.0]
num_samples = sum(class_counts)
labels = [0, 0,..., 0, 1] #corresponding labels of samples
class_weights = [num_samples/class_counts[i] for i in range(len(class_counts))]
weights = [class_weights[labels[i]] for i in range(int(num_samples))]
sampler = WeightedRandomSampler(torch.DoubleTensor(weights), int(num_samples))
Here is an alternative solution:
import numpy as np
from torch.utils.data.sampler import WeightedRandomSampler
counts = np.bincount(y)
labels_weights = 1. / counts
weights = labels_weights[y]
WeightedRandomSampler(weights, len(weights))
where y is a list of labels corresponding to each sample, has shape (n_samples,) and are encoded [0, ..., n_classes].
weights won't add up to 1, which is ok according to the official docs.
Here I show a dummy example that represents my actual problem.
My neural network (NN) receives one input and gives the probabilities for two output nodes. The code for the NN is:
class Net(torch.nn.Module):
def __init__(self, N, M):
super(Net, self).__init__()
self.fc1 = torch.nn.Linear(N, 4)
self.fc2 = torch.nn.Linear(4, 4)
self.fc3 = torch.nn.Linear(4, M)
def forward(self, x):
x = torch.sigmoid(self.fc1(x))
x = torch.sigmoid(self.fc2(x))
x = torch.softmax(self.fc3(x),0)
return x
The ABM class is our model that iteratiely sends calls to Net::forward, and based on the probabilities, chooses an action if it's the first index, increments agent_count. Inputs xx are stored in states which will be used to backward.
class ABM:
def __init__(self,_nn,_t_data):
self.nn = nn
self.iteration_n = _t_data.iteration_n
self.target_value = _t_data.target_value
def run(self):
for jj in range(self.iteration_n):
xx = self.generate_input();
self.states.append(xx); # store inputs
ys = nn.forward(xx);
action = self.draw(ys);
if (action == 0):
self.agent_count+=1
loss = self.calculateReward();
return loss;
def generate_input(self):
return torch.ones((1),requires_grad = True)
--some other attributes--
When the run is over, error is calculated as error = (target_value - agent_count)/target_value which is a value between -1 and 1.
In order to train the model, the error is applied to the probability of the first output node of NN. This is to correct the NN in a way that predicts the right probability for the first output. The code is:
class ABM:
def calculateReward(self):
error = (self.target_value - self.agent_count)/self.target_value
reward = torch.tensor((-error), requires_grad = True)
#since all states are same, we just choose the first one
state = self.states[0]
ys = nn.forward(state)
actionProb = ys[0]
action_reward = actionProb * reward
return action_reward;
--some other members--
Two parameters of iteration_n and target_value used in the ABM are defined in the training data class as:
class Train:
def __init__(self,tt , tv):
self.iteration_n = tt
self.target_value = tv
target_value =0;
iteration_n=0;
The harmony between different parts of the code is done as:
#### start optimization ####
nn = Net(1,2)
optimizer = optim.Adam(nn.parameters(), lr=0.01)
# create training data values
training_items = []
training_items.append(Train(1000,800))
training_items.append(Train(500,200))
error_record = []
for ii in range(100):
print("############ start iteration #%d ################"%ii)
for t_item in training_items:
model = ABM(nn,t_item)
loss = model.run()
optimizer.zero_grad()
loss.backward()
optimizer.step()
error_record.append(loss.item())
Now let's present the problem.
If I only define one training item as Train(/*iteration number*/1000,/*target value*/800));, the NN is optimized as expected:
however, by defining two training items, although the average error declines to zero, the error on each training data stays high:
Is there any idea how to solve this issue?
I have omitted some parts of the code here to make it more readable. The full running code is available on minimal ABM
I want to calculate the information gain on 20_newsgroup data set.
I am using the code here(also I put a copy of the code down of the question).
As you see the input to the algorithm is X,y
My confusion is that, X is going to be a matrix with documents in rows and features as column. (according to 20_newsgroup it is 11314,1000
in case i only considered 1000 features).
but according to the concept of information gain, it should calculate information gain for each feature.
(So I was expecting to see the code in a way loop through each feature, so the input to the function be a matrix where rows are features and columns are class)
But X is not feature here but X stands for documents, and I can not see the part in the code that take care of this part! ( I mean considering each document, and then going through each feature of that document; like looping through rows but at the same time looping through columns as the features are stored in columns).
I have read this and this and many similar questions but they are not clear in terms of input matrix shape.
this is the code for reading 20_newsgroup:
newsgroup_train = fetch_20newsgroups(subset='train')
X,y = newsgroup_train.data,newsgroup_train.target
cv = CountVectorizer(max_df=0.99,min_df=0.001, max_features=1000,stop_words='english',lowercase=True,analyzer='word')
X_vec = cv.fit_transform(X)
(X_vec.shape) is (11314,1000) which is not features in the 20_newsgroup data set. I am thinking am I calculating Information gain in a incorrect way?
This is the code for Information gain:
def information_gain(X, y):
def _calIg():
entropy_x_set = 0
entropy_x_not_set = 0
for c in classCnt:
probs = classCnt[c] / float(featureTot)
entropy_x_set = entropy_x_set - probs * np.log(probs)
probs = (classTotCnt[c] - classCnt[c]) / float(tot - featureTot)
entropy_x_not_set = entropy_x_not_set - probs * np.log(probs)
for c in classTotCnt:
if c not in classCnt:
probs = classTotCnt[c] / float(tot - featureTot)
entropy_x_not_set = entropy_x_not_set - probs * np.log(probs)
return entropy_before - ((featureTot / float(tot)) * entropy_x_set
+ ((tot - featureTot) / float(tot)) * entropy_x_not_set)
tot = X.shape[0]
classTotCnt = {}
entropy_before = 0
for i in y:
if i not in classTotCnt:
classTotCnt[i] = 1
else:
classTotCnt[i] = classTotCnt[i] + 1
for c in classTotCnt:
probs = classTotCnt[c] / float(tot)
entropy_before = entropy_before - probs * np.log(probs)
nz = X.T.nonzero()
pre = 0
classCnt = {}
featureTot = 0
information_gain = []
for i in range(0, len(nz[0])):
if (i != 0 and nz[0][i] != pre):
for notappear in range(pre+1, nz[0][i]):
information_gain.append(0)
ig = _calIg()
information_gain.append(ig)
pre = nz[0][i]
classCnt = {}
featureTot = 0
featureTot = featureTot + 1
yclass = y[nz[1][i]]
if yclass not in classCnt:
classCnt[yclass] = 1
else:
classCnt[yclass] = classCnt[yclass] + 1
ig = _calIg()
information_gain.append(ig)
return np.asarray(information_gain)
Well, after going through the code in detail, I learned more about X.T.nonzero().
Actually it is correct that information gain needs to loop through features.
Also it is correct that the matrix scikit-learn give us here is based on doc-features.
But:
in code it uses X.T.nonzero() which technically transform all the nonzero values into array. and then in the next row loop through the length of that array range(0, len(X.T.nonzero()[0]).
Overall, this part X.T.nonzero()[0] is returning all the none zero features to us :)
I am currently trying to adapt my tensorflow classifier, which is able to tag a sequence of words to be positive or negative, to handle much longer sequences, without retraining. My model is a RNN, with a max sequence lenght of 210. One input is one word(300 dim), I vectorised the words with Googles word2vec, so I am able to feed a sequence with max 210 words. Now my question is, how can I change the max sequence length to for example 3000, for classifying movie reviews.
My working model with fixed max sequence length of 210(tf_version: 1.1.0):
n_chunks = 210
chunk_size = 300
x = tf.placeholder("float",[None,n_chunks,chunk_size])
y = tf.placeholder("float",None)
seq_length = tf.placeholder("int64",None)
with tf.variable_scope("rnn1"):
lstm_cell = tf.contrib.rnn.LSTMCell(rnn_size,
state_is_tuple=True)
lstm_cell = tf.contrib.rnn.DropoutWrapper (lstm_cell,
input_keep_prob=0.8)
outputs, _ = tf.nn.dynamic_rnn(lstm_cell,x,dtype=tf.float32,
sequence_length = self.seq_length)
fc = tf.contrib.layers.fully_connected(outputs, 1000,
activation_fn=tf.nn.relu)
output = tf.contrib.layers.flatten(fc)
#*1
logits = tf.contrib.layers.fully_connected(output, self.n_classes,
activation_fn=None)
cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits
(logits=logits, labels=y) )
optimizer = tf.train.AdamOptimizer(learning_rate=0.01).minimize(cost)
...
#train
#train_x padded to fit(batch_size*n_chunks*chunk_size)
sess.run([optimizer, cost], feed_dict={x:train_x, y:train_y,
seq_length:seq_length})
#predict:
...
pred = tf.nn.softmax(logits)
pred = sess.run(pred,feed_dict={x:word_vecs, seq_length:sq_l})
What modifications I already tried:
1 Replacing n_chunks with None and simply feed data in
x = tf.placeholder(tf.float32, [None,None,300])
#model fails to build
#ValueError: The last dimension of the inputs to `Dense` should be defined.
#Found `None`.
# at *1
...
#all entrys in word_vecs still have got the same length for example
#3000(batch_size*3000(!= n_chunks)*300)
pred = tf.nn.softmax(logits)
pred = sess.run(pred,feed_dict={x:word_vecs, seq_length:sq_l})
2 Changing x and then restore the old model:
x = tf.placeholder(tf.float32, [None,n_chunks*10,chunk_size]
...
saver = tf.train.Saver(tf.all_variables(), reshape=True)
saver.restore(sess,"...")
#fails as well:
#InvalidArgumentError (see above for traceback): Input to reshape is a
#tensor with 420000 values, but the requested shape has 840000
#[[Node: save/Reshape_5 = Reshape[T=DT_FLOAT, Tshape=DT_INT32,
#_device="/job:localhost/replica:0/task:0/cpu:0"](save/RestoreV2_5,
#save/Reshape_5/shape)]]
# run prediction
If it is possible could you please provide me with any working example or explain me why it isnt?
I am just wondering why not you just assign the n_chunk a value of 3000?
In your first attempt, you cannot use two None, since tf cannot how many dimensions to put for each one. The first dimension is set as None because it is contingent upon the batch size. In your second attempt, you just change one place and the other places where n_chunks is used may conflict with the x placeholder.
Here is my code:
My target is a vector with shape(N,) which is a vector with only binary numbers
However, I'm running into compiling errors
/Library/Frameworks/Python.framework/Versions/3.6/bin/python3.6 /Users/Lai/Dropbox/PersonalProject/MachineLearningForSports/models/NeuralNetwork.py
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
Traceback (most recent call last):
File "/Users/Lai/Dropbox/PersonalProject/MachineLearningForSports/models/NeuralNetwork.py", line 102, in <module>
_, c = sess.run([optimizer,cost],feed_dict = {x:batch_x,y:batch_y})
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 943, in _run
% (np_val.shape, subfeed_t.name, str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (100,) for Tensor 'Placeholder_1:0', which has shape '(?, 2)'
Since my batch size is 100; I believe that the error is at when comparing my target to my predictions. The tf.placable seems make the prediction with N*2, although I'm sure. Any help ?? Thanks
import tensorflow as tf
import DataPrepare as dp
import numpy as np
def random_init(x,num_feature_1st,num_feature_2nd,num_class):
W1 = tf.Variable(tf.random_normal([num_feature_1st,num_feature_2nd]))
bias1 = tf.Variable(tf.random_normal([num_feature_2nd]))
W2 = tf.Variable(tf.random_normal([num_feature_2nd,num_class]))
bias2 = tf.Variable(tf.random_normal([num_class]))
return [W1,bias1,W2,bias2]
def softsign(z):
"""The softsign function, applied elementwise."""
return z / (1. + np.abs(z))
def multilayer_perceptron(x,num_feature_1st,num_feature_2nd,num_class):
params = random_init(x,num_feature_1st,num_feature_2nd,num_class)
layer_1 = tf.add(tf.matmul(x,params[0]),params[1])
layer_1 = softsign(layer_1)
#layer_1 = tf.nn.relu(layer_1)
layer_2 = tf.add(tf.matmul(layer_1,params[2]),params[3])
#output = tf.nn.softmax(layer_2)
output = tf.nn.sigmoid(layer_2)
return output
def next_batch(num, dataX,dataY):
idx = np.arange(0,len(dataX))
np.random.shuffle(idx)
idx = idx[0:num]
dataX_shuffle = [dataX[i] for i in idx]
dataY_shuffle = [dataY[i] for i in idx]
dataX_shuffle = np.asarray(dataX_shuffle)
dataY_shuffle = np.asarray(dataY_shuffle)
return dataX_shuffle, dataY_shuffle
if __name__ == "__main__":
#sess = tf.InteractiveSession()
learning_rate = 0.001
training_epochs = 10
batch_size = 100
display_step = 1
num_feature_1st = 6
num_feature_2nd = 500
num_class = 2
x = tf.placeholder('float', [None, 6])
y = tf.placeholder('float',[None,2])
data = dp.dataPrepare(dp.datas,dp.path)
trainX = data[0]
testX = data[1] # a matrix
trainY = data[2] # a vector with binary number
testY = data[3]
params = random_init(x,num_feature_1st,num_feature_2nd,num_class)
# construct model
pred = multilayer_perceptron(x, num_feature_1st, num_feature_2nd, num_class)
cost = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(pred, y))
optimizer = tf.train.AdamOptimizer().minimize(cost)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
#train
for epoch in range(training_epochs):
avg_cost = 0
total_batch = int(len(trainX[:,0])/batch_size)
for i in range(total_batch):
batch_x, batch_y = next_batch(batch_size,trainX,trainY)
_, c = sess.run([optimizer,cost],feed_dict = {x:batch_x,y:batch_y})
avg_cost += c/total_batch
if epoch % display_step ==0:
print("Epoch: ", "%04d" % (epoch+1), " cost= ", "{:.9f}".format(avg_cost))
print("Optimization Finished!")
Whenever you execute a dynamic node from the computation graph - which is pretty much any node that is not an input - you need to specify all dependent variables. Think about this this way: If you had a mathematical function of the form
y = f(x) = Ax + b (for example)
and you wanted to evaluate that function, you need to specify x as well. You need not, however, specify x if you wanted to evaluate (i.e. read) the value of A, since A is known (at least in this context).
Consequently, you can evaluate (by passing it to tf.Session.run(...) the parameters of your network without specifying the inputs (A in the example above). You cannot, however, evaluate the output of your functions without specifying the inputs (in the example, you need to specify x).
As for your code, the following line will thus not work:
print(sess.run(pred)), since you ask the session to evaluate a function without specifying its inputs.