How to get class names in classification? - machine-learning

I’m a beginner in ML, I built a SVM model to classify some inputs.
I used panda to read my dataset. The classification results are printed as indexes that each one of them is correspond to the name of the labels (classes) in my dataset. How can I convert these indexes to their names (string) ?
for example I have three classes : [Question,General,Info], but when I try to classify an input, the result is one of these numbers: [0,1,2]
I want to convert these numbers to the names of the classes I have.
here is a part of my code:
data = pandas.read_csv("classes.csv",encoding='utf-16' )
Train_X, Test_X, Train_Y, Test_Y = sklearn.model_selection.train_test_split(data['input'],data['Class'],test_size=0.3,random_state=None)
Test_Y and Train_Y are lists of numbers (classes) , each number is referred to one class, how do I know what each number represents?

The first thing you need to know is: your model is working as expected. Most of the time, it'll output a probability for each label. So, if your model outputs something like [0.1, 0.1, 0.8], it means the sample you're classifiying has 80% to belong to the label in position 2. If you pass all labels in the order you indicated in your question, that is, [question, general, info], it means this sample belongs to the class info. Observe the order is important here and you need to ensure that when you're feeding the model in your code.
Therefore, to output a string instead of a number, you need to get the number outputted by the model and check the label in a list or dictionary containing this relationship. Using as an example a list:
labels_str = ['question', 'general', 'info']
# preds is a np.array containing the probabilities
preds = model(some_sample)
# this function returns the position of the max value in the array
pos_pred = preds.argmax()
print ("The label for this sample is {}".format(labels_str[pos_pred])
Did you get the idea?

Related

Representing an array as a feature in ML training

I have a set of features x1,x2,x3,x4 where x1,x2,x3 are floats and x4 is an array of floats.
To give an example, say that I am trying to predict the price of a house. I could use the size of the house as an array (e.g. length, width, and height) along with other features like number of bedrooms, age of house, no of bathrooms etc.
This is simple, but I am sort of struggling how to represent this.
Here is a similar sample based on heart attack prediction https://colab.research.google.com/drive/1CQX2d0vkjlZKjX6wbG4ga6tRcI-zOMNA
I tried to add a column to add an array feature, with np.c_ to the end
##################################-Check-########################
print("Before",X_s[:1])
X_s =np.c_[ X_s,np.random.rand(303,2)] # add a numpy array here as a feature
print("After",X_s[:1])
print("shape of X_s",X_s.shape)
print(X_s[:1])
dataset = tf.data.Dataset.from_tensor_slices((X_s, y_s))
But the problems is that the array is added as two extra columns in the end
shape of X_s (303, 13)
shape of X_s (303, 15)
So if I have a feature array of say 330*300 with the above approach it will add 300 columns to the end. Not something I want
I am aware of CNN network, and one option is to model the whole problem as a CNN; that is pad the other features also as arrays and created an n dimension tensor and use a CNN
Is there something simpler and better than these approaches

How to update vocabulary of pre-trained bert model while doing my own training task?

I am now working on a task of predicting masked word using BERT model. Unlike others, the answer needs to be chosen from specific options.
For instance:
sentence: "In my daily [MASKED], ..."
options: A.word1 B.word2 C.word3 D.word4
the predict word will be chosen from four given words
I use hugging face's BertForMaskedLM to do this task. This model will give me a probability matrix which representing every word's probability of appearing in the [MASK] and I just need to compare the probability of word in options to select the answser.
# Predict all tokens
with torch.no_grad():
predictions = model(tokens_tensor, segments_tensors)
#predicted_index = torch.argmax(predictions[0, masked_index]).item()
#predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
A = predictions[0, masked_pos][tokenizer.convert_tokens_to_ids([option1])]
B = predictions[0, masked_pos][tokenizer.convert_tokens_to_ids([option2])]
C = predictions[0, masked_pos][tokenizer.convert_tokens_to_ids([option3])]
D = predictions[0, masked_pos][tokenizer.convert_tokens_to_ids([option4])]
#And then select from ABCD
But the problem is:
If the options are not in the “bert-vocabulary.txt”, the above method is not going to work since the output matrix does not give their probability. The same problem will also appear if the option is not a single word.
Should I update the vocabulary and how to do that? Or how to train the model
to add new words on the basis of pre-training?

Model returning wrong predictions every single time for language translation

I have used LSTM for designing a model for language translation. But even after 10000 entries in my data set, the model is not able to translate properly.
I have converted every word to small caps and have not removed any stop word. The training data is used as it is from the data set.
model = Sequential()
model.add(Embedding(src_vocab, n_units, input_length=src_timesteps,input_shape=trainX.shape,mask_zero=True))
model.add(LSTM(n_units))
model.add(RepeatVector(tar_timesteps))
model.add(LSTM(n_units, return_sequences=True))
model.add(TimeDistributed(Dense(tar_vocab, activation='softmax')))
Ideally, this should have shown some translation if not actual but most of the times, it does not return maximum probability for any of the tokenized words and the string I get in the end is empty.
Here is the link for my kernel. Any help is appreciated.

Using sample_weights with fit_generator()

Inside an autoregressive continuous problem, when the zeros take too much place, it is possible to treat the situation as a zero-inflated problem (i.e. ZIB). In other words, instead of working to fit f(x), we want to fit g(x)*f(x) where f(x) is the function we want to approximate, i.e. y, and g(x) is a function which output a value between 0 and 1 depending if a value is zero or non-zero.
Currently, I have two models. One model which gives me g(x) and another model which fits g(x)*f(x).
The first model gives me a set of weights. This is where I need your help. I can use the sample_weights arguments with model.fit(). As I work with tremendous amount of data, then I need to work with model.fit_generator(). However, fit_generator() does not have the argument sample_weights.
Is there a work around to work with sample_weights inside fit_generator()? Otherwise, how can I fit g(x)*f(x) knowing that I have already a trained model for g(x)?
You can provide sample weights as the third element of the tuple returned by the generator. From Keras documentation on fit_generator:
generator: A generator or an instance of Sequence (keras.utils.Sequence) object in order to avoid duplicate data when using multiprocessing. The output of the generator must be either
a tuple (inputs, targets)
a tuple (inputs, targets, sample_weights).
Update: Here is a rough sketch of a generator that returns the input samples and targets as well as the sample weights obtained from model g(x):
def gen(args):
while True:
for i in range(num_batches):
# get the i-th batch data
inputs = ...
targets = ...
# get the sample weights
weights = g.predict(inputs)
yield inputs, targets, weights
model.fit_generator(gen(args), steps_per_epoch=num_batches, ...)

Gensim word embedding training with initial values

I have a dataset with documents separated into different years, and my objective is to train an embedding model for each year's data, while at the same time, the same word appearing in different years will have similar vector representations. Like this: for word 'compute', its vector in year 1 is
[0.22, 0.33, 0.20]
and in year 2 it's something around:
[0.20, 0.35, 0.18]
Is there a way to accomplish this? For example, train the model of year 2 with both initial values (if the word is trained already in year 1, modify its vector) and randomness (if this is a new word for the corpus).
I think the easiest solution is to save the embeddings after training on the first data set, then load the trained model and continue training for the second data set. This way you shouldn't expect the embeddings to drift away from the saved state much (unless your data sets are very different).
It would also make sense to create a single vocabulary from all documents: vocabulary words that aren't present in a particular document will get some random representation, but still it will be a working word2vec model.
Example from the documentation:
>>> model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
>>> model.save(fname)
>>> model = Word2Vec.load(fname) # continue training with the loaded model

Resources