SpaCy-transformers regression output - machine-learning

I would like to have a regression output instead of the classification. For instance: instead of n classes I want a floating point output value from 0 to 1.
Here is the minimalistic example from the package github page:
import spacy
from spacy.util import minibatch
import random
import torch
is_using_gpu = spacy.prefer_gpu()
if is_using_gpu:
torch.set_default_tensor_type("torch.cuda.FloatTensor")
nlp = spacy.load("en_trf_bertbaseuncased_lg")
print(nlp.pipe_names) # ["sentencizer", "trf_wordpiecer", "trf_tok2vec"]
textcat = nlp.create_pipe("trf_textcat", config={"exclusive_classes": True})
for label in ("POSITIVE", "NEGATIVE"):
textcat.add_label(label)
nlp.add_pipe(textcat)
optimizer = nlp.resume_training()
for i in range(10):
random.shuffle(TRAIN_DATA)
losses = {}
for batch in minibatch(TRAIN_DATA, size=8):
texts, cats = zip(*batch)
nlp.update(texts, cats, sgd=optimizer, losses=losses)
print(i, losses)
nlp.to_disk("/bert-textcat")
Is there an easy way to make trf_textcat work as a regressor? Or would it mean extending the library?

I have figured out a workaround: extract vector representations from the nlp pipeline as:
vector_repres = nlp('Test text').vector
After doing so for all the text entries, You end up with a fixed-dimensional representation of the texts. Assuming You have the continuous output values, feel free to use any estimator, including Neural Network with a linear output.
Note that the vector representation is an average of the vector embeddings of all the words in the text - it might be a sub-optimal solution for Your case.

Related

LightGBM predicts negative values

My LightGBM regressor model returns negative values.
For XGBoost there is objective='count:poisson' hyperparameter in order to prevent returning negative predicitons.
Is there any chance to do this ?
Github issue => https://github.com/microsoft/LightGBM/issues/5629
LightGBM also supports poisson regression. For example, consider the following Python code.
import lightgbm as lgb
import numpy as np
from matplotlib import pyplot
# random Poisson-distributed target and one informative feature
y = np.random.poisson(lam=15.0, size=1_000)
X = y + np.random.normal(loc=10.0, scale=2.0, size=(y.shape[0], ))
X = X.reshape(-1, 1)
# fit a Poisson regression model
reg = lgb.LGBMRegressor(
objective="poisson",
n_estimators=150,
min_data=1
)
reg.fit(X, y)
# get predictions
preds = reg.predict(X)
print("summary of predicted values")
print(f" * min: {round(np.min(preds), 3)}")
print(f" * max: {round(np.max(preds), 3)}")
# compare predicted distribution to the empirical one
bins = np.linspace(0, 30, 50)
pyplot.hist(y, bins, alpha=0.5, label='actual')
pyplot.hist(preds, bins, alpha=0.5, label='predicted')
pyplot.legend(loc='upper right')
pyplot.show()
This example uses Python 3.10 and lightgbm==3.3.3.
However... I don't recommend using Poisson regression just to achieve "no negative predictions". The Poisson loss function is intended to be used for cases where you believe your target is Poisson-distributed, e.g. it looks like counts of events observed over some regular interval like time or space.
Other options you might consider to try to achieve the behavior "never predict a negative number from LightGBM regression":
write a custom objective function in one of the interfaces that support it, like the R or Python package
post-process LightGBM's predictions, recoding negative values to 0
pre-process the target variable such that there are no negative values (e.g. dropping such observations, re-scaling, taking the absolute value)
LightGBM also facilitates an objective parameter which can be set to 'poisson'. Follow this link for more information.
An example for LGBMRegressor (scikit-learn API):
from lightgbm import LGBMRegressor
regressor = LGBMRegressor(objective='poisson')

How to load unlabelled data for sentiment classification after training SVM model?

I am trying to do sentiment classification and I used sklearn SVM model. I used the labeled data to train the model and got 89% accuracy. Now I want to use the model to predict the sentiment of unlabeled data. How can I do that? and after classification of unlabeled data, how to see whether it is classified as positive or negative?
I used python 3.7. Below is the code.
import random
import pandas as pd
data = pd.read_csv("label data for testing .csv", header=0)
sentiment_data = list(zip(data['Articles'], data['Sentiment']))
random.shuffle(sentiment_data)
train_x, train_y = zip(*sentiment_data[:350])
test_x, test_y = zip(*sentiment_data[350:])
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn import metrics
clf = Pipeline([
('vectorizer', CountVectorizer(analyzer="word",
tokenizer=word_tokenize,
preprocessor=lambda text: text.replace("<br />", " "),
max_features=None)),
('classifier', LinearSVC())
])
clf.fit(train_x, train_y)
pred_y = clf.predict(test_x)
print("Accuracy : ", metrics.accuracy_score(test_y, pred_y))
print("Precision : ", metrics.precision_score(test_y, pred_y))
print("Recall : ", metrics.recall_score(test_y, pred_y))
When I run this code, I get the output:
ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning)
Accuracy : 0.8977272727272727
Precision : 0.8604651162790697
Recall : 0.925
What is the meaning of ConvergenceWarning?
Thanks in Advance!
What is the meaning of ConvergenceWarning?
As Pavel already mention, ConvergenceWArning means that the max_iteris hitted, you can supress the warning here: How to disable ConvergenceWarning using sklearn?
Now I want to use the model to predict the sentiment of unlabeled
data. How can I do that?
You will do it with the command: pred_y = clf.predict(test_x), the only thing you will adjust is :pred_y (this is your free choice), and test_x, this should be your new unseen data, it has to have the same number of features as your data test_x and train_x.
In your case as you are doing:
sentiment_data = list(zip(data['Articles'], data['Sentiment']))
You are forming a tuple: Check this out
then you are shuffling it and unzip the first 350 rows:
train_x, train_y = zip(*sentiment_data[:350])
Here you train_x is the column: data['Articles'], so all you have to do if you have new data:
new_ data = pd.read_csv("new_data.csv", header=0)
new_y = clf.predict(new_data['Articles'])
how to see whether it is classified as positive or negative?
You can run then: pred_yand there will be either a 1 or a 0 in your outcome. Normally 0 should be negativ, but it depends on your dataset-up
Check out this site about model's persistence. Then you just load it and call predict method. Model will return predicted label. If you used any encoder (LabelEncoder, OneHotEncoder), you need to dump and load it separately.
If I were you, I'd rather do full data-driven approach and use some pretrained embedder. It'll also work for dozens of languages out-of-the-box with is quite neat.
There's LASER from facebook. There's also pypi package, though unofficial. It works just fine.
Nowadays there's a lot of pretrained models, so it shouldn't be that hard to reach near-seminal scores.
Now I want to use the model to predict the sentiment of unlabeled data. How can I do that? and after classification of unlabeled data, how to see whether it is classified as positive or negative?
Basically, you aggregate unlabeled data in same way as train_x or test_x is generated. Probably, it's 2D matrix of shape n_samples x 1, which you would then use in clf.predict to obtain predictions. clf.predict outputs most probable class. In your case 0 is negative and 1 is positive, but it's hard to tell without the dataset.
What is the meaning of ConvergenceWarning?
LinearSVC model is optimized using iterative algorithm. There is an argument max_iter (1000 by default) that controls maximum amount of iterations. If stopping criteria wasn't met during this process, you will get ConvergenceWarning. It shouldn't bother you much, as long as you have acceptable performance in terms of accuracy, or other metrics.

ROC AUC score for AutoEncoder and IsolationForest

I am a new in Machine Learning area & I am (trying to) implementing anomaly detection algorithms, one algorithm is Autoencoder implemented with help of keras from tensorflow library and the second one is IsolationForest implemented with help of sklearn library and I want to compare these algorithms with help of roc_auc_score ( function from Python), but I am not sure if I am doing it correct.
In documentation of roc_auc_score function I can see, that for input it should be like:
sklearn.metrics.roc_auc_score(y_true, y_score, average=’macro’, sample_weight=None, max_fpr=None
y_true :
True binary labels or binary label indicators.
y_score :
Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as returned by “decision_function” on some classifiers). For binary y_true, y_score is supposed to be the score of the class with greater label.
For AE I am computing roc_auc_score like this:
model.fit(...) # model from https://www.tensorflow.org/api_docs/python/tf/keras/Sequential
pred = model.predict(x_test) # predict function from https://www.tensorflow.org/api_docs/python/tf/keras/Sequential#predict
metric = np.mean(np.power(x_test - pred, 2), axis=1) #MSE
print(roc_auc_score(y_test, metric) # where y_test is true binary labels 0/1
For IsolationForest I am computing roc_auc_score like this:
model.fit(...) # model from https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html
metric = -(model.score_samples(x_test)) # https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html#sklearn.ensemble.IsolationForest.score_samples
print(roc_auc_score(y_test, metric) #where y_test is true binary labels 0/1
I am just curious if returned roc_auc_score from both implementations of AE and IsolationForest are comparable (I mean, if I am computing them in the correct way)? Especially in AE model, where I am putting MSE into the roc_auc_score (if not, what should be the input as y_score to this function?)
Comparing AE and IsolationForest in the context of anomaly dection using sklearn.metrics.roc_auc_score based on scores coming from AE MSE loss and IF decision_function() respectively is okay. Varying range of the y_score when switching classifier isn't an issue, since this range is taken into account for each classifier when computing the AUC.
To understand that AUC isn't range dependent, remember that you travel along the decision function values to obtain the ROC points. Rescaling the decision function values will only change the decision function thresholds accordingly, defining similar points of the ROC since the new thresholds will lead each to the same TPR and FPR as they did before the rescaling.
Couldn't find a convincing code line in sklearn.metrics.roc_auc_score's implementation, but you can easily observe this comparison in published code associated with a research paper. For example, in the Deep One-Class Classification paper's code (I'm not an author, I know the paper's code because I'm reproducing their results), AE MSE loss and IF decision_function() are the roc_auc_score inputs (whose outputs the paper is comparing):
AE roc_auc_score computation
Found in this script on github.
from sklearn.metrics import roc_auc_score
(...)
scores = torch.sum((outputs - inputs) ** 2, dim=tuple(range(1, outputs.dim())))
(...)
auc = roc_auc_score(labels, scores)
IsolationForest roc_auc_score computation
Found in this script on github.
from sklearn.metrics import roc_auc_score
(...)
scores = (-1.0) * self.isoForest.decision_function(X.astype(np.float32)) # compute anomaly score
y_pred = (self.isoForest.predict(X.astype(np.float32)) == -1) * 1 # get prediction
(...)
auc = roc_auc_score(y, scores.flatten())
Note: The two scripts come from two different repositories but are actually the source of a single paper's results. The authors only chose to create an extra repository for their PyTorch implementation of an AD method requiring a neural network.

Using pre-trained word2vec with LSTM for word generation

LSTM/RNN can be used for text generation.
This shows way to use pre-trained GloVe word embeddings for Keras model.
How to use pre-trained Word2Vec word embeddings with Keras LSTM
model? This post did help.
How to predict / generate next word when the model is provided with the sequence of words as its input?
Sample approach tried:
# Sample code to prepare word2vec word embeddings
import gensim
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
sentences = [[word for word in document.lower().split()] for document in documents]
word_model = gensim.models.Word2Vec(sentences, size=200, min_count = 1, window = 5)
# Code tried to prepare LSTM model for word generation
from keras.layers.recurrent import LSTM
from keras.layers.embeddings import Embedding
from keras.models import Model, Sequential
from keras.layers import Dense, Activation
embedding_layer = Embedding(input_dim=word_model.syn0.shape[0], output_dim=word_model.syn0.shape[1], weights=[word_model.syn0])
model = Sequential()
model.add(embedding_layer)
model.add(LSTM(word_model.syn0.shape[1]))
model.add(Dense(word_model.syn0.shape[0]))
model.add(Activation('softmax'))
model.compile(optimizer='sgd', loss='mse')
Sample code / psuedocode to train LSTM and predict will be appreciated.
I've created a gist with a simple generator that builds on top of your initial idea: it's an LSTM network wired to the pre-trained word2vec embeddings, trained to predict the next word in a sentence. The data is the list of abstracts from arXiv website.
I'll highlight the most important parts here.
Gensim Word2Vec
Your code is fine, except for the number of iterations to train it. The default iter=5 seems rather low. Besides, it's definitely not the bottleneck -- LSTM training takes much longer. iter=100 looks better.
word_model = gensim.models.Word2Vec(sentences, vector_size=100, min_count=1,
window=5, iter=100)
pretrained_weights = word_model.wv.syn0
vocab_size, emdedding_size = pretrained_weights.shape
print('Result embedding shape:', pretrained_weights.shape)
print('Checking similar words:')
for word in ['model', 'network', 'train', 'learn']:
most_similar = ', '.join('%s (%.2f)' % (similar, dist)
for similar, dist in word_model.most_similar(word)[:8])
print(' %s -> %s' % (word, most_similar))
def word2idx(word):
return word_model.wv.vocab[word].index
def idx2word(idx):
return word_model.wv.index2word[idx]
The result embedding matrix is saved into pretrained_weights array which has a shape (vocab_size, emdedding_size).
Keras model
Your code is almost correct, except for the loss function. Since the model predicts the next word, it's a classification task, hence the loss should be categorical_crossentropy or sparse_categorical_crossentropy. I've chosen the latter for efficiency reasons: this way it avoids one-hot encoding, which is pretty expensive for a big vocabulary.
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=emdedding_size,
weights=[pretrained_weights]))
model.add(LSTM(units=emdedding_size))
model.add(Dense(units=vocab_size))
model.add(Activation('softmax'))
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
Note passing the pre-trained weights to weights.
Data preparation
In order to work with sparse_categorical_crossentropy loss, both sentences and labels must be word indices. Short sentences must be padded with zeros to the common length.
train_x = np.zeros([len(sentences), max_sentence_len], dtype=np.int32)
train_y = np.zeros([len(sentences)], dtype=np.int32)
for i, sentence in enumerate(sentences):
for t, word in enumerate(sentence[:-1]):
train_x[i, t] = word2idx(word)
train_y[i] = word2idx(sentence[-1])
Sample generation
This is pretty straight-forward: the model outputs the vector of probabilities, of which the next word is sampled and appended to the input. Note that the generated text would be better and more diverse if the next word is sampled, rather than picked as argmax. The temperature based random sampling I've used is described here.
def sample(preds, temperature=1.0):
if temperature <= 0:
return np.argmax(preds)
preds = np.asarray(preds).astype('float64')
preds = np.log(preds) / temperature
exp_preds = np.exp(preds)
preds = exp_preds / np.sum(exp_preds)
probas = np.random.multinomial(1, preds, 1)
return np.argmax(probas)
def generate_next(text, num_generated=10):
word_idxs = [word2idx(word) for word in text.lower().split()]
for i in range(num_generated):
prediction = model.predict(x=np.array(word_idxs))
idx = sample(prediction[-1], temperature=0.7)
word_idxs.append(idx)
return ' '.join(idx2word(idx) for idx in word_idxs)
Examples of generated text
deep convolutional... -> deep convolutional arithmetic initialization step unbiased effectiveness
simple and effective... -> simple and effective family of variables preventing compute automatically
a nonconvex... -> a nonconvex technique compared layer converges so independent onehidden markov
a... -> a function parameterization necessary both both intuitions with technique valpola utilizes
Doesn't make too much sense, but is able to produce sentences that look at least grammatically sound (sometimes).
The link to the complete runnable script.

Scikit learn - fit_transform on the test set

I am struggling to use Random Forest in Python with Scikit learn. My problem is that I use it for text classification (in 3 classes - positive/negative/neutral) and the features that I extract are mainly words/unigrams, so I need to convert these to numerical features. I found a way to do it with DictVectorizer's fit_transform:
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False)
rf = RandomForestClassifier(n_estimators = 100)
trainFeatures1 = vec.fit_transform(trainFeatures)
# Fit the training data to the training output and create the decision trees
rf = rf.fit(trainFeatures1.toarray(), LabelEncoder().fit_transform(trainLabels))
testFeatures1 = vec.fit_transform(testFeatures)
# Take the same decision trees and run on the test data
Output = rf.score(testFeatures1.toarray(), LabelEncoder().fit_transform(testLabels))
print "accuracy: " + str(Output)
My problem is that the fit_transform method is working on the train dataset, which contains around 8000 instances, but when I try to convert my test set to numerical features too, which is around 80000 instances, I get a memory error saying that:
testFeatures1 = vec.fit_transform(testFeatures)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\dict_vectorizer.py", line 143, in fit_transform
return self.transform(X)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\dict_vectorizer.py", line 251, in transform
Xa = np.zeros((len(X), len(vocab)), dtype=dtype)
MemoryError
What could possibly cause this and is there any workaround? Many thanks!
You are not supposed to do fit_transform on your test data, but only transform. Otherwise, you will get different vectorization than the one used during training.
For the memory issue, I recommend TfIdfVectorizer, which has numerous options of reducing the dimensionality (by removing rare unigrams etc.).
UPDATE
If the only problem is fitting test data, simply split it to small chunks. Instead of something like
x=vect.transform(test)
eval(x)
you can do
K=10
for i in range(K):
size=len(test)/K
x=vect.transform(test[ i*size : (i+1)*size ])
eval(x)
and record results/stats and analyze them afterwards.
in particular
predictions = []
K=10
for i in range(K):
size=len(test)/K
x=vect.transform(test[ i*size : (i+1)*size ])
predictions += rf.predict(x) # assuming it retuns a list of labels, otherwise - convert it to list
print accuracy_score( predictions, true_labels )

Resources