Classification of word2vec using weka - machine-learning

I have trained a word2vec model on a corpus of around 70k sentences. Each sentence contains a unique keyword such as 'abc-2011-100' followed by certain features that describe it. Now, I have to classify for every abc id. like abc-2011-100 belongs to abc_category_1. abc-2999-0000 belongs to abc_category_20 and so on. A category can have multiple abc id's assigned to it. I have around 70000 unique abc Id's. Out of these 70000, 5000 are already classified appropriately. Now I want to check my classification accuracy on the already classified 5000 id's. For that I will take 80% as training data and 20% for checking accuracy. I can describe every abc id as a d-dimensional vector. Using this information, how can I use weka for running this classification task.? Please any input would be highly appreciated.

See here.
First, read in your csv/arff:
import weka.core.Instances;
import java.io.BufferedReader;
import java.io.FileReader;
...
BufferedReader reader = new BufferedReader(new FileReader("yourData.arff"));
Instances data = new Instances(reader);
reader.close();
// setting class attribute
data.setClassIndex(data.numAttributes() - 1); // This is category for you
Then instantiate and train a classifier
import weka.classifiers.trees.J48;
...
String[] options = new String[1];
options[0] = "-U"; // unpruned tree
J48 tree = new J48(); // new instance of tree
tree.setOptions(options); // set the options
tree.buildClassifier(data); // build classifier
Run Cross-validation to evaluate the learner
import weka.classifiers.Evaluation;
import java.util.Random;
...
Evaluation eval = new Evaluation(data);
eval.crossValidateModel(tree, data, 10, new Random(1));
Or do training and testing on separate sets
import weka.core.Instances;
import weka.classifiers.Evaluation;
import weka.classifiers.trees.J48;
...
/* train and test are of type Instances (see above) */
// train classifier
Classifier cls = new J48();
cls.buildClassifier(train);
// evaluate classifier and print some statistics
Evaluation eval = new Evaluation(train);
eval.evaluateModel(cls, test);
System.out.println(eval.toSummaryString("\nResults\n======\n", false));

Related

How to load unlabelled data for sentiment classification after training SVM model?

I am trying to do sentiment classification and I used sklearn SVM model. I used the labeled data to train the model and got 89% accuracy. Now I want to use the model to predict the sentiment of unlabeled data. How can I do that? and after classification of unlabeled data, how to see whether it is classified as positive or negative?
I used python 3.7. Below is the code.
import random
import pandas as pd
data = pd.read_csv("label data for testing .csv", header=0)
sentiment_data = list(zip(data['Articles'], data['Sentiment']))
random.shuffle(sentiment_data)
train_x, train_y = zip(*sentiment_data[:350])
test_x, test_y = zip(*sentiment_data[350:])
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn import metrics
clf = Pipeline([
('vectorizer', CountVectorizer(analyzer="word",
tokenizer=word_tokenize,
preprocessor=lambda text: text.replace("<br />", " "),
max_features=None)),
('classifier', LinearSVC())
])
clf.fit(train_x, train_y)
pred_y = clf.predict(test_x)
print("Accuracy : ", metrics.accuracy_score(test_y, pred_y))
print("Precision : ", metrics.precision_score(test_y, pred_y))
print("Recall : ", metrics.recall_score(test_y, pred_y))
When I run this code, I get the output:
ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning)
Accuracy : 0.8977272727272727
Precision : 0.8604651162790697
Recall : 0.925
What is the meaning of ConvergenceWarning?
Thanks in Advance!
What is the meaning of ConvergenceWarning?
As Pavel already mention, ConvergenceWArning means that the max_iteris hitted, you can supress the warning here: How to disable ConvergenceWarning using sklearn?
Now I want to use the model to predict the sentiment of unlabeled
data. How can I do that?
You will do it with the command: pred_y = clf.predict(test_x), the only thing you will adjust is :pred_y (this is your free choice), and test_x, this should be your new unseen data, it has to have the same number of features as your data test_x and train_x.
In your case as you are doing:
sentiment_data = list(zip(data['Articles'], data['Sentiment']))
You are forming a tuple: Check this out
then you are shuffling it and unzip the first 350 rows:
train_x, train_y = zip(*sentiment_data[:350])
Here you train_x is the column: data['Articles'], so all you have to do if you have new data:
new_ data = pd.read_csv("new_data.csv", header=0)
new_y = clf.predict(new_data['Articles'])
how to see whether it is classified as positive or negative?
You can run then: pred_yand there will be either a 1 or a 0 in your outcome. Normally 0 should be negativ, but it depends on your dataset-up
Check out this site about model's persistence. Then you just load it and call predict method. Model will return predicted label. If you used any encoder (LabelEncoder, OneHotEncoder), you need to dump and load it separately.
If I were you, I'd rather do full data-driven approach and use some pretrained embedder. It'll also work for dozens of languages out-of-the-box with is quite neat.
There's LASER from facebook. There's also pypi package, though unofficial. It works just fine.
Nowadays there's a lot of pretrained models, so it shouldn't be that hard to reach near-seminal scores.
Now I want to use the model to predict the sentiment of unlabeled data. How can I do that? and after classification of unlabeled data, how to see whether it is classified as positive or negative?
Basically, you aggregate unlabeled data in same way as train_x or test_x is generated. Probably, it's 2D matrix of shape n_samples x 1, which you would then use in clf.predict to obtain predictions. clf.predict outputs most probable class. In your case 0 is negative and 1 is positive, but it's hard to tell without the dataset.
What is the meaning of ConvergenceWarning?
LinearSVC model is optimized using iterative algorithm. There is an argument max_iter (1000 by default) that controls maximum amount of iterations. If stopping criteria wasn't met during this process, you will get ConvergenceWarning. It shouldn't bother you much, as long as you have acceptable performance in terms of accuracy, or other metrics.

Does the number of classifiers on stacking classifier have to be equal to the number of columns of my training/testing dataset?

I'm trying to solve a binary classification task. The training data set contains 9 features and after my feature engineering I ended having 14 features. I want to use a stacking classifier approach with
mlxtend.classifier.StackingClassifier by using 4 different classifiers, but when trying to predict the test datata set I got the error: ValueError: query data dimension must match training data dimension
%%time
models=[KNeighborsClassifier(weights='distance'),
GaussianNB(),SGDClassifier(loss='hinge'),XGBClassifier()]
calibrated_models=Calibrated_classifier(models,return_names=False)
meta=LogisticRegression()
stacker=StackingCVClassifier(classifiers=calibrated_models,meta_classifier=meta,use_probas=True).fit(X.values,y.values)
Remark: In my code I just programmed a function to return a list with calibrated classifiers StackingCVClassifier I have checked this is not causing the error
Remark 2: I had already tried to perform a stacker from scratch with the same results so I had thought It was something wrong with my own stacker
from sklearn.linear_model import LogisticRegression
def StackingClassifier(X,y,models,stacker=LogisticRegression(),return_data=True):
names,ls=[],[]
predictions=pd.DataFrame()
for model in models:
names.append(str(model)[:str(model).find('(')])
for i,model in enumerate(models):
model.fit(X,y)
ls=model.predict_proba(X)[:,1]
predictions[names[i]]=ls
if return_data:
return predictions
else:
return stacker.fit(predictions,y)
Could you please help me to understand the correct usage of a stacking classifiers?
EDIT:
This is my code for calibrated classifier. This function takes a list of n classifiers and apply sklearn fucntion CalibratedClassifierCV to each one and returns a list with n calibrated classifiers. You have an option to return as a zip list since this function is mainly intended to be used along with sklearn's VotingClassifier
def Calibrated_classifier(models,method='sigmoid',return_names=True):
calibrated,names=[],[]
for model in models:
names.append(str(model)[:str(model).find('(')])
for model in models:
clf=CalibratedClassifierCV(base_estimator=model,method=method)
calibrated.append(clf)
if return_names:
return zip(names,calibrated)
else:
return calibrated
I have tried your code with Iris dataset. It is working fine, I think the problem is with the dimension of your test data and not with the calibration.
from sklearn.linear_model import LogisticRegression
from mlxtend.classifier import StackingCVClassifier
from sklearn import datasets
X, y = datasets.load_iris(return_X_y=True)
models=[KNeighborsClassifier(weights='distance'),
SGDClassifier(loss='hinge')]
calibrated_models=Calibrated_classifier(models,return_names=False)
meta=LogisticRegression( multi_class='ovr')
stacker = StackingCVClassifier(classifiers=calibrated_models,
meta_classifier=meta,use_probas=True,cv=3).fit(X,y)
Prediction
stacker.predict([X[0]])
#array([0])

can we save a partially trained Machine Learning model, reload it again and train from the point it was saved?

I want to know is there any way in which we can partially save a Scikit-Learn Machine Learning model and reload it again to train it from the point it was saved before?
For models such as Scikitlearn applied to sentiment analysis, I would suspect you need to save two important things: 1) your model, 2) your vectorizer.
Remember that after training your model, your words are represented by a vector of length N, and that is defined according to your total number of words.
Below is a piece from my test-model and test-vectorizer saved in order to be used latter.
SAVING THE MODEL
import pickle
pickle.dump(vectorizer, open("model5vectorizer.pickle", "wb"))
pickle.dump(classifier_fitted, open("model5.pickle", "wb"))
LOADING THE MODEL IN A NEW SCRIPT (.py)
import pickle
model = pickle.load(open("model5.pickle", "rb"))
vectorizer = pickle.load(open("model5vectorizer.pickle", "rb"))
TEST YOUR MODEL
sentence_test = ["Results by Andutta et al (2013), were completely wrong and unrealistic."]
USING THE VECTORIZER (model5vectorizer.pickle) !!
sentence_test_data = vectorizer.transform(sentence_test)
print("### sentence_test ###")
print(sentence_test)
print("### sentence_test_data ###")
print(sentence_test_data)
# OBS-1: VECTOR HERE WILL HAVE SAME LENGTH AS BEFORE :)
# OBS-2: If you load the default vectorizer or a different one, then you may see the following problems
# sklearn.exceptions.NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted.
# # ValueError: X has 8 features per sample; expecting 11
result1 = model.predict(sentence_test_data) # using saved vectorizer from calibrated model
print("### RESULT ###")
print(result1)
Hope that helps.
Regards,
Andutta
When a data set is fitted to a Scikit-learn machine learning model, it is trained and supposedly ready to be used for prediction purposes. By training a model with let's say, 100 samples and using it and then going back to it and fitting another 50 samples to it, you will not make it better but you will rebuild it.
If your purpose is to build a model and make it more powerful as it interacts with more samples, you would be thinking of a real-time condition, such as a mobile robot for mapping an environment with a Kalman Filter.

How to reset specific layer weights for transfer learning?

I am looking for a way to re initialize layer's weights in an existing keras pre trained model.
I am using python with keras and need to use transfer learning,
I use the following code to load the pre trained keras models
from keras.applications import vgg16, inception_v3, resnet50, mobilenet
vgg_model = vgg16.VGG16(weights='imagenet')
I read that when using a dataset that is very different than the original dataset it might be beneficial to create new layers over the lower level features that we have in the trained net.
I found how to allow fine tuning of parameters and now I am looking for a way to reset a selected layer for it to re train. I know I can create a new model and use layer n-1 as input and add layer n to it, but I am looking for a way to reset the parameters in an existing layer in an existing model.
For whatever reason you may want to re-initalize the weights of a single layer k, here is a general way to do it:
from keras.applications import vgg16
from keras import backend as K
vgg_model = vgg16.VGG16(weights='imagenet')
sess = K.get_session()
initial_weights = vgg_model.get_weights()
from keras.initializers import glorot_uniform # Or your initializer of choice
k = 30 # say for layer 30
new_weights = [glorot_uniform()(initial_weights[i].shape).eval(session=sess) if i==k else initial_weights[i] for i in range(len(initial_weights))]
vgg_model.set_weights(new_weights)
You can easily verify that initial_weights[k]==new_weights[k] returns an array of False, while initial_weights[i]==new_weights[i] for any other i returns an array of True.

Combining classifications for multiple observations of a single sample in scikit-learn

Let's say I have multiple observations of each sample to be classified. Example of problems like this are:
Multiple patches of a painting, where you're trying to classify the style
Multiple windows of a signal, where you're trying to classify the signal
What is the most pythonic way to combine the answers into a single one?
p.s.: I don't want ensemble -- to combine the answers of multiple models taking as input a single sample. I want to combine the answers of a single model over multiple observations of a single sample.
You do not want ensemble, but you can mimic best practices from ensembles. There are two basic ways to aggregate predictions:
Arithmetic Average, if your model does regression or probabilistic classification.
Mode, if your model does straitforward classification.
Of course, you can use any other summary statistic for aggregation.
The following code implements this idea with pandas:
import numpy as np
import pandas as pd
import sklearn.tree
object_ids = [1,1,1,2,2,2,3,3,3,3]
x = np.arange(10).reshape(10,1)
y = [0,0,0,1,0,1,1,0,1,1]
# regression
model = sklearn.tree.DecisionTreeRegressor().fit(x, y)
prediction = pd.Series(model.predict(x)).groupby(object_ids).mean()
# probabilistic_classification
model = sklearn.tree.DecisionTreeClassifier().fit(x, y)
prediction = pd.DataFrame(model.predict_proba(x)).groupby(object_ids).mean()
# 'crisp' classification
model = sklearn.tree.DecisionTreeClassifier().fit(x, y)
def mode(x):
return x.value_counts().index[0]
prediction = pd.Series(model.predict(x)).groupby(object_ids).apply(mode)

Resources