I use sklearn trained a SVM text classifier, used tf-idf(TfidfVectorizer) to extract the feature.
now I need to save the model and load it to predict the text unseen. I will load the model in another file, what confuses me is how to extract the new text tf-idf feature
You need to save the model AND the tfidf transformer. You can either save them separately, or create a pipeline of the two and save the pipeline (this is the preferred option).
Example:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
import pickle
Tfidf = TfidfVectorizer()
LR = LogisticRegression()
pipe = Pipeline([("Tfidf", Tfidf), ("LR", LR)])
pipe.fit(X, y)
with open('pipe.pickle', 'wb') as picklefile:
pickle.dump(pipe, picklefile)
You can then load the whole pipeline which upon predict will first apply the vectorizer and then pass it to the model:
with open('pipe.pickle', 'rb') as picklefile:
saved_pipe = pickle.load(picklefile)
saved_pipe.predict(X_test)
Related
i have KNN model pickled with StandartScaler.
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
when im trying to load model and pass new values via StandartScaler().transform() it give me an error:
sklearn.exceptions.NotFittedError: This StandardScaler instance is not
fitted yet. Call 'fit' with appropriate arguments before using this estimator.
im trying to load values from dictionary
dic = {'a':1, 'b':32323, 'c':12}
sc = StandartScaler()
load = pickle.load(open('KNN.mod'), 'rb'))
load.predict(sc.transform([[dic['a'], dic['b'], dic['c']]]))
as far i understand from error i have to fit new data to sc. but if do so it gives me wrong predictions. im not sure my im overfiting or smth, do random forest and decision tree works fine with that data without sc. logistic regresion semi ok
You need to train and pickle the entire machine learning pipeline at the same time. This can be done with the Pipeline tool from sklearn. In your case it will look like:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.neighbors import NearestNeighbors
pipeline = Pipeline([('scaler', StandardScaler()), ('knn', NearestNeighbors())])
pipe.fit(X_train, y_train)
# save the ml pipeline
pickle.dump(pipeline, open('KNN_pipeline.pkl'), 'wb'))
# load the ml pipeline and do prediction
pipeline = pickle.load(open('KNN_pipeline.pkl'), 'rb'))
pipeline.predict(X_test)
Let's say i have this python code:
from imblearn.pipeline import Pipeline
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import RandomOverSampler
from sklearn.decomposition import PCA
selector = VarianceThreshold()
scaler = StandardScaler()
ros = RandomOverSampler()
pca = PCA()
clf = neighbors.KNeighborsClassifier(n_jobs=-1)
pipe = Pipeline(steps=[('selector', selector), ('scaler', scaler), ('sampler', ros), ('pca', pca), ('kNN', clf)])
pipe.fit(X_train,y_train)
preds = pipe.predict(X_test)
What this does is import 4 transformers and an estimator from scickit learn.Then it fit's these in the data and lastly it predicts.If i understand correctly fit method applies the 4 transformers to the data and the predict method makes the final estimation(in our case using kNN).My question is this:For scaler as well as pca the alterations that are done in the train data must be also applied in the test data.But in fit's parameters we don't give test and as a result the test data won't be altered.How does that make sense?Is there something i am missing?
The model only learns the parameters from the training data and assumes that the test data will have similar patterns and transforms it accordingly. You cannot have a test data that is completely different from training data and expect good predictions, hence the same PCA and scaler models are also used on test dataset. If you fit a scaler on a smaller test dataset, the results might come out completely different than what the model is originally trained on.
I want to use PCA for dimensionality reduction and then use its o/p for one class SVM classifier in python. My training data set is of the order 16000x60. Also how to map principal component to original column to use it in SVM or can I use principal component directly?
It is unclear what the problem is and what did you try already. Of course you can. You can either add PCA output to your original set or just use the output as a single feature. I encourage you to use sklearn pipelines.
Simple example:
from sklearn import decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn import svm
svc = svm.SVC()
pca = decomposition.PCA()
pipe = Pipeline(steps=[('pca', pca), ('svc', svc)])
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target
pipe.fit(X_digits, y_digits)
print(pipe.score(X_digits,y_digits))
I have an arff file containing some sentences (Persian language) and a word in front of each sentence which shows its class in #data part. I need to use smo for classification. The questions:
1) Is it necessary to change the sentences to vectors ?
2) I selected "string to word vector", but the smo is inactive and still doesn't work. (and of course other algorithms like naive bayes).
How can I use this text data with smo ?
The above picture is a very small sample file.
file sample:
https://www.dropbox.com/s/ohpyortve8jbwhe/shoor.arff?dl=0
First, you need apply "string to word vector" filter. After, on classify tab, you need to change the target class to "(Nom) class". This is enought to enable the naive bayes and SVM algorithms. I downloaded the dataset, and it worked well.
You can follow this tutorial:
https://www.youtube.com/watch?v=zlVJ2_N_Olo
Hope it can help you
from sklearn.feature_extraction.text import TfidfVectorizer
import arff
from sklearn import svm
import numpy as np
from sklearn.model_selection import train_test_split
data=list(arff.load('shoor.arff'))
text=[]
label=[]
for r in data:
if (len(r)>1):
text.append(r[0])
label.append(r[1])
tfidf = TfidfVectorizer().fit_transform(text)
features = (tfidf * tfidf.T).A
X_train, X_test, y_train, y_test = train_test_split(features, label, test_size=0.5, random_state=0)
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)
1.0
Just a brief idea of my situation:
I have 4 columns of input: id, text, category, label.
I used TFIDFVectorizer on the text which gives me a list of instances with word tokens of TFIDF score.
Now I'd like to include the category (no need to pass TFIDF) as another feature in the data outputed by the vectorizer.
Also note that prior to the vectorization, the data have passed train_test_split.
How could I achieve this?
Initial code:
#initialization
import pandas as pd
path = 'data\data.csv'
rappler= pd.read_csv(path)
X = rappler.text
y = rappler.label
#rappler.category - contains category for each instance
#split train test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
#feature extraction
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train)
#after or even prior to perform fit_transform, how can I properly add category as a feature?
X_test_dtm = vect.transform(X_test)
#actual classfication
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)
#display result
from sklearn import metrics
print(metrics.accuracy_score(y_test,y_pred_class))
I would suggest doing your train test split after feature extraction.
Once you have the TF-IDF feature lists just add the other feature for each sample.
You will have to encode the category feature, a good choice would be sklearn's LabelEncoder. Then you should have two sets of numpy arrays that can be joined.
Here is a toy example:
X_tfidf = np.array([[0.1, 0.4, 0.2], [0.5, 0.4, 0.6]])
X_category = np.array([[1], [2]])
X = np.concatenate((X_tfidf, X_category), axis=1)
At this point you would continue as you were, starting with the train test split.
You should use FeatureUnions - as explained in the documentation
FeatureUnions combines several transformer objects into a new
transformer that combines their output. A FeatureUnion takes a list of
transformer objects. During fitting, each of these is fit to the data
independently. For transforming data, the transformers are applied in
parallel, and the sample vectors they output are concatenated
end-to-end into larger vectors.
Another good example on how to use FeatureUnions can be found here: http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html
Just concatenating different matrices like #AlexG suggests is probably an easier option but FeatureUnions is the scikit-learn way to do these things.