Combing the NLTK text features with sklearn Vectorized features - machine-learning

I am trying to combine the dict type features used in NLTK along with the SKLEARN tfidf feature for each instance.
Sample Input:
instances=[["I am working with text data"],["This is my second sentence"]]
instance = "I am working with text data "
def generate_features(instance):
featureset["suffix"]=tokenize(instance)[-1]
featureset["tfidf"]=self.tfidf.transform(instance)
return features
from sklearn.linear_model import LogisticRegressionCV
from nltk.classify.scikitlearn import SklearnClasskifier
self.classifier = SklearnClassifier(LogisticRegressionCV())
self.classifier.train(feature_sets)
This tfidf is trained on all the instances. But when I train the nltk classifier using this featureset it throws the following error.
self.classifier.train(feature_sets)
File "/Library/Python/2.7/site-packages/nltk/classify/scikitlearn.py", line 115, in train
X = self._vectorizer.fit_transform(X)
File "/Library/Python/2.7/site
packages/sklearn/feature_extraction/dict_vectorizer.py", line 226, in fit_transform
return self._transform(X, fitting=True)
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/dict_vectorizer.py", line 174, in _transform
values.append(dtype(v))
TypeError: float() argument must be a string or a number
I understand the issue here, that it cannot vectorize the already vectorized features. But is there a way to fix this ?

For those who might visit this question in future, I did the following thing which solved the issue.
from sklearn.linear_model import LogisticRegressionCV
from scipy.sparse import hstack
def generate_features(instance):
featureset["suffix"]=tokenize(instance)[-1]
return features
feature_sets=[(generate_features(instance),label) for instance in instances]
X = self.vec.fit_transform([item[0] for item in feature_sets]).toarray()
Y = [item[1] for item in feature_sets]
tfidf=TfidfVectorizer.fit_transform(instances)
X=hstack((X,tfidf))
classifier=LogisticRegressionCV()
classifier.fit(X,Y)

i don't know if it's help or not. In my case, featureset["suffix"]'s values must be string or number. For example :
featureset["suffix"] = "some value"

Related

GluonTS example airpassengers dataset not found

I am trying to run the GluonTS example code, going through some struggle to install the libraries, now I get the following error:
FileNotFoundError: C:\Users\abcde\.mxnet\gluon-ts\datasets\airpassengers\test
The C:\Users\abcde\.mxnet\gluon-ts\datasets\airpassengers\ does exist but contains only train folder. Have tried reinstalling but to no avail. Any ideas how to fix this and run the example, even if finding the dataset in correct format elsewhere?
EDIT: To clarify, I was referring to an example on https://ts.gluon.ai/stable/
import matplotlib.pyplot as plt
from gluonts.dataset.util import to_pandas
from gluonts.dataset.pandas import PandasDataset
from gluonts.dataset.repository.datasets import get_dataset
from gluonts.mx import DeepAREstimator, Trainer
dataset = get_dataset("airpassengers")
deepar = DeepAREstimator(prediction_length=12, freq="M", trainer=Trainer(epochs=5))
model = deepar.train(dataset.train)
# Make predictions
true_values = to_pandas(list(dataset.test)[0])
true_values.to_timestamp().plot(color="k")
prediction_input = PandasDataset([true_values[:-36], true_values[:-24], true_values[:-12]])
predictions = model.predict(prediction_input)
for color, prediction in zip(["green", "blue", "purple"], predictions):
prediction.plot(color=f"tab:{color}")
plt.legend(["True values"], loc="upper left", fontsize="xx-large")
There was an incorrect import on the earlier version of the example, which was since corrected, also I needed to specify regenerate=True while getting the dataset, so:
dataset = get_dataset("airpassengers", regenerate=True).

when I run GridSearchCV() classifier with parameters so i get this kind of error:-ValueError: could not convert string to float: 'text'

so please how can I resolve this kind of error, anyone's guide me please
X = df.iloc[:,:-2]
y = df.My_Labels
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
from sklearn.model_selection import KFold
import numpy as np
from sklearn.model_selection import GridSearchCV
log_class=LogisticRegression()
#grid={'C':10.0 **np.arange(-2,3),'penalty':['l1','l2'],'solver': [ 'lbfgs', 'liblinear']}
grid={'C':10.0 **np.arange(-2,3),'penalty': ['l1'], 'solver': [ 'lbfgs', 'liblinear', 'sag', 'saga'],'penalty': ['l2'], 'solver': ['newton-cg']}
cv=KFold(n_splits=5,random_state=None,shuffle=False)
from sklearn.model_selection import train_test_split
X_train,X_test, y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=10)
clf=GridSearchCV(log_class,grid,cv=cv,n_jobs=-1,scoring='f1_macro')
clf.fit(X_train.astype(str),y_train)
this is the error section, which I get when i run the above code for purely TEXT Classification
ValueError Traceback (most recent call last)
<ipython-input-18-4d99cebd483c> in <module>
17
18 clf=GridSearchCV(log_class,grid,cv=cv,n_jobs=-1,scoring='f1_macro')
---> 19 clf.fit(X_train.astype(str),y_train)
ValueError: could not convert string to float: 'great kindle text sucks cant use calling feature phone im connected wifi makes great calls'
It looks like you have text in your dataset, use Scikit-Learn's LabelEncoder to encode it into a numeric value. Use This Link if you don't know how to use LabelEncoder.
Looking at your error, I think you have some review column, If you don't want to LabelEncode it, use some NLP techniques and perform sentimental analysis and all.

pyspark.sql.utils.IllegalArgumentException: 'Field "features" does not exist

I am trying to perform topic modelling and sentimental analysis on text data over SparkNLP. I have done all the pre-processing steps on the dataset but getting an error in LDA.
Error
Program is:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StopWordsRemover, CountVectorizer, IDF
from pyspark.ml.clustering import LDA
from pyspark.sql.functions import col, lit, concat, regexp_replace
from pyspark.sql.utils import AnalysisException
from pyspark.ml.feature import Tokenizer, RegexTokenizer
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType
from pyspark.ml.clustering import LDA
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.feature import Normalizer
from pyspark.ml.linalg import Vectors
dataframe_new = spark.read.format('com.databricks.spark.csv') \
.options(header='true', inferschema='true') \
.load('/home/cdh#psnet.com/Gourav/chap3/abcnews-date-text.csv')
get_tokenizers = Tokenizer(inputCol="headline_text", outputCol="get_tokens")
get_tokenized = get_tokenizers.transform(dataframe_new)
remover = StopWordsRemover(inputCol="get_tokens", outputCol="row")
get_remover = remover.transform(get_tokenized)
counter_vectorized = CountVectorizer(inputCol="row", outputCol="get_features")
getmodel = counter_vectorized.fit(get_remover)
get_result = getmodel.transform(get_remover)
idf_function = IDF(inputCol="get_features", outputCol="get_idf_feature")
train_model = idf_function.fit(get_result)
outcome = train_model.transform(get_result)
lda = LDA(k=10, maxIter=10)
model = lda.fit(outcome)
Schema of DataFrame after the IDF :
According to the documentation, LDA includes a featuresCol argument, with default value featuresCol='features', i.e. the name of the column that holds the actual features; according to your shown schema, such a column is not present in your dataframe, hence the expected error.
It is not exactly clear which column contains the features in your dataframe - get_features or get_idf_feature (they look identical in the sample you show); assuming it is get_idf_feature, you should change the LDA call to:
lda = LDA(featuresCol='get_idf_feature', k=10, maxIter=10)
Spark (including pyspark) ML API has a quite distinct and different logic than, say, scikit-learn and similar frameworks; one of the differences is indeed that the features have to be all in a single column of the respective dataframe. For a general demonstration of the idea, see own answer in KMeans clustering in PySpark (it is about K-Means, but the logic is identical).

How can I do Sentiment analysis for specific dataset?

I have a dataset which contains reviews of hotels. I want to predict whether review is positive or negative. But i don't have a dependent variable y in my dataset.
I am tring to use NLTK and naive Bayes algorithm. Please help me to solve this problem.
Here is my code up to now.
Reviews = dataset.iloc[:,18]
#print(Reviews)
#Cleaning the texts
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for num in range(0,10000):
#nltk.download('stopwords')
review = re.sub('[^a-zA-Z]' , ' ' , str(Reviews[num]))
review = review.lower()
review = review.split()
ps = PorterStemmer()
review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
review = ' '.join(review)
corpus.append(review)
print(corpus)
#Creating the Bag of Words Model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()
print(X)
Considering that you do not have a target class (dependent variable y) I believe that you should consider an unsupervised learning approach e.g clustering.
if you don't have target variable than you can give try to Textblob
from textblob import Textblob
testimonial = TextBlob("today is a bad day for me!")
print(testimonial.sentiment)
# o/p (polarity close to 1 means positive, close to -1 means negative)
Sentiment(polarity=-0.8749999999999998, subjectivity=0.6666666666666666)

Why the following partial fit is not working property?

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
Hello I have the following list of comments:
comments = ['I am very agry','this is not interesting','I am very happy']
These are the corresponding labels:
sents = ['angry','indiferent','happy']
I am using tfidf to vectorize these comments as follows:
tfidf_vectorizer = TfidfVectorizer(analyzer='word')
tfidf = tfidf_vectorizer.fit_transform(comments)
from sklearn import preprocessing
I am using label encoder to vectorize the labels:
le = preprocessing.LabelEncoder()
le.fit(sents)
labels = le.transform(sents)
print(labels.shape)
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.model_selection import train_test_split
with open('tfidf.pickle','wb') as idxf:
pickle.dump(tfidf, idxf, pickle.HIGHEST_PROTOCOL)
with open('tfidf_vectorizer.pickle','wb') as idxf:
pickle.dump(tfidf_vectorizer, idxf, pickle.HIGHEST_PROTOCOL)
Here I am using passive aggressive to fit the model:
clf2 = PassiveAggressiveClassifier()
with open('passive.pickle','wb') as idxf:
pickle.dump(clf2, idxf, pickle.HIGHEST_PROTOCOL)
with open('passive.pickle', 'rb') as infile:
clf2 = pickle.load(infile)
with open('tfidf_vectorizer.pickle', 'rb') as infile:
tfidf_vectorizer = pickle.load(infile)
with open('tfidf.pickle', 'rb') as infile:
tfidf = pickle.load(infile)
Here I am trying to test the usage of partial fit as follows with three new comments and their corresponding labels:
new_comments = ['I love the life','I hate you','this is not important']
new_labels = [1,0,2]
vec_new_comments = tfidf_vectorizer.transform(new_comments)
print(clf2.predict(vec_new_comments))
clf2.partial_fit(vec_new_comments, new_labels)
The problem is that I am not getting the right results after the partial fit as follows:
print('AFTER THIS UPDATE THE RESULT SHOULD BE 1,0,2??')
print(clf2.predict(vec_new_comments))
however I am getting this output:
[2 2 2]
So I really appreciate support to find, why the model is not being updated if I am testing it with the same examples that it has used to be trained the desired output should be:
[1,0,2]
I would like to appreciate support to ajust maybe the hyperparameters to see the desired output.
this is the complete code, to show the partial fit:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import sys
from sklearn.metrics.pairwise import cosine_similarity
import random
comments = ['I am very agry','this is not interesting','I am very happy']
sents = ['angry','indiferent','happy']
tfidf_vectorizer = TfidfVectorizer(analyzer='word')
tfidf = tfidf_vectorizer.fit_transform(comments)
#print(tfidf.shape)
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(sents)
labels = le.transform(sents)
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.model_selection import train_test_split
with open('tfidf.pickle','wb') as idxf:
pickle.dump(tfidf, idxf, pickle.HIGHEST_PROTOCOL)
with open('tfidf_vectorizer.pickle','wb') as idxf:
pickle.dump(tfidf_vectorizer, idxf, pickle.HIGHEST_PROTOCOL)
clf2 = PassiveAggressiveClassifier()
clf2.fit(tfidf, labels)
with open('passive.pickle','wb') as idxf:
pickle.dump(clf2, idxf, pickle.HIGHEST_PROTOCOL)
with open('passive.pickle', 'rb') as infile:
clf2 = pickle.load(infile)
with open('tfidf_vectorizer.pickle', 'rb') as infile:
tfidf_vectorizer = pickle.load(infile)
with open('tfidf.pickle', 'rb') as infile:
tfidf = pickle.load(infile)
new_comments = ['I love the life','I hate you','this is not important']
new_labels = [1,0,2]
vec_new_comments = tfidf_vectorizer.transform(new_comments)
clf2.partial_fit(vec_new_comments, new_labels)
print('AFTER THIS UPDATE THE RESULT SHOULD BE 1,0,2??')
print(clf2.predict(vec_new_comments))
However I got:
AFTER THIS UPDATE THE RESULT SHOULD BE 1,0,2??
[2 2 2]
Well there are multiple problems with your code. I will start by stating the obvious ones to more complex ones:
You are pickling the clf2 before it has learnt anything. (ie. you pickle it as soon as it is defined, it doesnt serve any purpose). If you are only testing, then fine. Otherwise they should be pickled after the fit() or equivalent calls.
You are calling clf2.fit() before the clf2.partial_fit(). This defeats the whole purpose of partial_fit(). When you call fit(), you essentially fix the classes (labels) that the model will learn about. In your case it is acceptable, because on your subsequent call to partial_fit() you are giving the same labels. But still it is not a good practice.
See this for more details
In a partial_fit() scenario, dont call the fit() ever. Always call the partial_fit() with your starting data and new coming data. But make sure that you supply all the labels you want the model to learn in the first call to parital_fit() in a parameter classes.
Now the last part, about your tfidf_vectorizer. You call fit_transform()(which is essentially fit() and then transformed() combined) on tfidf_vectorizer with comments array. That means that it on subsequent calls to transform() (as you did in transform(new_comments)), it will not learn new words from new_comments, but only use the words which it saw during the call to fit()(words present in comments).
Same goes for LabelEncoder and sents.
This again is not prefereble in a online learning scenario. You should fit all the available data at once. But since you are trying to use the partial_fit(), we assume that you have very large dataset which may not fit in memory at once. So you would want to apply some sort of partial_fit to TfidfVectorizer as well. But TfidfVectorizer doesnt support partial_fit(). In fact its not made for large data. So you need to change your approach. See the following questions for more details:-
Updating the feature names into scikit TFIdfVectorizer
How can i reduce memory usage of Scikit-Learn Vectorizers?
All things aside, if you change just the tfidf part of fitting the whole data (comments and new_comments at once), you will get your desired results.
See the below code changes (I may have organized it a bit and renamed vec_new_comments to new_tfidf, please go through it with attention):
comments = ['I am very agry','this is not interesting','I am very happy']
sents = ['angry','indiferent','happy']
new_comments = ['I love the life','I hate you','this is not important']
new_sents = ['happy','angry','indiferent']
tfidf_vectorizer = TfidfVectorizer(analyzer='word')
le = preprocessing.LabelEncoder()
# The below lines are important
# I have given the whole data to fit in tfidf_vectorizer
tfidf_vectorizer.fit(comments + new_comments)
# same for `sents`, but since the labels dont change, it doesnt matter which you use, because it will be same
# le.fit(sents)
le.fit(sents + new_sents)
Below is the Not so preferred code (which you are using, and about which I talked in point 2), but results are good as long as you make the above changes.
tfidf = tfidf_vectorizer.transform(comments)
labels = le.transform(sents)
clf2.fit(tfidf, labels)
print(clf2.predict(tfidf))
# [0 2 1]
new_tfidf = tfidf_vectorizer.transform(new_comments)
new_labels = le.transform(new_sents)
clf2.partial_fit(new_tfidf, new_labels)
print(clf2.predict(new_tfidf))
# [1 0 2] As you wanted
Correct approach, or the way partial_fit() is intended to be used:
# Declare all labels that you want the model to learn
# Using classes learnt by labelEncoder for this
# In any calls to `partial_fit()`, all labels should be from this array only
all_classes = le.transform(le.classes_)
# Notice the parameter classes here
# It needs to present first time
clf2.partial_fit(tfidf, labels, classes=all_classes)
print(clf2.predict(tfidf))
# [0 2 1]
# classes is not present here
clf2.partial_fit(new_tfidf, new_labels)
print(clf2.predict(new_tfidf))
# [1 0 2]

Resources