I have a dataset which contains reviews of hotels. I want to predict whether review is positive or negative. But i don't have a dependent variable y in my dataset.
I am tring to use NLTK and naive Bayes algorithm. Please help me to solve this problem.
Here is my code up to now.
Reviews = dataset.iloc[:,18]
#Cleaning the texts
import re
import nltk'stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for num in range(0,10000):'stopwords')
review = re.sub('[^a-zA-Z]' , ' ' , str(Reviews[num]))
review = review.lower()
review = review.split()
ps = PorterStemmer()
review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
review = ' '.join(review)
#Creating the Bag of Words Model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()

Considering that you do not have a target class (dependent variable y) I believe that you should consider an unsupervised learning approach e.g clustering.

if you don't have target variable than you can give try to Textblob
from textblob import Textblob
testimonial = TextBlob("today is a bad day for me!")
# o/p (polarity close to 1 means positive, close to -1 means negative)
Sentiment(polarity=-0.8749999999999998, subjectivity=0.6666666666666666)


How calculate clusters coherence/quality?

I did embeddings with fasttext and I have clusters thanks to KMeans.
I would like to calculate similarities inside each cluster to check if the sentences inside are well clustered. I want to keep sentences with good similarities in each clusters. If the similarity is not good, I want to exit sentence that not belong to a cluster, and next group similar sentences not belonging to clusters.
How can I do it in a good manner ? I thought using cosine similarity but don't know how to compare all sentences inside a cluster
Maybe something like this...
# clustering words into similar groups:
import numpy as np
from sklearn.cluster import AffinityPropagation
import distance
words = np.asarray(words) #So that indexing with a list will work
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])
affprop = AffinityPropagation(affinity="precomputed", damping=0.5)
for cluster_id in np.unique(affprop.labels_):
exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
cluster_str = ", ".join(cluster)
print(" - *%s:* %s" % (exemplar, cluster_str))
See these links for additional guidance on how to cluster text.
Here are a couple examples using Cosine Similarity.
d1 = "plot: two teen couples go to a church party, drink and then drive."
d2 = "films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . "
d3 = "every now and then a movie comes along from a suspect studio , with every indication that it will be a stinker , and to everybody's surprise ( perhaps even the studio ) the film becomes a critical darling . "
d4 = "damn that y2k bug . "
documents = [d1, d2, d3, d4]
import nltk, string, numpy'punkt') # first-time use only
stemmer = nltk.stem.porter.PorterStemmer()
def StemTokens(tokens):
return [stemmer.stem(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def StemNormalize(text):
return StemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))'wordnet') # first-time use only
lemmer = nltk.stem.WordNetLemmatizer()
def LemTokens(tokens):
return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def LemNormalize(text):
return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
from sklearn.feature_extraction.text import CountVectorizer
LemVectorizer = CountVectorizer(tokenizer=LemNormalize, stop_words='english')
tf_matrix = LemVectorizer.transform(documents).toarray()
from sklearn.feature_extraction.text import TfidfTransformer
tfidfTran = TfidfTransformer(norm="l2")
import math
def idf(n,df):
result = math.log((n+1.0)/(df+1.0)) + 1
return result
print("The idf for terms that appear in one document: " + str(idf(4,1)))
print("The idf for terms that appear in two documents: " + str(idf(4,2)))
tfidf_matrix = tfidfTran.transform(tf_matrix)
cos_similarity_matrix = (tfidf_matrix * tfidf_matrix.T).toarray()
from sklearn.feature_extraction.text import TfidfVectorizer
TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
def cos_similarity(textlist):
tfidf = TfidfVec.fit_transform(textlist)
return (tfidf * tfidf.T).toarray()
# Define the documents
doc_trump = "Mr. Trump became president after winning the political election. Though he lost the support of some republican friends, Trump is friends with President Putin"
doc_election = "President Trump says Putin had no political interference is the election outcome. He says it was a witchhunt by political parties. He claimed President Putin is a friend who had nothing to do with the election"
doc_putin = "Post elections, Vladimir Putin became President of Russia. President Putin had served as the Prime Minister earlier in his political career"
documents = [doc_trump, doc_election, doc_putin]
# Scikit Learn
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
# Create the Document Term Matrix
count_vectorizer = CountVectorizer(stop_words='english')
count_vectorizer = CountVectorizer()
sparse_matrix = count_vectorizer.fit_transform(documents)
# OPTIONAL: Convert Sparse Matrix to Pandas Dataframe if you want to see the word frequencies.
doc_term_matrix = sparse_matrix.todense()
df = pd.DataFrame(doc_term_matrix,
index=['doc_trump', 'doc_election', 'doc_putin'])
# Compute Cosine Similarity
from sklearn.metrics.pairwise import cosine_similarity
print(cosine_similarity(df, df))

pyspark.sql.utils.IllegalArgumentException: 'Field "features" does not exist

I am trying to perform topic modelling and sentimental analysis on text data over SparkNLP. I have done all the pre-processing steps on the dataset but getting an error in LDA.
Program is:
from import Pipeline
from import StopWordsRemover, CountVectorizer, IDF
from import LDA
from pyspark.sql.functions import col, lit, concat, regexp_replace
from pyspark.sql.utils import AnalysisException
from import Tokenizer, RegexTokenizer
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType
from import LDA
from import StopWordsRemover
from import Normalizer
from import Vectors
dataframe_new ='com.databricks.spark.csv') \
.options(header='true', inferschema='true') \
get_tokenizers = Tokenizer(inputCol="headline_text", outputCol="get_tokens")
get_tokenized = get_tokenizers.transform(dataframe_new)
remover = StopWordsRemover(inputCol="get_tokens", outputCol="row")
get_remover = remover.transform(get_tokenized)
counter_vectorized = CountVectorizer(inputCol="row", outputCol="get_features")
getmodel =
get_result = getmodel.transform(get_remover)
idf_function = IDF(inputCol="get_features", outputCol="get_idf_feature")
train_model =
outcome = train_model.transform(get_result)
lda = LDA(k=10, maxIter=10)
model =
Schema of DataFrame after the IDF :
According to the documentation, LDA includes a featuresCol argument, with default value featuresCol='features', i.e. the name of the column that holds the actual features; according to your shown schema, such a column is not present in your dataframe, hence the expected error.
It is not exactly clear which column contains the features in your dataframe - get_features or get_idf_feature (they look identical in the sample you show); assuming it is get_idf_feature, you should change the LDA call to:
lda = LDA(featuresCol='get_idf_feature', k=10, maxIter=10)
Spark (including pyspark) ML API has a quite distinct and different logic than, say, scikit-learn and similar frameworks; one of the differences is indeed that the features have to be all in a single column of the respective dataframe. For a general demonstration of the idea, see own answer in KMeans clustering in PySpark (it is about K-Means, but the logic is identical).

How to do a single value prediction in NLP

My dataset was restaurants review with two columns review and liked.
Based on the review it shows if they liked the restaurant or not
I cleaned up the data in NLP as the first step.Then as second step used bag of words model as below.
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values
This above gave X as 1500 columns with 0 and 1 with 1000 rows according to my dataset.
I predicted as below
y_pred = classifier.predict(X_test)
So now I have review as "Food was good",how do I predict if they like it or not.A single value to predict.
Please can you help me out.Please let me know if additional information is required.
All you need is to apply cv.transform first just like so:
>>> test = ['Food was good']
>>> test_vec = cv.transform(test)
>>> classifier.predict(test_vec)
# returns predicted class
For training and testing here is simple example:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
text = ["This is good place","Hyatt is awesome hotel"]
count_vect = CountVectorizer()
tfidf_transformer = TfidfTransformer()
X_train_counts = count_vect.fit_transform(text)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
pd.DataFrame(X_train_tfidf.todense(), columns = count_vect.get_feature_names())
# Now apply any classification u want to on top of this data-set
Now Testing:
Note: use the same transformation as done in training:
new = ["I like the ambiance of this hotel "]
columns = count_vect.get_feature_names())
Apply model.predict on top of this now.
you can also use sklearn pipeline.
from sklearn.pipeline import Pipeline
model_pipeline = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()), ('model', classifier())]) #call the Model which you want to use
model_pipeline.fit_transform(x,y) # here x is your text data, and y is going to be your target
model_pipeline.predict(['Food was good"']) # predict your new sentence

Difference in PCA with Scikit-Learn and SVD

I am working on a PCA example with Scikit-Learn and SVD in the following dataset. I thought I should get the same PCA components with both methods at the end however, what I find is that the signs get reversed. I followed different resources but correctly I assume. Did not quite understand why getting this sign reversal. Below is what I have done. Xpca and Xsvd should be same I thought.
Useful links 1, 2
import pandas as pd
data = pd.read_csv("", header=None)
data.columns = ["V"+str(i) for i in range(1, len(data.columns)+1)] # rename column names to be similar to R naming convention
data.V1 = data.V1.astype(str)
X = data.loc[:, "V2":] # independent variables data
y = data.V1 # dependent variable data
# Using Scikit-Learn
from sklearn.preprocessing import scale
from sklearn.decomposition import PCA
standardisedX = scale(X)
standardisedX = pd.DataFrame(standardisedX, index=X.index, columns=X.columns)
pca = PCA().fit(standardisedX)
Xpca = pd.DataFrame(pca.transform(standardisedX))
# Using SVD
U, S, V = np.linalg.svd(standardisedX, full_matrices=False, compute_uv=True)
Xsvd = pd.DataFrame(

increase accuracy of model in sklearn

The decision tree classification gives an accuracy of 0.52 but I want to increase the accuracy. How can I increase the accuracy by using any of the classification model available in sklearn.
I have used knn, decision tree, and cross-validation but all of them gives less accuracy.
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
#read from the csv file and return a Pandas DataFrame.
nba = pd.read_csv('wine.csv')
# print the column names
original_headers = list(nba.columns.values)
#print the first three rows.
# "Position (pos)" is the class attribute we are predicting.
class_column = 'quality'
#The dataset contains attributes such as player name and team name.
#We know that they are not useful for classification and thus do not
#include them as features.
feature_columns = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH','sulphates', 'alcohol']
#Pandas DataFrame allows you to select columns.
#We use column selection to split the data into features and class.
nba_feature = nba[feature_columns]
nba_class = nba[class_column]
train_feature, test_feature, train_class, test_class = \
train_test_split(nba_feature, nba_class, stratify=nba_class, \
train_size=0.75, test_size=0.25)
training_accuracy = []
test_accuracy = []
knn = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=1), train_class)
prediction = knn.predict(test_feature)
print("Test set predictions:\n{}".format(prediction))
print("Test set accuracy: {:.2f}".format(knn.score(test_feature, test_class)))
train_class_df = pd.DataFrame(train_class,columns=[class_column])
train_data_df = pd.merge(train_class_df, train_feature, left_index=True, right_index=True)
train_data_df.to_csv('train_data.csv', index=False)
temp_df = pd.DataFrame(test_class,columns=[class_column])
temp_df['Predicted Pos']=pd.Series(prediction, index=temp_df.index)
test_data_df = pd.merge(temp_df, test_feature, left_index=True, right_index=True)
test_data_df.to_csv('test_data.csv', index=False)
tree = DecisionTreeClassifier(max_depth=4, random_state=0), train_class)
print("Training set score: {:.3f}".format(tree.score(train_feature, train_class)))
print("Test set score Decision: {:.3f}".format(tree.score(test_feature, test_class)))
prediction = tree.predict(test_feature)
print("Confusion matrix:")
print(pd.crosstab(test_class, prediction, rownames=['True'], colnames=['Predicted'], margins=True))
cancer = nba.as_matrix()
tree = DecisionTreeClassifier(max_depth=4, random_state=0)
scores = cross_val_score(tree, train_feature,train_class, cv=10)
print("Cross-validation scores: {}".format(scores))
print("Average cross-validation score: {:.2f}".format(scores.mean()))
Usually the next step after DT are RF (and it's neighbors) or XGBoost (but it's not sklearn). Try them. And DT are very simple to overfit.
Remove outliers. Check classes in your dataset: if they are unbalanced, most of errors may be there. In this case you need to use weights while fitting or in metric function (or use f1).
You can attach here your Confusion Matrix - could be great to see.
Also NN (even from sklearn) may show better results.
Improve your preprocessing.
Methods such as DT and kNN may be sensitive to how you preprocess your columns. For example, a DT can benefit much from well-chosen thresholds on the continuous variables.
