I'm currently working on a Naive Bayes sentiment analysis program but I'm not quite sure how to determine it's accuracy. My code is:
x = df["Text"]
y = df["Mood"]
test_size = 1785
x_train = x[:-test_size]
y_train = y[:-test_size]
x_test = x[-test_size:]
y_test = y[-test_size:]
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(x_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = MultinomialNB().fit(X_train_tfidf, y_train)
print(clf.predict(count_vect.transform(["Random text"])))
The prediction works just fine for a sentence that I give it, however I want to run it on 20% from my database (x_test and y_test) and calculate the accuracy. I'm not quite sure how to approach this. Any help would be appreciated.
I've also tried the following:
predictions = clf.predict(x_test)
print(accuracy_score(y_test, predictions))
Which gives me the following error:
ValueError: could not convert string to float: "A sentence from the dataset"

before usiing predictions = clf.predict(x_test) please convert the test set also to numeric
x_test = count_vect.transform(x_test).toarray()
you can find step by step to do this [here]


How to create an ANN regression model with several vectors as input and several vectors as output?

There is one input variable and one output variable.
However each data point of the input/output variable is a vector.
Size of each input vector is 141X1 and size of each output vector is 400X1.
I have attached the input data file(Ivec.xls) and output data file(Ovec.xls)
data link
For training:
Input vectors: Ivec(:,1:9) and output vectors: Ovec(:,1:9)
For testing:
Input vector: Ivec(:,10) and the predicted_Ovec10 can be compared with Ovec(:,10)
to know the performance of the model.
How to create a regression model from this?
dataset_Ivec = pd.read_excel(r'Ivec.xls',header = None)
dataset_Ovec = pd.read_excel(r'Ovec.xls',header = None)
dataset_Ivec_numpy = dataset_Ivec.to_numpy()
dataset_Ovec_numpy = dataset_Ovec.to_numpy()
X_train = dataset_Ivec_numpy[:,:-1]
y_train = dataset_Ovec_numpy[:,:-1]
X_test = dataset_Ivec_numpy[:,9]
y_test = dataset_Ovec_numpy[:,9]
model = Sequential()
model.add(Dense(activation="relu", input_dim=X_train.shape[0], units = X_train.shape[1], kernel_initializer="uniform"))
model.add(Dense(activation="linear", input_dim=y_train.shape[1], units = X_train.shape[1], kernel_initializer="uniform"))
model.compile(optimizer="adagrad", loss="mean_squared_error", metrics=["accuracy"])
# model = baseline_model()
result =, y_train, batch_size=2, epochs=20, validation_data=(X_test, y_test))
I tried to write this code, however, it is too confusing for me.
And now I am quite stuck for many days. Please help someone.

making predictions using classification models with multiple independent variables in hand

I am trying to make a simple classification using Logistic Regression. I fit the model and scale the values using a standard scaler. how can I make a single prediction after that? I am getting the same result for different values. For every value, I am getting 0. the prediction I am getting from single inputs does not resemble with the result from the prediction made by the testing dataset. Can someone please give me a hand?
dataset = pd.read_csv("Social_Network_Ads.csv")
x = dataset.iloc[:, 2:4].values
y = dataset.iloc[:, 4].values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=0)
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
classifier = LogisticRegression(), y_train)
y_pred = classifier.predict(x_test)
x_values = [36, 36000]
x_values = np.array(x_values).reshape(1, -1)
x_values = scaler.transform(x_values)
pred = classifier.predict(x_values)
print("single prediction: ", pred)

sklearn and weka kNN predictions exactly same for all except for one data point

I wrote a code for kNN using sklearn and then compared the predictions using the WEKA kNN. The comparison was done using the 10 test set predictions, out of which, only a single one is showing a high difference of >1.5 but all others are exactly the same. So, I am not sure about if my code is working fine or not. Here is my code:
df = pd.read_csv('xxxx.csv')
X = df.drop(['Name', 'activity'], axis=1)
y = df['activity']
Xstd = StandardScaler().fit_transform(X)
x_train, x_test, y_train, y_test = train_test_split(Xstd, y, test_size=0.2,
shuffle=False, random_state=None)
print(x_train.shape, x_test.shape)
X_train_trans = x_train
X_test_trans = x_test
for i in range(2, 3):
knn_regressor = KNeighborsRegressor(n_neighbors=i, algorithm='brute',
weights='uniform', metric='euclidean', n_jobs=1, p=2)
CV_pred_train = cross_val_predict(knn_regressor, X_train_trans, y_train,
n_jobs=-1, verbose=0, cv=LeaveOneOut())
print("LOO Q2: ", metrics.r2_score(y_train, CV_pred_train).round(2))
# Train Test predictions, y_train)
train_r2 = knn_regressor.score(X_train_trans, y_train)
y_train_pred = knn_regressor.predict(X_train_trans).round(3)
train_r2_1 = metrics.r2_score(y_train, y_train_pred)
y_test_pred = knn_regressor.predict(X_test_trans).round(3)
train_r = stats.pearsonr(y_train, y_train_pred)
abs_error_train = (y_train - y_train_pred)
train_predictions = pd.DataFrame({'Actual': y_train, 'Predcited':
y_train_pred, "error": abs_error_train.round(3)})
MAE_train = metrics.mean_absolute_error(y_train, y_train_pred)
abs_error_test = (y_test_pred - y_test)
test_predictions = pd.DataFrame({'Actual': y_test, 'predcited':
y_test_pred, 'error': abs_error_test.round(3)})
test_r = stats.pearsonr(y_test, y_test_pred)
test_r2 = metrics.r2_score(y_test, y_test_pred)
MAE_test = metrics.mean_absolute_error(y_test, y_test_pred).round(3)
The train set statistics are almost same in both sklearn and WEKA kNN.
the sklearn predictions are:
Actual predcited error
6.00 5.285 -0.715
5.44 5.135 -0.305
6.92 6.995 0.075
7.28 7.005 -0.275
5.96 6.440 0.480
7.96 7.150 -0.810
7.30 6.660 -0.640
6.68 7.200 0.520
***4.60 6.950 2.350***
and the weka predictions are:
actual predicted error
6 5.285 -0.715
5.44 5.135 -0.305
6.92 6.995 0.075
7.28 7.005 -0.275
5.96 6.44 0.48
7.96 7.15 -0.81
7.3 6.66 -0.64
6.68 7.2 0.52
***4.6 5.285 0.685***
parameters used in both algorithms are: k =2, brute force for distance calculation, metric: euclidean.
Any suggestions for the difference?

Random Forest sklearn - Equals values predict

House Prices challenge Kaggle
I'm trying to predict prices with RandomForestClassifier. After predict it results same prices for all id. Do you have an idea of the problem ?
clf = RandomForestClassifier(n_estimators=50)
clf =,y_train)
clf.score(X_train, y_train)
X = df_test2[feature_cols]
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
df_imp = imp.fit_transform(X)
df_test_scale = scaler.transform(df_imp)
y_pred = clf.predict(df_test_scale)
predict_prices = pd.DataFrame({"Id" : df_test2['Id'], "SalePrice":y_pred})
Since you have scaled the training set, scale test set with same scaler.
Change as below:
clf = RandomForestClassifier(n_estimators=50)
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
clf =,y_train)
clf.score(X_train, y_train)
X = df_test2[feature_cols]
df_imp = imp.fit_transform(X)
df_test_scale = scaler.transform(df_imp)
y_pred = clf.predict(df_test_scale)
predict_prices = pd.DataFrame({"Id" : df_test2['Id'], "SalePrice":y_pred})

Cross validation in classifying text documents using scikit-learn

Do you first do cross validation followed by feature extraction or the other way while classifying text documents using scikit-learn?
Here is my pipeline:
union = FeatureUnion(
transformer_list = [
('tfidf', TfidfVectorizer()),
('featureEx', FeatureExtractor()),
('spell_chker', Spellingchecker()),
], n_jobs = -1)
I am doing it in the following way, but I wonder if I should extract the features first and do the cross validation. In this example X is list of documents and y is label.
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size= 0.2)
X_train = union.fit_transform(X_train)
X_test = union.transform(X_test)
ch2 = SelectKBest(f_classif, k = 7000)
X_train = ch2.fit_transform(X_train, y_train)
X_test = ch2.transform(X_test)
clf = SVC(C=1, gamma=0.001, kernel = 'linear', probability=True).fit(
X_train , y_train)
print("classification report:")
y_true, y_pred = y_test, clf.predict(X_test)
print(classification_report(y_true, y_pred))
Doing the feature selection and then cross validating on those features is sometimes common on text data, but it is less desirable. This can lead to over-fitting and the cross-validation procedure may over-estimate your true accuracy.
When you do the feature selection first, that feauter selection process got to look at all the data. The point of cross validation is to hide 1 fold from the others. By doing the FS first, you leak some of that data knowledge to the other folds.
