making predictions using classification models with multiple independent variables in hand

making predictions using classification models with multiple independent variables in hand - machine-learning

I am trying to make a simple classification using Logistic Regression. I fit the model and scale the values using a standard scaler. how can I make a single prediction after that? I am getting the same result for different values. For every value, I am getting 0. the prediction I am getting from single inputs does not resemble with the result from the prediction made by the testing dataset. Can someone please give me a hand?
dataset = pd.read_csv("Social_Network_Ads.csv")
x = dataset.iloc[:, 2:4].values
y = dataset.iloc[:, 4].values
print(dataset)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=0)
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
classifier = LogisticRegression()
classifier.fit(x_train, y_train)
y_pred = classifier.predict(x_test)
x_values = [36, 36000]
x_values = np.array(x_values).reshape(1, -1)
x_values = scaler.transform(x_values)
pred = classifier.predict(x_values)
print("single prediction: ", pred)

Related

SMOTE resampling produces nan values

I am using SMOTE to oversample the minority of a dataset. My code is as follows:
from imblearn.over_sampling import SMOTE
X_train, X_test, y_train, y_test = train_test_split(features_coded, labels, test_size=0.2, random_state=42)
sm = SMOTE(random_state=42, sampling_strategy='all')
# also tried the following, same result
# sm = SMOTE(random_state=42, sampling_strategy=0.5)
X_train, y_train = sm.fit_resample(X_train, y_train)
I check features_coded, labels, X_train and y_train using statements like the following:
features_coded[features_coded.isnull().any(axis=1)]
I am pretty sure that they do not contain any nan values before oversampling. However, after resampling, there are a lot of nan values in the X_train dataframe.
Just in case you are wondering:
This is my dataframe (saved as csv file) before oversampling, nothing is missing.
This is my dataframe (saved as csv file) after oversampling, a lot of empty values!
Is anything wrong?

I had a similar issue, I converted my inputs X and Y as arrays using the lines X_arr = numpy.array(X) and y_arr = numpy.array(Y) and fed them to train_test_split() as follows:
X_train, X_test, y_train, y_test = train_test_split(X_arr, y_arr, test_size = 0.2, random_state = 2)
smote = SMOTE(random_state=2)
X_train_balanced, Y_train_balanced = smote.fit_resample(X_train, y_train)

sklearn and weka kNN predictions exactly same for all except for one data point

I wrote a code for kNN using sklearn and then compared the predictions using the WEKA kNN. The comparison was done using the 10 test set predictions, out of which, only a single one is showing a high difference of >1.5 but all others are exactly the same. So, I am not sure about if my code is working fine or not. Here is my code:
df = pd.read_csv('xxxx.csv')
X = df.drop(['Name', 'activity'], axis=1)
y = df['activity']
Xstd = StandardScaler().fit_transform(X)
x_train, x_test, y_train, y_test = train_test_split(Xstd, y, test_size=0.2,
shuffle=False, random_state=None)
print(x_train.shape, x_test.shape)
X_train_trans = x_train
X_test_trans = x_test
for i in range(2, 3):
knn_regressor = KNeighborsRegressor(n_neighbors=i, algorithm='brute',
weights='uniform', metric='euclidean', n_jobs=1, p=2)
CV_pred_train = cross_val_predict(knn_regressor, X_train_trans, y_train,
n_jobs=-1, verbose=0, cv=LeaveOneOut())
print("LOO Q2: ", metrics.r2_score(y_train, CV_pred_train).round(2))
# Train Test predictions
knn_regressor.fit(X_train_trans, y_train)
train_r2 = knn_regressor.score(X_train_trans, y_train)
y_train_pred = knn_regressor.predict(X_train_trans).round(3)
train_r2_1 = metrics.r2_score(y_train, y_train_pred)
y_test_pred = knn_regressor.predict(X_test_trans).round(3)
train_r = stats.pearsonr(y_train, y_train_pred)
abs_error_train = (y_train - y_train_pred)
train_predictions = pd.DataFrame({'Actual': y_train, 'Predcited':
y_train_pred, "error": abs_error_train.round(3)})
MAE_train = metrics.mean_absolute_error(y_train, y_train_pred)
abs_error_test = (y_test_pred - y_test)
test_predictions = pd.DataFrame({'Actual': y_test, 'predcited':
y_test_pred, 'error': abs_error_test.round(3)})
test_r = stats.pearsonr(y_test, y_test_pred)
test_r2 = metrics.r2_score(y_test, y_test_pred)
MAE_test = metrics.mean_absolute_error(y_test, y_test_pred).round(3)
print(test_predictions)
The train set statistics are almost same in both sklearn and WEKA kNN.
the sklearn predictions are:
Actual predcited error
6.00 5.285 -0.715
5.44 5.135 -0.305
6.92 6.995 0.075
7.28 7.005 -0.275
5.96 6.440 0.480
7.96 7.150 -0.810
7.30 6.660 -0.640
6.68 7.200 0.520
***4.60 6.950 2.350***
and the weka predictions are:
actual predicted error
6 5.285 -0.715
5.44 5.135 -0.305
6.92 6.995 0.075
7.28 7.005 -0.275
5.96 6.44 0.48
7.96 7.15 -0.81
7.3 6.66 -0.64
6.68 7.2 0.52
***4.6 5.285 0.685***
parameters used in both algorithms are: k =2, brute force for distance calculation, metric: euclidean.
Any suggestions for the difference?

Training and testing ML from two different sources

I am using sklearn for a classification task. I want to train my model on data from table "train" and test it on data from a different table"test". Both tables have the same exact features, but different numbers of rows. I have the code below, but I am getting the error:
(<class 'ValueError'>, ValueError('Found input variables with inconsistent numbers of samples: [123, 174]',), <traceback object at 0x0000016476E10C48>).
what am I doing wrong?
get_train_data = 'select * from train;'
get_test_data = 'select * from test;'
df_train = pd.read_sql_query(get_train_data, con=connection)
df_test = pd.read_sql_query(get_test_data, con=connection)
X = df_train[:, 2:30]
Y = df_test[:, :30]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y)
model.fit(X_train, Y_train)
predictions = model.predict(X_test)
split_mat=confusion_matrix(Y_test, predictions)

If you want to train on dataframe df_train and test on dataframe df_test, why are you taking the features of df_train and the target column of df_test and pass them to the train_test_split function?
You can simply do the following:
get_train_data = 'select * from train;'
get_test_data = 'select * from test;'
df_train = pd.read_sql_query(get_train_data, con=connection)
df_test = pd.read_sql_query(get_test_data, con=connection)
X_train = df_train[:, 2:30]
y_train = df_train.y # assuming y is the name of your target variable in df_train
X_test = df_test[:, i:j] # change i to j with the number that allow you to take the same columns as X_train
y_test = df_test.y # assuming y is the name of your target variable in df_test
model.fit(X_train, y_train)
predictions = model.predict(X_test)
# Do something with predictions, e.g.
mean(predictions == y_test)

Accuracy in logistic regression

Here is slightly modified code that I found here...
I am using the same logic as the original author and still not getting good accuracy. The Mean Reciprocal Rank is close (mine: 52.79, example: 48.04)
cv = CountVectorizer(binary=True, max_df=0.95)
feature_set = cv.fit_transform(df["short_description"])
X_train, X_test, y_train, y_test = train_test_split(
feature_set, df["category"].values, random_state=2000)
scikit_log_reg = LogisticRegression(
verbose=1, solver="liblinear", random_state=0, C=5, penalty="l2", max_iter=1000)
model = scikit_log_reg.fit(X_train, y_train)
target = to_categorical(y_test)
y_pred = model.predict_proba(X_test)
label_ranking_average_precision_score(target, y_pred)
>> 0.5279108613021547
model.score(X_test, y_test)
>> 0.38620071684587814
But the accuracy of notebook sample (59.80) does not match with my code (38.62)
Is the following function used in the sample notebook correctly returning accuracy?
def compute_accuracy(eval_items:list):
correct=0
total=0
for item in eval_items:
true_pred=item[0]
machine_pred=set(item[1])
for cat in true_pred:
if cat in machine_pred:
correct+=1
break
accuracy=correct/float(len(eval_items))
return accuracy

The notebook code is checking whether the actual category is in the top 3 returned from the model:
def get_top_k_predictions(model, X_test, k):
probs = model.predict_proba(X_test)
best_n = np.argsort(probs, axis=1)[:, -k:]
preds=[[model.classes_[predicted_cat] for predicted_cat in prediction] for prediction in best_n]
preds=[item[::-1] for item in preds]
return preds
If you replace the evaluation part of your code with the below, you'll see that your model returns a top-3 accuracy of 0.5980 as well:
...
model = scikit_log_reg.fit(X_train, y_train)
top_preds = get_top_k_predictions(model, X_test, 3)
pred_pairs = list(zip([[v] for v in y_test], top_preds))
print(compute_accuracy(pred_pairs))
# below is a simpler & more Pythonic version of compute_accuracy
print(np.mean([actual in pred for actual, pred in zip(y_test, top_preds)]))

Cross validation in classifying text documents using scikit-learn

Do you first do cross validation followed by feature extraction or the other way while classifying text documents using scikit-learn?
Here is my pipeline:
union = FeatureUnion(
transformer_list = [
('tfidf', TfidfVectorizer()),
('featureEx', FeatureExtractor()),
('spell_chker', Spellingchecker()),
], n_jobs = -1)
I am doing it in the following way, but I wonder if I should extract the features first and do the cross validation. In this example X is list of documents and y is label.
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size= 0.2)
X_train = union.fit_transform(X_train)
X_test = union.transform(X_test)
ch2 = SelectKBest(f_classif, k = 7000)
X_train = ch2.fit_transform(X_train, y_train)
X_test = ch2.transform(X_test)
clf = SVC(C=1, gamma=0.001, kernel = 'linear', probability=True).fit(
X_train , y_train)
print("classification report:")
y_true, y_pred = y_test, clf.predict(X_test)
print(classification_report(y_true, y_pred))
print()

Doing the feature selection and then cross validating on those features is sometimes common on text data, but it is less desirable. This can lead to over-fitting and the cross-validation procedure may over-estimate your true accuracy.
When you do the feature selection first, that feauter selection process got to look at all the data. The point of cross validation is to hide 1 fold from the others. By doing the FS first, you leak some of that data knowledge to the other folds.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

making predictions using classification models with multiple independent variables in hand - machine-learning

Related

SMOTE resampling produces nan values

sklearn and weka kNN predictions exactly same for all except for one data point

Training and testing ML from two different sources

Accuracy in logistic regression

Cross validation in classifying text documents using scikit-learn

Categories

Resources