I am doing a NearestNeighbor recommendation model that takes a list of words and recommend similar words and I want to tune the values for n_neighbors. This is the code I typed out.
from sklearn.model_selection import GridSearchCV
gs_clf = GridSearchCV(NearestNeighbors(algorithm = 'brute'), {
'n_neighbors': [1,2,3,4,5,6,7,8,9,10]
}, scoring = 'f1', cv=5)
gs_clf = gs_clf.fit(transformed_courses_new , np.array(courses.code))
trasnformed_courses_new is a array of shape (159, 120) and np.array(courses.code) is (159,) and each value is an unique label. So my understanding was that the gridsearch will do testing for all the values of n_neighbors and rank the best value of k for which the f1 scoring is maximum. But when I ran the code, I got a warning that NearestNeighbors don't have .predict functionality.
Is there any workaround for this?
Any help is appreciated.
Use KNeighborsClassifier.
(Keep in mind that kNN suffers from curse of dimensionality.)
Related
General terms that I used to search on google such as Localised Accuracy, custom accuracy, biased cost functions all seem wrong, and maybe I am not even asking the right questions.
Imagine I have some data, may it be the:
The famous Iris Classification Problem
Pictures of felines
The Following Dataset that I made up on predicting house prices:
In all these scenario, I am really interested in the accuracy of one set/one regression range of data.
For irises, I really need Iris "setosa" to be classified correctly, really don't care if Iris virginica and Iris versicolor are all wrong.
for Felines, I really need the model to tell me if you spotted a tiger (for obvious reason), whether it is a Persian or ragdoll or not I dont really care.
For the house prices one, i want the accuracy of higher-end houses error to be minimised. Because error in those is costly.
How do I do this? If I want Setosa to be classified correctly, removing virginica or versicolour both seem wrong. Trying different algorithm like Linear/SVM etc are all well and good, but it only improves the OVERALL accuracy. But I really need, for example, "Tigers" to be predicted correctly, even at the expense of the "overall" accuracy of the model.
Is there a way to have a custom cost-function to allow me to have a high accuracy in a localise region in a regression problem, or a specific category in a classification problem?
If this cannot be answered, if you could just point me to some terms that i can search/research that would still be greatly appreciated.
You can use weights to achieve that. If you're using the SVC class of scikit-learn, you can pass class_weight in the constructor. You could also pass sample_weight in the fit-method.
For example:
from sklearn import svm
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
clf = svm.SVC(class_weight={0: 3, 1: 1, 2: 1})
clf.fit(X, y)
This way setosa is more important than the other classes.
Example for regression
from sklearn.linear_model import LinearRegression
X = ... # features
y = ... # house prices
weights = []
for house_price in y:
if house_price > threshold:
weights.append(3)
else:
weights.append(1)
clf = LinearRegression()
clf.fit(X, y, sample_weight=weights)
I have written the following code to generate a confusion matrix
from sklearn.naive_bayes import MultinomialNB
mnb=MultinomialNB()
mnb.fit(X_train,Y_train)
sms2="REMINDER FROM O2: To get 2.50 pounds free call credit and details of great offers pls reply 2 this text with your valid name, house no and postcode"
sms="You’ve Won!"
X_test = [str (item) for item in X_test]
Y_pred = mnb.predict(vec.transform(X_test))
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(Y_test, Y_pred)
print(mat)
names =[ "non-spam", "spam"]
print(names)
sns.heatmap(mat, square=True, annot=True, fmt='d', cbar=False,
xticklabels=names, yticklabels=names)
plt.xlabel('Actual [Truth]')
plt.ylabel('Predicted')
plt.show()
It generated the following confusion matrix:
I am not sure if the axis are labelled correctly
ie. If the x-axis should be Actual and y-axis should be predicted
or the other way around
According to the scikit=learn documentation, https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html, the first call parameter should be the ground truth values, and the second call parameter should be the predicted values via the classifier.
If that is the case, then the return parameter for the confusion matrix should have the rows containing the true classes and the columns should be the predicted classes.
I think your labels are backward based on that. See also the example https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_display_object_visualization.html#sphx-glr-auto-examples-miscellaneous-plot-display-object-visualization-py
I'm learning Machine Learning and I'm facing a mismatch I can't explain.
I have a grid to compute the best model, according to the accuracy returned by GridSearchCV.
model=sklearn.neighbors.KNeighborsClassifier()
n_neighbors=[3, 4, 5, 6, 7, 8, 9]
weights=['uniform','distance']
algorithm=['auto','ball_tree','kd_tree','brute']
leaf_size=[20,30,40,50]
p=[1]
param_grid = dict(n_neighbors=n_neighbors, weights=weights, algorithm=algorithm, leaf_size=leaf_size, p=p)
grid = sklearn.model_selection.GridSearchCV(estimator=model, param_grid=param_grid, cv = 5, n_jobs=1)
SGDgrid = grid.fit(data1, targetd_simp['VALUES'])
print("SGD Classifier: ")
print("Best: ")
print(SGDgrid.best_score_)
value=SGDgrid.best_score_
print("params:")
print(SGDgrid.best_params_)
print("Best estimator:")
print(SGDgrid.best_estimator_)
y_pred_train=SGDgrid.best_estimator_.predict(data1)
print(sklearn.metrics.confusion_matrix(targetd_simp['VALUES'],y_pred_train))
print(sklearn.metrics.accuracy_score(targetd_simp['VALUES'],y_pred_train))
The results I get are the following:
SGD Classifier:
Best:
0.38694539229180525
params:
{'algorithm': 'auto', 'leaf_size': 20, 'n_neighbors': 8, 'p': 1, 'weights': 'distance'}
Best estimator:
KNeighborsClassifier(leaf_size=20, n_neighbors=8, p=1, weights='distance')
[[4962 0 0]
[ 0 4802 0]
[ 0 0 4853]]
1.0
Probably this model is highly overfitted. I still to check it, but it's not the matter of question here.
So, basically, if I understand correctly, GridSearchCV is finding a best accuracy score of 0.3869 (quite poor) for one of the chunks in the cross validation, but the final confusion matrix is perfect, as well as the accuracy of this final matrix. It doesn't make much sense for me... How such a in theory, bad model is performing so well?
I also added scoring = 'accuracy' in GridSearchCV to be sure that the returned value is actually accuracy, and it returns exactly the same value.
What am I missing here?
The behavior you are describing is rather normal and to be expected. You should know that GridSearchCV has a parameter refit which is by default set to true. It triggers the following:
Refit an estimator using the best found parameters on the whole dataset.
This means that the estimator returned by best_estimator_ has been refit on your whole dataset (data1 in your case). It is therefore data that the estimator has already seen during training and, expectedly, performs especially well on it. You can easily reproduce this with the following example:
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
X, y = make_classification(random_state=7)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
search = GridSearchCV(KNeighborsClassifier(), param_grid={'n_neighbors': [3, 4, 5]})
search.fit(X_train, y_train)
print(search.best_score_)
>>> 0.8533333333333333
print(accuracy_score(y_train, search.predict(X_train)))
>>> 0.9066666666666666
While this is not as impressive as in your case, it is still a clear result. During cross-validation, the model is validated against one fold that was not used for training the model, and thus, against data the model has not seen before. In the second case, however, the model already saw all data during training and it is to be expected that the model will perform better on them.
To get a better feeling of the true model performance, you should use a holdout set with data the model has not seen before:
print(accuracy_score(y_test, search.predict(X_test)))
>>> 0.76
As you can see, the model performs considerably worse on this data and shows us that the former metrics were all a bit too optimistic. The model did in fact not generalize that well.
In conclusion, your result is not surprising and has an easy explanation. The high discrepancy in scores is impressive but still follows the same logic and is actually just a clear indicator of overfitting.
I have the following code which performs 5-fold cross validation and returns several metric values.
iris = load_iris()
clf = SVC()
scoring = {'acc': 'accuracy',
'prec_macro': 'precision_macro',
'rec_micro': 'recall_macro'}
scores = cross_validate(clf, iris.data, iris.target, scoring=scoring,
cv=5, return_train_score=True)
I want to know if this can be modified to print the predicted values for each fold.
If you're using sklearn you can use cross_val_predict:
from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(clf_name,X_train,y_train_5,cv=3)
cross_val_score gives score for each fold. while cross_val_predict gives prediction for each fold.
Since I need also this feature in scikit-learn, I've hacked the code in my sklearn repo.
If you still need this, you can find this on my github, on the branch group_cv:
https://github.com/robbisg/scikit-learn/tree/group_cv
The modified cross_validate function is here:
https://github.com/robbisg/scikit-learn/blob/group_cv/sklearn/model_selection/_validation.py
You need to call cross_validate with return_predictions=True.
Hope this helps.
I am new to statistics, Python, machine learning and Scikit-learn. However, I am trying this project where I have a CSV with 35 columns of student data. The first column is an ID which I think I can ignore. The last 3 columns are the grade 1, grade 2 and grade 3 scores. I have 400 rows. I want to see if I can learn some machine learning with it, and make some sense of the data I have. Now I understand Scikit works on Numpy arrays which do not handle categorical data like sex ('male', 'female') and so on. So I codified all the 30 categories with 1 for male, 2 for female and so on and so forth. I then did the following
X = my_data[:,1:33]
y = my_data[:,34]
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X,y)
expected = y
predicted = model.predict(X)
mse = np.mean((predicted-expected)**2)
print(mse)
print(model.score(X,y))
I got a MSE of 6.0839840461 and a model score of 0.709407474898.
I got some result. So far so good for a first attempt. However, I realized that since I assigned increasing code values like 1 for male, 2 for female, the Linear Regression would have treated them as weights. How do I replace the Gender column with [1,0] or [0,1], which I learn is the right way to represent categorical data? Would it be a dictionary type column or a list type column? If so how will it be part of the Numpy array?
This is called indicator or dummy variables, and Pandas allows to easily encode such categorical values:
>>> import pandas as pd
>>> pd.get_dummies(['male', 'female'])
female male
0 0 1
1 1 0
Don't forget about multicollinearity, though - algorithms like linear regression rely on independence of variables, while in your case female=0 definitely means male=1. In this case simply remove one dummy variable (e.g. use only female var and not male).
There is also a LabelEncoder() in sklearn.preprocessing package:
from sklearn import preprocessing
le1 = preprocessing.LabelEncoder()
y = le1.transform(y)
You can also inverse transform back with le1.inverse_transform(y).
The encoding is done automatically though, you cannot change the order.