Sk-learn GridSearchCV fits on full data

Sk-learn GridSearchCV fits on full data - machine-learning

I used sklearn GridSearchCV to search # of topics using lda model. After fitting the model, the fitted model is saved in CV_model.best_estimator_. Based on skelarn document, GridSearchCV has default option 'refit, default=True', which 'Refit an estimator using the best found parameters on the whole dataset.' Sklearn GridSearchCV
Since the document says the it has already fit on the full data, I therefore believed 'CV_model.best_estimator_.fit_transform(full_train_data)' shall bring the same result as 'CV_model.best_estimator_.transform(full_train_data)' . However, outputs from using fit_transform and transform differ. What did I miss? Should I use fit_transform or transform after GridsearchCV?

I realized it might be due to the unfixed random state, after I assigned a fixed random state, .transform() and .fit_transform() return same results.

Related

Feature selection with GridsearchCV

I am trying to use GridSearchCV to optimize a pipeline that does feature selection in the beginning and classification using KNN at the end. I have fitted the model using my data set but when I see the best parameters found by GridSearchCV, it only gives the best parameters for SelectKBest. I have no idea why it doesn't show the best parameters for KNN.
Here is my code.
Addition of KNN and SelectKbest
classifier = KNeighborsClassifier()
parameters = {"classify__n_neighbors": list(range(5,15)),
"classify__p":[1,2]}
sel = SelectKBest(f_classif)
param={'kbest__k': [10, 20 ,30 ,40 ,50]}
GridsearchCV with pipeline and parameter grid
model = GridSearchCV(Pipeline([('kbest',sel),('classify', classifier)]),
param_grid=[param,parameters], cv=10)
fitting the model
model.fit(X_new, y)
the result
print(model.best_params_)
{'kbest__k': 40}

That's an incorrect way of merging dicts I believe. Try
param_grid={**param,**parameters}
or (Python 3.9+)
param_grid=param|parameters

When param_grid is a list, the disjoint union of the grids generated by each dictionary in the list is explored. So your search is over (1) the default k=10 selected features and every combination of classifier parameters, and separately (2) the default classifier parameters and each value of k. That the best parameters just show k=40 means that having more features, even with default classifier, performed best. You can check your cv_results_ to verify.
As dx2-66 answers, merging the dictionaries will generate the full grid you probably are after. You could also just define a single dictionary from the start.

what does lightgbm python Dataset reference parameter mean?

I am trying to figure out how to train a gbdt classifier with lightgbm in python, but getting confused with the example provided on the official website.
Following the steps listed, I find that the validation_data comes from nowhere and there is no clue about the format of the valid_data nor the merit or avail of training model with or without it.
Another question comes with it is that, in the documentation, it is said that "the validation data should be aligned with training data", while I look into the Dataset details, I find that there is another statement shows that "If this is Dataset for validation, training data should be used as reference".
My final questions are, why should validation data be aligned with training data? what is the meaning of reference in Dataset and how is it used during training? is the alignment goal accomplished with reference set to training data? what is the difference between this "reference" strategy and cross-validation?
Hope someone could help me out of this maze, thanks!

The idea of "validation data should be aligned with training data" is simple :
every preprocessing you do to the training data, you should do it the same way for validation data and in production of course. This apply to every ML algorithm.
For example, for neural network, you will often normalize your training inputs (substract by mean and divide by std).
Suppose you have a variable "age" with mean 26yo in training. It will be mapped to "0" for the training of your neural network. For validation data, you want to normalize in the same way as training data (using mean of training and std of training) in order that 26yo in validation is still mapped to 0 (same value -> same prediction).
This is the same for LightGBM. The data will be "bucketed" (in short, every continuous value will be discretized) and you want to map the continuous values to the same bins in training and in validation. Those bins will be calculated using the "reference" dataset.
Regarding training without validation, this is something you don't want to do most of the time! It is very easy to overfit the training data with boosted trees if you don't have a validation to adjust parameters such as "num_boost_round".

still everything is tricky
can you share full example with using and without using this "reference="
for example
will it be different
import lightgbm as lgbm
importance_type_LGB = 'gain'
d_train = lgbm.Dataset(train_data_with_NANs, label= target_train)
d_valid = lgbm.Dataset(train_data_with_NANs, reference= target_train)
lgb_clf = lgbm.LGBMClassifier(class_weight = 'balanced' ,importance_type = importance_type_LGB)
lgb_clf.fit(test_data_with_NANs,target_train)
test_data_predict_proba_lgb = lgb_clf.predict_proba(test_data_with_NANs)
from
import lightgbm as lgbm
importance_type_LGB = 'gain'
lgb_clf = lgbm.LGBMClassifier(class_weight = 'balanced' ,importance_type = importance_type_LGB)
lgb_clf.fit(test_data_with_NANs,target_train)
test_data_predict_proba_lgb = lgb_clf.predict_proba(test_data_with_NANs)

can we save a partially trained Machine Learning model, reload it again and train from the point it was saved?

I want to know is there any way in which we can partially save a Scikit-Learn Machine Learning model and reload it again to train it from the point it was saved before?

For models such as Scikitlearn applied to sentiment analysis, I would suspect you need to save two important things: 1) your model, 2) your vectorizer.
Remember that after training your model, your words are represented by a vector of length N, and that is defined according to your total number of words.
Below is a piece from my test-model and test-vectorizer saved in order to be used latter.
SAVING THE MODEL
import pickle
pickle.dump(vectorizer, open("model5vectorizer.pickle", "wb"))
pickle.dump(classifier_fitted, open("model5.pickle", "wb"))
LOADING THE MODEL IN A NEW SCRIPT (.py)
import pickle
model = pickle.load(open("model5.pickle", "rb"))
vectorizer = pickle.load(open("model5vectorizer.pickle", "rb"))
TEST YOUR MODEL
sentence_test = ["Results by Andutta et al (2013), were completely wrong and unrealistic."]
USING THE VECTORIZER (model5vectorizer.pickle) !!
sentence_test_data = vectorizer.transform(sentence_test)
print("### sentence_test ###")
print(sentence_test)
print("### sentence_test_data ###")
print(sentence_test_data)
# OBS-1: VECTOR HERE WILL HAVE SAME LENGTH AS BEFORE :)
# OBS-2: If you load the default vectorizer or a different one, then you may see the following problems
# sklearn.exceptions.NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted.
# # ValueError: X has 8 features per sample; expecting 11
result1 = model.predict(sentence_test_data) # using saved vectorizer from calibrated model
print("### RESULT ###")
print(result1)
Hope that helps.
Regards,
Andutta

When a data set is fitted to a Scikit-learn machine learning model, it is trained and supposedly ready to be used for prediction purposes. By training a model with let's say, 100 samples and using it and then going back to it and fitting another 50 samples to it, you will not make it better but you will rebuild it.
If your purpose is to build a model and make it more powerful as it interacts with more samples, you would be thinking of a real-time condition, such as a mobile robot for mapping an environment with a Kalman Filter.

Text Classification with scikit-learn: how to get a new document's representation from a pickle model

I have a document binomial classifier that uses a tf-idf representation of a training set of documents and applies Logistic Regression to it:
lr_tfidf = Pipeline([('vect', tfidf),('clf', LogisticRegression(random_state=0))])
lr_tfidf.fit(X_train, y_train)
I save the model in pickle and used it to classify new documents:
text_model = pickle.load(open('text_model.pkl', 'rb'))
results = text_model.predict_proba(new_document)
How can I get the representation (features + frequencies) used by the model for this new document without explicitly computing it?
EDIT: I am trying to explain better what I want to get.
Wen I use predict_proba, I guess that the new document is represented as a vector of term frequencies (according to the rules used in the model stored) and those frequencies are multiplied by the coefficients learnt by the logistic regression model to predict the class. Am I right? If yes, how can I get the terms and term frequencies of this new document, as used by predict_proba?
I am using sklearn v 0.19

As I understand from the comments, you need to access the tfidfVectorizer from inside the pipeline. This can be done easily by:
tfidfVect = text_model.named_steps['vect']
Now you can use the transform() method of the vectorizer to get the tfidf values.
tfidf_vals = tfidfVect.transform(new_document)
The tfidf_vals will be a sparse matrix of single row containing the tfidf of terms found in the new_document. To check what terms are present in this matrix, you need to use tfidfVect.get_feature_names().

Transformer of a pipeline acts on the training data instead of the whole data in GridSearchCV

I am using pipeline and GridSearchCV to select features automatically. Since the data set is small, I set the parameter 'cv' in GridSearchCV to StratifiedShuffleSplit. The code looks like as follows:
selection = SelectKBest()
clf = LinearSVC()
pipeline = Pipeline([("select", selection), ("classify", clf)])
cv = StratifiedShuffleSplit(n_splits=50)
grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv = cv)
grid_search.fit(X, y)
It seems that SelectKBest acts on the training data of each split instead of the whole data set (the latter is what I want) since the result becomes different if I separate the 'select' and 'classify', where the StratifiedShuffleSplit will for sure only act on the classifier.
What is the correct way of using pipeline and GridSearchCV in this case? Thanks a lot!

Cross-validating the whole pipeline (i.e. running SelectKBest only on training part of each split) is the way to go. Otherwise the model is allowed to look at the test part - it means quality estimates become wrong. Best hyperparameters found with these unfair quality estimates may also work worse on a real unseen data.
At prediction time you're not going to re-run SelectKBest on (training dataset + example to be predicted) and then re-train the classifier, why would you do that in evaluation?

You can also find the answer of this question from page 245 to page 246 in the book-"The elements of statistical learning (2rd edition)" by Hastie,etc.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Sk-learn GridSearchCV fits on full data - machine-learning

I realized it might be due to the unfixed random state, after I assigned a fixed random state, .transform() and .fit_transform() return same results.

Related

Feature selection with GridsearchCV

what does lightgbm python Dataset reference parameter mean?

can we save a partially trained Machine Learning model, reload it again and train from the point it was saved?

Text Classification with scikit-learn: how to get a new document's representation from a pickle model

Transformer of a pipeline acts on the training data instead of the whole data in GridSearchCV

Categories

Resources