Reverse column transform - machine-learning

In a machine learning model y_train is a string so column transform was applied in y_train before training the model ,How to remove the transform while printing the end result

All scikit-learn transformers have .inverse_transform method to do so. e.g. LabelEncoder https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
>>> le = preprocessing.LabelEncoder()
>>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
LabelEncoder()
>>> list(le.classes_)
['amsterdam', 'paris', 'tokyo']
>>> le.transform(["tokyo", "tokyo", "paris"])
array([2, 2, 1]...)
>>> list(le.inverse_transform([2, 2, 1]))
['tokyo', 'tokyo', 'paris']

Related

Is it necessary to use cross-validation after data is split using StratifiedShuffleSplit?

I used StratifiedShuffleSplit to split the data and now I am wondering if I need to use cross-validation again as I go for building the classification model(Logistic Regression,KNN,Random Forest etc.) I am confused about it because reading the documentation in Sklearn I get the impression that StratifiedShuffleSplit is a mix of splitting the data and cross-validating it at the same time.
StratifiedShuffleSplit provides you just a list with train/test indices. How it will be used depends on you.
You can fit the model with train set and predict on the test and calculate the score manually - so implementing cross validation by yourself
Or you can use cross_val_score and pass StratifiedShuffleSplit() to it and cross_val_score will do the same thing.
Example:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedShuffleSplit, cross_val_score
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 0, 1, 1, 1])
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
model = RandomForestClassifier(n_estimators=1, random_state=1)
# Calculate scores automatically
accuracy_per_split = cross_val_score(model, X, y, scoring="accuracy", cv=sss, n_jobs=1)
print(f"Accuracies per splits: {accuracy_per_split}")
# Calculate scores manually
accuracy_per_split = []
for train_index, test_index in sss.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
accuracy_per_split.append(acc)
print(f"Accuracies per splits: {accuracy_per_split}")

Naive Bayes model from scikit learn

What is the difference between the two gnb.fit()?
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train.ravel())
gnb.fit(X_train, y_train).predict(X_test)
I see two differences:
np.ravel flattens arrays, e.g.
x = np.array([[1, 2, 3], [4, 5, 6]])
np.ravel(x)
array([1, 2, 3, 4, 5, 6])
gnb.fit returns the model itself, and it can be used to immediately predict. Predicting does not modify the model.
If your y_train is one-dimensional, you'll get the same model regardless of whether you call gbn.fit(x, y) or gbn.fit(x, y.ravel()) because y == y.ravel()

how to predict new values when a machine learning model was standardized StandardScaler

I'm working on a machine learning model, I have a dataframe with the data
I normalize the data with a standard distribution
scaler = StandardScaler()
df = scaler.fit_transform(df)
I divide the datasets into target and characteristics
X_df = df[X_characteristics_list]
y_df = df[target]
I split into train and test then I train the model
X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size = 0.25)
forest = RandomForestRegressor()
forest.fit(X_train, y_train)
I predict the test to validate the effectiveness
y_test_pred = forest.predict(X2_test)
mse = mean_squared_error(y_test, y_test_pred)
But when is time to test in real life I need to leave the model ready to predict
If i Want to predict just one record
let say [100,20,34]
I can't because I need the record standardized, and transform it with StandardScaler does not work because it depends on standard deviation so I would need the original dataset
What's the best way to solve this problem.
See below:
>>> from sklearn.datasets import make_classification
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.preprocessing import StandardScaler
# Create our input and output matrices
>>> X, y = make_classification()
# Split train-test... "test" will be production/unobserved/"real-life" data
>>> X_train, X_test, y_train, y_test = train_test_split(X, y)
# What does X_train look like?
>>> X_train
array([[-0.08930702, -2.71113991, -0.93849926, ..., 0.21650905,
0.68952722, 0.61365789],
[-0.31143977, -1.87817904, 0.08287492, ..., -0.41332943,
-0.58967179, 1.7239411 ],
[-1.62287589, 1.10691318, -0.630556 , ..., -0.35060008,
1.11270562, 0.08106694],
...,
[-0.59797041, 0.90218081, 0.89983074, ..., -0.54374315,
1.18534841, -0.03397969],
[-1.2006559 , 1.01890955, -1.21617181, ..., 1.76263322,
1.38280423, -1.0192972 ],
[ 0.11883425, 1.42952643, -1.23647358, ..., 1.02509208,
-1.14308885, 0.72096531]])
# Let's scale it
>>> scaler = StandardScaler()
>>> X_train = scaler.fit_transform(X_train)
>>> X_train
array([[ 0.08867642, -1.97950269, -1.1214106 , ..., 0.22075623,
0.57844552, 0.46487917],
[-0.10736984, -1.34896243, 0.00808597, ..., -0.37670234,
-0.6045418 , 1.57819736],
[-1.26479555, 0.91071257, -0.78086855, ..., -0.3171979 ,
0.96979563, -0.06916763],
...,
[-0.36025134, 0.7557329 , 0.91152449, ..., -0.50041152,
1.03697478, -0.18452874],
[-0.89215959, 0.84409499, -1.42847749, ..., 1.68739437,
1.21957946, -1.17253964],
[ 0.27237431, 1.15492649, -1.4509284 , ..., 0.98777012,
-1.116335 , 0.57247992]])
# Fit the model
>>> model = LogisticRegression()
>>> model.fit(X_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
# Now let's use the already-fitted StandardScaler object to simply transform
# *not fit_transform* the test data
>>> X_test = scaler.transform(X_test)
>>> model.predict(X_test)
array([1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0,
0, 0, 0])
Note that using joblib or pickle you can save the scaler object and re-load it for scaling in "real-time" later on.

Meta classifier based on "or" logic in scikit-learn

How can I build a meta-classifier in scikit-learn out of N binary classifiers which will return 1 if any of the classifiers returns 1?
Currently I've tried VotingClassifier, but it lacks the logic that I need, both with voting equal to hard and soft. Pipeline seems to be oriented towards sequential computation
I can write the logic by myself, but I am wondering if there is anything built-in?
The built-in options are only soft and hard voting. As you mentioned, we can create a custom function to this meta-classifier, which uses OR logic based on the source code here. This custom meta classifier can fit into the pipeline as well.
from sklearn.utils.validation import check_is_fitted
class CustomMetaClassifier(VotingClassifier):
def predict(self, X):
""" Predict class labels for X.
Parameters
----------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
The input samples.
Returns
----------
maj : array-like, shape = [n_samples]
Predicted class labels.
"""
check_is_fitted(self, 'estimators_')
maj = np.max(eclf1._predict(X), 1)
maj = self.le_.inverse_transform(maj)
return maj
>>> import numpy as np
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.naive_bayes import GaussianNB
>>> from sklearn.ensemble import RandomForestClassifier, VotingClassifier
>>> clf1 = LogisticRegression(solver='lbfgs', multi_class='multinomial',
... random_state=1)
>>> clf2 = RandomForestClassifier(n_estimators=50, random_state=1)
>>> clf3 = GaussianNB()
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> y = np.array([1, 1, 1, 2, 2, 2])
>>> eclf1 = CustomMetaClassifier(estimators=[
... ('lr', clf1), ('rf', clf2), ('gnb', clf3)])
>>> eclf1 = eclf1.fit(X, y)
>>> eclf1.predict(X)
array([1, 1, 1, 2, 2, 2])

Combining Principal Component Analysis and Support Vector Machine in a pipeline

I want to combine PCA and SVM to a pipeline, to find the best combination of hyperparameters in a GridSearch.
The following code
from sklearn.svm import SVC
from sklearn import decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
digits = datasets.load_digits()
X_train = digits.data
y_train = digits.target
#Use Principal Component Analysis to reduce dimensionality
# and improve generalization
pca = decomposition.PCA()
# Use a linear SVC
svm = SVC()
# Combine PCA and SVC to a pipeline
pipe = Pipeline(steps=[('pca', pca), ('svm', svm)])
# Check the training time for the SVC
n_components = [20, 40, 64]
svm_grid = [
{'C': [1, 10, 100, 1000], 'kernel': ['linear']},
{'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]
estimator = GridSearchCV(pipe,
dict(pca__n_components=n_components,
svm=svm_grid))
estimator.fit(X_train, y_train)
Results in an
AttributeError: 'dict' object has no attribute 'get_params'
There is probably something wrong with the way I define and use svm_grid. How can I pass this parameter combination to GridSearchCV correctly?
The problem was that when the GridSearchCV tried to give the estimator the parameters:
if parameters is not None:
estimator.set_params(**parameters)
the estimator here was a Pipeline object, not the actual svm because of the naming in your parameters grid.
I believe it should be like this:
from sklearn.svm import SVC
from sklearn import decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
digits = datasets.load_digits()
X_train = digits.data
y_train = digits.target
# Use Principal Component Analysis to reduce dimensionality
# and improve generalization
pca = decomposition.PCA()
# Use a linear SVC
svm = SVC()
# Combine PCA and SVC to a pipeline
pipe = Pipeline(steps=[('pca', pca), ('svm', svm)])
# Check the training time for the SVC
n_components = [20, 40, 64]
params_grid = {
'svm__C': [1, 10, 100, 1000],
'svm__kernel': ['linear', 'rbf'],
'svm__gamma': [0.001, 0.0001],
'pca__n_components': n_components,
}
estimator = GridSearchCV(pipe, params_grid)
estimator.fit(X_train, y_train)
print estimator.best_params_, estimator.best_score_
Output:
{'pca__n_components': 64, 'svm__C': 10, 'svm__kernel': 'rbf', 'svm__gamma': 0.001} 0.976071229827
Incorporating all of your parameters in params_grid and naming them correspondingly to the named steps.
Hope this helps! Good luck!

Resources