NaN giving ValueError in OneHotEncoder in scikit-learn

NaN giving ValueError in OneHotEncoder in scikit-learn - machine-learning

Here is my code
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
train = pd.DataFrame({
'users':['John Johnson','John Smith','Mary Williams']
})
test = pd.DataFrame({
'users':[None,np.nan,'John Smith','Mary Williams']
})
ohe = OneHotEncoder(sparse=False,handle_unknown='ignore')
ohe.fit(train)
train_transformed = ohe.fit_transform(train)
test_transformed = ohe.transform(test)
print(test_transformed)
I expected the OneHotEncoder to be able to handle the np.nan in the test dataset, since
handle_unknown='ignore'
But it gives ValueError. It is able to handle the None value though. Why is it failing?And how do I get around it (besides Imputer)?
From the documentation (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)
it seemed that this was what handle_unknown is for.

This option gives a solution when test set has unseen categorical value in train set. If you would put ‘steve stevenson’ in the test set it would not return an error, it would return column with all zeros.
train = pd.DataFrame({
'users':['John Johnson','John Smith','Mary Williams']
})
test = pd.DataFrame({
'users':['John Smith','Mary Williams', 'Steve Stevenson']
})
ohe = OneHotEncoder(sparse=False, handle_unknown = 'ignore')
ohe.fit(train)
test_transformed = ohe.transform(test)
print(test_transformed)
Solution to None problem would be to replace None values with some category, like ‘unknown’.
Hope this helps

Related

When predicting new dataset should I use scaler.fit_trasform(new_dataset) or scaler.transform(new_dataset)

final_poly_converter = PolynomialFeatures(degree=3,include_bias=False)
final_poly_features = final_poly_converter.fit_transform(X)
final_scaler = StandardScaler()
scaled_X = final_scaler.fit_transform(final_poly_features)
from sklearn.linear_model import Lasso
final_model = Lasso(alpha=0.004943070909225827,max_iter=1000000)
final_model.fit(scaled_X,y)
from joblib import dump,load
dump(final_model,'lasso_model.joblib')
dump(final_poly_converter,'lasso_poly_coverter.joblib')
dump(final_scaler,'scaler.joblib')
loaded_converter = load('lasso_poly_coverter.joblib')
loaded_model = load('lasso_model.joblib')
loaded_scaler = load('scaler.joblib')
campaign = [[149,22,12]]
transformed_data = loaded_converter.fit_transform(campaign)
scaled_data = loaded_scaler.transform(transformed_data)# fit_transform or only transform
loaded_model.predict(scaled_data)
The output values change when I use fit_transform() and when I use transform()

You should always use fit_transform on your train and transform on test and further predictions. If you refit your scaler on test pool you would have a different feature distribution in your test set vs train set which is something you don't want to happen. Think of scaler params that you fit as part of the model parameters. Naturally you fit all the parameters on the training set and then you don't change them on the test evaluation/prediction.

Fine tuning a BERT Model as a chatbot giving error while training

I have been trying to fine tune a BERT model to give response sentences like a character based on input sentences but I am getting a rather odd error every time . the code is
`
Here sourcetexts is a list of sentences that give the context and target_text is a list of sentences that give response to context statments
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("bert-base-cased").to(device)
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
input_ids = \[\]
output_ids = \[\]
for i in range (0 , len(source_text):
input_ids.append(tokenizer.encode(source_texts\[i\], return_tensors="pt"))
output_ids.append(tokenizer.encode(target_texts\[i\], return_tensors="pt"))
import torch
device = torch.device("cuda")
from transformers import BertForMaskedLM, AdamW
model = BertForMaskedLM.from_pretrained("bert-base-cased")
optimizer = AdamW(model.parameters(), lr=1e-5)
loss_fn = torch.nn.CrossEntropyLoss()
def train(input_id, output_id):
input_id = input_id.to(device)
output_id = output_id.to(device)
model.zero_grad()
logits, _ = model(input_id, labels=output_id)
# Compute the loss
loss = loss_fn(logits.view(-1, logits.size(-1)), output_id.view(-1))
loss.backward()
optimizer.step()
return loss.item()
for epoch in range(50):
\# Train the model on the training dataset
train_loss = 0.0
for input_sequences, output_sequences in zip(input_ids, output_ids):
input_sequences = input_sequences.to(device)
output_sequences = output_sequences.to(device)
train_loss += train(input_sequences, output_sequences)
This is the Error that I am getting
Any help would be really appreciated .
Pls help!!

Hi i saw your code but you didn't move your model to GPU, only the inputs, pytorch by default is on CPU
import torch
device = torch.device('cuda')
model = BertForMaskedLM.from_pretrained("bert-base-cased")
model.to(device)

How to handle alphanumeric values in machine learning

I am trying to the find the best algorithm for my claims data. The claims data include some diagnosis code which are alphanumeric like 'EA43454' . when i run the below code to evaluate the models
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=None)
cv_results = model_selection.cross_val_score(model, X, y, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
i get the error
ValueError: could not convert string to float: 'U0003'
How to handle these alphanumeric values?

You need to convert your strings to an indicator variable (dummy variables). Each value of the string variable has to be associated with a number so that the models can train on that data.
Scikit-learn has several preprocessors to help you with this such as OneHotEncoder. You can also use pandas.get_dummies, but using sklearn's own classes is more composable - for example, you can use them as part of a pipeline.
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
rng = np.random.default_rng()
animals = pd.DataFrame({"animal": rng.choice(["cat", "dog"], size=10),
"age": rng.integers(1, 20, size=10)})
animals_ohe = OneHotEncoder().fit_transform(animals.drop(columns=["age"]))

How to use plot_importance function with MultiOutputRegressor?

I use the code blow to get multi output.
But I got the error message when I want to plot importance.
"ValueError: tree must be Booster, XGBModel or dict instance"
How to fix this problem?
Or is there any other way to get the feature importance?
import numpy as np
import xgboost as xgb
from xgboost import plot_importance
X = np.array([[0,1,2,3,4],[2,3,4,5,6],[3,4,5,6,7]])
y = np.array([[2,3,4],[3,4,5],[4,5,6]])
model_ = MultiOutputRegressor(xgb.XGBRegressor(objective='reg:linear',n_jobs=-1))
model_.fit(X, y)
pred = model_.predict(X)
fig,ax = plt.subplots(figsize=(15,15))
plot_importance(model_,height=0.5,ax=ax,max_num_features=3)
plt.show()

I found the solution.
fig,ax = plt.subplots(ncols=3,figsize=(15,6))
plot_importance(model.estimators_[0],height=0.5,ax=ax[0],max_num_features=20)
plot_importance(model.estimators_[1],height=0.5,ax=ax[1],max_num_features=20)
plot_importance(model.estimators_[2],height=0.5,ax=ax[2],max_num_features=20)
plt.show()

How to use the best parameter as parameter of a classifier in GridSearchCV?

I have a function called svc_param_selection(X, y, n) which returns best_param_.
Now I want to use the best_params returned as the parameter of a classifier like:
.
parameters = svc_param_selection(X, y, 2)
from sklearn.model_selection import ParameterGrid
from sklearn.svm import SVC
param_grid = ParameterGrid(parameters)
for params in param_grid:
svc_clf = SVC(**params)
print (svc_clf)
classifier2=SVC(**svc_clf)
It seems parameters is not a grid here..

You can use GridSearchCV to do this. There is a example here:
# Applying GridSearch to find best parameters
from sklearn.model_selection import GridSearchCV
parameters = [{ 'criterion' : ['gini'], 'splitter':['best','random'], 'min_samples_split':[0.1,0.2,0.3,0.4,0.5],
'min_samples_leaf': [1,2,3,4,5]},
{'criterion' : ['entropy'], 'splitter':['best','random'], 'min_samples_split':[0.1,0.2,0.3,0.4,0.5],
'min_samples_leaf': [1,2,3,4,5]} ]
gridsearch = GridSearchCV(estimator = classifier, param_grid = parameters,refit= False, scoring='accuracy', cv=10)
gridsearch = gridsearch.fit(x,y)
optimal_accuracy = gridsearch.best_score_
optimal_parameters = gridsearch.best_params_
But for param_grid of GridSearchCV, you should pass a dictionary of parameter name and value for you classifier. For example a classifier like this:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(random_state=0, presort=True,
criterion='entropy')
classifier = classifier.fit(x_train,y_train)
Then after finding best parameters by GridSearchCV you apply them on you model.

#Ben At the start of gridsearch, you either specify the classifier outside the param_grid (if you have only one classification method to check) or inside the param_grid. I have made a check for the 'inside' case only.
First, I set the 'classifier' key in the param_grid. That is the key which you need to ask for in the end.
param_grid = [
{'classifier' : [LogisticRegression()],
...
},
{'classifier' : [RandomForestClassifier()],
}
]
As an example, the result of gridsearch.best_params_ is:
{'classifier': RandomForestClassifier(criterion='entropy', max_depth=2, n_estimators=2),
'classifier__criterion': 'entropy',
'classifier__max_depth': 2,
'classifier__min_samples_leaf': 1,
'classifier__n_estimators': 2}
Then ask this dictionary gridsearch.best_params_ for the key that you called the 'classifier'.
clfBest = clfGridSearchBest.best_params_['classifier']
clfBest:
RandomForestClassifier(criterion='entropy', max_depth=2, n_estimators=2)
Now just fit clfBest.

Categories

HOME

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

NaN giving ValueError in OneHotEncoder in scikit-learn - machine-learning

Related

When predicting new dataset should I use scaler.fit_trasform(new_dataset) or scaler.transform(new_dataset)

Fine tuning a BERT Model as a chatbot giving error while training

How to handle alphanumeric values in machine learning

How to use plot_importance function with MultiOutputRegressor?

How to use the best parameter as parameter of a classifier in GridSearchCV?

Categories

Resources