How to use plot_importance function with MultiOutputRegressor? - machine-learning

I use the code blow to get multi output.
But I got the error message when I want to plot importance.
"ValueError: tree must be Booster, XGBModel or dict instance"
How to fix this problem?
Or is there any other way to get the feature importance?
import numpy as np
import xgboost as xgb
from xgboost import plot_importance
X = np.array([[0,1,2,3,4],[2,3,4,5,6],[3,4,5,6,7]])
y = np.array([[2,3,4],[3,4,5],[4,5,6]])
model_ = MultiOutputRegressor(xgb.XGBRegressor(objective='reg:linear',n_jobs=-1))
model_.fit(X, y)
pred = model_.predict(X)
fig,ax = plt.subplots(figsize=(15,15))
plot_importance(model_,height=0.5,ax=ax,max_num_features=3)
plt.show()

I found the solution.
fig,ax = plt.subplots(ncols=3,figsize=(15,6))
plot_importance(model.estimators_[0],height=0.5,ax=ax[0],max_num_features=20)
plot_importance(model.estimators_[1],height=0.5,ax=ax[1],max_num_features=20)
plot_importance(model.estimators_[2],height=0.5,ax=ax[2],max_num_features=20)
plt.show()

Related

inferentia neuron core usage is only 1 when 4 cores are available

Im using the following code to load a neuron compiled model for inference. However on my inf1.2xlarge instance, neuron-top shows for cores (NC0 to NC3). Only NC0 gets used in inference. How do I increase throughput by using all cores???
from transformers import BertTokenizer, BertModel
import torch
import torch_neuron
import os.path
import os
os.environ['NEURON_RT_NUM_CORES']=str(4)
fname = 'modelneuron.pt'
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
inputs = tokenizer("Hello, my dog is cute and big", return_tensors="pt")
if not os.path.isfile(fname):
model = BertModel.from_pretrained('bert-base-uncased', return_dict=False)
neuron_model = torch_neuron.trace(model,
example_inputs = (inputs['input_ids'],inputs['attention_mask']))
neuron_model.save("modelneuron.pt")
print('saved neuron model')
else:
neuron_model = torch.jit.load('modelneuron.pt')
print('loaded neuron model')
for i in range(10000):
outputs = neuron_model(*(inputs['input_ids'],inputs['attention_mask']))
print(outputs)```

How to handle alphanumeric values in machine learning

I am trying to the find the best algorithm for my claims data. The claims data include some diagnosis code which are alphanumeric like 'EA43454' . when i run the below code to evaluate the models
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=None)
cv_results = model_selection.cross_val_score(model, X, y, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
i get the error
ValueError: could not convert string to float: 'U0003'
How to handle these alphanumeric values?
You need to convert your strings to an indicator variable (dummy variables). Each value of the string variable has to be associated with a number so that the models can train on that data.
Scikit-learn has several preprocessors to help you with this such as OneHotEncoder. You can also use pandas.get_dummies, but using sklearn's own classes is more composable - for example, you can use them as part of a pipeline.
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
rng = np.random.default_rng()
animals = pd.DataFrame({"animal": rng.choice(["cat", "dog"], size=10),
"age": rng.integers(1, 20, size=10)})
animals_ohe = OneHotEncoder().fit_transform(animals.drop(columns=["age"]))

How to use SlidingWindowSplitter in sktime

I need to fit ARIMA model from sktime package. I want to use SlidingWindowSplitter from sktime.forecasting.model_selection but I don really understand how it works.
If I wanted to fit a simple ARIMA I would do this
...
model = ARIMA(order = (p, d, q)).fit(y_train)
y_pred, y_conf = model.predict(fh, return_pred_int=True)
But how that works with the SlidingWindowSplitter?
This should work:
from sktime.forecasting.all import *
from sktime.forecasting.model_evaluation import evaluate
y = load_airline()
forecaster = ARIMA()
cv = SlidingWindowSplitter()
out = evaluate(forecaster, cv, y)

Getting the column names chosen after a feature selection method

Given a simple feature selection code below, I want to know the selected columns after the feature selection (The dataset includes a header V1 ... V20)
import pandas as pd
from sklearn.feature_selection import SelectFromModel, SelectKBest, f_regression
def feature_selection(data):
y = data['Class']
X = data.drop(['Class'], axis=1)
fs = SelectKBest(score_func=f_regression, k=10)
# Applying feature selection
X_selected = fs.fit_transform(X, y)
# TODO: determine the columns being selected
return X_selected
data = pd.read_csv("../dataset.csv")
new_data = feature_selection(data)
I appreciate any help.
I have used the iris dataset for my example but you can probably easily modify your code to match your use case.
The SelectKBest method has the scores_ attribute I used to sort the features.
Feel free to ask for any clarifications.
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectFromModel, SelectKBest, f_regression
from sklearn.datasets import load_iris
def feature_selection(data):
y = data[1]
X = data[0]
column_names = ["A", "B", "C", "D"] # Here you should use your dataframe's column names
k = 2
fs = SelectKBest(score_func=f_regression, k=k)
# Applying feature selection
X_selected = fs.fit_transform(X, y)
# Find top features
# I create a list like [[ColumnName1, Score1] , [ColumnName2, Score2], ...]
# Then I sort in descending order on the score
top_features = sorted(zip(column_names, fs.scores_), key=lambda x: x[1], reverse=True)
print(top_features[:k])
return X_selected
data = load_iris(return_X_y=True)
new_data = feature_selection(data)
I don't know the in-build method, but it can be easily coded.
n_columns_selected = X_new.shape[0]
new_columns = list(sorted(zip(fs.scores_, X.columns))[-n_columns_selected:])
# new_columns order is perturbed, we need to restore it. We use the names of the columns of X as a reference
new_columns = list(sorted(cols_new, key=lambda x: list(X.columns).index(x)))

NaN giving ValueError in OneHotEncoder in scikit-learn

Here is my code
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
train = pd.DataFrame({
'users':['John Johnson','John Smith','Mary Williams']
})
test = pd.DataFrame({
'users':[None,np.nan,'John Smith','Mary Williams']
})
ohe = OneHotEncoder(sparse=False,handle_unknown='ignore')
ohe.fit(train)
train_transformed = ohe.fit_transform(train)
test_transformed = ohe.transform(test)
print(test_transformed)
I expected the OneHotEncoder to be able to handle the np.nan in the test dataset, since
handle_unknown='ignore'
But it gives ValueError. It is able to handle the None value though. Why is it failing?And how do I get around it (besides Imputer)?
From the documentation (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)
it seemed that this was what handle_unknown is for.
This option gives a solution when test set has unseen categorical value in train set. If you would put ‘steve stevenson’ in the test set it would not return an error, it would return column with all zeros.
train = pd.DataFrame({
'users':['John Johnson','John Smith','Mary Williams']
})
test = pd.DataFrame({
'users':['John Smith','Mary Williams', 'Steve Stevenson']
})
ohe = OneHotEncoder(sparse=False, handle_unknown = 'ignore')
ohe.fit(train)
test_transformed = ohe.transform(test)
print(test_transformed)
Solution to None problem would be to replace None values with some category, like ‘unknown’.
Hope this helps

Resources