"LightGBMError: Do not support special JSON characters in feature name" - machine-learning

My 'X' data is a pandas data frame of time-series. I extracted features of X data using Tsfresh and try to apply LightGBM algorithm to classify the data into 0(Bad) and 1(Good). But it shows an error. Columns of my X data are`
Index(['0__ratio_beyond_r_sigma__r_1',
'0__change_quantiles__f_agg_"mean"isabs_True__qh_0.8__ql_0.0',
'0__cwt_coefficients__coeff_1__w_20__widths(2, 5, 10, 20)',
'0__cwt_coefficients__coeff_1__w_10__widths(2, 5, 10, 20)',
'0__change_quantiles__f_agg_"var"_isabs_False__qh_0.8__ql_0.0',
'0__change_quantiles__f_agg"mean"_isabs_True__qh_0.4__ql_0.0',
'0__change_quantiles__f_agg"mean"_isabs_True__qh_0.8__ql_0.6',
'0__change_quantiles__f_agg"mean"_isabs_False__qh_0.4__ql_0.0',
'0__fft_coefficient__attr"real"_coeff_3',
'0__change_quantiles__f_agg"mean"_isabs_True__qh_1.0__ql_0.0',
...
'0__quantile__q_0.4', '0__fft_coefficient__attr"imag"coeff_39',
'0__large_standard_deviation__r_0.2',
'0__cwt_coefficients__coeff_13__w_10__widths(2, 5, 10, 20)',
'0__fourier_entropy__bins_10',
'0__fft_coefficient__attr"angle"_coeff_9',
'0__fft_coefficient__attr"imag"_coeff_17',
'0__fft_coefficient__attr"angle"_coeff_92', '0__maximum',
'0__fft_coefficient__attr"imag"__coeff_32'],
dtype='object', length=225)
My code is
`
import lightgbm as lgb
d_train = lgb.Dataset(X_train, label=y_train)
lgbm_params = {'learning_rate':0.05, 'boosting_type':'dart',
'objective':'binary',
'metric':['auc', 'binary_logloss'],
'num_leaves':100,
'max_depth':10}
clf = lgb.train(lgbm_params, d_train, 50)
y_pred_lgbm=clf.predict(X_test)
for i in range(0, X_test.shape[0]):
if y_pred_lgbm[i]>=.5:
y_pred_lgbm[i]=1
else:
y_pred_lgbm[i]=0
cm_lgbm = confusion_matrix(y_test, y_pred_lgbm)
sns.heatmap(cm_lgbm, annot=True)
`
I tried below code to change my columns but it does not work.
`
import re
X = X.rename(columns = lambda u:re.sub('[^A-Za-z0-9_]+', '', u))
After applying that rename function the columns looks as below
`
Index(['0__ratio_beyond_r_sigma__r_1',
'0__change_quantiles__f_agg_mean__isabs_True__qh_08__ql_00',
'0__cwt_coefficients__coeff_1__w_20__widths_251020',
'0__cwt_coefficients__coeff_1__w_10__widths_251020',
'0__change_quantiles__f_agg_var__isabs_False__qh_08__ql_00',
'0__change_quantiles__f_agg_mean__isabs_True__qh_04__ql_00',
'0__change_quantiles__f_agg_mean__isabs_True__qh_08__ql_06',
'0__change_quantiles__f_agg_mean__isabs_False__qh_04__ql_00',
'0__fft_coefficient__attr_real__coeff_3',
'0__change_quantiles__f_agg_mean__isabs_True__qh_10__ql_00',
...
'0__quantile__q_04', '0__fft_coefficient__attr_imag__coeff_39',
'0__large_standard_deviation__r_02',
'0__cwt_coefficients__coeff_13__w_10__widths_251020',
'0__fourier_entropy__bins_10',
'0__fft_coefficient__attr_angle__coeff_9',
'0__fft_coefficient__attr_imag__coeff_17',
'0__fft_coefficient__attr_angle__coeff_92', '0__maximum',
'0__fft_coefficient__attr_imag__coeff_32'],
dtype='object', length=225)
`
What should I do to get rid of this error?

u cant put like '_' these kind of symbol in column names or the lgb will report this kind of error

Related

R: Error in predict.xgboost: Feature names stored in `object` and `newdata` are different

I wrote a script using xgboost to predict soil class for a certain area using data from field and satellite images. The script as below:
`
rm(list=ls())
library(xgboost)
library(caret)
library(raster)
library(sp)
library(rgeos)
library(ggplot2)
setwd("G:/DATA")
data <- read.csv('96PointsClay02finalone.csv')
head(data)
summary(data)
dim(data)
ras <- stack("Allindices04TIFF.tif")
names(ras) <- c("b1", "b2", "b3", "b4", "b5", "b6", "b7", "b10", "b11","DEM",
"R1011", "SCI", "SAVI", "NDVI", "NDSI", "NDSandI", "MBSI",
"GSI", "GSAVI", "EVI", "DryBSI", "BIL", "BI","SRCI")
set.seed(27) # set seed for generating random data.
# createDataPartition() function from the caret package to split the original dataset into a training and testing set and split data into training (80%) and testing set (20%)
parts = createDataPartition(data$Clay, p = .8, list = F)
train = data[parts, ]
test = data[-parts, ]
#define predictor and response variables in training set
train_x = data.matrix(train[, -1])
train_y = train[,1]
#define predictor and response variables in testing set
test_x = data.matrix(test[, -1])
test_y = test[, 1]
#define final training and testing sets
xgb_train = xgb.DMatrix(data = train_x, label = train_y)
xgb_test = xgb.DMatrix(data = test_x, label = test_y)
#defining a watchlist
watchlist = list(train=xgb_train, test=xgb_test)
#fit XGBoost model and display training and testing data at each iteartion
model = xgb.train(data = xgb_train, max.depth = 3, watchlist=watchlist, nrounds = 100)
#define final model
model_xgboost = xgboost(data = xgb_train, max.depth = 3, nrounds = 86, verbose = 0)
summary(model_xgboost)
#use model to make predictions on test data
pred_y = predict(model_xgboost, xgb_test)
# performance metrics on the test data
mean((test_y - pred_y)^2) #mse - Mean Squared Error
caret::RMSE(test_y, pred_y) #rmse - Root Mean Squared Error
y_test_mean = mean(test_y)
rmseE<- function(error)
{
sqrt(mean(error^2))
}
y = test_y
yhat = pred_y
rmseresult=rmseE(y-yhat)
(r2 = R2(yhat , y, form = "traditional"))
cat('The R-square of the test data is ', round(r2,4), ' and the RMSE is ', round(rmseresult,4), '\n')
#use model to make predictions on satellite image
result <- predict(model_xgboost, ras[1:(nrow(ras)*ncol(ras))])
#create a result raster
res <- raster(ras)
#fill in results and add a "1" to them (to get back to initial class numbering! - see above "Prepare data" for more information)
res <- setValues(res,result+1)
#Save the output .tif file into saved directory
writeRaster(res, "xgbmodel_output", format = "GTiff", overwrite=T)
`
The script works well till it reachs
result <- predict(model_xgboost, ras[1:(nrow(ras)*ncol(ras))])
it takes some time then gives this error:
Error in predict.xgb.Booster(model_xgboost, ras[1:(nrow(ras) * ncol(ras))]) :
Feature names stored in `object` and `newdata` are different!
I realize that I am doing something wrong in that line. However, I do not know how to apply the xgboost model to a raster image that represents my study area.
It would be highly appreciated if someone give a hand, enlightened me, and helped me solve this problem....
My data as csv and raster image can be found here.
Finally, I got the reason for this error.
It was my mistake as the number of columns in the traning data was not the same as in the number of layers in the satellite image.

How to handle alphanumeric values in machine learning

I am trying to the find the best algorithm for my claims data. The claims data include some diagnosis code which are alphanumeric like 'EA43454' . when i run the below code to evaluate the models
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=None)
cv_results = model_selection.cross_val_score(model, X, y, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
i get the error
ValueError: could not convert string to float: 'U0003'
How to handle these alphanumeric values?
You need to convert your strings to an indicator variable (dummy variables). Each value of the string variable has to be associated with a number so that the models can train on that data.
Scikit-learn has several preprocessors to help you with this such as OneHotEncoder. You can also use pandas.get_dummies, but using sklearn's own classes is more composable - for example, you can use them as part of a pipeline.
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
rng = np.random.default_rng()
animals = pd.DataFrame({"animal": rng.choice(["cat", "dog"], size=10),
"age": rng.integers(1, 20, size=10)})
animals_ohe = OneHotEncoder().fit_transform(animals.drop(columns=["age"]))

Getting the column names chosen after a feature selection method

Given a simple feature selection code below, I want to know the selected columns after the feature selection (The dataset includes a header V1 ... V20)
import pandas as pd
from sklearn.feature_selection import SelectFromModel, SelectKBest, f_regression
def feature_selection(data):
y = data['Class']
X = data.drop(['Class'], axis=1)
fs = SelectKBest(score_func=f_regression, k=10)
# Applying feature selection
X_selected = fs.fit_transform(X, y)
# TODO: determine the columns being selected
return X_selected
data = pd.read_csv("../dataset.csv")
new_data = feature_selection(data)
I appreciate any help.
I have used the iris dataset for my example but you can probably easily modify your code to match your use case.
The SelectKBest method has the scores_ attribute I used to sort the features.
Feel free to ask for any clarifications.
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectFromModel, SelectKBest, f_regression
from sklearn.datasets import load_iris
def feature_selection(data):
y = data[1]
X = data[0]
column_names = ["A", "B", "C", "D"] # Here you should use your dataframe's column names
k = 2
fs = SelectKBest(score_func=f_regression, k=k)
# Applying feature selection
X_selected = fs.fit_transform(X, y)
# Find top features
# I create a list like [[ColumnName1, Score1] , [ColumnName2, Score2], ...]
# Then I sort in descending order on the score
top_features = sorted(zip(column_names, fs.scores_), key=lambda x: x[1], reverse=True)
print(top_features[:k])
return X_selected
data = load_iris(return_X_y=True)
new_data = feature_selection(data)
I don't know the in-build method, but it can be easily coded.
n_columns_selected = X_new.shape[0]
new_columns = list(sorted(zip(fs.scores_, X.columns))[-n_columns_selected:])
# new_columns order is perturbed, we need to restore it. We use the names of the columns of X as a reference
new_columns = list(sorted(cols_new, key=lambda x: list(X.columns).index(x)))

NaN giving ValueError in OneHotEncoder in scikit-learn

Here is my code
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
train = pd.DataFrame({
'users':['John Johnson','John Smith','Mary Williams']
})
test = pd.DataFrame({
'users':[None,np.nan,'John Smith','Mary Williams']
})
ohe = OneHotEncoder(sparse=False,handle_unknown='ignore')
ohe.fit(train)
train_transformed = ohe.fit_transform(train)
test_transformed = ohe.transform(test)
print(test_transformed)
I expected the OneHotEncoder to be able to handle the np.nan in the test dataset, since
handle_unknown='ignore'
But it gives ValueError. It is able to handle the None value though. Why is it failing?And how do I get around it (besides Imputer)?
From the documentation (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)
it seemed that this was what handle_unknown is for.
This option gives a solution when test set has unseen categorical value in train set. If you would put ‘steve stevenson’ in the test set it would not return an error, it would return column with all zeros.
train = pd.DataFrame({
'users':['John Johnson','John Smith','Mary Williams']
})
test = pd.DataFrame({
'users':['John Smith','Mary Williams', 'Steve Stevenson']
})
ohe = OneHotEncoder(sparse=False, handle_unknown = 'ignore')
ohe.fit(train)
test_transformed = ohe.transform(test)
print(test_transformed)
Solution to None problem would be to replace None values with some category, like ‘unknown’.
Hope this helps

how to convert pandas str.split call to to dask

I have a dask data frame where the index is a string which looks like this:
12/09/2016 00:00;32.0046;-106.259
12/09/2016 00:00;32.0201;-108.838
12/09/2016 00:00;32.0224;-106.004
(its basically a string encoding the datetime;latitude;longitude of the row)
I'd like to split that while still in the dask context to individual columns representing each of the fields.
I can do that with a pandas dataframe as:
df['date'], df['Lat'], df['Lon'] = df.index.str.split(';', 2).str
But that doesn't work in dask for several of the attempts I've tried. If I directly substitute the df for a dask df I get the error:
'Index' object has no attribute 'str'
If I use the column name instead of index as:
forecastDf['date'], forecastDf['Lat'], forecastDf['Lon'] = forecastDf['dateLocation'].str.split(';', 2).str
I get the error:
TypeError: 'StringAccessor' object is not iterable
Here is an runnable example of this working in Pandas
import pandas as pd
df = pd.DataFrame()
df['dateLocation'] = ['12/09/2016 00:00;32.0046;-106.259','12/09/2016 00:00;32.0201;-108.838','12/09/2016 00:00;32.0224;-106.004']
df = df.set_index('dateLocation')
df['date'], df['Lat'], df['Lon'] = df.index.str.split(';', 2).str
df.head()
Here is the error I get if I directly convert that to dask
import dask.dataframe as dd
dd = dd.from_pandas(df, npartitions=1)
dd['date'], dd['Lat'], dd['Lon'] = dd.index.str.split(';', 2).str
>>TypeError: 'StringAccessor' object is not iterable
forecastDf['date'] = forecastDf['dateLocation'].str.partition(';')[0]
forecastDf['Lat'] = forecastDf['dateLocation'].str.partition(';')[2]
forecastDf['Lon'] = forecastDf['dateLocation'].str.partition(';')[4]
Let me know if this works for you!
First make sure the column is string dtype
forecastDD['dateLocation'] = forecastDD['dateLocation'].astype('str')
Then you can use this to split in dask
splitColumns = client.persist(forecastDD['dateLocation'].str.split(';',2))
You can then index the columns in the new dataframe splitColumns and add them back to the original data frame.
forecastDD = forecastDD.assign(Lat=splitColumns.apply(lambda x: x[0], meta=('Lat', 'f8')), Lon=splitColumns.apply(lambda x: x[1], meta=('Lat', 'f8')), date=splitColumns.apply(lambda x: x[2], meta=('Lat', np.dtype(str))))
Unfortunately I couldn't figure out how to do it without calling compute and creating the temp dataframe.

Resources