LassoCV getting axis -1 is out of bounds for array of dimension 0 and other questions - machine-learning

Good evening to all,
I am trying to implement for the first time LassoCV with sklearn.
My code is as follows:
numeric_features = ['AGE_2019', 'Inhabitants'] categorical_features = ['familty_type','studying','Job_42','sex','DEGREE', 'Activity_type', 'Nom de la commune', 'city_type', 'DEP', 'INSEE', 'Nom du département', 'reg', 'Nom de la région']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median'))
,('scaler', MinMaxScaler()) # Centrage des données ])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant',fill_value='missing'))
,('encoder', OneHotEncoder(handle_unknown='ignore')) # Création de variables binaires pour les variables catégoriques ])
preprocessor = ColumnTransformer( transformers=[
('numeric', numeric_transformer, numeric_features) ,('categorical', categorical_transformer, categorical_features) ])
# Creation of the pipeline
lassocv_piped = Pipeline([
('preprocessor', preprocessor),
('model', LassoCV())
])
# Creation of the grid of parameters
dt_params = {'model__alphas': np.array([0.5])
}
cv_folds = KFold(n_splits=5, shuffle=True, random_state=0)
lassocv_grid_piped = GridSearchCV(lassocv_piped, dt_params, cv=cv_folds, n_jobs=-1, scoring=['neg_mean_squared_error', 'r2'], refit='r2')
# Fitting our model
lassocv_grid_piped.fit(df_X_train,df_Y_train.values.ravel())
# Getting our metrics and predictions
Y_pred_lassocv = lassocv_grid_piped.predict(df_X_test)
metrics_lassocv = lassocv_grid_piped.cv_results_ best_lassocv_parameters = lassocv_grid_piped.best_params_
print('Best test negatif MSE of the base model : ', max(metrics_lassocv['mean_test_neg_mean_squared_error'])) print('Best test R^2 of the base model : ', max(metrics_lassocv['mean_test_r2'])) print('Best parameters of the base model : ', best_lassocv_parameters)
# Graphique representation
results = pd.DataFrame(dt_params) for k in range(5):
results = pd.concat([results,
pd.DataFrame(lassocv_grid_piped.cv_results_['split'+str(k)+'_test_neg_mean_squared_error'])],axis=1)
sns.relplot(data=results.melt('model__alphas',value_name='neg_mean_squared_error'),x='model__alphas',y='neg_mean_squared_error',kind='line')
I am still a novice when it comes to using this model. So, I have some questions about the use of this estimator:
Is it useful to use a cv_fold outside the estimator, as I do?
Is it useful to set up a GridSearchCV to test the different alpha values?
How is it possible to extract the R^2 from our model?
Also, I encounter this error:
AxisError: axis -1 is out of bounds for array of dimension 0
Would you have an idea to solve it?
I wish you a good evening!

After a good night's sleep, I was able to overcome some of my problems.
Is it useful to use a cv_fold outside the estimator, as I do ?
After studying the documentation of LassoCV a bit, it seems not. So I could remove cv_fold from my code. Instead, I could use the cv argument of LassoCV.
Is it useful to set up a GridSearchCV to test the different alpha values?
I haven't really been able to answer that question yet. It seems that LassoCV does it by itself.
How is it possible to extract the R^2 from our model ?
This can be done simply with the function: .score(X,y).
As for my error message. I was able to get rid of it once I deleted GridSearchCV.
Here's my final code :
numeric_features = ['AGE_2019', 'Inhabitants']
categorical_features = ['familty_type','studying','Job_42','sex','DEGREE', 'Activity_type', 'Nom de la commune', 'city_type', 'DEP', 'INSEE', 'Nom du département', 'reg', 'Nom de la région']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median'))
,('scaler', MinMaxScaler()) # Centrage des données
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant',fill_value='missing'))
,('encoder', OneHotEncoder(handle_unknown='ignore')) # Création de variables binaires pour les variables catégoriques
])
preprocessor = ColumnTransformer(
transformers=[
('numeric', numeric_transformer, numeric_features)
,('categorical', categorical_transformer, categorical_features)
])
# Creation of the pipeline
list_metrics_lassocv = []
list_best_lassocv_parameters = []
for i in range (1,12) :
lassocv_piped = Pipeline([
('preprocessor', preprocessor),
('model', LassoCV(cv=5, n_alphas=i, random_state=0))
])
# Fitting our model
lassocv_piped.fit(df_X_train,df_Y_train.values.ravel())
# Getting our metrics and predictions
Y_pred_lassocv = lassocv_piped.predict(df_X_test)
metrics_lassocv = lassocv_piped.score(df_X_train,df_Y_train.values.ravel())
best_lassocv_parameters = lassocv_piped['model'].alpha_
list_metrics_lassocv.append(metrics_lassocv)
list_best_lassocv_parameters.append(best_lassocv_parameters)
Do not hesitate to correct me if you see an impression or an error.

Related

How can I synchronize two Deep Reinforcement Learning agents?

I am doing a project in which I simulate a computer network. Each node of the network is a Deep Reinforcement Learning agent and its states will depend on a global matrix from which they have to take data and then modify data. And that I would like to know when would be the most appropriate time to update the state of these agents and what would be the most correct option.
The state has one row more than the matrix containing the MLU of the links. This row will store the packet to be worked on.
#Creamos una matriz que almacenara la mlu en cada momento, inicializada a 0 en los nodos #conectados
matrizMLU = np.full((nodos_red, nodos_red), -1, int)
for i in range(nodos_red):
for j in range(i+1, nodos_red):
if j in puertos[i]:
matrizMLU[i][j] = 0
matrizMLU[j][i] = 0
class nodoEnv(Env):
def __init__(self, idNodo): #Inicializacion del entorno
self.id = idNodo
self.action_space = Discrete(5) #Acciones
self.observation_space = Box(low=0, high=100, shape = (len(matrizMLU)+1, len(matrizMLU)))
self.estado = np.array(np.zeros((len(matrizMLU)+1, len(matrizMLU)), dtype = int))
self.camino = calcularCaminos(idNodo)

R: Error in predict.xgboost: Feature names stored in `object` and `newdata` are different

I wrote a script using xgboost to predict soil class for a certain area using data from field and satellite images. The script as below:
`
rm(list=ls())
library(xgboost)
library(caret)
library(raster)
library(sp)
library(rgeos)
library(ggplot2)
setwd("G:/DATA")
data <- read.csv('96PointsClay02finalone.csv')
head(data)
summary(data)
dim(data)
ras <- stack("Allindices04TIFF.tif")
names(ras) <- c("b1", "b2", "b3", "b4", "b5", "b6", "b7", "b10", "b11","DEM",
"R1011", "SCI", "SAVI", "NDVI", "NDSI", "NDSandI", "MBSI",
"GSI", "GSAVI", "EVI", "DryBSI", "BIL", "BI","SRCI")
set.seed(27) # set seed for generating random data.
# createDataPartition() function from the caret package to split the original dataset into a training and testing set and split data into training (80%) and testing set (20%)
parts = createDataPartition(data$Clay, p = .8, list = F)
train = data[parts, ]
test = data[-parts, ]
#define predictor and response variables in training set
train_x = data.matrix(train[, -1])
train_y = train[,1]
#define predictor and response variables in testing set
test_x = data.matrix(test[, -1])
test_y = test[, 1]
#define final training and testing sets
xgb_train = xgb.DMatrix(data = train_x, label = train_y)
xgb_test = xgb.DMatrix(data = test_x, label = test_y)
#defining a watchlist
watchlist = list(train=xgb_train, test=xgb_test)
#fit XGBoost model and display training and testing data at each iteartion
model = xgb.train(data = xgb_train, max.depth = 3, watchlist=watchlist, nrounds = 100)
#define final model
model_xgboost = xgboost(data = xgb_train, max.depth = 3, nrounds = 86, verbose = 0)
summary(model_xgboost)
#use model to make predictions on test data
pred_y = predict(model_xgboost, xgb_test)
# performance metrics on the test data
mean((test_y - pred_y)^2) #mse - Mean Squared Error
caret::RMSE(test_y, pred_y) #rmse - Root Mean Squared Error
y_test_mean = mean(test_y)
rmseE<- function(error)
{
sqrt(mean(error^2))
}
y = test_y
yhat = pred_y
rmseresult=rmseE(y-yhat)
(r2 = R2(yhat , y, form = "traditional"))
cat('The R-square of the test data is ', round(r2,4), ' and the RMSE is ', round(rmseresult,4), '\n')
#use model to make predictions on satellite image
result <- predict(model_xgboost, ras[1:(nrow(ras)*ncol(ras))])
#create a result raster
res <- raster(ras)
#fill in results and add a "1" to them (to get back to initial class numbering! - see above "Prepare data" for more information)
res <- setValues(res,result+1)
#Save the output .tif file into saved directory
writeRaster(res, "xgbmodel_output", format = "GTiff", overwrite=T)
`
The script works well till it reachs
result <- predict(model_xgboost, ras[1:(nrow(ras)*ncol(ras))])
it takes some time then gives this error:
Error in predict.xgb.Booster(model_xgboost, ras[1:(nrow(ras) * ncol(ras))]) :
Feature names stored in `object` and `newdata` are different!
I realize that I am doing something wrong in that line. However, I do not know how to apply the xgboost model to a raster image that represents my study area.
It would be highly appreciated if someone give a hand, enlightened me, and helped me solve this problem....
My data as csv and raster image can be found here.
Finally, I got the reason for this error.
It was my mistake as the number of columns in the traning data was not the same as in the number of layers in the satellite image.

How to do batch inferenece for Hugging face models?

I want to do batch inference on MarianMT model. Here's the code:
from transformers import MarianTokenizer
tokenizer = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-en-de')
src_texts = [ "I am a small frog.", "Tom asked his teacher for advice."]
tgt_texts = ["Ich bin ein kleiner Frosch.", "Tom bat seinen Lehrer um Rat."] # optional
inputs = tokenizer(src_texts, return_tensors="pt", padding=True)
with tokenizer.as_target_tokenizer():
labels = tokenizer(tgt_texts, return_tensors="pt", padding=True)
inputs["labels"] = labels["input_ids"]
outputs = model(**inputs)
How do I do batch inference?

use tff.learning.build_federated_evaluation instead of keras_evaluate

I'm newer in TFF, I work on this tutorial. I would like to replace keras_evaluate function by predefined function of TFF : evaluation = tff.learning.build_federated_evaluation(model)
So how can I edit those lines :
def keras_evaluate(state, round_num):
# Take our global model weights and push them back into a Keras model to
# use its standard `.evaluate()` method.
keras_model = load_model(batch_size=BATCH_SIZE)
keras_model.compile(
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=[FlattenedCategoricalAccuracy()])
state.model.assign_weights_to(keras_model)
loss, accuracy = keras_model.evaluate(example_dataset, steps=2, verbose=0)
print('\tEval: loss={l:.3f}, accuracy={a:.3f}'.format(l=loss, a=accuracy))
for round_num in range(NUM_ROUNDS):
print('Round {r}'.format(r=round_num))
keras_evaluate(state, round_num)
state, metrics = fed_avg.next(state, train_datasets)
train_metrics = metrics['train']
print('\tTrain: loss={l:.3f}, accuracy={a:.3f}'.format(
l=train_metrics['loss'], a=train_metrics['accuracy']))
print('Final evaluation')
keras_evaluate(state, NUM_ROUNDS + 1)
In this line :
loss, accuracy = keras_model.evaluate(example_dataset, steps=2, verbose=0)
the function evaluate only on an example of dataset contrary to build_federated_evaluation , it evaluate on federated_test_data totally. So how can I modify this function to evaluate on the totality of federated_test_data like in the other tutorial : test_metrics = evaluation(state.model, federated_test_data)

How to tune GaussianNB?

Trying to fit data with GaussianNB() gives me low accuracy score.
I'd like to try Grid Search, but it seems that parameters sigma and theta cannot be set. Is there anyway to tune GausssianNB?
You can tune 'var_smoothing' parameter like this:
nb_classifier = GaussianNB()
params_NB = {'var_smoothing': np.logspace(0,-9, num=100)}
gs_NB = GridSearchCV(estimator=nb_classifier,
param_grid=params_NB,
cv=cv_method, # use any cross validation technique
verbose=1,
scoring='accuracy')
gs_NB.fit(x_train, y_train)
gs_NB.best_params_
As of version 0.20
GaussianNB().get_params().keys()
returns 'priors' and 'var_smoothing'
A grid search would look like:
pipeline = Pipeline([
('clf', GaussianNB())
])
parameters = {
'clf__priors': [None],
'clf__var_smoothing': [0.00000001, 0.000000001, 0.00000001]
}
cv = GridSearchCV(pipeline, param_grid=parameters)
cv.fit(X_train, y_train)
y_pred_gnb = cv.predict(X_test)
In an sklearn pipeline it may look as follows:
pipe = Pipeline(steps=[
('pca', PCA()),
('estimator', GaussianNB()),
])
parameters = {'estimator__var_smoothing': [1e-11, 1e-10, 1e-9]}
Bayes = GridSearchCV(pipe, parameters, scoring='accuracy', cv=10).fit(X_train, y_train)
print(Bayes.best_estimator_)
print('best score:')
print(Bayes.best_score_)
predictions = Bayes.best_estimator_.predict(X_test)
Naive Bayes doesn't have any hyperparameters to tune.

Resources