Sklearn: Found input variables with inconsistent numbers of samples:

Sklearn: Found input variables with inconsistent numbers of samples: - machine-learning

I have built a model.
est1_pre = ColumnTransformer([('catONEHOT', OneHotEncoder(dtype='int',handle_unknown='ignore'),['Var1'])],remainder='drop')
est2_pre = ColumnTransformer([('BOW', TfidfVectorizer(ngram_range=(1, 3),max_features=1000),['Var2'])],remainder='drop')
m1= Pipeline([('FeaturePreprocessing', est1_pre),
('clf',alternative)])
m2= Pipeline([('FeaturePreprocessing', est2_pre),
('clf',alternative)])
model_combo = StackingClassifier(
estimators=[('cate',m1),('text',m2)],
final_estimator=RandomForestClassifier(n_estimators=10,
random_state=42)
)
I can successfully, fit and predict using m1 and m2.
However, when I look at the combination model_combo
Any attempt in calling .fit/.predict results in ValueError: Found input variables with inconsistent numbers of samples:
model_fitted=model_combo.fit(x_train,y_train)
x_train contains Var1 and Var2
How to fit model_combo?

The problem is that sklearn text preprocessors (TfidfVectorizer in this case) operate on one-dimensional data, not two-dimensional as most other preprocessors. So the vectorizer treats its input as an iterable of its columns, so there's only one "document". This can be fixed in the ColumnTransformer by specifying the column to operate on not in a list:
est2_pre = ColumnTransformer([('BOW', TfidfVectorizer(ngram_range=(1, 3),max_features=1000),'Var2')],remainder='drop')

Related

Scikit-Learn issues error for RandomForestClassifier for multilabel classification - Jagged arrays

Scikit-Learn RandomForestClassifier throws an error for a multilabel classification problem.
This code creates a RandomForestClassifier multilabel object, given predictors C and multi-labels out with no error.
C = np.array([[2,4,6],[4,2,1],[8,3,1]])
out = np.array([[0,1],[0,1],[1,0]])
rf = RandomForestClassifier(n_estimators=100, oob_score=True)
rf.fit(C,out)
If I modify the multilabels, so that all the elements at a certain index are the same, say (where all the first components of the multilabels equals zero)
out = np.array([[0,1],[0,1],[0,0]])
I get an error and traceback:
VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a
list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated.
If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
y_pred = np.array(y_pred, copy=False)
raise ValueError(
507 "The type of target cannot be used to compute OOB "
508 f"estimates. Got {y_type} while only the following are "
509 "supported: continuous, continuous-multioutput, binary, "
510 "multiclass, multilabel-indicator."
511 )
ValueError: could not broadcast input array from shape (2,1) into shape (2,)
Not requesting OOB predictions does not result in an error:
rf_err = RandomForestClassifier(n_estimators=100, oob_score=False)
I cannot figure out why keeping the OOB predictions would trigger such an error, when all the n-component of a multilabel are equal.

In your setup out_err = np.array([[0,1],[0,1],[0,0]]) you do not have any examples of the second class, so you only have elements of 1 class.
That means that there is no 'class label' dimension and it can be omitted. That's why you see (2,) shape.
Please, describe your initial intent: why would you need to set a particular position in labels to 0. If you try to go with N-1 classes instead of N classes I suggest removing the position itself and the elements of the class from the dataset, not putting all zeros:
out=[[1,0,0],[0,1,0],[0,1,0],[0,0,1],[1,0,0]] # 3 classes
# remove the second class:
out=[[1,0],[0,1],[1,0]] # 2 classes

grid of mtry values while training random forests with ranger

I am working with a subset of the 'Ames Housing' dataset and have originally 17 features. Using the 'recipes' package, I have engineered the original set of features and created dummy variables for nominal predictors with the following code. That has resulted in 35 features in the 'baked_train' dataset below.
blueprint <- recipe(Sale_Price ~ ., data = _train) %>%
step_nzv(Street, Utilities, Pool_Area, Screen_Porch, Misc_Val) %>%
step_impute_knn(Gr_Liv_Area) %>%
step_integer(Overall_Qual) %>%
step_normalize(all_numeric_predictors()) %>%
step_other(Neighborhood, threshold = 0.01, other = "other") %>%
step_dummy(all_nominal_predictors(), one_hot = FALSE)
prepare <- prep(blueprint, data = ames_train)
baked_train <- bake(prepare, new_data = ames_train)
baked_test <- bake(prepare, new_data = ames_test)
Now, I am trying to train random forests with the 'ranger' package using the following code.
cv_specs <- trainControl(method = "repeatedcv", number = 5, repeats = 5)
param_grid_rf <- expand.grid(mtry = seq(1, 35, 1),
splitrule = "variance",
min.node.size = 2)
rf_cv <- train(blueprint,
data = ames_train,
method = "ranger",
trControl = cv_specs,
tuneGrid = param_grid_rf,
metric = "RMSE")
I have set the grid of 'mtry' values based on the number of features in the 'baked_train' data. It is my understanding that 'caret' will apply the blueprint within each resample of 'ames_train' creating a baked version at each CV step.
The text Hands-On Machine Learning with R by Boehmke & Greenwell says on section 3.8.3,
Consequently, the goal is to develop our blueprint, then within each resample iteration we want to apply prep() and bake() to our resample training and validation data. Luckily, the caret package simplifies this process. We only need to specify the blueprint and caret will automatically prepare and bake within each resample.
However, when I run the code above I get an error,
mtry can not be larger than number of variables in data. Ranger will EXIT now.
I get the same error when I specify 'tuneLength = 20' instead of the 'tuneGrid'. Although the code works fine when the grid of 'mtry' values is specified to be from 1 to 17 (the number of features in the original training data 'ames_train').
When I specify a grid of 'mtry' values from 1 to 17, info about the final model after CV is shown below. Notice that it mentions Number of independent variables: 35 which corresponds to the 'baked_train' data, although specifying a grid from 1 to 35 throws an error.
Type: Regression
Number of trees: 500
Sample size: 618
Number of independent variables: 35
Mtry: 15
Target node size: 2
Variable importance mode: impurity
Splitrule: variance
OOB prediction error (MSE): 995351989
R squared (OOB): 0.8412147
What am I missing here? Specifically, why do I have to specify the number of features in 'ames_train' instead of 'baked_train' when essentially 'caret' is supposed to create a baked version before fitting and evaluating the model for each resample?
Thanks.

What is the meaning of in-place in dropout

def dropout(input, p=0.5, training=True, inplace=False)
inplace: If set to True, will do this operation in-place.
I would like to ask what is the meaning of in-place in dropout. What does it do?
Any performance changes when performing these operation?
Thanks

Keeping inplace=True will itself drop few values in the tensor input itself, whereas if you keep inplace=False, you will to save the result of droput(input) in some other variable to be retrieved.
Example:
import torch
import torch.nn as nn
inp = torch.tensor([1.0, 2.0, 3, 4, 5])
outplace_dropout = nn.Dropout(p=0.4)
print(inp)
output = outplace_dropout(inp)
print(output)
print(inp) # Notice that the input doesn't get changed here
inplace_droput = nn.Dropout(p=0.4, inplace=True)
inplace_droput(inp)
print(inp) # Notice that the input is changed now
PS: This is not related to what you have asked but try not using input as a variable name since input is a Python keyword. I am aware that Pytorch docs also does that, and it is kinda funny.

ValueError: setting an array element with a sequence when using Onehotencoder on multiple columns

I'm fairly new to machine learning and I am using the following code to encode my categorical data for preprocessing:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([('one_hot_encoder', OneHotEncoder(handle_unknown = 'ignore'), [0])],remainder='passthrough')
X = np.array(ct.fit_transform(X), dtype=np.float)
which works when I only have one categorical column of data in X.
However when I have multiple columns of categorical data I change my code to :
ct = ColumnTransformer([('one_hot_encoder', OneHotEncoder(handle_unknown = 'ignore'), [0,1,2,3,4,5,10,14,15])],remainder='passthrough')
but I get the following error when calling the np.array function:
Value Error: setting an array element with a sequence
on the np.array function call...
From what I understand all I need to do is specify which columns I'm hot encoding as in the above line of code...so why does one work and the other give an error? What should I do to fix it?
Also: if I remove the
dtype=np.float
from the np.array function I don't get an error - but I also don't get anything returned in X

Never mind I was able to answer my own question.
For anyone interested what I did was change the line
X = np.array(ct.fit_transform(X), dtype=np.float)
to:
X = ct.fit_transform(X).toarray()
The code works perfectly now.

How can one add batch mechanism to the input function in Tensorflow tutorial overcoming tf.Sparsetensor objects?

How can one add batch mechanism to the input_fn in the TensorFlow Wide & Deep Learning Tutorial overcoming that some features are represented as tf.Sparsetensor objects?
I have made many attempts around adding tf.train.slice_input_producer and tf.train.batchto the original code (see below), but have failed miserably so far.
I would like to keep the global working of that input_fn as it is handy while while training and evaluating the model.
Can someone help, please?
def input_fn(df):
# Creates a dictionary mapping from each continuous feature column name (k) to
# the values of that column stored in a constant Tensor.
continuous_cols = {k: tf.constant(df[k].values)
for k in CONTINUOUS_COLUMNS}
# Creates a dictionary mapping from each categorical feature column name (k)
# to the values of that column stored in a tf.SparseTensor.
categorical_cols = {k: tf.SparseTensor(indices=[[i, 0] for i in range(df[k].size)],
values=df[k].values,
shape=[df[k].size, 1]) for k in CATEGORICAL_COLUMNS}
# Merges the two dictionaries into one.
feature_cols = dict(continuous_cols.items() + categorical_cols.items())
# Converts the label column into a constant Tensor.
labels = tf.constant(df[LABEL_COLUMN].values)
'''
Changes from here:
'''
features_slices, features_slices = tf.train.slice_input_producer([features_cols, labels], ...)
features_batches, labels_batches = tf.train.batch([features_slices, features_slices], ...)
# Returns the feature and label batches.
return features_batches, labels_batches

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Sklearn: Found input variables with inconsistent numbers of samples: - machine-learning

Related

Scikit-Learn issues error for RandomForestClassifier for multilabel classification - Jagged arrays

grid of mtry values while training random forests with ranger

What is the meaning of in-place in dropout

ValueError: setting an array element with a sequence when using Onehotencoder on multiple columns

How can one add batch mechanism to the input function in Tensorflow tutorial overcoming tf.Sparsetensor objects?

Categories

Resources