How to use a ValidationMonitor for an Estimator in TensorFlow 1.0? - machine-learning

TensorFlow provides the possibility for combining ValidationMonitors with several predefined estimators like tf.contrib.learn.DNNClassifier.
But I want to use a ValidationMonitor for my own estimator which I have created based on 1.
For my own estimator I initialize first a ValidationMonitor:
validation_monitor = tf.contrib.learn.monitors.ValidationMonitor(testX,testY,every_n_steps=50)
estimator = tf.contrib.learn.Estimator(model_fn=model,model_dir=direc,config=tf.contrib.learn.RunConfig(save_checkpoints_secs=1))
input_fn = tf.contrib.learn.io.numpy_input_fn({"x": x}, y, 4, num_epochs=1000)
Here I pass the monitor as shown in 2 for tf.contrib.learn.DNNClassifier:
estimator.fit(input_fn=input_fn, steps=1000,monitors=[validation_monitor])
This fails and following error was printed:
ValueError: Features are incompatible with given information. Given features: Tensor("input:0", shape=(?, 1), dtype=float64), required signatures: {'x': TensorSignature(dtype=tf.float64, shape=TensorShape([Dimension(None)]), is_sparse=False)}.
How can I use monitors for my own estimators?
Thanks.

Problem is solved when passing input_fn containing testX and testY to ValidationMonitor instead of passing the tensors testX and testY directly.

For the record, your error was caused by the fact that ValidationMonitor expects x to be a dictionary like { 'feature_name_as_a_string' : feature_tensor }, which in your input_fn is done internally by the call to tf.contrib.learn.io.numpy_input_fn(...).
More information about how to build features dictionaries can be found in the Building Input Functions with tf.contrib.learn article of the documentation.

Related

Scikit-Learn issues error for RandomForestClassifier for multilabel classification - Jagged arrays

Scikit-Learn RandomForestClassifier throws an error for a multilabel classification problem.
This code creates a RandomForestClassifier multilabel object, given predictors C and multi-labels out with no error.
C = np.array([[2,4,6],[4,2,1],[8,3,1]])
out = np.array([[0,1],[0,1],[1,0]])
rf = RandomForestClassifier(n_estimators=100, oob_score=True)
rf.fit(C,out)
If I modify the multilabels, so that all the elements at a certain index are the same, say (where all the first components of the multilabels equals zero)
out = np.array([[0,1],[0,1],[0,0]])
I get an error and traceback:
VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a
list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated.
If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
y_pred = np.array(y_pred, copy=False)
raise ValueError(
507 "The type of target cannot be used to compute OOB "
508 f"estimates. Got {y_type} while only the following are "
509 "supported: continuous, continuous-multioutput, binary, "
510 "multiclass, multilabel-indicator."
511 )
ValueError: could not broadcast input array from shape (2,1) into shape (2,)
Not requesting OOB predictions does not result in an error:
rf_err = RandomForestClassifier(n_estimators=100, oob_score=False)
I cannot figure out why keeping the OOB predictions would trigger such an error, when all the n-component of a multilabel are equal.
In your setup out_err = np.array([[0,1],[0,1],[0,0]]) you do not have any examples of the second class, so you only have elements of 1 class.
That means that there is no 'class label' dimension and it can be omitted. That's why you see (2,) shape.
Please, describe your initial intent: why would you need to set a particular position in labels to 0. If you try to go with N-1 classes instead of N classes I suggest removing the position itself and the elements of the class from the dataset, not putting all zeros:
out=[[1,0,0],[0,1,0],[0,1,0],[0,0,1],[1,0,0]] # 3 classes
# remove the second class:
out=[[1,0],[0,1],[1,0]] # 2 classes

Sklearn: Found input variables with inconsistent numbers of samples:

I have built a model.
est1_pre = ColumnTransformer([('catONEHOT', OneHotEncoder(dtype='int',handle_unknown='ignore'),['Var1'])],remainder='drop')
est2_pre = ColumnTransformer([('BOW', TfidfVectorizer(ngram_range=(1, 3),max_features=1000),['Var2'])],remainder='drop')
m1= Pipeline([('FeaturePreprocessing', est1_pre),
('clf',alternative)])
m2= Pipeline([('FeaturePreprocessing', est2_pre),
('clf',alternative)])
model_combo = StackingClassifier(
estimators=[('cate',m1),('text',m2)],
final_estimator=RandomForestClassifier(n_estimators=10,
random_state=42)
)
I can successfully, fit and predict using m1 and m2.
However, when I look at the combination model_combo
Any attempt in calling .fit/.predict results in ValueError: Found input variables with inconsistent numbers of samples:
model_fitted=model_combo.fit(x_train,y_train)
x_train contains Var1 and Var2
How to fit model_combo?
The problem is that sklearn text preprocessors (TfidfVectorizer in this case) operate on one-dimensional data, not two-dimensional as most other preprocessors. So the vectorizer treats its input as an iterable of its columns, so there's only one "document". This can be fixed in the ColumnTransformer by specifying the column to operate on not in a list:
est2_pre = ColumnTransformer([('BOW', TfidfVectorizer(ngram_range=(1, 3),max_features=1000),'Var2')],remainder='drop')

Generating local interpretations using Shap Kernel Explainer

I am trying to interpret my model using shap kernel explainer. The dataset is of shape (176683, 42). The explainer (xgbexplainer) is successfully modelled and when I use it to generate shap_values, it throws Memory Error.
import shap
xgb_explainer = shap.KernelExplainer(trained_model.steps[-1][-1].predict,X_for_shap.values)
shap_val = xgb_explainer.shap_values(X_for_shap.loc[0], nsamples=1)
First I used nsamples as default = 2*X_for_shap.shape[2] + 2048, it returned
MemoryError: Unable to allocate array with shape (2132, 7420686) and data type float64
When I set it to nsamples = 1, it runs for indefinite time. Please help me out to understand where I am doing wrong here
This is the screenshot of the error message
One thing that I dont understand about the kernelexplainer is why we need to impute the missing features with some strategies ( mean , median k means etc ) ? Why not just ignore them and fit a linear learner and compare it with the the model without observing that feature ? P( y| {S} U feature_i ) - P( y | { S } ) ? What kind of added value SHAP approach provides with having the whole features but some of them unknown ?

What is the meaning of in-place in dropout

def dropout(input, p=0.5, training=True, inplace=False)
inplace: If set to True, will do this operation in-place.
I would like to ask what is the meaning of in-place in dropout. What does it do?
Any performance changes when performing these operation?
Thanks
Keeping inplace=True will itself drop few values in the tensor input itself, whereas if you keep inplace=False, you will to save the result of droput(input) in some other variable to be retrieved.
Example:
import torch
import torch.nn as nn
inp = torch.tensor([1.0, 2.0, 3, 4, 5])
outplace_dropout = nn.Dropout(p=0.4)
print(inp)
output = outplace_dropout(inp)
print(output)
print(inp) # Notice that the input doesn't get changed here
inplace_droput = nn.Dropout(p=0.4, inplace=True)
inplace_droput(inp)
print(inp) # Notice that the input is changed now
PS: This is not related to what you have asked but try not using input as a variable name since input is a Python keyword. I am aware that Pytorch docs also does that, and it is kinda funny.

sklearn logistic regression parameter in GridSearch

just wondering how to separate parameters into a group and pass it to gridsearch?
As i want to pass penalty l1 and l2 to grid search and corresponding solver newton-cg to L2.
However, when i run the code below, the gridsearch will first run l1 with newton-cg and result in error msg
ValueError: Solver newton-cg supports only l2 penalties, got l1 penalty.
Thanks
param_grid = [
{'penalty':['l1','l2'] ,
'solver' : ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
}
]
Try this example:
param_grid = [
{'penalty': ['l1'], 'solver': [ 'lbfgs', 'liblinear', 'sag', 'saga']},
{'penalty': ['l2'], 'solver': ['newton-cg']},
]
here l1 will be tried with 'lbfgs', 'liblinear', 'sag', 'saga' and l2 will be tried with only 'newton-cg'
The official doc says:
... or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.
So just supply a list of dictionaries each dictionary with consistent set of arguments that work together
There is also an explicit example in the GridSearchCV User Guide, which serves as a good example.

Resources