Cross validation error of Gaussian process with noisy target - machine-learning

I created the Gaussian process model and trained with noisy target. I implemented the noise as a parameter alpha [n_samples] according to documentation for the last scikit-learn 18.
model = GaussianProcessRegressor(kernel=kernel,n_restarts_optimizer=0, alpha=dy_train ** 2)
It works until I want to perform cross validation. It raises an error that the length of the alpha parameter and actual target is not equal:
scores = cross_val_score(model, X_test, y_test)
ValueError: alpha must be a scalar or an array with same number of entries as y.(35 != 10)
I understand the error but I don't know how to properly define alpha vector for cross validation. Please any suggestion?
Thanks

Alpha is supposed to be a number (and it would work just fine with your code). You can also have per-sample alpha, but this will not work with .cross_val_score, since it has no support for slicing internally. Furthermore what you are using looks like an extremely odd heuristic to assign alpha. I am pretty sure it is not anywhere in scikitlearn documentation. In order to use cross validation you need to go with the 'full' approach which is iterating over cross validation iterators and averaging yourself. It is pretty much three lines of code, thus should not be a big burden
from sklearn.model_selection import KFold
import numpy as np
kf = KFold(n_splits=10)
scores = []
for train, test in kf.split(X):
model.fit(X_train)
scores.append(model.score(X_test, y_test))
print np.mean(scores)

Related

I am using RuleFit for binary classification; how do I interpret the rules?

I am using RuleFit with a GradientBoostingClassifier to generate rules for a binary classification problem (health-dataset on Kaggle). When I print out the rules with RuleFit.get_rules(), it shows rule, type, coef, support, and importance. But it doesn’t show which class (0 or 1) is the target of the rule. For example: does exang <= 0.5 describe a 0 or 1 class?
Summary: how do I know which target class a given rule is describing?
"RuleFit learns a sparse linear model with the original features and also a number of new features that are decision rules. These new features capture interactions between the original features. RuleFit automatically generates these features from decision trees. Each path through a tree can be transformed into a decision rule by combining the split decisions into a rule." (reference)
Also, from this example we understand that rf.get_rules() will return the rules created from initial attributes and created attributes, but not the predictions.
I am therefore assuming the prediction results come from your GradientBoostingClassifier's predict method. If that is the case, then the most natural thing to do is to indeed select a threshold above which you consider if a sample is predicted as a 0 or 1. Here is a possible example:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=42)
reg = GradientBoostingClassifier()
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)
thresh = 0.5
y_pred = np.array([y_pred > thresh])
Notice that the threshold may not be 0.5, depending on what you are aiming for.
For more about this, I encourage you to look for the Area Under the Curve metric.
I hope this helped!

Unable to inverse_transform the value of feature because of different dimensionality

I'm designing a multivariate time series model. For that I'm inputing 5 features to lstm model and try to predict the output of 1 variable(i.e. whose value is dependent on itself and other 4 features).
For that I'm doing the feature scaling as follows:-
#Features Scaling
`from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler(feature_range = (0,1))
training_set_scaled = sc.fit_transform(training_set)
print(training set scaled)`
Output:-
At the output of the model, I got the predicted value as:
However, when it tried to inverse transform it as:
predicted_stock_price = sc.inverse_transform(predicted_stock_price)
I got the the following error:-
non-broadcastable output operand with shape (65,1) doesn't match the broadcast shape (65,5)
Please help. Thank you in advance :)
The problem is that you use sc to min-max-scale the five features. Therefore, sc can also only be used to inverse transform the scaled version of the features (shown by you as output), which would give you back the original feature values.
The label (model output) is independent from that. You can also, but do not necessarily have to scale your dependent variable, and certainly not with the same scaler object.

Is there any method like 「scaler.inverse_transform()」to get partial scaler params to de-normalize the answer?

I am trying to normalize my data(with shape (23687,7)), then I save the mean and std of the original dataset to "normalized_param.pkl"
After fitting the normalized data to my LSTM model, I will get an answer array (with shape (23687, 1))
Now what I gonna do is:
test_sc_path = os.path.join('normalized_standard', 'normalized_param.pkl')
test_scaler = load(test_sc_path)
test_denorm_value = test_scaler.inverse_transform(test_normalized_data)
ValueError: non-broadcastable output operand with shape (23687,1) doesn't match the broadcast shape (23687,7)
I think that's because the test_scaler object have 7 dim params inside, so if I want to de-normalize only 1 dim data, I should use
test_scaler.mean_[-1]and「test_scaler.scale_[-1]to get the last param I want to compute.
However, I think it's quite complicated, is there any sklearn method just like scaler.inverse_transform() I can easily use to solve this problem?
thanks
Yes, there is a method for it. See the documentation here.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(data) # Basically fits the data, store means & standard deviations.
scaler.transform(data) # Standardize (Normalize) the data with the scaler parameters
scaler.fit_transform(data) # Fits & Transform
scaler.inverse_transform(data) # Apply inverse transformation for the input data.

Are the k-fold cross-validation scores from scikit-learn's `cross_val_score` and `GridsearchCV` biased if we include transformers in the pipeline?

Data pre-processers such as StandardScaler should be used to fit_transform the train set and only transform (not fit) the test set. I expect the same fit/transform process applies to cross-validation for tuning the model. However, I found cross_val_score and GridSearchCV fit_transform the entire train set with the preprocessor (rather than fit_transform the inner_train set, and transform the inner_validation set). I believe this artificially removes the variance from the inner_validation set which makes the cv score (the metric used to select the best model by GridSearch) biased. Is this a concern or did I actually miss anything?
To demonstrate the above issue, I tried the following three simple test cases with the Breast Cancer Wisconsin (Diagnostic) Data Set from Kaggle.
I intentionally fit and transform the entire X with StandardScaler()
X_sc = StandardScaler().fit_transform(X)
lr = LogisticRegression(penalty='l2', random_state=42)
cross_val_score(lr, X_sc, y, cv=5)
I include SC and LR in the Pipeline and run cross_val_score
pipe = Pipeline([
('sc', StandardScaler()),
('lr', LogisticRegression(penalty='l2', random_state=42))
])
cross_val_score(pipe, X, y, cv=5)
Same as 2 but with GridSearchCV
pipe = Pipeline([
('sc', StandardScaler()),
('lr', LogisticRegression(random_state=42))
])
params = {
'lr__penalty': ['l2']
}
gs=GridSearchCV(pipe,
param_grid=params, cv=5).fit(X, y)
gs.cv_results_
They all produce the same validation scores.
[0.9826087 , 0.97391304, 0.97345133, 0.97345133, 0.99115044]
No, sklearn doesn't do fit_transform with entire dataset.
To check this, I subclassed StandardScaler to print the size of the dataset sent to it.
class StScaler(StandardScaler):
def fit_transform(self,X,y=None):
print(len(X))
return super().fit_transform(X,y)
If you now replace StandardScaler in your code, you'll see dataset size passed in first case is actually bigger.
But why does the accuracy remain exactly same? I think this is because LogisticRegression is not very sensitive to feature scale. If we instead use a classifier that is very sensitive to scale, like KNeighborsClassifier for example, you'll find accuracy between two cases start to vary.
X,y = load_breast_cancer(return_X_y=True)
X_sc = StScaler().fit_transform(X)
lr = KNeighborsClassifier(n_neighbors=1)
cross_val_score(lr, X_sc,y, cv=5)
Outputs:
569
[0.94782609 0.96521739 0.97345133 0.92920354 0.9380531 ]
And the 2nd case,
pipe = Pipeline([
('sc', StScaler()),
('lr', KNeighborsClassifier(n_neighbors=1))
])
print(cross_val_score(pipe, X, y, cv=5))
Outputs:
454
454
456
456
456
[0.95652174 0.97391304 0.97345133 0.92920354 0.9380531 ]
Not big change accuracy-wise, but change nonetheless.
Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting. To avoid it, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set X_test, y_test
A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the k “folds”:
A model is trained using of the folds as training data;
the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).
The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. This approach can be computationally expensive, but does not waste too much data (as is the case when fixing an arbitrary validation set), which is a major advantage in problems such as inverse inference where the number of samples is very small.
More over if your model is already biased from starting we have to make it balance by SMOTE /Oversampling of Less Target Variable/Under-sampling of High target variable.

Confusion matrix subset of classes not working properly

I have searched for an answer to this question on the internet including suggestion when writing the title but still to no avail so hopefully someone can help!
I am trying to construct a confusion matrix using sci-kit learn. This comes after a keras model.
This is bizarre because i am having the following problem: For the training and test set of the original data... I can construct the confusion matrix as follows (please note this is a multi-label problem and so data has to be subset for the different labels.
The following works fine:
cm = confusion_matrix(y_train[:,0:6].argmax(axis=1), trainpred[:,0:6].argmax(axis=1))
and the 6:18 etc... until all classes have been subset. The confusion matrix that forms as a result reflects the true outcome of the keras model..
The problem arises when i deploy the model on completely unseen data.
I deploy the model by calling model.predict() and get results as above. However, now I cannot subset confusion matrices in the same way.
The code cm=confusion_matrix etc...causes the output of the CM to be the wrong dimensions, even when specifying 0:6 etc..
I therefore used the code from above used but with the labels argument modification:
age[0,1,2,3,4]
organ[5,6,7,8]
cm = confusion_matrix(y_train[:,0:6].argmax(axis=1), trainpred[:,0:6].argmax(axis=1), labels=age)
The FIRST label (1:5) works perfectly... However, the next labels do not! I dont get the right values in the confusion matrices and the matching is also incorrect for those that are in there.
To put this in to context: there are over 400 samples in the unseen test data.
model.predict shows very high classification and correct scores for most labels..
calling CM=ytest[:,4:8]etc, does indeed produce a 4x4 matrix, however there are like 5 values in there not 400, and those values that are in there are not correctly matching.
Also.. with the labels age being 012345, subsetting the ytest to 0:6 causes the correct confusion matrix to form (i am unsure as to why the 6 has to be included in the subset... nevertheless i have tried different combinations with the same issue!
I have searched high and low for this answer so would really appreciate some assistance as it is incredibly frustrating. any more code/information i can provide i will be happy to!!
Many thanks!
This is happening because you are trying to subset the generated confusion matrix, but you actually have to generate a new confusion matrix manually with the specified class labels. If you classes A, B, C you will get a 3X3 matrix. If you want to create matrix focusing only on class A, the other classes will become the false class, but the false positive and false negative will change and hence you cannot just sample the initial matrix.
This is how you show actually do it
import matplotlib.pytplot as plt
import seaborn as sns
def generate_matrix(y_true, predict, class_name):
TP, FP, FN, TN = 0, 0, 0, 0
for i in range(len(y_true)):
if y_true[i] == class_name:
if y_true[i] == predict[i]:
TP += 1
else:
FN += 1
else:
if y_true[i] == predict[i]:
TN += 1
else:
FP += 1
return np.array([[TP, FP],
[FN, TN]])
# Plot new matrix
matrix = generate_matrix(actual_labels,
predicted_labels,
class_name = 'A')
This will generate a confusion matrix for class A.

Resources