Scaling the data in a decision tree changed my results? - machine-learning

I know that a decision tree doesn't get affected by scaling the data but when I scale the data within my decision tree it gives me a bad performance (bad recall, precision and accuracy)
But when I don't scale all the performance metrics the decision tree gives me an amazing result. How can this be?
Note: I use GridSearchCV but I don't think that the cross validation is the reason for my problem. Here is my code:
scaled = MinMaxScaler()
pca = PCA()
bestK = SelectKBest()
combined_transformers = FeatureUnion([ ("scale",scaled),("best", bestK),
("pca", pca)])
clf = tree.DecisionTreeClassifier(class_weight= "balanced")
pipeline = Pipeline([("features", combined_transformers), ("tree", clf)])
param_grid = dict(features__pca__n_components=[1, 2,3],
features__best__k=[1, 2,3],
tree__min_samples_split=[4,5],
tree__max_depth= [4,5],
)
grid_search = GridSearchCV(pipeline, param_grid=param_grid,scoring='f1')
grid_search.fit(features,labels)
With the scale function MinMaxScaler() my performance is:
f1 = 0.837209302326
recall = 1.0
precision = 0.72
accuracy = 0.948148148148
But without scaling:
f1 = 0.918918918919
recall = 0.944444444444
precision = 0.894736842105
accuracy = 0.977777777778

I am not familiar with scikit-learn, so excuse me if I misunderstand something.
First of all, does PCA standardize features? If it does not, it will give different results for scaled and non-scaled input.
Second, due to the randomness in splitting the samples, CV may give different results on each run. This will affect the results especially for small sample size. In addition, in case you have small sample size, the results may not be that different after all.
I have the following suggestions:
Scaling can be treated as an additional hyperparameter, which can be optimized by CV.
Perform an extra CV (called nested CV) or hold-out to estimate performance. This is done by keeping a test set, selecting your model using CV on the training data and then evaluate its performance on the test set (in case of nested CV you do this repeatedly for all folds and average the performance estimates). Of course, your final model should be trained on the whole dataset. In general, you should not use the performance estimate of the CV used for model selection, as it will be overly optimistic.

Related

When do I use scoring vs metrics to evaluate ML performance

hi what is the basic difference between 'scoring' and 'metrics'. these are used to measure performance but how do they differ?
if you see the example
in the below the cross val is using 'neg_mean_squared_error' for scoring
X = array[:, 0:13]
Y = array[:, 13]
seed = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = LinearRegression()
scoring = 'neg_mean_squared_error'
results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("MSE: %.3f (%.3f)") % (results.mean(), results.std())
but in the below xgboost example I am using metrics = 'rmse'
cmatrix = xgb.DMatrix(data=X, label=y)
params = {'objective': 'reg:linear', 'max_depth': 3}
cv_results = xgb.cv(dtrain=cmatrix, params=params, nfold=3, num_boost_round=5, metrics='rmse', as_pandas=True, seed=123)
print(cv_results)
how do they differ?
They don't; these are actually just different terms, to declare the same thing.
To be very precise, scoring is the process in which one measures the model performance, according to some metric (or score). The scikit-learn term choice for the argument scoring (as in your first snippet) is rather unfortunate (it actually implies a scoring function), as the MSE (and its variants, as negative MSE and RMSE) are metrics or scores. But practically speaking, as shown in your example snippets, these two terms are used as synonyms and are frequently used interchangeably.
The real distinction of interest here is not between "score" and "metric", but between loss (often referred to as cost) and metrics such as the accuracy (for classification problems); this is often a source of confusion among new users. You may find my answers in the following threads useful (ignore the Keras mentions in some titles, the answers are generally applicable):
Loss & accuracy - Are these reasonable learning curves?
How does Keras evaluate the accuracy?
Optimizing for accuracy instead of loss in Keras model

Can intercept and regression coefficients (Beta values) be very high?

I have 38 variables, like oxygen, temperature, pressure, etc and have a task to determine the total yield produced every day from these variables. When I calculate the regression coefficients and intercept value, they seem to be abnormal and very high (Impractical). For example, if 'temperature' coefficient was found to be +375.456, I could not give a meaning to them saying an increase in one unit in temperature would increase yield by 375.456g. That's impractical in my scenario. However, the prediction accuracy seems right. I would like to know, how to interpret these huge intercept( -5341.27355) and huge beta values shown below. One other important point is that I removed multicolinear columns and also, I am not scaling the variables/normalizing them because I need beta coefficients to have meaning such that I could say, increase in temperature by one unit increases yield by 10g or so. Your inputs are highly appreciated!
modl.intercept_
Out[375]: -5341.27354961415
modl.coef_
Out[376]:
array([ 1.38096017e+00, -7.62388829e+00, 5.64611255e+00, 2.26124164e-01,
4.21908571e-01, 4.50695302e-01, -8.15167717e-01, 1.82390184e+00,
-3.32849969e+02, 3.31942553e+02, 3.58830763e+02, -2.05076898e-01,
-3.06404757e+02, 7.86012402e+00, 3.21339318e+02, -7.00817205e-01,
-1.09676321e+04, 1.91481734e+00, 6.02929848e+01, 8.33731416e+00,
-6.23433431e+01, -1.88442804e+00, 6.86526274e+00, -6.76103795e+01,
-1.11406021e+02, 2.48270706e+02, 2.94836048e+01, 1.00279016e+02,
1.42906659e-02, -2.13019683e-03, -6.71427100e+02, -2.03158515e+02,
9.32094007e-03, 5.56457014e+01, -2.91724945e+00, 4.78691176e-01,
8.78121854e+00, -4.93696073e+00])
It's very unlikely that all of these variables are linearly correlated, so I would suggest that you have a look at simple non-linear regression techniques, such as Decision Trees or Kernel Ridge Regression. These are however more difficult to interpret.
Going back to your issue, these high weights might well be due to there being some high amount of correlation between the variables, or that you simply don't have very much training data.
If you instead of linear regression use Lasso Regression, the solution is biased away from high regression coefficients, and the fit will likely improve as well.
A small example on how to do this in scikit-learn, including cross validation of the regularization hyper-parameter:
from sklearn.linear_model LassoCV
# Make up some data
n_samples = 100
n_features = 5
X = np.random.random((n_samples, n_features))
# Make y linear dependent on the features
y = np.sum(np.random.random((1,n_features)) * X, axis=1)
model = LassoCV(cv=5, n_alphas=100, fit_intercept=True)
model.fit(X,y)
print(model.intercept_)
If you have a linear regression, the formula looks like this (y= target, x= features inputs):
y= x1*b1 +x2*b2 + x3*b3 + x4*b4...+ c
where b1,b2,b3,b4... are your modl.coef_. AS you already realized one of your bigges number is 3.319+02 = 331 and the intercept is also quite big with -5431.
As you already mentioned the coeffiecient variables means how much the target variable changes, if the coeffiecient feature changes with 1 unit and all others features are constant.
so for your interpretation, the higher the absoult coeffienct, the higher the influence of your analysis. But it is important to note that the model is using a lot of high coefficient, that means your model is not depending only of one variable

Are the k-fold cross-validation scores from scikit-learn's `cross_val_score` and `GridsearchCV` biased if we include transformers in the pipeline?

Data pre-processers such as StandardScaler should be used to fit_transform the train set and only transform (not fit) the test set. I expect the same fit/transform process applies to cross-validation for tuning the model. However, I found cross_val_score and GridSearchCV fit_transform the entire train set with the preprocessor (rather than fit_transform the inner_train set, and transform the inner_validation set). I believe this artificially removes the variance from the inner_validation set which makes the cv score (the metric used to select the best model by GridSearch) biased. Is this a concern or did I actually miss anything?
To demonstrate the above issue, I tried the following three simple test cases with the Breast Cancer Wisconsin (Diagnostic) Data Set from Kaggle.
I intentionally fit and transform the entire X with StandardScaler()
X_sc = StandardScaler().fit_transform(X)
lr = LogisticRegression(penalty='l2', random_state=42)
cross_val_score(lr, X_sc, y, cv=5)
I include SC and LR in the Pipeline and run cross_val_score
pipe = Pipeline([
('sc', StandardScaler()),
('lr', LogisticRegression(penalty='l2', random_state=42))
])
cross_val_score(pipe, X, y, cv=5)
Same as 2 but with GridSearchCV
pipe = Pipeline([
('sc', StandardScaler()),
('lr', LogisticRegression(random_state=42))
])
params = {
'lr__penalty': ['l2']
}
gs=GridSearchCV(pipe,
param_grid=params, cv=5).fit(X, y)
gs.cv_results_
They all produce the same validation scores.
[0.9826087 , 0.97391304, 0.97345133, 0.97345133, 0.99115044]
No, sklearn doesn't do fit_transform with entire dataset.
To check this, I subclassed StandardScaler to print the size of the dataset sent to it.
class StScaler(StandardScaler):
def fit_transform(self,X,y=None):
print(len(X))
return super().fit_transform(X,y)
If you now replace StandardScaler in your code, you'll see dataset size passed in first case is actually bigger.
But why does the accuracy remain exactly same? I think this is because LogisticRegression is not very sensitive to feature scale. If we instead use a classifier that is very sensitive to scale, like KNeighborsClassifier for example, you'll find accuracy between two cases start to vary.
X,y = load_breast_cancer(return_X_y=True)
X_sc = StScaler().fit_transform(X)
lr = KNeighborsClassifier(n_neighbors=1)
cross_val_score(lr, X_sc,y, cv=5)
Outputs:
569
[0.94782609 0.96521739 0.97345133 0.92920354 0.9380531 ]
And the 2nd case,
pipe = Pipeline([
('sc', StScaler()),
('lr', KNeighborsClassifier(n_neighbors=1))
])
print(cross_val_score(pipe, X, y, cv=5))
Outputs:
454
454
456
456
456
[0.95652174 0.97391304 0.97345133 0.92920354 0.9380531 ]
Not big change accuracy-wise, but change nonetheless.
Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting. To avoid it, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set X_test, y_test
A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the k “folds”:
A model is trained using of the folds as training data;
the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).
The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. This approach can be computationally expensive, but does not waste too much data (as is the case when fixing an arbitrary validation set), which is a major advantage in problems such as inverse inference where the number of samples is very small.
More over if your model is already biased from starting we have to make it balance by SMOTE /Oversampling of Less Target Variable/Under-sampling of High target variable.

How do I perform a differentiable operation selection in TensorFlow?

I am trying to produce a mathematical operation selection nn model, which is based on the scalar input. The operation is selected based on the softmax result which is produce by the nn. Then this operation has to be applied to the scalar input in order to produce the final output. So far I’ve come up with applying argmax and onehot on the softmax output in order to produce a mask which then is applied on the concated values matrix from all the possible operations to be performed (as show in the pseudo code below). The issue is that neither argmax nor onehot appears to be differentiable. I am new to this, so any would be highly appreciated. Thanks in advance.
#perform softmax
logits = tf.matmul(current_input, W) + b
softmax = tf.nn.softmax(logits)
#perform all possible operations on the input
op_1_val = tf_op_1(current_input)
op_2_val = tf_op_2(current_input)
op_3_val = tf_op_2(current_input)
values = tf.concat([op_1_val, op_2_val, op_3_val], 1)
#create a mask
argmax = tf.argmax(softmax, 1)
mask = tf.one_hot(argmax, num_of_operations)
#produce the input, by masking out those operation results which have not been selected
output = values * mask
I believe that this is not possible. This is similar to Hard Attention described in this paper. Hard attention is used in Image captioning to allow the model to focus only on a certain part of the image at each step. Hard attention is not differentiable but there are 2 ways to go around this:
1- Use Reinforcement Learning (RL): RL is made to train models that makes decisions. Even though, the loss function won't back-propagate any gradients to the softmax used for the decision, you can use RL techniques to optimize the decision. For a simplified example, you can consider the loss as penalty, and send to the node, with the maximum value in the softmax layer, a policy gradient proportional to the penalty in order to decrease the score of the decision if it was bad (results in a high loss).
2- Use something like soft attention: instead of picking only one operation, mix them with weights based on the softmax. so instead of:
output = values * mask
Use:
output = values * softmax
Now, the operations will converge down to zero based on how much the softmax will not select them. This is easier to train compared to RL but it won't work if you must completely remove the non-selected operations from the final result (set them to zero completely).
This is another answer that talks about Hard and Soft attention that you may find helpful: https://stackoverflow.com/a/35852153/6938290

Interpretation of a learning curve in machine learning

While following the Coursera-Machine Learning class, I wanted to test what I learned on another dataset and plot the learning curve for different algorithms.
I (quite randomly) chose the Online News Popularity Data Set, and tried to apply a linear regression to it.
Note : I'm aware it's probably a bad choice but I wanted to start with linear reg to see later how other models would fit better.
I trained a linear regression and plotted the following learning curve :
This result is particularly surprising for me, so I have questions about it :
Is this curve even remotely possible or is my code necessarily flawed?
If it is correct, how can the training error grow so quickly when adding new training examples? How can the cross validation error be lower than the train error?
If it is not, any hint to where I made a mistake?
Here's my code (Octave / Matlab) just in case:
Plot :
lambda = 0;
startPoint = 5000;
stepSize = 500;
[error_train, error_val] = ...
learningCurve([ones(mTrain, 1) X_train], y_train, ...
[ones(size(X_val, 1), 1) X_val], y_val, ...
lambda, startPoint, stepSize);
plot(error_train(:,1),error_train(:,2),error_val(:,1),error_val(:,2))
title('Learning curve for linear regression')
legend('Train', 'Cross Validation')
xlabel('Number of training examples')
ylabel('Error')
Learning curve :
S = ['Reg with '];
for i = startPoint:stepSize:m
temp_X = X(1:i,:);
temp_y = y(1:i);
% Initialize Theta
initial_theta = zeros(size(X, 2), 1);
% Create "short hand" for the cost function to be minimized
costFunction = #(t) linearRegCostFunction(X, y, t, lambda);
% Now, costFunction is a function that takes in only one argument
options = optimset('MaxIter', 50, 'GradObj', 'on');
% Minimize using fmincg
theta = fmincg(costFunction, initial_theta, options);
[J, grad] = linearRegCostFunction(temp_X, temp_y, theta, 0);
error_train = [error_train; [i J]];
[J, grad] = linearRegCostFunction(Xval, yval, theta, 0);
error_val = [error_val; [i J]];
fprintf('%s %6i examples \r', S, i);
fflush(stdout);
end
Edit : if I shuffle the whole dataset before splitting train/validation and doing the learning curve, I have very different results, like the 3 following :
Note : the training set size is always around 24k examples, and validation set around 8k examples.
Is this curve even remotely possible or is my code necessarily flawed?
It's possible, but not very likely. You might be picking the hard to predict instances for the training set and the easy ones for the test set all the time. Make sure you shuffle your data, and use 10 fold cross validation.
Even if you do all this, it is still possible for it to happen, without necessarily indicating a problem in the methodology or the implementation.
If it is correct, how can the training error grow so quickly when adding new training examples? How can the cross validation error be lower than the train error?
Let's assume that your data can only be properly fitted by a 3rd degree polynomial, and you're using linear regression. This means that the more data you add, the more obviously it will be that your model is inadequate (higher training error). Now, if you choose few instances for the test set, the error will be smaller, because linear vs 3rd degree might not show a big difference for too few test instances for this particular problem.
For example, if you do some regression on 2D points, and you always pick 2 points for your test set, you will always have 0 error for linear regression. An extreme example, but you get the idea.
How big is your test set?
Also, make sure that your test set remains constant throughout the plotting of the learning curves. Only the train set should increase.
If it is not, any hint to where I made a mistake?
Your test set might not be large enough or your train and test sets might not be properly randomized. You should shuffle the data and use 10 fold cross validation.
You might want to also try to find other research regarding that data set. What results are other people getting?
Regarding the update
That makes a bit more sense, I think. Test error is generally higher now. However, those errors look huge to me. Probably the most important information this gives you is that linear regression is very bad at fitting this data.
Once more, I suggest you do 10 fold cross validation for learning curves. Think of it as averaging all of your current plots into one. Also shuffle the data before running the process.

Resources