How to combine confusion matrices from the outputs of multiple algorithms?

How to combine confusion matrices from the outputs of multiple algorithms? - join

I want to extract prediction confusion matrices of multiple prediction models in one table for easier comparison.
I collected confusion matrix tables from each model and combined them to created the following table.
Table1 <- rbind(confmat_nb$table, confmat_nb_2$table, confmat_svm$table)
Table1
Reference
excluded included
naive_bayes excluded 6234 46
included 3107 470
naive_bayes_2 excluded 5774 60
included 3567 456
svm excluded 7197 101
included 2144 415
Table 2 was created using the following code
byClass <- rbind(naive_bayes =round(confmat_nb$byClass, 3),
naive_bayes_2=round(confmat_nb_2$byClass, 3),
svm = round(confmat_svm$byClass, 3)
overall <- rbind(naive_bayes =round(confmat_nb$overall, 3),
naive_bayes_2=round(confmat_nb_2$overall, 3),
svm = round(confmat_svm$overall, 3)
Table2 <- rbind(byClass, overall)
Table2
Accuracy Sensitivity Specificity Pos Pred Value Neg Pred Value Precision
naive_bayes 0.680 0.911 0.667 0.131 0.993 0.131
naive_bayes_2 0.632 0.884 0.618 0.113 0.990 0.113
svm 0.772 0.804 0.770 0.162 0.986 0.162
My goal is to merge Table 1 and Table 2 and extract the outputs in one table. Something like this

Related

Why same paths of xgboost tree give 2 different predictions?

I'm trying to investigate the xgboost prediction.
It seems that 2 inputs with same 2 paths gives 2 different predictions.
I'm running on the following dateset:
f1,f2,f3,f4,f5,f6,f7,f8,y
6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1
5,116,74,0,0,25.6,0.201,30,0
3,78,50,32,88,31.0,0.248,26,1
10,115,0,0,0,35.3,0.134,29,0
2,197,70,45,543,30.5,0.158,53,1
8,125,96,0,0,0.0,0.232,54,1
4,110,92,0,0,37.6,0.191,30,0
10,168,74,0,0,38.0,0.537,34,1
10,139,80,0,0,27.1,1.441,57,0
1,189,60,23,846,30.1,0.398,59,1
5,166,72,19,175,25.8,0.587,51,1
7,100,0,0,0,30.0,0.484,32,1
0,118,84,47,230,45.8,0.551,31,1
7,107,74,0,0,29.6,0.254,31,1
1,103,30,38,83,43.3,0.183,33,0
1,115,70,30,96,34.6,0.529,32,1
3,126,88,41,235,39.3,0.704,27,0
8,99,84,0,0,35.4,0.388,50,0
7,196,90,0,0,39.8,0.451,41,1
9,119,80,35,0,29.0,0.263,29,1
11,143,94,33,146,36.6,0.254,51,1
10,125,70,26,115,31.1,0.205,41,1
7,147,76,0,0,39.4,0.257,43,1
1,97,66,15,140,23.2,0.487,22,0
13,145,82,19,110,22.2,0.245,57,0
5,117,92,0,0,34.1,0.337,38,0
5,109,75,26,0,36.0,0.546,60,0
3,158,76,36,245,31.6,0.851,28,1
3,88,58,11,54,24.8,0.267,22,0
6,92,92,0,0,19.9,0.188,28,0
10,122,78,31,0,27.6,0.512,45,0
4,103,60,33,192,24.0,0.966,33,0
11,138,76,0,0,33.2,0.420,35,0
9,102,76,37,0,32.9,0.665,46,1
2,90,68,42,0,38.2,0.503,27,1
prediction and tree creation code:
df = pd.read_csv("input.csv")
x = df[['f1','f2','f3', 'f4', 'f5', 'f6','f7','f8']]
y = df[['y']]
X_train, X_test, y_train, y_test = train_test_split( x, y, test_size = 0.33, random_state = 42)
model = XGBClassifier(n_jobs=-1)
model.fit(X_train, y_train)
res = model.predict(X_test)
print ("X_test (first 2 rows:")
print(X_test.head(2))
print("Predictions (first 2 rows:")
print(res[0:2])
plot_tree(model)
plt.show()
Output:
X_test (first 2 rows:
f1 f2 f3 f4 f5 f6 f7 f8
33 6 92 92 0 0 19.9 0.188 28
36 11 138 76 0 0 33.2 0.420 35
Predictions (first 2 rows:
[0 1]
The same 2 inputs have f2<146.5 and f4=0 => get into same leaf (-0.34)
So why the prediction for those 2 are different ? (0 and 1) ?

What you have plotted in not the whole XGBoost model; it is only the first tree of it.
To see why this is so, check the source code of plot_tree:
def plot_tree(booster, fmap='', num_trees=0, rankdir=None, ax=None, **kwargs):
"""Plot specified tree.
and the documentation:
num_trees (int, default 0) – Specify the ordinal number of target tree
from where it is apparent that, when you do not specify the num_trees argument, like here, it takes the default value of 0, i.e. the first tree of the ensemble.
Using different values for num_trees you will get different trees, hence different decision paths for each sample.
You cannot plot all the trees of the boosting ensemble (and even if you could, it would not be of any practical use). plot_tree is just a utility function in order to be able to have a look at individual trees of the model. You can have a look at its use in How to Visualize Gradient Boosting Decision Trees With XGBoost in Python.

LSTM multiple feature regression data preparation

I am modeling an LSTM model that contains multiple features and one target value. It is a regression problem.
I have doubts that my data preparation for the LSTM is erroneous; mainly because the model learns nothing but the average of the target value.
The following code I wrote is for preparing the data for the LSTM:
# df is a pandas data frame that contains the feature columns (f1 to f5) and the target value named 'target'
# all columns of the df are time series data (including the 'target')
# seq_length is the sequence length
def prepare_data_multiple_feature(df):
X = []
y = []
for x in range(len(df)):
start_id = x
end_id = x + seq_length
one_data_point = []
if end_id + 1 <= len(df):
# prepare X
for col in ['f1', 'f2', 'f3', 'f4', 'f5']:
one_data_point.append(np.array(df[col].values[start_id:end_id]))
X.append(np.array(one_data_point))
# prepare y
y.append(np.array(df['target'].values[end_id ]))
assert len(y) == len(X)
return X, y
Then, I reshape the data as follows:
X, y = prepare_data_multiple_feature(df)
X = X.reshape((len(X), seq_length, 5)) #5 is the number of features, i.e., f1 to f5
is my data preparation method and data reshaping correct?

As #isp-zax mentioned, please provide a reprex so we could reproduce the outcome and see where the problem lies.
As an aside, you could use for col in df.columns instead of listing all the column names and (minor optimisation) the first loop should be executed for x in range(len(df) - seq_length), otherwise at the end you execute the loop seq_length - 1 many times without actually processing any data. Also, df.values[a, b] will not include the element at index b so if you want to include the "window" with last row inside your X the end_id can be equal to the len(df), i.e. you could execute your inner condition (prepare and append) for if end_id <= len(df):
Apart from that I think it would be simpler to read if you sliced the dataframe across columns and rows at the same time, without using one_data_point, i.e.
to select seq_length rows without the (last) target column, simply do:
df.values[start_id, end_id, :-1]

Recursive Feature Elimination (RFE) SKLearn

I created a table to test my understanding
F1 F2 Outcome
0 2 5 1
1 4 8 2
2 6 0 3
3 9 8 4
4 10 6 5
From F1 and F2 I tried to predict Outcome
As you can see F1 have a strong correlation to Outcome,F2 is random noise
I tested
pca = PCA(n_components=2)
fit = pca.fit(X)
print("Explained Variance")
print(fit.explained_variance_ratio_)
Explained Variance
[ 0.57554896 0.42445104]
Which is what I expected and shows that F1 is more important
However when I do RFE (Recursive Feature Elimination)
model = LogisticRegression()
rfe = RFE(model, 1)
fit = rfe.fit(X, Y)
print(fit.n_features_)
print(fit.support_)
print(fit.ranking_)
1
[False True]
[2 1]
It asked me to keep F2 instead? It should ask me to keep F1 since F1 is a strong predictor while F2 is random noise... why F2?
Thanks

It is advisable to do a Recursive Feature Elimination Cross Validation (RFECV) before running the Recursive Feature Elimination (RFE)
Here is an example:
Having columns :
df.columns = ['age', 'id', 'sex', 'height', 'gender', 'marital status', 'income', 'race']
Use RFECV to identify the optimal number of features needed.
from sklearn.ensemble import RandomForestClassifier
rfe = RandomForestClassifier(random_state = 32) # Instantiate the algo
rfecv = RFECV(estimator= rfe, step=1, cv=StratifiedKFold(2), scoring="accuracy") # Instantiate the RFECV and its parameters
fit = rfecv.fit(features(or X), target(or y))
print("Optimal number of features : %d" % rfecv.n_features_)
>>>> Optimal number of output is 4
Now that the optimal number of features has been known, we can use Recursive Feature Elimination to identify the Optimal features
from sklearn.feature_selection import RFE
min_features_to_select = 1
rfc = RandomForestClassifier()
rfe = RFE(estimator=rfc, n_features_to_select= 4, step=1)
fittings1 = rfe.fit(features, target)
for i in range(features.shape[1]):
print('Column: %d, Selected %s, Rank: %.3f' % (i, rfe.support_[i], rfe.ranking_[i]))
output will be something like:
>>> Column: 0, Selected True, Rank: 1.000
>>> Column: 1, Selected False, Rank: 4.000
>>> Column: 2, Selected False, Rank: 7.000
>>> Column: 3, Selected False, Rank: 10.000
>>> Column: 4, Selected True, Rank: 1.000
>>> Column: 5, Selected False, Rank: 3.000
Now display the features to remove based on recursive feature elimination done above
columns_to_remove = features.columns.values[np.logical_not(rfe.support_)]
columns_to_remove
output will be something like:
>>> array(['age', 'id', 'race'], dtype=object)
Now create your new dataset by dropping the un-needed features and selecting the needed one
new_df = df.drop(['age', 'id', 'race'], axis = 1)
Then you can cross validation to know how well this newly selected features (new_df) predicts the target column.
# Check how well the features predict the target variable using cross_validation
cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
scores = cross_val_score(RandomForestClassifier(), new_df, target, cv= cv)
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))
>>> 0.84 accuracy with a standard deviation of 0.01
Don't forget you can also read up on the best Cross-Validation (CV) parameters to use in this documentation
Recursive Feature Elimination (RFE) documentation to learn more and understand better
Recursive Feature Elimination Cross validation (RFECV) documentation

You are using LogisticRegression model. This is a classifier, not a regressor. So your outcome here is treated as labels (not numbers). For good training and prediction, a classifier needs multiple samples of each class. But in your data, only single row is present for each class. Hence the results are garbage and not to be taken seriously.
Try replacing that with any regression model and you will see the outcome which you thought would be.
model = LinearRegression()
rfe = RFE(model, 1)
fit = rfe.fit(X, y)
print(fit.n_features_)
print(fit.support_)
print(fit.ranking_)
# Output
1
[ True False]
[1 2]

Model interpretation using timeslice method in CARET

Suppose you want to evaluate a simple glm model to forecast an economic data series.
Consider the following code:
library(caret)
library(ggplot2)
data(economics)
h <- 7
myTimeControl <- trainControl(method = "timeslice",
initialWindow = 24*h,
horizon = 12,
fixedWindow = TRUE)
fit.glm <- train(unemploy ~ pce + pop + psavert,
data = economics,
method = "glm",
preProc = c("center", "scale","BoxCox"),
trControl = myTimeControl)
Suppose that the covariates used into the train formula are predictions of values obtained by some other model.
This simple model gives the following results:
Generalized Linear Model
574 samples
3 predictor
Pre-processing: centered (3), scaled (3), Box-Cox transformation (3)
Resampling: Rolling Forecasting Origin Resampling (12 held-out with a fixed
window)
Summary of sample sizes: 168, 168, 168, 168, 168, 168, ...
Resampling results:
RMSE Rsquared
1446.335 0.2958317
Apart from the bad results obtained (this is only an example).
I wonder if it is correct:
To consider the above results as results obtained, on the entire dataset, by a GLM trained using only 24*h=24*7 samples and retrained after every horizon=12 samples
How evaluate RMSE as horizon grows from 1 to 12 (as reported here http://robjhyndman.com/hyndsight/tscvexample/ )?
if I show fit.glm summary I obtain:
Call:
NULL
Deviance Residuals:
Min 1Q Median 3Q Max
-5090.0 -1025.5 -208.1 833.4 4948.4
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7771.56 64.93 119.688 < 2e-16 ***
pce 5750.27 1153.03 4.987 8.15e-07 ***
pop -1483.01 1117.06 -1.328 0.185
psavert 2932.38 144.56 20.286 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 2420081)
Null deviance: 3999514594 on 573 degrees of freedom
Residual deviance: 1379446256 on 570 degrees of freedom
AIC: 10072
Number of Fisher Scoring iterations: 2
The parameters showed refer to the last trained GLM or are "average" paramters?
I hope I've been clear enough.

This resampling method is like any others. The RMSE is estimated using different subsets of the training data. Note that it says "Summary of sample sizes: 168, 168, 168, 168, 168, 168, ...". The final model uses all of the training data set.
The difference between Rob's results and these are primarily due to the difference between Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE)

How to calculate the number of parameters of an LSTM network?

Is there a way to calculate the total number of parameters in a LSTM network.
I have found a example but I'm unsure of how correct this is or If I have understood it correctly.
For eg consider the following example:-
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding
from keras.layers import LSTM
model = Sequential()
model.add(LSTM(256, input_dim=4096, input_length=16))
model.summary()
Output
____________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
====================================================================================================
lstm_1 (LSTM) (None, 256) 4457472 lstm_input_1[0][0]
====================================================================================================
Total params: 4457472
____________________________________________________________________________________________________
As per My understanding n is the input vector lenght.
And m is the number of time steps. and in this example they consider the number of hidden layers to be 1.
Hence according to the formula in the post. 4(nm+n^2) in my example m=16;n=4096;num_of_units=256
4*((4096*16)+(4096*4096))*256 = 17246978048
Why is there such a difference?
Did I misunderstand the example or was the formula wrong ?

No - the number of parameters of a LSTM layer in Keras equals to:
params = 4 * ((size_of_input + 1) * size_of_output + size_of_output^2)
Additional 1 comes from bias terms. So n is size of input (increased by the bias term) and m is size of output of a LSTM layer.
So finally :
4 * (4097 * 256 + 256^2) = 4457472

image via this post
num_params = [(num_units + input_dim + 1) * num_units] * 4
num_units + input_dim: concat [h(t-1), x(t)]
+ 1: bias
* 4: there are 4 neural network layers (yellow box) {W_forget, W_input, W_output, W_cell}
model.add(LSTM(units=256, input_dim=4096, input_length=16))
[(256 + 4096 + 1) * 256] * 4 = 4457472
PS: num_units = num_hidden_units = output_dims

I think it would be easier to understand if we start with a simple RNN.
Let's assume that we have 4 units (please ignore the ... in the network and concentrate only on visible units), and the input size (number of dimensions) is 3:
The number of weights is 28 = 16 (num_units * num_units) for the recurrent connections + 12 (input_dim * num_units) for input. The number of biases is simply num_units.
Recurrency means that each neuron output is fed back into the whole network, so if we unroll it in time sequence, it looks like two dense layers:
and that makes it clear why we have num_units * num_units weights for the recurrent part.
The number of parameters for this simple RNN is 32 = 4 * 4 + 3 * 4 + 4, which can be expressed as num_units * num_units + input_dim * num_units + num_units or num_units * (num_units + input_dim + 1)
Now, for LSTM, we must multiply the number of of these parameters by 4, as this is the number of sub-parameters inside each unit, and it was nicely illustrated in the answer by #FelixHo

Formula expanding for #JohnStrong :
4 means we have different weight and bias variables for 3 gates (read / write / froget) and - 4-th - for the cell state (within same hidden state).
(These mentioned are shared among timesteps along particular hidden state vector)
4 * lstm_hidden_state_size * (lstm_inputs_size + bias_variable + lstm_outputs_size)
as LSTM output (y) is h (hidden state) by approach, so, without an extra projection, for LSTM outputs we have :
lstm_hidden_state_size = lstm_outputs_size
let's say it's d :
d = lstm_hidden_state_size = lstm_outputs_size
Then
params = 4 * d * ((lstm_inputs_size + 1) + d) = 4 * ((lstm_inputs_size + 1) * d + d^2)

LSTM Equations (via deeplearning.ai Coursera)
It is evident from the equations that the final dimensions of all the 6 equations will be same and final dimension must necessarily be equal to the dimension of a(t).
Out of these 6 equations, only 4 equations contribute to the number of parameters and by looking at the equations, it can be deduced that all the 4 equations are symmetric. So,if we find out the number of parameters for 1 equation, we can just multiply it by 4 and tell the total number of parameters.
One important point is to note that the total number of parameters doesn't depend on the time-steps(or input_length) as same "W" and "b" is shared throughout the time-step.
Assuming, insider of LSTM cell having just one layer for a gate(as that in Keras).
Take equation 1 and lets relate. Let number of neurons in the layer be n and number of dimension of x be m (not including number of example and time-steps). Therefore, dimension of forget gate will be n too. Now,same as that in ANN, dimension of "Wf" will be n*(n+m) and dimension of "bf" will be n. Therefore, total number of parameters for one equation will be [{n*(n+m)} + n]. Therefore, total number of parameters will be 4*[{n*(n+m)} + n].Lets open the brackets and we will get -> 4*(nm + n2 + n).
So,as per your values. Feeding it into the formula gives:->(n=256,m=4096),total number of parameters is 4*((256*256) + (256*4096) + (256) ) = 4*(1114368) = 4457472.

The others have pretty much answered it. But just for further clarification, on creating an LSTM layer. The number of params is as follows:
No of params= 4*((num_features used+1)*num_units+
num_units^2)
The +1 is because of the additional bias we take.
Where the num_features is the num_features in your input shape to the LSTM:
Input_shape=(window_size,num_features)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

How to combine confusion matrices from the outputs of multiple algorithms? - join

Related

Why same paths of xgboost tree give 2 different predictions?

LSTM multiple feature regression data preparation

Recursive Feature Elimination (RFE) SKLearn

Model interpretation using timeslice method in CARET

How to calculate the number of parameters of an LSTM network?

Categories

Resources