Get accuracy from true and predicted values - machine-learning

I have predicted_y and real_y.
Is there a faster way to get accuracy than:
from keras import backend as K
accuracy_array = K.eval(keras.metrics.categorical_accuracy(real_y, predicted_y))
print(sum(accuracy_array)/len(accuracy_array))

I would suggest to use scikit-learn for your purpose as I mentioned in my comment.
Example 1:
from sklearn import metrics
results = metrics.accuracy_score(real_y, predicted_y)
You can allso get the classification report including precision, recall, f1-scores.
Example 2:
from sklearn.metrics import classification_report
y_true = [0, 1, 2, 2, 2]
y_pred = [0, 0, 2, 2, 1]
target_names = ['class 0', 'class 1', 'class 2']
print(classification_report(y_true, y_pred, target_names=target_names))
precision recall f1-score support
class 0 0.50 1.00 0.67 1
class 1 0.00 0.00 0.00 1
class 2 1.00 0.67 0.80 3
avg / total 0.70 0.60 0.61 5
Finally, for the confusion matrix use this:
Example 3:
from sklearn.metrics import confusion_matrix
y_true = [0, 1, 2, 2, 2]
y_pred = [0, 0, 2, 2, 1]
confusion_matrix(y_true, y_pred)
array([[1, 0, 0],
[1, 0, 0],
[0, 1, 2]])

Try accuracy_score from scikit-learn.
import numpy as np
from sklearn.metrics import accuracy_score
y_pred = [0, 2, 1, 3]
y_true = [0, 1, 2, 3]
accuracy_score(y_true, y_pred)
accuracy_score(y_true, y_pred, normalize=False)

I wrote a Python lib for confusion matrix analysis, you can use it for your purpose.
>>> from pycm import *
>>> y_actu = [2, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 2] # or y_actu = numpy.array([2, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 2])
>>> y_pred = [0, 0, 2, 1, 0, 2, 1, 0, 2, 0, 2, 2] # or y_pred = numpy.array([0, 0, 2, 1, 0, 2, 1, 0, 2, 0, 2, 2])
>>> cm = ConfusionMatrix(actual_vector=y_actu, predict_vector=y_pred) # Create CM From Data
>>> cm.classes
[0, 1, 2]
>>> cm.table
{0: {0: 3, 1: 0, 2: 0}, 1: {0: 0, 1: 1, 2: 2}, 2: {0: 2, 1: 1, 2: 3}}
>>> print(cm)
Predict 0 1 2
Actual
0 3 0 0
1 0 1 2
2 2 1 3
Overall Statistics :
95% CI (0.30439,0.86228)
Bennett_S 0.375
Chi-Squared 6.6
Chi-Squared DF 4
Conditional Entropy 0.95915
Cramer_V 0.5244
Cross Entropy 1.59352
Gwet_AC1 0.38931
Joint Entropy 2.45915
KL Divergence 0.09352
Kappa 0.35484
Kappa 95% CI (-0.07708,0.78675)
Kappa No Prevalence 0.16667
Kappa Standard Error 0.22036
Kappa Unbiased 0.34426
Lambda A 0.16667
Lambda B 0.42857
Mutual Information 0.52421
Overall_ACC 0.58333
Overall_RACC 0.35417
Overall_RACCU 0.36458
PPV_Macro 0.56667
PPV_Micro 0.58333
Phi-Squared 0.55
Reference Entropy 1.5
Response Entropy 1.48336
Scott_PI 0.34426
Standard Error 0.14232
Strength_Of_Agreement(Altman) Fair
Strength_Of_Agreement(Cicchetti) Poor
Strength_Of_Agreement(Fleiss) Poor
Strength_Of_Agreement(Landis and Koch) Fair
TPR_Macro 0.61111
TPR_Micro 0.58333
Class Statistics :
Classes 0 1 2
ACC(Accuracy) 0.83333 0.75 0.58333
BM(Informedness or bookmaker informedness) 0.77778 0.22222 0.16667
DOR(Diagnostic odds ratio) None 4.0 2.0
ERR(Error rate) 0.16667 0.25 0.41667
F0.5(F0.5 score) 0.65217 0.45455 0.57692
F1(F1 score - harmonic mean of precision and sensitivity) 0.75 0.4 0.54545
F2(F2 score) 0.88235 0.35714 0.51724
FDR(False discovery rate) 0.4 0.5 0.4
FN(False negative/miss/type 2 error) 0 2 3
FNR(Miss rate or false negative rate) 0.0 0.66667 0.5
FOR(False omission rate) 0.0 0.2 0.42857
FP(False positive/type 1 error/false alarm) 2 1 2
FPR(Fall-out or false positive rate) 0.22222 0.11111 0.33333
G(G-measure geometric mean of precision and sensitivity) 0.7746 0.40825 0.54772
LR+(Positive likelihood ratio) 4.5 3.0 1.5
LR-(Negative likelihood ratio) 0.0 0.75 0.75
MCC(Matthews correlation coefficient) 0.68313 0.2582 0.16903
MK(Markedness) 0.6 0.3 0.17143
N(Condition negative) 9 9 6
NPV(Negative predictive value) 1.0 0.8 0.57143
P(Condition positive) 3 3 6
POP(Population) 12 12 12
PPV(Precision or positive predictive value) 0.6 0.5 0.6
PRE(Prevalence) 0.25 0.25 0.5
RACC(Random accuracy) 0.10417 0.04167 0.20833
RACCU(Random accuracy unbiased) 0.11111 0.0434 0.21007
TN(True negative/correct rejection) 7 8 4
TNR(Specificity or true negative rate) 0.77778 0.88889 0.66667
TON(Test outcome negative) 7 10 7
TOP(Test outcome positive) 5 2 5
TP(True positive/hit) 3 1 3
TPR(Sensitivity, recall, hit rate, or true positive rate) 1.0 0.33333 0.5
>>> cm.matrix()
Predict 0 1 2
Actual
0 3 0 0
1 0 1 2
2 2 1 3
>>> cm.normalized_matrix()
Predict 0 1 2
Actual
0 1.0 0.0 0.0
1 0.0 0.33333 0.66667
2 0.33333 0.16667 0.5
Link : PyCM

Thanks to seralouk, I've found:
from sklearn import metrics
metrics.accuracy_score(real_y.argmax(axis=1), predicted_y.argmax(axis=1))

Related

How can I predict the weathers in the next days using LSTM or GRU?

I am a newbie in Deep learning and I am using Tensorflow, Keras. I was given a small task to predict the weather in the next days based on the past weather information sequence using neural networks. Example:
Given: snowy -> sunny -> rainy -> sunny -> sunny -> snowy -> rainy -> snowy -> sunny -> ? -> ?
(days) 1 2 3 4 5 6 7 8 9 10 11
Problem: predict the weather in days 10, 11 based on "n" days in the past.
I am planning to use LSTM or GRU for this task, but I am not sure how to do it properly. All I can think so far is to encode the weather using One-hot encoder as
Snowy: 0 0 1
Rainy: 0 1 0
Sunny: 1 0 0
I do not know what to do next to train those One-hot encoded data with the LSTM or GRU networks. The code is below:
import numpy as np
import tensorflow
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM
lookBack = 3
stepsAhead = 2
categories = ["sunny", "rainy", "snowy"]
data = ["snowy", "sunny", "rainy", "sunny", "sunny", "snowy", "rainy", "snowy", "sunny"]
# Days 1 2 3 4 5 6 7 8 9
# Encode:
mapping = {}
for x in range(len(categories)):
mapping[categories[x]] = x
one_hot_encode = []
for c in data:
arr = list(np.zeros(len(categories), dtype = int))
arr[mapping[c]] = 1
one_hot_encode.append(arr)
print(one_hot_encode)
# The data after encoded: [[0, 0, 1], [1, 0, 0], [0, 1, 0], [1, 0, 0], [1, 0, 0],
# [0, 0, 1], [0, 1, 0], [0, 0, 1], [1, 0, 0]]
'''
? ? ?
I do not know how to do the next steps such as:
how to create a proper validation dataset, testing dataset, what LSTM structure
should be used...
'''
Could I ask for your help on how to do it? I would be thankful if you show me some sample codes.

How to get the each class results seperately in multiclass confusion matrix

I have actual class and res class here - https://extendsclass.com/csv-editor.html#46eaa9e
I wanted to calculate the sensitivity, specificity, pos predictivity for each of the class A, N,O. Here is my code
Here is the code
from sklearn.metrics import multilabel_confusion_matrix
import numpy as np
mcm = multilabel_confusion_matrix(act_class, pred_class)
tps = mcm[:, 1, 1]
tns = mcm[:, 0, 0]
recall = tps / (tps + mcm[:, 1, 0]) # Sensitivity
specificity = tns / (tns + mcm[:, 0, 1]) # Specificity
precision = tps / (tps + mcm[:, 0, 1]) # PPV
print(recall)
print(specificity)
print(precision)
print(classification_report(act_class, pred_class))
Which gives me results like this
[0.31818182 0.96186441 nan nan]
[0.99576271 0.86363636 0.86092715 0.99337748]
[0.95454545 0.96186441 0. 0. ]
precision recall f1-score support
A 0.95 0.32 0.48 66
N 0.96 0.96 0.96 236
O 0.00 0.00 0.00 0
~ 0.00 0.00 0.00 0
accuracy 0.82 302
macro avg 0.48 0.32 0.36 302
weighted avg 0.96 0.82 0.86 302
The problem here is - I can not deduce clearly what is the sensitivity, specificity, pos predictivity for each of the class A, N,O.
This might be quicker to explain visually:
By default the labels should occur in sorted order (for your problem: A, N, O, ~).
If you want a different order, you can specify one with the labels= parameter. The following has two classes, and orders them by: [3, 2]
from sklearn.metrics import multilabel_confusion_matrix
from sklearn.metrics import classification_report
y_true = [2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3]
y_pred = [2, 2, 2, 3, 3, 2, 2, 2, 3, 3, 3, 3]
mcm = multilabel_confusion_matrix(y_true, y_pred, labels=[3, 2])
tps = mcm[:, 1, 1]
precision = tps / (tps + mcm[:, 0, 1])
print(precision)
print(f"Precision class 3: {precision[0]}. Precision class 2: {precision[1]}")
print(classification_report(y_true, y_pred, labels=[3, 2]))
Output:
[0.66666667 0.5 ]
Precision class 3: 0.6666666666666666. Precision class 2: 0.5
precision recall f1-score support
3 0.67 0.57 0.62 7
2 0.50 0.60 0.55 5
accuracy 0.58 12
macro avg 0.58 0.59 0.58 12
weighted avg 0.60 0.58 0.59 12

DCGAN error: Using a target size (torch.Size([128])) that is different to the input size (torch.Size([128, 1, 5, 5])) is deprecated

I keep getting this error when trying to run a DCGAN using the EMNIST dataset, I'm fairly new to this and struggling to debug the issue
class Generator(nn.Module):
def __init__(self, ngpu):
super(Generator, self).__init__()
self.ngpu = ngpu
self.main = nn.Sequential(
# input is Z, going into a convolution
nn.ConvTranspose2d( nz, ngf * 8, 4, 1, 0, bias=False),
nn.BatchNorm2d(ngf * 8),
nn.ReLU(True),
# state size. (ngf*8) x 4 x 4
nn.ConvTranspose2d(ngf * 8, ngf * 4, 4, 2, 1, bias=False),
nn.BatchNorm2d(ngf * 4),
nn.ReLU(True),
# state size. (ngf*4) x 8 x 8
nn.ConvTranspose2d( ngf * 4, ngf * 2, 4, 2, 1, bias=False),
nn.BatchNorm2d(ngf * 2),
nn.ReLU(True),
# state size. (ngf*2) x 16 x 16
nn.ConvTranspose2d( ngf * 2, ngf, 4, 2, 1, bias=False),
nn.BatchNorm2d(ngf),
nn.ReLU(True),
# state size. (ngf) x 32 x 32
nn.ConvTranspose2d( ngf, nc, 4, 2, 1, bias=False),
nn.Tanh()
# state size. (nc) x 64 x 64
)
def forward(self, input):
return self.main(input)
class Discriminator(nn.Module):
def __init__(self, ngpu):
super(Discriminator, self).__init__()
self.ngpu = ngpu
self.main = nn.Sequential(
# input is (nc) x 64 x 64
nn.Conv2d(nc, ndf, 4, 2, 1, bias=False),
nn.BatchNorm2d(ndf),
nn.LeakyReLU(0.2, inplace=True),
# state size. (ndf) x 32 x 32
nn.Conv2d(ndf, ndf * 2, 4, 2, 1, bias=False),
nn.BatchNorm2d(ndf * 2),
nn.LeakyReLU(0.2, inplace=True),
# state size. (ndf*2) x 16 x 16
nn.Conv2d(ndf * 2, ndf * 4, 4, 2, 1, bias=False),
nn.BatchNorm2d(ndf * 4),
nn.LeakyReLU(0.2, inplace=True),
# state size. (ndf*4) x 8 x 8
#nn.Conv2d(ndf * 4, ndf * 8, 4, 2, 1, bias=False),
#nn.BatchNorm2d(ndf * 8),
#nn.LeakyReLU(0.2, inplace=True),
# state size. (ndf*8) x 4 x 4
nn.Conv2d(ndf * 4, 1, 4, 1, 0, bias=False),
nn.Sigmoid()
)
def forward(self, input):
return self.main(input)
# Lists to keep track of progress
img_list = []
G_losses = []
D_losses = []
iters = 0
print("Starting Training Loop...")
# For each epoch
for epoch in range(num_epochs):
# For each batch in the dataloader
for i, data in enumerate(dataloader, 0):
############################
# (1) Update D network: maximize log(D(x)) + log(1 - D(G(z)))
###########################
## Train with all-real batch
netD.zero_grad()
# Format batch
real_cpu = data[0].to(device)
b_size = real_cpu.size(0)
label = torch.full((b_size,), real_label, dtype=torch.float, device=device)
# Forward pass real batch through D
output = netD(real_cpu).view(-1)
# Calculate loss on all-real batch
errD_real = criterion(output, label)
# Calculate gradients for D in backward pass
errD_real.backward()
D_x = output.mean().item()
## Train with all-fake batch
# Generate batch of latent vectors
noise = torch.randn(b_size, nz, 1, 1, device=device)
# Generate fake image batch with G
fake = netG(noise)
label.fill_(fake_label)
# Classify all fake batch with D
output = netD(fake.detach()).view(-1)
# Calculate D's loss on the all-fake batch
errD_fake = criterion(output, label)
# Calculate the gradients for this batch, accumulated (summed) with previous gradients
errD_fake.backward()
D_G_z1 = output.mean().item()
# Compute error of D as sum over the fake and the real batches
errD = errD_real + errD_fake
# Update D
optimizerD.step()
############################
# (2) Update G network: maximize log(D(G(z)))
###########################
netG.zero_grad()
label.fill_(real_label) # fake labels are real for generator cost
# Since we just updated D, perform another forward pass of all-fake batch through D
output = netD(fake).view(-1)
# Calculate G's loss based on this output
errG = criterion(output, label)
# Calculate gradients for G
errG.backward()
D_G_z2 = output.mean().item()
# Update G
optimizerG.step()
# Output training stats
if i % 50 == 0:
print('[%d/%d][%d/%d]\tLoss_D: %.4f\tLoss_G: %.4f\tD(x): %.4f\tD(G(z)): %.4f / %.4f'
% (epoch, num_epochs, i, len(dataloader),
errD.item(), errG.item(), D_x, D_G_z1, D_G_z2))
# Save Losses for plotting later
G_losses.append(errG.item())
D_losses.append(errD.item())
# Check how the generator is doing by saving G's output on fixed_noise
if (iters % 500 == 0) or ((epoch == num_epochs-1) and (i == len(dataloader)-1)):
with torch.no_grad():
fake = netG(fixed_noise).detach().cpu()
img_list.append(vutils.make_grid(fake, padding=2, normalize=True))
iters += 1
I'm wanting to do this with just the conv2d layers that are in use and I have tried using the .squeeze function to get it working but had no luck

Isolation Forest getting accuracy score 0.0 [closed]

Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 6 months ago.
Improve this question
Edit: please share comments as I'm learning to post good questions
I'm trying to train this dataset with IsolationForest(), I need to train this dataset, and use it in another dataset with altered qualities to predict the quality values and fetch all wines with quality 8 and 9.
However I'm having some problems with it. Because the accuracy score is 0.0 from the classification report:
print(classification_report(y_test, prediction))
precision recall f1-score support
-1 0.00 0.00 0.00 0.0
1 0.00 0.00 0.00 0.0
3 0.00 0.00 0.00 866.0
4 0.00 0.00 0.00 829.0
5 0.00 0.00 0.00 841.0
6 0.00 0.00 0.00 861.0
7 0.00 0.00 0.00 822.0
8 0.00 0.00 0.00 886.0
9 0.00 0.00 0.00 851.0
accuracy 0.00 5956.0
macro avg 0.00 0.00 0.00 5956.0
weighted avg 0.00 0.00 0.00 5956.0
I don't know if it's a hyperparameter issue, or if I'm clearing the wrong data or putting wrong parameters, I already tried to use with SMOTE and without SMOTE, I wanted to reach an accuracy of 90% at least.
I'll leave the shared drive link public for dataset verification::
https://drive.google.com/drive/folders/18_sOSIZZw9DCW7ftEKuOG4aIzGXoasFe?usp=sharing
Here's my code:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.ensemble import IsolationForest
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from sklearn.metrics import classification_report,confusion_matrix
df = pd.read_csv('wines.csv')
df.head(5)
ordinalEncoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-99).fit(df[['color']])
df[['color']] = ordinalEncoder.transform(df[['color']])
df.info()
df['color'] = df['color'].astype(int)
df.head(3)
stm = SMOTE(k_neighbors=4)
x_smote = df.drop('quality',axis=1)
y_smote = df['quality']
x_smote,y_smote = stm.fit_resample(x_smote,y_smote)
print(x_smote.shape,y_smote.shape)
x_smote.columns
scaler = StandardScaler()
X = scaler.fit_transform(x_smote)
y = y_smote
X.shape, y.shape
x_train, x_test, y_train, y_test = train_test_split(X,y,test_size=0.3)
from sklearn.ensemble import IsolationForest
from sklearn.metrics import hamming_loss
iforest = IsolationForest(n_estimators=200, max_samples=0.1, contamination=0.10, max_features=1.0, bootstrap=False, n_jobs=-1,
random_state=None, verbose=0, warm_start=False)
iforest_fit = iforest.fit(x_train,y_train)
prediction = iforest_fit.predict(x_test)
print (prediction.shape, y_test.shape)
y.value_counts()
prediction
print(confusion_matrix(y_test, prediction))
hamming_loss(y_test, prediction)
from sklearn.metrics import classification_report
print(classification_report(y_test, prediction))
May I know why do you choose Isolation Forest as your model? This article says that Isolation Forest is an unsupervised learning algorithm for anomaly detection.
When I print some samples of the prediction (by Isolation Forest) and samples of actual truth, I get the following results, so you know why the accuracy score is 0.0:
print(list(prediction[0:15]))
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
print(list(y_test[0:15]))
[9, 4, 4, 7, 9, 3, 6, 7, 4, 8, 8, 7, 3, 8, 5]
The wines.csv dataset and your code are both pointing towards a multi-class classification problem. I have chosen RandomForestClassifier() to continue with the second part of your code:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import hamming_loss
model = RandomForestClassifier()
model.fit(x_train,y_train)
prediction = model.predict(x_test)
print(prediction[0:15]) #see 15 samples of prediction
[3, 9, 5, 5, 7, 9, 7, 6, 9, 8, 5, 9, 8, 3, 3]
print(list(y_test[0:15])) #see 15 samples of actual truth
[3, 9, 5, 6, 6, 9, 7, 5, 9, 8, 5, 9, 8, 3, 3]
print(confusion_matrix(y_test, prediction))
[[842 0 0 0 0 0 0]
[ 2 815 17 8 1 1 0]
[ 8 50 690 130 26 2 0]
[ 2 28 152 531 128 16 0]
[ 4 1 15 66 716 32 3]
[ 0 1 0 4 12 833 0]
[ 0 0 0 0 0 0 820]]
print('hamming_loss =', hamming_loss(y_test, prediction))
hamming_loss = 0.11903962390866353
print(classification_report(y_test, prediction))
precision recall f1-score support
3 0.98 1.00 0.99 842
4 0.91 0.97 0.94 844
5 0.79 0.76 0.78 906
6 0.72 0.62 0.67 857
7 0.81 0.86 0.83 837
8 0.94 0.98 0.96 850
9 1.00 1.00 1.00 820
accuracy 0.88 5956
macro avg 0.88 0.88 0.88 5956
weighted avg 0.88 0.88 0.88 5956
The accuracy is already 0.88 even before tuning hyperparameters.

TimeDistributed(Dense) vs Dense in Keras - Same number of parameters

I'm building a model that converts a string to another string using recurrent layers (GRUs). I have tried both a Dense and a TimeDistributed(Dense) layer as the last-but-one layer, but I don't understand the difference between the two when using return_sequences=True, especially as they seem to have the same number of parameters.
My simplified model is the following:
InputSize = 15
MaxLen = 64
HiddenSize = 16
inputs = keras.layers.Input(shape=(MaxLen, InputSize))
x = keras.layers.recurrent.GRU(HiddenSize, return_sequences=True)(inputs)
x = keras.layers.TimeDistributed(keras.layers.Dense(InputSize))(x)
predictions = keras.layers.Activation('softmax')(x)
The summary of the network is:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 64, 15) 0
_________________________________________________________________
gru_1 (GRU) (None, 64, 16) 1536
_________________________________________________________________
time_distributed_1 (TimeDist (None, 64, 15) 255
_________________________________________________________________
activation_1 (Activation) (None, 64, 15) 0
=================================================================
This makes sense to me as my understanding of TimeDistributed is that it applies the same layer at all timepoints, and so the Dense layer has 16*15+15=255 parameters (weights+biases).
However, if I switch to a simple Dense layer:
inputs = keras.layers.Input(shape=(MaxLen, InputSize))
x = keras.layers.recurrent.GRU(HiddenSize, return_sequences=True)(inputs)
x = keras.layers.Dense(InputSize)(x)
predictions = keras.layers.Activation('softmax')(x)
I still only have 255 parameters:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 64, 15) 0
_________________________________________________________________
gru_1 (GRU) (None, 64, 16) 1536
_________________________________________________________________
dense_1 (Dense) (None, 64, 15) 255
_________________________________________________________________
activation_1 (Activation) (None, 64, 15) 0
=================================================================
I wonder if this is because Dense() will only use the last dimension in the shape, and effectively treat everything else as a batch-like dimension. But then I'm no longer sure what the difference is between Dense and TimeDistributed(Dense).
Update Looking at https://github.com/fchollet/keras/blob/master/keras/layers/core.py it does seem that Dense uses the last dimension only to size itself:
def build(self, input_shape):
assert len(input_shape) >= 2
input_dim = input_shape[-1]
self.kernel = self.add_weight(shape=(input_dim, self.units),
It also uses keras.dot to apply the weights:
def call(self, inputs):
output = K.dot(inputs, self.kernel)
The docs of keras.dot imply that it works fine on n-dimensional tensors. I wonder if its exact behavior means that Dense() will in effect be called at every time step. If so, the question still remains what TimeDistributed() achieves in this case.
TimeDistributedDense applies a same dense to every time step during GRU/LSTM Cell unrolling. So the error function will be between predicted label sequence and the actual label sequence. (Which is normally the requirement for sequence to sequence labeling problems).
However, with return_sequences=False, Dense layer is applied only once at the last cell. This is normally the case when RNNs are used for classification problem. If return_sequences=True then Dense layer is applied to every timestep just like TimeDistributedDense.
So for as per your models both are same, but if you change your second model to return_sequences=False, then Dense will be applied only at the last cell. Try changing it and the model will throw as error because then the Y will be of size [Batch_size, InputSize], it is no more a sequence to sequence but a full sequence to label problem.
from keras.models import Sequential
from keras.layers import Dense, Activation, TimeDistributed
from keras.layers.recurrent import GRU
import numpy as np
InputSize = 15
MaxLen = 64
HiddenSize = 16
OutputSize = 8
n_samples = 1000
model1 = Sequential()
model1.add(GRU(HiddenSize, return_sequences=True, input_shape=(MaxLen, InputSize)))
model1.add(TimeDistributed(Dense(OutputSize)))
model1.add(Activation('softmax'))
model1.compile(loss='categorical_crossentropy', optimizer='rmsprop')
model2 = Sequential()
model2.add(GRU(HiddenSize, return_sequences=True, input_shape=(MaxLen, InputSize)))
model2.add(Dense(OutputSize))
model2.add(Activation('softmax'))
model2.compile(loss='categorical_crossentropy', optimizer='rmsprop')
model3 = Sequential()
model3.add(GRU(HiddenSize, return_sequences=False, input_shape=(MaxLen, InputSize)))
model3.add(Dense(OutputSize))
model3.add(Activation('softmax'))
model3.compile(loss='categorical_crossentropy', optimizer='rmsprop')
X = np.random.random([n_samples,MaxLen,InputSize])
Y1 = np.random.random([n_samples,MaxLen,OutputSize])
Y2 = np.random.random([n_samples, OutputSize])
model1.fit(X, Y1, batch_size=128, nb_epoch=1)
model2.fit(X, Y1, batch_size=128, nb_epoch=1)
model3.fit(X, Y2, batch_size=128, nb_epoch=1)
print(model1.summary())
print(model2.summary())
print(model3.summary())
In the above example architecture of model1 and model2 are sample (sequence to sequence models) and model3 is a full sequence to label model.
Here is a piece of code that verifies TimeDistirbuted(Dense(X)) is identical to Dense(X):
import numpy as np
from keras.layers import Dense, TimeDistributed
import tensorflow as tf
X = np.array([ [[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[10, 11, 12]
],
[[3, 1, 7],
[8, 2, 5],
[11, 10, 4],
[9, 6, 12]
]
]).astype(np.float32)
print(X.shape)
(2, 4, 3)
dense_weights = np.array([[0.1, 0.2, 0.3, 0.4, 0.5],
[0.2, 0.7, 0.9, 0.1, 0.2],
[0.1, 0.8, 0.6, 0.2, 0.4]])
bias = np.array([0.1, 0.3, 0.7, 0.8, 0.4])
print(dense_weights.shape)
(3, 5)
dense = Dense(input_dim=3, units=5, weights=[dense_weights, bias])
input_tensor = tf.Variable(X, name='inputX')
output_tensor1 = dense(input_tensor)
output_tensor2 = TimeDistributed(dense)(input_tensor)
print(output_tensor1.shape)
print(output_tensor2.shape)
(2, 4, 5)
(2, ?, 5)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
output1 = sess.run(output_tensor1)
output2 = sess.run(output_tensor2)
print(output1 - output2)
And the difference is:
[[[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]]
[[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]]]

Resources