Number of Images in Training and Test sets of tfds.load() - machine-learning

I just want to know if the total number of images in train_data are 60000 or 10000. I tried train_data.shape() but it just returns
DatasetV1Adapter shapes: {image: (28, 28, 1), label: ()}, types: {image: tf.uint8, label: tf.int64}
Please let me know how can I know the number of images.
(train_data, test_data), metadata = tfds.load(name = 'mnist',
split = ['train', 'test'],
with_info = True,)

Given your syntax:
(train_data, test_data), metadata = tfds.load(name = 'mnist',
split = ['train', 'test'],
with_info = True)
The option with_info=True returns the additional info, which is captured by the 2nd variable metadata in your case.
Hence, the size of train_data and test_data can simply be checked as follows:
print(metadata)
Further info: print(train_data.shape) will not work as it's not numpy array.

You could use:
builder = tfds.builder('mnist')
info = builder.info
print(info)
Output:
tfds.core.DatasetInfo(
name='mnist',
version=3.0.1,
description='The MNIST database of handwritten digits.',
homepage='http://yann.lecun.com/exdb/mnist/',
features=FeaturesDict({
'image': Image(shape=(28, 28, 1), dtype=tf.uint8),
'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
}),
total_num_examples=70000,
splits={
'test': 10000,
'train': 60000,
},
supervised_keys=('image', 'label'),
citation="""#article{lecun2010mnist,
title={MNIST handwritten digit database},
author={LeCun, Yann and Cortes, Corinna and Burges, CJ},
journal={ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist},
volume={2},
year={2010}
}""",
redistribution_info=,
)
From the info metadata you can extract information with the following code:
info.features
Output:
FeaturesDict({
'image': Image(shape=(28, 28, 1), dtype=tf.uint8),
'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
})
Number of classes, label names:
print(info.features["label"].num_classes)
print(info.features["label"].names)
Output:
10
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
Shapes, data types:
print(info.features.shape)
print(info.features.dtype)
print(info.features['image'].shape)
print(info.features['image'].dtype)
Output:
{'image': (28, 28, 1), 'label': ()}
{'image': tf.uint8, 'label': tf.int64}
(28, 28, 1)
<dtype: 'uint8'>
Split information:
print(info.splits)
Output:
{'test': <tfds.core.SplitInfo num_examples=10000>, 'train': <tfds.core.SplitInfo num_examples=60000>}
Available splits:
print(list(info.splits.keys()))
Output:
['test', 'train']
And some more. You can see the above examples and more on the tensorflow website # https://www.tensorflow.org/datasets/overview

Related

How to compute mean/max of HuggingFace Transformers BERT token embeddings with attention mask?

I'm using the HuggingFace Transformers BERT model, and I want to compute a summary vector (a.k.a. embedding) over the tokens in a sentence, using either the mean or max function. The complication is that some tokens are [PAD], so I want to ignore the vectors for those tokens when computing the average or max.
Here's an example. I initially instantiate a BertTokenizer and a BertModel:
import torch
import transformers
from transformers import AutoTokenizer, AutoModel
transformer_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(transformer_name, use_fast=True)
model = AutoModel.from_pretrained(transformer_name)
I then input some sentences into the tokenizer and get out input_ids and attention_mask. Notably, an attention_mask value of 0 means that the token was a [PAD] that I can ignore.
sentences = ['Deep learning is difficult yet very rewarding.',
'Deep learning is not easy.',
'But is rewarding if done right.']
tokenizer_result = tokenizer(sentences, max_length=32, padding=True, return_attention_mask=True, return_tensors='pt')
input_ids = tokenizer_result.input_ids
attention_mask = tokenizer_result.attention_mask
print(input_ids.shape) # torch.Size([3, 11])
print(input_ids)
# tensor([[ 101, 2784, 4083, 2003, 3697, 2664, 2200, 10377, 2075, 1012, 102],
# [ 101, 2784, 4083, 2003, 2025, 3733, 1012, 102, 0, 0, 0],
# [ 101, 2021, 2003, 10377, 2075, 2065, 2589, 2157, 1012, 102, 0]])
print(attention_mask.shape) # torch.Size([3, 11])
print(attention_mask)
# tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
# [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
# [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]])
Now, I call the BERT model to get the 768-D token embeddings (the top-layer hidden states).
model_result = model(input_ids, attention_mask=attention_mask, return_dict=True)
token_embeddings = model_result.last_hidden_state
print(token_embeddings.shape) # torch.Size([3, 11, 768])
So at this point, I have:
token embeddings in a [3, 11, 768] matrix: 3 sentences, 11 tokens, 768-D vector for each token.
attention mask in a [3, 11] matrix: 3 sentences, 11 tokens. A 1 value indicates non-[PAD].
How do I compute the mean / max over the vectors for the valid, non-[PAD] tokens?
I tried using the attention mask as a mask and then called torch.max(), but I don't get the right dimensions:
masked_token_embeddings = token_embeddings[attention_mask==1]
print(masked_token_embeddings.shape) # torch.Size([29, 768] <-- WRONG. SHOULD BE [3, 11, 768]
pooled = torch.max(masked_token_embeddings, 1)
print(pooled.values.shape) # torch.Size([29]) <-- WRONG. SHOULD BE [3, 768]
What I really want is a tensor of shape [3, 768]. That is, a 768-D vector for each of the 3 sentences.
For max, you can multiply with attention_mask:
pooled = torch.max((token_embeddings * attention_mask.unsqueeze(-1)), axis=1)
For mean, you can sum along the axis and divide by attention_mask along that axis:
mean_pooled = token_embeddings.sum(axis=1) / attention_mask.sum(axis=-1).unsqueeze(-1)
In addition to #Quang, you can have a look at sentence_transformers Pooling layer.
For max pooling, they do this:
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
token_embeddings[input_mask_expanded == 0] = -1e9 # Set padding tokens to large negative value
pooled = torch.max(token_embeddings, 1)[0]
And for mean pooling they do the following:
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
sum_mask = input_mask_expanded.sum(1)
sum_mask = torch.clamp(sum_mask, min=1e-9)
pooled = sum_embeddings / sum_mask
The max pooling presented in the accepted answer will suffer when the max is negative, and the implementation from sentence transformers changes token_embeddings, which throw an error when you want to use the embedding for back propagation:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation:
If you're interested on anything back-prop related, you can do something like this:
input_mask_expanded = torch.where(attention_mask==0, -1e-9, 0.).unsqueeze(-1).expand(token_embeddings.size()).float()
pooled = torch.max(token_embeddings-input_mask_expanded, 1)[0] # Set padding tokens to large negative value
It's the same idea of making all masked tokens to be very small, but it doesn't change the token_embeddings while at it.
Alex is right.
Look on hidden states for strings that go into tokenizer. For different strings, padding will have different embeddings.
So, in order to properly pool embeddings, you need to ignore those padding vectors.
Let's say you want to get embeddings out of the last 4 layers of BERT (as it yields the best classification results):
#iterate over the last 4 layers and get embeddings for
#strings without having embeddings from PAD tokens
m = []
for i in range(len(hidden_states[0])):
m.append([hidden_states[j+9][i,:,:][tokens["attention_mask"][i] !=0] for j in range(4)])
#average over all tokens embeddings
means = []
for i in range(len(hidden_states[0])):
means.append(torch.stack(m[i]).mean(dim=1))
#stack embeddings for all strings
pooled = torch.stack(means).reshape(-1,1,3072)

Multiple dimensionality reduction techniques with pipeline and GridSearchCV

we all know the common approach to define a pipeline with a dimensionality reduction technique and then a model for training and testing. Then we can apply the GridSearchCv for hyperparameter tuning.
grid = GridSearchCV(
Pipeline([
('reduce_dim', PCA()),
('classify', RandomForestClassifier(n_jobs = -1))
]),
param_grid=[
{
'reduce_dim__n_components': range(0.7,0.9,0.1),
'classify__n_estimators': range(10,50,5),
'classify__max_features': ['auto', 0.2],
'classify__min_samples_leaf': [40,50,60],
'classify__criterion': ['gini', 'entropy']
}
],
cv=5, scoring='f1')
grid.fit(X,y)
I can understand the above code.
Now i was going through the documentation today and there i found one part code which is little bit strange.
pipe = Pipeline([
# the reduce_dim stage is populated by the param_grid
('reduce_dim', 'passthrough'), # How does this work??
('classify', LinearSVC(dual=False, max_iter=10000))
])
N_FEATURES_OPTIONS = [2, 4, 8]
C_OPTIONS = [1, 10, 100, 1000]
param_grid = [
{
'reduce_dim': [PCA(iterated_power=7), NMF()],
'reduce_dim__n_components': N_FEATURES_OPTIONS, ### No PCA is used..??
'classify__C': C_OPTIONS
},
{
'reduce_dim': [SelectKBest(chi2)],
'reduce_dim__k': N_FEATURES_OPTIONS,
'classify__C': C_OPTIONS
},
]
reducer_labels = ['PCA', 'NMF', 'KBest(chi2)']
grid = GridSearchCV(pipe, n_jobs=1, param_grid=param_grid)
X, y = load_digits(return_X_y=True)
grid.fit(X, y)
First of all while defining a pipeline, it used a string 'passthrough' instead of a object.
('reduce_dim', 'passthrough'), ```
Then while defining different dimensionality reduction technique for the grid search, it used a different strategy. How does [PCA(iterated_power=7), NMF()] this work ?
'reduce_dim': [PCA(iterated_power=7), NMF()],
'reduce_dim__n_components': N_FEATURES_OPTIONS, # here
Please Someone explain the code to me .
Solved - in one line, the order is ['PCA', 'NMF', 'KBest(chi2)']
Courtesy of - seralouk (see answer below)
For Reference If someone looks for more details
1 2 3
It is equivalent as far as I know.
In the documentation you have this:
pipe = Pipeline([
# the reduce_dim stage is populated by the param_grid
('reduce_dim', 'passthrough'),
('classify', LinearSVC(dual=False, max_iter=10000))
])
N_FEATURES_OPTIONS = [2, 4, 8]
C_OPTIONS = [1, 10, 100, 1000]
param_grid = [
{
'reduce_dim': [PCA(iterated_power=7), NMF()],
'reduce_dim__n_components': N_FEATURES_OPTIONS,
'classify__C': C_OPTIONS
},
{
'reduce_dim': [SelectKBest(chi2)],
'reduce_dim__k': N_FEATURES_OPTIONS,
'classify__C': C_OPTIONS
},
]
Initially we have ('reduce_dim', 'passthrough'), and then 'reduce_dim': [PCA(iterated_power=7), NMF()]
The definition of the PCA is done in the second line.
You could define alternatively:
pipe = Pipeline([
# the reduce_dim stage is populated by the param_grid
('reduce_dim', PCA(iterated_power=7)),
('classify', LinearSVC(dual=False, max_iter=10000))
])
N_FEATURES_OPTIONS = [2, 4, 8]
C_OPTIONS = [1, 10, 100, 1000]
param_grid = [
{
'reduce_dim__n_components': N_FEATURES_OPTIONS,
'classify__C': C_OPTIONS
},
{
'reduce_dim': [SelectKBest(chi2)],
'reduce_dim__k': N_FEATURES_OPTIONS,
'classify__C': C_OPTIONS
},
]

Finding the appropriate CNN Model Architecture and Parameters

I am currently creating a CNN model that classifies whether the font is Arial, Verdana, Times New Roman and Georgia. All in all there are 16 classes since I considered also detecting whether the font is regular, bold, italics or bold italics. So 4 fonts * 4 styles = 16 classes.
The data that I have used in my training are the following:
Training data set : 800 image patches of 256 * 256 dimension (50 for each class)
Validation data set : 320 image patches of 256 * 256 dimension (20 for each class)
Testing data set : 160 image patches of 256 * 256 dimension (10 for each class)
Below is the sample screenshot of my data:
Below is my initial code:
import numpy as np
import keras
from keras import backend as K
from keras.models import Sequential
from keras.layers import Activation
from keras.layers.core import Dense, Flatten
from keras.optimizers import Adam
from keras.metrics import categorical_crossentropy
from keras.preprocessing.image import ImageDataGenerator
from keras.layers.normalization import BatchNormalization
from keras.layers.convolutional import *
from matplotlib import pyplot as plt
import itertools
import matplotlib.pyplot as plt
import pickle
image_width = 256
image_height = 256
train_path = 'font_model_data/train'
valid_path = 'font_model_data/valid'
test_path = 'font_model_data/test'
train_batches = ImageDataGenerator().flow_from_directory(train_path, target_size=(image_width, image_height), classes=['1','2','3','4', '5', '6', '7', '8', '9', '10', '11', '12','13', '14', '15', '16'], batch_size = 16)
valid_batches = ImageDataGenerator().flow_from_directory(valid_path, target_size=(image_width, image_height), classes=['1','2','3','4', '5', '6', '7', '8', '9', '10', '11', '12','13', '14', '15', '16'], batch_size = 16)
test_batches = ImageDataGenerator().flow_from_directory(test_path, target_size=(image_width, image_height), classes=['1','2','3','4', '5', '6', '7', '8', '9', '10', '11', '12','13', '14', '15', '16'], batch_size = 160)
imgs, labels = next(train_batches)
#CNN model
model = Sequential([
Conv2D(32, (3,3), activation='relu', input_shape=(image_width, image_height, 3)),
Flatten(),
Dense(16, activation='softmax'),
])
print(model.summary())
model.compile(Adam(lr=.0001),loss='categorical_crossentropy', metrics=['accuracy'])
model.fit_generator(train_batches, steps_per_epoch = 50, validation_data= valid_batches, validation_steps = 20, epochs = 1, verbose = 2)
model_pickle = open('cnn_font_model.pickle', 'wb')
pickle.dump(model, model_pickle)
model_pickle.close()
print('Training Done.')
test_imgs, test_labels = next(test_batches)
predictions = model.predict_generator(test_batches, steps = 160, verbose = 2)
print(predictions)
Can anyone suggest how will I know the right network architecture and parameters to get the optimal accuracy? How should I start tweaking my network?
Before going to choose Network you need to segmentize the image tile into subtitles with characters and feed to the following architecture...
# Initialising the CNN
classifier = Sequential()
# Step 1 - Convolution
classifier.add(Conv2D(32, (3, 3), input_shape = (64, 64, 3), activation = 'relu'))
# Step 2 - Pooling
classifier.add(MaxPooling2D(pool_size = (2, 2)))
# Adding a second convolutional layer
classifier.add(Conv2D(32, (3, 3), activation = 'relu'))
classifier.add(MaxPooling2D(pool_size = (2, 2)))
# Step 3 - Flattening
classifier.add(Flatten())
# Step 4 - Full connection
classifier.add(Dense(units = 128, activation = 'relu'))
classifier.add(Dense(units = 1, activation = 'sigmoid'))
# Compiling the CNN
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
classifier.fit_generator(training_set,
steps_per_epoch = XXX,
epochs = XX,
validation_data = test_set,
validation_steps = XXX)
from keras.models import load_model
classifier.save('your_classifier.h5')

multilayer_perceptron : ConvergenceWarning: Stochastic Optimizer: Maximum iterations reached and the optimization hasn't converged yet.Warning?

I have written a basic program to understand what's happening in MLP classifier?
from sklearn.neural_network import MLPClassifier
data: a dataset of body metrics (height, width, and shoe size) labeled male or female:
X = [[181, 80, 44], [177, 70, 43], [160, 60, 38], [154, 54, 37], [166, 65, 40],
[190, 90, 47], [175, 64, 39],
[177, 70, 40], [159, 55, 37], [171, 75, 42], [181, 85, 43]]
y = ['male', 'male', 'female', 'female', 'male', 'male', 'female', 'female',
'female', 'male', 'male']
prepare the model:
clf= MLPClassifier(hidden_layer_sizes=(3,), activation='logistic',
solver='adam', alpha=0.0001,learning_rate='constant',
learning_rate_init=0.001)
train
clf= clf.fit(X, y)
attributes of the learned classifier:
print('current loss computed with the loss function: ',clf.loss_)
print('coefs: ', clf.coefs_)
print('intercepts: ',clf.intercepts_)
print(' number of iterations the solver: ', clf.n_iter_)
print('num of layers: ', clf.n_layers_)
print('Num of o/p: ', clf.n_outputs_)
test
print('prediction: ', clf.predict([ [179, 69, 40],[175, 72, 45] ]))
calc. accuracy
print( 'accuracy: ',clf.score( [ [179, 69, 40],[175, 72, 45] ], ['female','male'], sample_weight=None ))
RUN1
current loss computed with the loss function: 0.617580287851
coefs: [array([[ 0.17222046, -0.02541928, 0.02743722],
[-0.19425909, 0.14586716, 0.17447281],
[-0.4063903 , 0.148889 , 0.02523247]]), array([[-0.66332919],
[ 0.04249613],
[-0.10474769]])]
intercepts: [array([-0.05611057, 0.32634023, 0.51251098]), array([ 0.17996649])]
number of iterations the solver: 200
num of layers: 3
Num of o/p: 1
prediction: ['female' 'male']
accuracy: 1.0
/home/anubhav/anaconda3/envs/mytf/lib/python3.6/site-packages/sklearn/neural_network/multilayer_perceptron.py:563: ConvergenceWarning: Stochastic Optimizer: Maximum iterations reached and the optimization hasn't converged yet.
% (), ConvergenceWarning)
RUN2
current loss computed with the loss function: 0.639478303643
coefs: [array([[ 0.02300866, 0.21547873, -0.1272455 ],
[-0.2859666 , 0.40159542, 0.55881399],
[ 0.39902066, -0.02792529, -0.04498812]]), array([[-0.64446013],
[ 0.60580985],
[-0.22001532]])]
intercepts: [array([-0.10482234, 0.0281211 , -0.16791644]), array([-0.19614561])]
number of iterations the solver: 39
num of layers: 3
Num of o/p: 1
prediction: ['female' 'female']
accuracy: 0.5
RUN3
current loss computed with the loss function: 0.691966937074
coefs: [array([[ 0.21882191, -0.48037975, -0.11774392],
[-0.15890357, 0.06887471, -0.03684797],
[-0.28321762, 0.48392007, 0.34104955]]), array([[ 0.08672174],
[ 0.1071615 ],
[-0.46085333]])]
intercepts: [array([-0.36606747, 0.21969636, 0.10138625]), array([-0.05670653])]
number of iterations the solver: 4
num of layers: 3
Num of o/p: 1
prediction: ['male' 'male']
accuracy: 0.5
RUN4:
current loss computed with the loss function: 0.697102567593
coefs: [array([[ 0.32489731, -0.18529689, -0.08712877],
[-0.35425908, 0.04214241, 0.41249622],
[-0.19993622, -0.38873908, -0.33057999]]), array([[ 0.43304555],
[ 0.37959392],
[ 0.55998979]])]
intercepts: [array([ 0.11555407, -0.3473817 , -0.16852093]), array([ 0.31326347])]
number of iterations the solver: 158
num of layers: 3
Num of o/p: 1
prediction: ['male' 'male']
accuracy: 0.5
-----------------------------------------------------------------
I have following questions:
1.Why in the RUN1 the optimizer did not converge?
2.Why in RUN3 the number of iteration were suddenly becomes so low and in the RUN4 so high?
3.What else can be done to increase the accuracy which I get in RUN1.?
1: Your MLP didn't converge:
The algorithm is optimizing by a stepwise convergence to a minimum and in run 1 your minimum wasn't found.
2 Difference of runs:
You have some random starting values for your MLP, so you dont get the same results as you see in your data. Seems that you started very close to a minimum in your fourth run. You can change the random_state parameter of your MLP to a constant e.g. random_state=0 to get the same result over and over.
3 is the most difficult point.
You can optimize parameters with
from sklearn.model_selection import GridSearchCV
Gridsearch splits up your test set in eqally sized parts, uses one part as test data and the rest as training data. So it optimizes as many classifiers as parts you split your data into.
you need to specify (your data is small so i suggest 2 or 3) the number of parts you split, a classifier (your MLP), and a Grid of parameters you want to optimize like this:
param_grid = [
{
'activation' : ['identity', 'logistic', 'tanh', 'relu'],
'solver' : ['lbfgs', 'sgd', 'adam'],
'hidden_layer_sizes': [
(1,),(2,),(3,),(4,),(5,),(6,),(7,),(8,),(9,),(10,),(11,), (12,),(13,),(14,),(15,),(16,),(17,),(18,),(19,),(20,),(21,)
]
}
]
Beacuse you once got 100 percent accuracy with a hidden layer of three neurons, you can try to optimize parameters like learning rate and momentum instead of the hidden layers.
Use Gridsearch like that:
clf = GridSearchCV(MLPClassifier(), param_grid, cv=3,
scoring='accuracy')
clf.fit(X,y)
print("Best parameters set found on development set:")
print(clf.best_params_)
You can consider increasing the number of iterations eg.
clf = MLPClassifier(max_iter=500)
This cleared the error when I did same.

Keras model predict train/test shape

I am training a CNN with Keras but with 30x30 patches from an image. I want to test the network with a full image but I get the following error:
ValueError: GpuElemwise. Input dimension mis-match. Input 2 (indices
start at 0) has shape[1] == 30, but the output's size on that axis is
100. Apply node that caused the error: GpuElemwise{Composite{((i0 + i1) - i2)}}[(0, 0)](GpuDimShuffle{0,2,3,1}.0, GpuReshape{4}.0,
GpuFromHost.0) Toposort index: 79 Inputs types:
[CudaNdarrayType(float32, 4D), CudaNdarrayType(float32, (True, True,
True, False)), CudaNdarrayType(float32, 4D)] Inputs shapes: [(10, 100,
100, 3), (1, 1, 1, 3), (10, 30, 30, 3)] Inputs strides: [(30000, 100,
1, 10000), (0, 0, 0, 1), (2700, 90, 3, 1)] Inputs values: ['not
shown', CudaNdarray([[[[ 0.01060364 0.00988821 0.00741314]]]]), 'not
shown'] Outputs clients:
[[GpuCAReduce{pre=sqr,red=add}{0,1,1,1}(GpuElemwise{Composite{((i0 +
i1) - i2)}}[(0, 0)].0)]]
This is my model.predict:
predict_image = model.predict(np.array([test_images[1]]), batch_size=1)[0]
It's seems like the issue is that the input size cannot be anything other than 30x30 but the first input shape for the first layer of my network is none, none, 3.
model.add(Convolution2D(n1, f1, f1, border_mode='same', input_shape=(None, None, 3), activation='relu'))
Is it simply not possible to test an image with different dimensions to the ones I trained with?
As fchollet himself described here, you should be able to define the input as so:
input_shape=(1, None, None)
However this will fail if you have layers that use the Flatten operation.
This suggests that you should be able to accomplish your goal with a fully convolutional NN.

Resources