How to Fine-tune HuggingFace BERT model for Text Classification [closed] - machine-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
Is there a Step by step explanation on how to Fine-tune HuggingFace BERT model for text classification?

Fine Tuning Approach
There are multiple approaches to fine-tune BERT for the target tasks.
Further Pre-training the base BERT model
Custom classification layer(s) on top of the base BERT model being trainable
Custom classification layer(s) on top of the base BERT model being non-trainable (frozen)
Note that the BERT base model has been pre-trained only for two tasks as in the original paper.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
3.1 Pre-training BERT ...we pre-train BERT using two unsupervised tasks
Task #1: Masked LM
Task #2: Next Sentence Prediction (NSP)
Hence, the base BERT model is like half-baked which can be fully baked for the target domain (1st way). We can use it as part of our custom model training with the base trainable (2nd) or not-trainable (3rd).
1st approach
How to Fine-Tune BERT for Text Classification? demonstrated the 1st approach of Further Pre-training, and pointed out the learning rate is the key to avoid Catastrophic Forgetting where the pre-trained knowledge is erased during learning of new knowledge.
We find that a lower learning rate, such as 2e-5,
is necessary to make BERT overcome the catastrophic forgetting problem. With an aggressive learn rate of 4e-4, the training set fails to converge.
Probably this is the reason why the BERT paper used 5e-5, 4e-5, 3e-5, and 2e-5 for fine-tuning.
We use a batch size of 32 and fine-tune for 3 epochs over the data for all GLUE tasks. For each task, we selected the best fine-tuning learning rate (among 5e-5, 4e-5, 3e-5, and 2e-5) on the Dev set
Note that the base model pre-training itself used higher learning rate.
bert-base-uncased - pretraining
The model was trained on 4 cloud TPUs in Pod configuration (16 TPU chips total) for one million steps with a batch size of 256. The sequence length was limited to 128 tokens for 90% of the steps and 512 for the remaining 10%. The optimizer used is Adam with a learning rate of 1e-4, β1=0.9 and β2=0.999, a weight decay of 0.01, learning rate warmup for 10,000 steps and linear decay of the learning rate after.
Will describe the 1st way as part of the 3rd approach below.
FYI:
TFDistilBertModel is the bare base model with the name distilbert.
Model: "tf_distil_bert_model_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
distilbert (TFDistilBertMain multiple 66362880
=================================================================
Total params: 66,362,880
Trainable params: 66,362,880
Non-trainable params: 0
2nd approach
Huggingface takes the 2nd approach as in Fine-tuning with native PyTorch/TensorFlow where TFDistilBertForSequenceClassification has added the custom classification layer classifier on top of the base distilbert model being trainable. The small learning rate requirement will apply as well to avoid the catastrophic forgetting.
from transformers import TFDistilBertForSequenceClassification
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=model.compute_loss) # can also use any keras loss fn
model.fit(train_dataset.shuffle(1000).batch(16), epochs=3, batch_size=16)
Model: "tf_distil_bert_for_sequence_classification_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
distilbert (TFDistilBertMain multiple 66362880
_________________________________________________________________
pre_classifier (Dense) multiple 590592
_________________________________________________________________
classifier (Dense) multiple 1538
_________________________________________________________________
dropout_59 (Dropout) multiple 0
=================================================================
Total params: 66,955,010
Trainable params: 66,955,010 <--- All parameters are trainable
Non-trainable params: 0
Implementation of the 2nd approach
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
from transformers import (
DistilBertTokenizerFast,
TFDistilBertForSequenceClassification,
)
DATA_COLUMN = 'text'
LABEL_COLUMN = 'category_index'
MAX_SEQUENCE_LENGTH = 512
LEARNING_RATE = 5e-5
BATCH_SIZE = 16
NUM_EPOCHS = 3
# --------------------------------------------------------------------------------
# Tokenizer
# --------------------------------------------------------------------------------
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
def tokenize(sentences, max_length=MAX_SEQUENCE_LENGTH, padding='max_length'):
"""Tokenize using the Huggingface tokenizer
Args:
sentences: String or list of string to tokenize
padding: Padding method ['do_not_pad'|'longest'|'max_length']
"""
return tokenizer(
sentences,
truncation=True,
padding=padding,
max_length=max_length,
return_tensors="tf"
)
# --------------------------------------------------------------------------------
# Load data
# --------------------------------------------------------------------------------
raw_train = pd.read_csv("./train.csv")
train_data, validation_data, train_label, validation_label = train_test_split(
raw_train[DATA_COLUMN].tolist(),
raw_train[LABEL_COLUMN].tolist(),
test_size=.2,
shuffle=True
)
# --------------------------------------------------------------------------------
# Prepare TF dataset
# --------------------------------------------------------------------------------
train_dataset = tf.data.Dataset.from_tensor_slices((
dict(tokenize(train_data)), # Convert BatchEncoding instance to dictionary
train_label
)).shuffle(1000).batch(BATCH_SIZE).prefetch(1)
validation_dataset = tf.data.Dataset.from_tensor_slices((
dict(tokenize(validation_data)),
validation_label
)).batch(BATCH_SIZE).prefetch(1)
# --------------------------------------------------------------------------------
# training
# --------------------------------------------------------------------------------
model = TFDistilBertForSequenceClassification.from_pretrained(
'distilbert-base-uncased',
num_labels=NUM_LABELS
)
optimizer = tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE)
model.compile(
optimizer=optimizer,
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
)
model.fit(
x=train_dataset,
y=None,
validation_data=validation_dataset,
batch_size=BATCH_SIZE,
epochs=NUM_EPOCHS,
)
3rd approach
Basics
Please note that the images are taken from A Visual Guide to Using BERT for the First Time and modified.
Tokenizer
Tokenizer generates the instance of BatchEncoding which can be used like a Python dictionary and the input to the BERT model.
BatchEncoding
Holds the output of the encode_plus() and batch_encode() methods (tokens, attention_masks, etc).
This class is derived from a python dictionary and can be used as a dictionary. In addition, this class exposes utility methods to map from word/character space to token space.
Parameters
data (dict) – Dictionary of lists/arrays/tensors returned by the encode/batch_encode methods (‘input_ids’, ‘attention_mask’, etc.).
The data attribute of the class is the tokens generated which has input_ids and attention_mask elements.
input_ids
input_ids
The input ids are often the only required parameters to be passed to the model as input. They are token indices, numerical representations of tokens building the sequences that will be used as input by the model.
attention_mask
Attention mask
This argument indicates to the model which tokens should be attended to, and which should not.
If the attention_mask is 0, the token id is ignored. For instance if a sequence is padded to adjust the sequence length, the padded words should be ignored hence their attention_mask are 0.
Special Tokens
BertTokenizer addes special tokens, enclosing a sequence with [CLS] and [SEP]. [CLS] represents Classification and [SEP] separates sequences. For Question Answer or Paraphrase tasks, [SEP] separates the two sentences to compare.
BertTokenizer
cls_token (str, optional, defaults to "[CLS]")The Classifier Token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.
sep_token (str, optional, defaults to "[SEP]")The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens.
A Visual Guide to Using BERT for the First Time show the tokenization.
[CLS]
The embedding vector for [CLS] in the output from the base model final layer represents the classification that has been learned by the base model. Hence feed the embedding vector of [CLS] token into the classification layer added on top of the base model.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
The first token of every sequence is always a special classification token ([CLS]). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks. Sentence pairs are packed together into a single sequence. We differentiate the sentences in two ways. First, we separate them with a special token ([SEP]). Second, we add a learned embedding to every token indicating whether it belongs to sentence A or sentence B.
The model structure will be illustrated as below.
Vector size
In the model distilbert-base-uncased, each token is embedded into a vector of size 768. The shape of the output from the base model is (batch_size, max_sequence_length, embedding_vector_size=768). This accords with the BERT paper about the BERT/BASE model (as indicated in distilbert-base-uncased).
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT/BASE (L=12, H=768, A=12, Total Parameters=110M) and BERT/LARGE (L=24, H=1024, A=16, Total Parameters=340M).
Base Model - TFDistilBertModel
Hugging Face Transformers: Fine-tuning DistilBERT for Binary Classification Tasks
TFDistilBertModel class to instantiate the base DistilBERT model without any specific head on top (as opposed to other classes such as TFDistilBertForSequenceClassification that do have an added classification head).
We do not want any task-specific head attached because we simply want the pre-trained weights of the base model to provide a general understanding of the English language, and it will be our job to add our own classification head during the fine-tuning process in order to help the model distinguish between toxic comments.
TFDistilBertModel generates an instance of TFBaseModelOutput whose last_hidden_state parameter is the output from the model last layer.
TFBaseModelOutput([(
'last_hidden_state',
<tf.Tensor: shape=(batch_size, sequence_lendgth, 768), dtype=float32, numpy=array([[[...]]], dtype=float32)>
)])
TFBaseModelOutput
Parameters
last_hidden_state (tf.Tensor of shape (batch_size, sequence_length, hidden_size)) – Sequence of hidden-states at the output of the last layer of the model.
Implementation
Python modules
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
from transformers import (
DistilBertTokenizerFast,
TFDistilBertModel,
)
Configuration
TIMESTAMP = datetime.datetime.now().strftime("%Y%b%d%H%M").upper()
DATA_COLUMN = 'text'
LABEL_COLUMN = 'category_index'
MAX_SEQUENCE_LENGTH = 512 # Max length allowed for BERT is 512.
NUM_LABELS = len(raw_train[LABEL_COLUMN].unique())
MODEL_NAME = 'distilbert-base-uncased'
NUM_BASE_MODEL_OUTPUT = 768
# Flag to freeze base model
FREEZE_BASE = True
# Flag to add custom classification heads
USE_CUSTOM_HEAD = True
if USE_CUSTOM_HEAD == False:
# Make the base trainable when no classification head exists.
FREEZE_BASE = False
BATCH_SIZE = 16
LEARNING_RATE = 1e-2 if FREEZE_BASE else 5e-5
L2 = 0.01
Tokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained(MODEL_NAME)
def tokenize(sentences, max_length=MAX_SEQUENCE_LENGTH, padding='max_length'):
"""Tokenize using the Huggingface tokenizer
Args:
sentences: String or list of string to tokenize
padding: Padding method ['do_not_pad'|'longest'|'max_length']
"""
return tokenizer(
sentences,
truncation=True,
padding=padding,
max_length=max_length,
return_tensors="tf"
)
Input layer
The base model expects input_ids and attention_mask whose shape is (max_sequence_length,). Generate Keras Tensors for them with Input layer respectively.
# Inputs for token indices and attention masks
input_ids = tf.keras.layers.Input(shape=(MAX_SEQUENCE_LENGTH,), dtype=tf.int32, name='input_ids')
attention_mask = tf.keras.layers.Input((MAX_SEQUENCE_LENGTH,), dtype=tf.int32, name='attention_mask')
Base model layer
Generate the output from the base model. The base model generates TFBaseModelOutput. Feed the embedding of [CLS] to the next layer.
base = TFDistilBertModel.from_pretrained(
MODEL_NAME,
num_labels=NUM_LABELS
)
# Freeze the base model weights.
if FREEZE_BASE:
for layer in base.layers:
layer.trainable = False
base.summary()
# [CLS] embedding is last_hidden_state[:, 0, :]
output = base([input_ids, attention_mask]).last_hidden_state[:, 0, :]
Classification layers
if USE_CUSTOM_HEAD:
# -------------------------------------------------------------------------------
# Classifiation leayer 01
# --------------------------------------------------------------------------------
output = tf.keras.layers.Dropout(
rate=0.15,
name="01_dropout",
)(output)
output = tf.keras.layers.Dense(
units=NUM_BASE_MODEL_OUTPUT,
kernel_initializer='glorot_uniform',
activation=None,
name="01_dense_relu_no_regularizer",
)(output)
output = tf.keras.layers.BatchNormalization(
name="01_bn"
)(output)
output = tf.keras.layers.Activation(
"relu",
name="01_relu"
)(output)
# --------------------------------------------------------------------------------
# Classifiation leayer 02
# --------------------------------------------------------------------------------
output = tf.keras.layers.Dense(
units=NUM_BASE_MODEL_OUTPUT,
kernel_initializer='glorot_uniform',
activation=None,
name="02_dense_relu_no_regularizer",
)(output)
output = tf.keras.layers.BatchNormalization(
name="02_bn"
)(output)
output = tf.keras.layers.Activation(
"relu",
name="02_relu"
)(output)
Softmax Layer
output = tf.keras.layers.Dense(
units=NUM_LABELS,
kernel_initializer='glorot_uniform',
kernel_regularizer=tf.keras.regularizers.l2(l2=L2),
activation='softmax',
name="softmax"
)(output)
Final Custom Model
name = f"{TIMESTAMP}_{MODEL_NAME.upper()}"
model = tf.keras.models.Model(inputs=[input_ids, attention_mask], outputs=output, name=name)
model.compile(
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
optimizer=tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE),
metrics=['accuracy']
)
model.summary()
---
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_ids (InputLayer) [(None, 256)] 0
__________________________________________________________________________________________________
attention_mask (InputLayer) [(None, 256)] 0
__________________________________________________________________________________________________
tf_distil_bert_model (TFDistilB TFBaseModelOutput(la 66362880 input_ids[0][0]
attention_mask[0][0]
__________________________________________________________________________________________________
tf.__operators__.getitem_1 (Sli (None, 768) 0 tf_distil_bert_model[1][0]
__________________________________________________________________________________________________
01_dropout (Dropout) (None, 768) 0 tf.__operators__.getitem_1[0][0]
__________________________________________________________________________________________________
01_dense_relu_no_regularizer (D (None, 768) 590592 01_dropout[0][0]
__________________________________________________________________________________________________
01_bn (BatchNormalization) (None, 768) 3072 01_dense_relu_no_regularizer[0][0
__________________________________________________________________________________________________
01_relu (Activation) (None, 768) 0 01_bn[0][0]
__________________________________________________________________________________________________
02_dense_relu_no_regularizer (D (None, 768) 590592 01_relu[0][0]
__________________________________________________________________________________________________
02_bn (BatchNormalization) (None, 768) 3072 02_dense_relu_no_regularizer[0][0
__________________________________________________________________________________________________
02_relu (Activation) (None, 768) 0 02_bn[0][0]
__________________________________________________________________________________________________
softmax (Dense) (None, 2) 1538 02_relu[0][0]
==================================================================================================
Total params: 67,551,746
Trainable params: 1,185,794
Non-trainable params: 66,365,952 <--- Base BERT model is frozen
Data allocation
# --------------------------------------------------------------------------------
# Split data into training and validation
# --------------------------------------------------------------------------------
raw_train = pd.read_csv("./train.csv")
train_data, validation_data, train_label, validation_label = train_test_split(
raw_train[DATA_COLUMN].tolist(),
raw_train[LABEL_COLUMN].tolist(),
test_size=.2,
shuffle=True
)
# X = dict(tokenize(train_data))
# Y = tf.convert_to_tensor(train_label)
X = tf.data.Dataset.from_tensor_slices((
dict(tokenize(train_data)), # Convert BatchEncoding instance to dictionary
train_label
)).batch(BATCH_SIZE).prefetch(1)
V = tf.data.Dataset.from_tensor_slices((
dict(tokenize(validation_data)), # Convert BatchEncoding instance to dictionary
validation_label
)).batch(BATCH_SIZE).prefetch(1)
Train
# --------------------------------------------------------------------------------
# Train the model
# https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit
# Input data x can be a dict mapping input names to the corresponding array/tensors,
# if the model has named inputs. Beware of the "names". y should be consistent with x
# (you cannot have Numpy inputs and tensor targets, or inversely).
# --------------------------------------------------------------------------------
history = model.fit(
x=X, # dictionary
# y=Y,
y=None,
epochs=NUM_EPOCHS,
batch_size=BATCH_SIZE,
validation_data=V,
)
To implement the 1st approach, change the configuration as below.
USE_CUSTOM_HEAD = False
Then FREEZE_BASE is changed to False and LEARNING_RATE is changed to 5e-5 which will run Further Pre-training on the base BERT model.
Saving the model
For the 3rd approach, saving the model will cause issues. The save_pretrained method of the Huggingface Model cannot be used as the model is not a direct sub class from of Huggingface PreTrainedModel.
Keras save_model causes an error with the default save_traces=True, or causes a different error with save_traces=True when loading the model with Keras load_model.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-71-01d66991d115> in <module>()
----> 1 tf.keras.models.load_model(MODEL_DIRECTORY)
11 frames
/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/saving/saved_model/load.py in _unable_to_call_layer_due_to_serialization_issue(layer, *unused_args, **unused_kwargs)
865 'recorded when the object is called, and used when saving. To manually '
866 'specify the input shape/dtype, decorate the call function with '
--> 867 '`#tf.function(input_signature=...)`.'.format(layer.name, type(layer)))
868
869
ValueError: Cannot call custom layer tf_distil_bert_model of type <class 'tensorflow.python.keras.saving.saved_model.load.TFDistilBertModel'>, because the call function was not serialized to the SavedModel.Please try one of the following methods to fix this issue:
(1) Implement `get_config` and `from_config` in the layer/model class, and pass the object to the `custom_objects` argument when loading the model. For more details, see: https://www.tensorflow.org/guide/keras/save_and_serialize
(2) Ensure that the subclassed model or layer overwrites `call` and not `__call__`. The input shape and dtype will be automatically recorded when the object is called, and used when saving. To manually specify the input shape/dtype, decorate the call function with `#tf.function(input_signature=...)`.
Only Keras Model save_weights worked as far as I tested.
Experiments
As far as I tested with Toxic Comment Classification Challenge, the 1st approach gave better recall (identify true toxic comment, true non-toxic comment). Code can be accessed as below. Please provide correction/suggestion if anything.
Code for 1st and 3rd approach
Related
BERT Document Classification Tutorial with Code - Fine tuning using TFDistilBertForSequenceClassification and Pytorch
Hugging Face Transformers: Fine-tuning DistilBERT for Binary Classification Tasks - Fine tuning using TFDistilBertModel

Related

Can the number of units in NN input layer be different than the number of features in the data?

Based on the tensorflow keras API tutorial;
model = keras.Sequential([
keras.layers.Dense(10, activation='softmax', input_shape=(32,)),
keras.layers.Dense(10, activation='softmax')
])
I couldn't understand that why the number of units in the input layer is 10 while the input shape is 32. Also, there are many examples like this one in the tensorflow tutorials.
This is a rather common confusion by new practitioners, and not without a reason: the answer, as it has already been hinted at in the comments, is that in the Keras Sequential API there is an implicit input layer, determined by the input_shape argument of the first explicit layer.
This is directly visible in the Keras Functional API (check the example in the docs), where Input is an explicit layer itself, and in which your model would be written as:
inputs = Input(shape=(32,)) # input layer
x = Dense(10, activation='softmax')(inputs) # hidden layer
outputs = Dense(10, activation='softmax')(x) # output layer
model = Model(inputs, outputs)
i.e. your model is actually an example of a "good old" neural net with three layers (input, hidden, and output), despite that it looks like a two-layer net in the Keras Sequential API.
(BTW, and irrelevant to the question, it does not make much sense to have softmax as activation for your hidden layer.)

Training LSTMs in Keras with time series of different length

I'm new to Keras and wondering how to train an LTSM with (interrupted) time series of different lengths. Consider, for example, a continuous series from day 1 to day 10 and another continuous series from day 15 to day 20. Simply concatenating them to a single series might yield wrong results. I see basically two options to bring them to shape (batch_size, timesteps, output_features):
Extend the shorter series by some default value (0), i.e. for the above example we would have the following batch:
d1, ..., d10
d15, ..., d20, 0, 0, 0, 0, 0
Compute the GCD of the lengths, cut the series into pieces, and use a stateful LSTM, i.e.:
d1, ..., d5
d6, ..., d10
reset_state
d15, ..., d20
Are there any other / better solutions? Is training a stateless LSTM with a complete sequence equivalent to training a stateful LSTM with pieces?
Have you tried feeding the LSTM layer with inputs of different length? The input time-series can be of different length when LSTM is used (even the batch sizes can be different from one batch to another, but obvisouly the dimension of features should be the same). Here is an example in Keras:
from keras import models, layers
n_feats = 32
latent_dim = 64
lstm_input = layers.Input(shape=(None, n_feats))
lstm_output = layers.LSTM(latent_dim)(lstm_input)
model = models.Model(lstm_input, lstm_output)
model.summary()
Output:
Layer (type) Output Shape Param #
=================================================================
input_2 (InputLayer) (None, None, 32) 0
_________________________________________________________________
lstm_2 (LSTM) (None, 64) 24832
=================================================================
Total params: 24,832
Trainable params: 24,832
Non-trainable params: 0
As you can see the first and second axis of Input layer is None. It means they are not pre-specified and can be any value. You can think of LSTM as a loop. No matter the input length, as long as there are remaining data vectors of same length (i.e. n_feats), the LSTM layer processes them. Therefore, as you can see above, the number of parameters used in a LSTM layer does not depend on the batch size or time-series length (it only depends on input feature vector's length and the latent dimension of LSTM).
import numpy as np
# feed LSTM with: batch_size=10, timestamps=5
model.predict(np.random.rand(10, 5, n_feats)) # This works
# feed LSTM with: batch_size=5, timestamps=100
model.predict(np.random.rand(5, 100, n_feats)) # This also works
However, depending on the specific problem you are working on, this may not work; though I don't have any specific examples in my mind now in which this behavior may not be suitable and you should make sure all the time-series have the same length.

How to Use Keras Model to Predict Output After Unpacking the Model

I can unpack my RNN model onto my website, but I am having trouble getting it to predict a numpy array of predictions using a list as input (contains only one string called text but needs to be a list for preprocessing from what I've gathered) and I am coming across the problem:
ValueError: Error when checking : expected embedding_1_input to have shape (None, 72)
but got array with shape (1, 690)
Here is how I am currently preprocessing and predicting with the model:
tokenizer = Tokenizer(num_words = 5000, split=' ')
tokenizer.fit_on_texts([text])
X = tokenizer.texts_to_sequences([text])
X = pad_sequences(X)
prediction = loadedModel.predict(X)
print(prediction)
And this is how I trained my model:
HIDDEN_LAYER_SIZE = 195 # Details the amount of nodes in a hidden layer.
TOP_WORDS = 5000 # Most-used words in the dataset.
MAX_REVIEW_LENGTH = 500 # Char length of each text being sent in (necessary).
EMBEDDING_VECTOR_LENGTH = 128 # The specific Embedded later will have 128-length vectors to
# represent each word.
BATCH_SIZE = 32 # Takes 64 sentences at a time and continually retrains RNN.
NUMBER_OF_EPOCHS = 10 # Fits RNN to more accurately guess the data's political bias.
DROPOUT = 0.2 # Helps slow down overfitting of data (slower convergence rate)
# Define the model
model = Sequential()
model.add(Embedding(TOP_WORDS, EMBEDDING_VECTOR_LENGTH, \
input_length=X.shape[1]))
model.add(SpatialDropout1D(DROPOUT))
model.add(LSTM(HIDDEN_LAYER_SIZE))
model.add(Dropout(DROPOUT))
model.add(Dense(2, activation='softmax'))
# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', \
metrics=['accuracy'])
#printModelSummary(model)
# Fit the model
model.fit(X_train, Y_train, validation_data=(X_test, Y_test), \
epochs=NUMBER_OF_EPOCHS, batch_size=BATCH_SIZE)
How can I fix my preprocessing code in the codebox starting with "tokenizer" to stop getting the ValueError?
Thank you, and I can definitely provide more code or expand upon the purpose of the project.
So there are two problems here:
Set max_len in pad_sequences: it seems that all of your training sequences were padded to have length 72 so - you need to change the following line:
X = pad_sequences(X, max_len=72)
Use training Tokenizer: this is a subtle problem - you are creating and fitting a totally new Tokenizer so it could be different than one which you used for training. This could cause problems - because different words could have different indexes - and this will make your model to work awful. Try to pickle your training Tokenizer and load it during deployment in order to transform sentences into data points fed to your model properly.

Weird accuracy in multilabel classification keras

I have a multilabel classification problem, I used the following code but the validation accuracy jumps to 99% in the first epoch which is weird given the complexity of the data as the input features are 2048 extracted from inception model (pool3:0) layer and the labels are [1000],(here is the link of a file contains samples of features and label : https://drive.google.com/file/d/0BxI_8PO3YBPPYkp6dHlGeExpS1k/view?usp=sharing ),
is there something I am doing wrong here ??
Note: labels are sparse vector contain only 1 ~ 10 entry as 1 the rest is zeros
model.compile(optimizer='adadelta', loss='binary_crossentropy', metrics=['accuracy'])
The output of prediction is zeros !
What wrong I do in training the model to bother the prediction ?
#input is the features file and labels file
def generate_arrays_from_file(path ,batch_size=100):
x=np.empty([batch_size,2048])
y=np.empty([batch_size,1000])
while True:
f = open(path)
i = 1
for line in f:
# create Numpy arrays of input data
# and labels, from each line in the file
words=line.split(',')
words=map(float, words[1:])
x_= np.array(words[0:2048])
y_=words[2048:]
y_= np.array(map(int,y_))
x_=x_.reshape((1, -1))
#print np.squeeze(x_)
y_=y_.reshape((1,-1))
x[i]= x_
y[i]=y_
i += 1
if i == batch_size:
i=1
yield (x, y)
f.close()
model = Sequential()
model.add(Dense(units=2048, activation='sigmoid', input_dim=2048))
model.add(Dense(units=1000, activation="sigmoid",
kernel_initializer="uniform"))
model.compile(optimizer='adadelta', loss='binary_crossentropy', metrics=
['accuracy'])
model.fit_generator(generate_arrays_from_file('train.txt'),
validation_data= generate_arrays_from_file('test.txt'),
validation_steps=1000,epochs=100,steps_per_epoch=1000,
verbose=1)
I think the problem with the accuracy is that your output are sparse.
Keras computes accuracy using this formula:
K.mean(K.equal(y_true, K.round(y_pred)), axis=-1)
So, in your case, having only 1~10 non zero labels, a prediction of all 0 will yield an accuracy of 99.9% ~ 99%.
As far as the problem not learning, I think the problem is that you are using a sigmoid as last activation and using 0 or 1 as output value. This is bad practice since, in order for the sigmoid to return 0 or 1 the values it gets as input must be very large or very small, which reflects on the net having very large (in absolute value) weights. Furthermore, since in each training output there are far less 1 than 0 the network will soon get to a stationary point in which it simply outputs all zeros (the loss in this case is not very large either, should be around 0.016~0.16).
What you can do is scale your output labels so that they are between (0.2, 0.8) for example so that the weights of the net won't become too big or too small. Alternatively you can use a relu as activation function.
Did you try to use the cosine similarity as loss function?
I had the same multi-label + high dimensionality problem.
The cosine distance takes account of the orientation of the model output (prediction) and the desired output (true class) vector.
It is the normalized dot-product between two vectors.
In keras the cosine_proximity function is -1*cosine_distance. Meaning that -1 corresponds to two vectors with the same size and orientation.

Keras: model with one input and two outputs, trained jointly on different data (semi-supervised learning)

I would like to code with Keras a neural network that acts both as an autoencoder AND a classifier for semi-supervised learning. Take for example this dataset where there is a few labeled images and a lot of unlabeled images: https://cs.stanford.edu/~acoates/stl10/
Some papers listed here achieved that, or very similar things, successfully.
To sum up: if the model would have the same input data shape and the same "encoding" convolutional layers, but would split into two heads (fork-style), so there is a classification head and a decoding head, in a way that the unsupervised autoencoder will contribute to a good learning for the classification head.
With TensorFlow there would be no problem doing that as we have full control over the computational graph.
But with Keras, things are more high-level and I feel that all the calls to ".fit" must always provide all the data at once (so it would force me to tie together the classification head and the autoencoding head into one time-step).
One way in keras to almost do that would be with something that goes like this:
input = Input(shape=(32, 32, 3))
cnn_feature_map = sequential_cnn_trunk(input)
classification_predictions = Dense(10, activation='sigmoid')(cnn_feature_map)
autoencoded_predictions = decode_cnn_head_sequential(cnn_feature_map)
model = Model(inputs=[input], outputs=[classification_predictions, ])
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit([images], [labels, images], epochs=10)
However, I think and I fear that if I just want to fit things in that way it will fail and ask for the missing head:
for epoch in range(10):
# classifications step
model.fit([images], [labels, None], epochs=1)
# "semi-unsupervised" autoencoding step
model.fit([images], [None, images], epochs=1)
# note: ".train_on_batch" could probably be used rather than ".fit" to avoid doing a whole epoch each time.
How should one implement that behavior with Keras? And could the training be done jointly without having to split the two calls to the ".fit" function?
Sometimes when you don't have a label you can pass zero vector instead of one hot encoded vector. It should not change your result because zero vector doesn't have any error signal with categorical cross entropy loss.
My custom to_categorical function looks like this:
def tricky_to_categorical(y, translator_dict):
encoded = np.zeros((y.shape[0], len(translator_dict)))
for i in range(y.shape[0]):
if y[i] in translator_dict:
encoded[i][translator_dict[y[i]]] = 1
return encoded
When y contains labels, and translator_dict is a python dictionary witch contains labels and its unique keys like this:
{'unisex':2, 'female': 1, 'male': 0}
If an UNK label can't be found in this dictinary then its encoded label will be a zero vector
If you use this trick you also have to modify your accuracy function to see real accuracy numbers. you have to filter out all zero vectors from our metrics
def tricky_accuracy(y_true, y_pred):
mask = K.not_equal(K.sum(y_true, axis=-1), K.constant(0)) # zero vector mask
y_true = tf.boolean_mask(y_true, mask)
y_pred = tf.boolean_mask(y_pred, mask)
return K.cast(K.equal(K.argmax(y_true, axis=-1), K.argmax(y_pred, axis=-1)), K.floatx())
note: You have to use larger batches (e.g. 32) in order to prevent zero matrix update, because It can make your accuracy metrics crazy, I don't know why
Alternative solution
Use Pseudo Labeling :)
you can train jointly, you have to pass an array insted of single label.
I used fit_generator, e.g.
model.fit_generator(
batch_generator(),
steps_per_epoch=len(dataset) / batch_size,
epochs=epochs)
def batch_generator():
batch_x = np.empty((batch_size, img_height, img_width, 3))
gender_label_batch = np.empty((batch_size, len(gender_dict)))
category_label_batch = np.empty((batch_size, len(category_dict)))
while True:
i = 0
for idx in np.random.choice(len(dataset), batch_size):
image_id = dataset[idx][0]
batch_x[i] = load_and_convert_image(image_id)
gender_label_batch[i] = gender_labels[idx]
category_label_batch[i] = category_labels[idx]
i += 1
yield batch_x, [gender_label_batch, category_label_batch]

Resources